bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
@ 2019-05-20 23:47 Kris Van Hees
  2019-05-21 17:56 ` Alexei Starovoitov
                   ` (12 more replies)
  0 siblings, 13 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-20 23:47 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

This patch set is also available, applied to bpf-next, at the following URL:

	https://github.com/oracle/dtrace-linux-kernel/tree/dtrace-bpf

The patches in this set are part of an larger effort to re-implement DTrace
based on existing Linux kernel features wherever possible.  This allows
existing DTrace scripts to run without modification on Linux and to write
new scripts using a tracing tool that people may already be familiar with.
This set of patches is posted as an RFC.  I am soliciting feedback on the
patches, especially because they cross boundaries between tracing and BPF.
Some of the features might be combined with existing more specialized forms
of similar functionality, and perhaps some functionality could be moved to
other parts of the code.

This set of patches provides the initial core to make it possible to execute
DTrace BPF programs as probe actions, triggered from existing probes in the
kernel (right now just kprobe, but more will be added in followup patches).
The DTrace BPF programs run in a specific DTrace context that is independent
from the probe-specific BPF program context because DTrace actions are
implemented based on a general probe concept (an abstraction of the various
specific probe types).

It also provides a mechanism to store probe data in output buffers directly
from BPF programs, using direct store instructions.  Finally, it provides a
simple sample userspace tool to load programs, collect data, and print out the
data.  This little tool is currently hardcoded to process a single test case,
to show how the BPF program is to be constructed and to show how to retrieve
data from the output buffers.

The work presented here would not be possible without the effort many people
have put into tracing features on Linux.  Especially BPF is instrumental in
being able to do this project because it provides a safe and fast virtual
execution engine that can be leveraged to execute probe actions in a more
elegant manner.  The perf_event ring-buffer output mechanism has also proven
to be very beneficial to starting a re-implementation of DTrace on Linux,
especially because it avoids needing to add yet another buffer implementation
to the kernel.  It really helped with being able to re-use functionality.

The patch set provides the following patches:

    1. bpf: context casting for tail call

	This patch adds the ability to tail-call into a BPF program of a
	different type than the one initiating the call.  It provides two
	program type specific operations: is_valid_tail_call (to validate
	whether the tail-call between the source type and target type is
	allowed) and convert_ctx (to create a context for the target type
	based on the context of the source type).  It also provides a
	bpf_finalize_context() helper function prototype.  BPF program types
	should implement this helper to perform any final context setup that
	may need to be done within the execution context of the program type.
	This helper is typically invoked as the first statement in an eBPF
	program that can be tail-called from another type.

    2. bpf: add BPF_PROG_TYPE_DTRACE

	This patch adds BPF_PROG_TYPE_DTRACE as a new BPF program type, without
	actually providing an implementation.  The actual implementation is
	added in patch 4 (see below).  We do it this way because the
	implementation is being added to the tracing subsystem as a component
	that I would be happy to maintain (if merged) whereas the declaration
	of the program type must be in the bpf subsystem.  Since the two
	subsystems are maintained by different people, we split the
	implementing patches across maintainer boundaries while ensuring that
	the kernel remains buildable between patches.

    3. bpf: export proto for bpf_perf_event_output helper

	This patch make a prototype available for the bpf_perf_event_output
	helper so that program types outside of the base tracing eBPF code can
	make use of it.

    4. trace: initial implementation of DTrace based on kernel facilities

	This patch provides the most basic implementation of the DTrace
	execution core based on eBPF and other kernel facilities.  This
	version only supports kprobes.  It makes use of the cross-program-type
	tail-call support adding with patch 1 (see above).

    5. trace: update Kconfig and Makefile to include DTrace

	This patch adds DTrace to the kernel config system and it ensures that
	if CONFIG_DTRACE is set, the implementation of the DTrace core is
	compiled into the kernel.

    6. dtrace: tiny userspace tool to exercise DTrace support features

	This patch provides a tiny userspace DTrace consumer as a
	proof-of-concept and to test the DTrace eBPF program type and its use
	by the DTrace core.

    7. bpf: implement writable buffers in contexts

	This patch adds the ability to specify writable buffers in an eBPF
	program type context.  The public context declaration should provide
	<buf> and <buf>_end members (<buf> can be any valid identifier) for
	each buffer.  The is_valid_access() function for the program type
	should force the register type of read access to <buf> as
	PTR_TO_BUFFER whereas reading <buf>_end should yield register type
	PTR_TO_BUFFER_END.  The functionality is nearly identical to
	PTR_TO_PACKET and PTR_TO_PACKET_END.  Contexts can have multiple
	writable buffers, distinguished from one another by a new buf_id
	member in the bpf_reg_state struct.  For every writable buffer, both
	<buf> and <buf>_end must provide the same buf_id value (using
	offset(context, <buf>) is a good and convenient choice). 

    8. perf: add perf_output_begin_forward_in_page

	This patch introduces a new function to commence the process of
	writing data to a perf_event ring-buffer.  This variant enforces the
	requirement that the data to be written cannot cross a page boundary.
	It will fill the remainder of the current page with zeros and allocate
	space for the data in the next page if the remainder of the current
	page is too small.  This is necessary to allow eBPF program to write
	to the buffer space directly with statements like: buf[offset] = value.

    9. bpf: mark helpers explicitly whether they may change the context

	This patch changes the way BPF determines whether a helper may change
	the content of the context (i.e. if it does, any range information
	related to pointers in the context must be invalidated).  The original
	implementation contained a hard-coded list of helpers that change the
	context.  The new implementation adds a new field to the helper proto
	struct (ctx_update, default false).

    10. bpf: add bpf_buffer_reserve and bpf_buffer_commit helpers

	This patch adds two new helpers: bpf_buffer_reserve (to set up a
	specific buffer in the context as writable space of a given size) and
	bpf_buffer_commit (to finalize the data written to the buffer prepared
	with bpf_buffer_reserve).

    11. dtrace: make use of writable buffers in BPF

	This patch updates the initial implementation of the DTrace core and
	the proof-of-concept utility to make use of the writable-buffer support
	and the bpf_buffer_reserve and bpf_buffer_commit helpers.

(More detailed descriptions can be found in the individual commit messages.)

The road ahead is roughly as follows:

    - Adding support for DTrace specific probe meta data to be available to
      DTrace BPF programs
    - Adding support for other probe types
    - Adding support for probe arguments
    - Adding support for the DTrace probe naming mechanism to map DTrace
      style probe names to the actual defined probes in the kernel
    - Adding support for DTrace features that currently do not exist in
      the kernel as existing functionality
    - Rework the existing dtrace utility to make use of the new implementation
    - Keep adding features to the DTrace system

	Cheers,
	Kris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-20 23:47 [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
@ 2019-05-21 17:56 ` Alexei Starovoitov
  2019-05-21 18:41   ` Kris Van Hees
  2019-05-22 14:25   ` Peter Zijlstra
  2019-05-21 20:39 ` [RFC PATCH 01/11] bpf: context casting for tail call Kris Van Hees
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-21 17:56 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: netdev, bpf, dtrace-devel, linux-kernel, rostedt, mhiramat, acme,
	ast, daniel, peterz

On Mon, May 20, 2019 at 11:47:00PM +0000, Kris Van Hees wrote:
> 
>     2. bpf: add BPF_PROG_TYPE_DTRACE
> 
> 	This patch adds BPF_PROG_TYPE_DTRACE as a new BPF program type, without
> 	actually providing an implementation.  The actual implementation is
> 	added in patch 4 (see below).  We do it this way because the
> 	implementation is being added to the tracing subsystem as a component
> 	that I would be happy to maintain (if merged) whereas the declaration
> 	of the program type must be in the bpf subsystem.  Since the two
> 	subsystems are maintained by different people, we split the
> 	implementing patches across maintainer boundaries while ensuring that
> 	the kernel remains buildable between patches.

None of these kernel patches are necessary for what you want to achieve.
Feel free to add tools/dtrace/ directory and maintain it though.

The new dtrace_buffer doesn't need to replicate existing bpf+kernel functionality
and no changes are necessary in kernel/events/ring_buffer.c either.
tools/dtrace/ user space component can use either per-cpu array map
or hash map as a buffer to store arbitrary data into and use
existing bpf_perf_event_output() to send it to user space via perf ring buffer.

See, for example, how bpftrace does that.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 17:56 ` Alexei Starovoitov
@ 2019-05-21 18:41   ` Kris Van Hees
  2019-05-21 20:55     ` Alexei Starovoitov
  2019-05-22 14:25   ` Peter Zijlstra
  1 sibling, 1 reply; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 18:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, rostedt,
	mhiramat, acme, ast, daniel, peterz

On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> On Mon, May 20, 2019 at 11:47:00PM +0000, Kris Van Hees wrote:
> > 
> >     2. bpf: add BPF_PROG_TYPE_DTRACE
> > 
> > 	This patch adds BPF_PROG_TYPE_DTRACE as a new BPF program type, without
> > 	actually providing an implementation.  The actual implementation is
> > 	added in patch 4 (see below).  We do it this way because the
> > 	implementation is being added to the tracing subsystem as a component
> > 	that I would be happy to maintain (if merged) whereas the declaration
> > 	of the program type must be in the bpf subsystem.  Since the two
> > 	subsystems are maintained by different people, we split the
> > 	implementing patches across maintainer boundaries while ensuring that
> > 	the kernel remains buildable between patches.
> 
> None of these kernel patches are necessary for what you want to achieve.

I disagree.  The current support for BPF programs for probes associates a
specific BPF program type with a specific set of probes, which means that I
cannot write BPF programs based on a more general concept of a 'DTrace probe'
and provide functionality based on that.  It also means that if I have a D
clause (DTrace probe action code associated with probes) that is to be executed
for a list of probes of different types, I need to duplicate the program
because I cannot cross program type boundaries.

By implementing a program type for DTrace, and making it possible for
tail-calls to be made from various probe-specific program types to the DTrace
program type, I can accomplish what I described above.  More details are in
the cover letter and the commit messages of the individual patches.

The reasons for these patches is because I cannot do the same with the existing
implementation.  Yes, I can do some of it or use some workarounds to accomplish
kind of the same thing, but at the expense of not being able to do what I need
to do but rather do some kind of best effort alternative.  That is not the goal
here.

> Feel free to add tools/dtrace/ directory and maintain it though.

Thank you.

> The new dtrace_buffer doesn't need to replicate existing bpf+kernel functionality
> and no changes are necessary in kernel/events/ring_buffer.c either.
> tools/dtrace/ user space component can use either per-cpu array map
> or hash map as a buffer to store arbitrary data into and use
> existing bpf_perf_event_output() to send it to user space via perf ring buffer.
> 
> See, for example, how bpftrace does that.

When using bpf_perf_event_output() you need to construct the sample first,
and then send it off to user space using the perf ring-buffer.  That is extra
work that is unnecessary.  Also, storing arbitrary data from userspace in maps
is not relevant here because this is about data that is generated at the level
of the kernel and sent to userspace as part of the probe action that is
executed when the probe fires.

Bpftrace indeed uses maps and ways to construct the sample and then uses the
perf ring-buffer to pass data to userspace.  And that is not the way DTrace
works and that is not the mechanism that we need here,  So, while this may be
satisfactory for bpftrace, it is not for DTrace.  We need more fine-grained
control over how we write data to the buffer (doing direct stores from BPF
code) and without the overhead of constructing a complete sample that can just
be handed over to bpf_perf_event_output().

Also, please note that I am not duplicating any kernel functionality when it
comes to buffer handling, and in fact, I found it very easy to be able to
tap into the perf event ring-buffer implementation and add a feature that I
need for DTrace.  That was a very pleasant experience for sure!

Kris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH 01/11] bpf: context casting for tail call
  2019-05-20 23:47 [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
  2019-05-21 17:56 ` Alexei Starovoitov
@ 2019-05-21 20:39 ` Kris Van Hees
  2019-05-21 20:39 ` [RFC PATCH 02/11] bpf: add BPF_PROG_TYPE_DTRACE Kris Van Hees
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:39 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

Currently BPF programs are executed with a context that is provided by
code that initiates the execution.  Tracing tools that want to make use
of existing probes and events that allow BPF programs to be attached to
them are thus limited to the context information provided by the probe
or event source.  Often, more context is needed to allow tracing tools
the ablity to implement more complex constructs (e.g. more state-full
tracing).

This patch extends the tail-call mechanism to allow a BPF program of
one type to call a BPF program of another type.

BPF program types can specify two new operations in struct bpf_prog_ops:
- bool is_valid_tail_call(enum bpf_prog_type stype)
    This function is called from bpf_prog_array_valid_tail_call()
            which is called from bpf_check_tail_call()
            which is called from bpf_prog_select_runtime()
            which is called from bpf_prog_load() right after the
    verifier finishes processing the program.  It is called for every
    map of type BPF_MAP_TYPE_PROG_ARRAY, and is passed the type of the
    program that is being loaded and therefore will be the origin of
    tail calls.  It returns true if tail calls from the source BPF
    program type to the implementing program type are allowed.

- void *convert_ctx(enum bpf_prog_type stype, void *ctx)
    This function is called during the execution of a BPF tail-call.
    It returns a valid context for the implementing BPF program type,
    based on the passed context pointer (ctx) for BPF program type
    stype.

The program array holding BPF programs that you can tail-call into
continues to require that all programs are of the same type.  But when
a compatibility check is made in a program that performs a tail-call,
the is_valid_tail_call() function is called (if available) to allow
the target type to determine whether it can handle the conversion of
a context from the source type to the target type.  If the function is
not implemented by the program type, casting is denied.

During execution, the convert_ctx() function is called (if available)
to perform the conversion of the current context to the context that the
target type expects.  Since the program type of the executing BPF program
is not explicitly known during execution, the verifier inserts an
instruction right before the tail-call to assign the current BPF program
type to R4.

The interpreter calls convert_ctx() using the program type in R4 as
source program type, the program type associated with the program array
as target program type, and the context as provided in R1.

A helper (finalize_context) is added to allow tail called programs to
perform context setup based on information that is passed in from the
calling program by means of a map that is indexed by CPU id.  The actual
content of the map is defined by the BPF program type implementation
for the program type that is being called.

The bpf_prog_types array is now being exposed to the rest of the BPF
code (where before it was local to just the syscall handling) because
the is_valid_tail_call() and convert_ctx() operations need to be
accessible.

There is no noticeable effect on BPF program types that do not implement
this new feature.

A JIT implementation is not available yet in this first iteration.

v2: Fixed compilation when CONFIG_BPF_SYSCALL=n.
    Fixed casting issue on platforms with 32-bit pointers.

v3: Renamed the new program type operations to be more descriptive.
    Added finalize_context() helper.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
---
 include/linux/bpf.h                       |  3 +++
 include/uapi/linux/bpf.h                  | 11 ++++++++-
 kernel/bpf/core.c                         | 29 ++++++++++++++++++++++-
 kernel/bpf/syscall.c                      |  2 +-
 kernel/bpf/verifier.c                     | 16 +++++++++----
 tools/include/uapi/linux/bpf.h            | 11 ++++++++-
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 7 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 59631dd0777c..7a40a3cd7ff2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -294,6 +294,8 @@ bpf_ctx_record_field_size(struct bpf_insn_access_aux *aux, u32 size)
 struct bpf_prog_ops {
 	int (*test_run)(struct bpf_prog *prog, const union bpf_attr *kattr,
 			union bpf_attr __user *uattr);
+	bool (*is_valid_tail_call)(enum bpf_prog_type stype);
+	void *(*convert_ctx)(enum bpf_prog_type stype, void *ctx);
 };
 
 struct bpf_verifier_ops {
@@ -571,6 +573,7 @@ extern const struct file_operations bpf_prog_fops;
 #undef BPF_PROG_TYPE
 #undef BPF_MAP_TYPE
 
+extern const struct bpf_prog_ops * const bpf_prog_types[];
 extern const struct bpf_prog_ops bpf_offload_prog_ops;
 extern const struct bpf_verifier_ops tc_cls_act_analyzer_ops;
 extern const struct bpf_verifier_ops xdp_analyzer_ops;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 63e0cf66f01a..61abe6b56948 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2672,6 +2672,14 @@ union bpf_attr {
  *		0 on success.
  *
  *		**-ENOENT** if the bpf-local-storage cannot be found.
+ *
+ * int bpf_finalize_context(void *ctx, struct bpf_map *map)
+ *	Description
+ *		Perform any final context setup after a tail call took
+ *		place from another BPF program type into a program of
+ *		the implementing program type.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2782,7 +2790,8 @@ union bpf_attr {
 	FN(strtol),			\
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
-	FN(sk_storage_delete),
+	FN(sk_storage_delete),		\
+	FN(finalize_context),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 242a643af82f..225b1be766b0 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1456,10 +1456,12 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
 		CONT;
 
 	JMP_TAIL_CALL: {
+		void *ctx = (void *) (unsigned long) BPF_R1;
 		struct bpf_map *map = (struct bpf_map *) (unsigned long) BPF_R2;
 		struct bpf_array *array = container_of(map, struct bpf_array, map);
 		struct bpf_prog *prog;
 		u32 index = BPF_R3;
+		u32 type = BPF_R4;
 
 		if (unlikely(index >= array->map.max_entries))
 			goto out;
@@ -1471,6 +1473,13 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
 		prog = READ_ONCE(array->ptrs[index]);
 		if (!prog)
 			goto out;
+		if (prog->aux->ops->convert_ctx) {
+			ctx = prog->aux->ops->convert_ctx(type, ctx);
+			if (!ctx)
+				goto out;
+
+			BPF_R1 = (u64) (uintptr_t) ctx;
+		}
 
 		/* ARG1 at this point is guaranteed to point to CTX from
 		 * the verifier side due to the fact that the tail call is
@@ -1667,6 +1676,23 @@ bool bpf_prog_array_compatible(struct bpf_array *array,
 	       array->owner_jited == fp->jited;
 }
 
+bool bpf_prog_array_valid_tail_call(struct bpf_array *array,
+				    const struct bpf_prog *fp)
+{
+#ifdef CONFIG_BPF_SYSCALL
+	const struct bpf_prog_ops *ops;
+
+	if (array->owner_jited != fp->jited)
+		return false;
+
+	ops = bpf_prog_types[array->owner_prog_type];
+	if (ops->is_valid_tail_call)
+		return ops->is_valid_tail_call(fp->type);
+#endif
+
+	return false;
+}
+
 static int bpf_check_tail_call(const struct bpf_prog *fp)
 {
 	struct bpf_prog_aux *aux = fp->aux;
@@ -1680,7 +1706,8 @@ static int bpf_check_tail_call(const struct bpf_prog *fp)
 			continue;
 
 		array = container_of(map, struct bpf_array, map);
-		if (!bpf_prog_array_compatible(array, fp))
+		if (!bpf_prog_array_compatible(array, fp) &&
+		    !bpf_prog_array_valid_tail_call(array, fp))
 			return -EINVAL;
 	}
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ad3ccf82f31d..f76fd30ad372 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1179,7 +1179,7 @@ static int map_freeze(const union bpf_attr *attr)
 	return err;
 }
 
-static const struct bpf_prog_ops * const bpf_prog_types[] = {
+const struct bpf_prog_ops * const bpf_prog_types[] = {
 #define BPF_PROG_TYPE(_id, _name) \
 	[_id] = & _name ## _prog_ops,
 #define BPF_MAP_TYPE(_id, _ops)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 95f9354495ad..f9e5536fd1af 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7982,9 +7982,10 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			insn->imm = 0;
 			insn->code = BPF_JMP | BPF_TAIL_CALL;
 
+			cnt = 0;
 			aux = &env->insn_aux_data[i + delta];
 			if (!bpf_map_ptr_unpriv(aux))
-				continue;
+				goto privileged;
 
 			/* instead of changing every JIT dealing with tail_call
 			 * emit two extra insns:
@@ -7999,13 +8000,20 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 
 			map_ptr = BPF_MAP_PTR(aux->map_state);
 			insn_buf[0] = BPF_JMP_IMM(BPF_JGE, BPF_REG_3,
-						  map_ptr->max_entries, 2);
+						  map_ptr->max_entries, 3);
 			insn_buf[1] = BPF_ALU32_IMM(BPF_AND, BPF_REG_3,
 						    container_of(map_ptr,
 								 struct bpf_array,
 								 map)->index_mask);
-			insn_buf[2] = *insn;
-			cnt = 3;
+			cnt = 2;
+
+privileged:
+			/* store the BPF program type of the current program in
+			 * R4 so it is known in case this tail call requires
+			 * casting the context to a different program type
+			 */
+			insn_buf[cnt++] = BPF_MOV64_IMM(BPF_REG_4, prog->type);
+			insn_buf[cnt++] = *insn;
 			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
 			if (!new_prog)
 				return -ENOMEM;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 63e0cf66f01a..61abe6b56948 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2672,6 +2672,14 @@ union bpf_attr {
  *		0 on success.
  *
  *		**-ENOENT** if the bpf-local-storage cannot be found.
+ *
+ * int bpf_finalize_context(void *ctx, struct bpf_map *map)
+ *	Description
+ *		Perform any final context setup after a tail call took
+ *		place from another BPF program type into a program of
+ *		the implementing program type.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2782,7 +2790,8 @@ union bpf_attr {
 	FN(strtol),			\
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
-	FN(sk_storage_delete),
+	FN(sk_storage_delete),		\
+	FN(finalize_context),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 6e80b66d7fb1..d98a62b3b56c 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -216,6 +216,8 @@ static void *(*bpf_sk_storage_get)(void *map, struct bpf_sock *sk,
 	(void *) BPF_FUNC_sk_storage_get;
 static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
 	(void *)BPF_FUNC_sk_storage_delete;
+static int (*bpf_finalize_context)(void *ctx, void *map) =
+	(void *) BPF_FUNC_finalize_context;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 02/11] bpf: add BPF_PROG_TYPE_DTRACE
  2019-05-20 23:47 [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
  2019-05-21 17:56 ` Alexei Starovoitov
  2019-05-21 20:39 ` [RFC PATCH 01/11] bpf: context casting for tail call Kris Van Hees
@ 2019-05-21 20:39 ` Kris Van Hees
  2019-05-21 20:39 ` [RFC PATCH 03/11] bpf: export proto for bpf_perf_event_output helper Kris Van Hees
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:39 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

Add a new BPF program type for DTrace.  The program type is not compiled
right now because the CONFIG_DTRACE option does not exist yet.  It will
be added in a following commit.

Three commits are involved here:

1. add the BPF program type (conditional on a to-be-added option)
2. add the BPF_PROG_TYPE_DTRACE implementation (building not enabled)
3. add the CONFIG_DTRACE option and enable compilation of the prog type
   implementation

The reason for this sequence is to ensure that the kernel tree remains
buildable between these commits.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
---
 include/linux/bpf_types.h      |  3 +++
 include/uapi/linux/bpf.h       |  1 +
 samples/bpf/bpf_load.c         | 10 +++++++---
 tools/include/uapi/linux/bpf.h |  1 +
 tools/lib/bpf/libbpf.c         |  2 ++
 tools/lib/bpf/libbpf_probes.c  |  1 +
 6 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 5a9975678d6f..908f2e4f597e 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -26,6 +26,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_TRACEPOINT, tracepoint)
 BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event)
 BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
+#ifdef CONFIG_DTRACE
+BPF_PROG_TYPE(BPF_PROG_TYPE_DTRACE, dtrace)
+#endif
 #endif
 #ifdef CONFIG_CGROUP_BPF
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 61abe6b56948..7bcb707539d1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -170,6 +170,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_FLOW_DISSECTOR,
 	BPF_PROG_TYPE_CGROUP_SYSCTL,
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+	BPF_PROG_TYPE_DTRACE,
 };
 
 enum bpf_attach_type {
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index eae7b635343d..4812295484a1 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -87,6 +87,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_sockops = strncmp(event, "sockops", 7) == 0;
 	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
 	bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
+	bool is_dtrace = strncmp(event, "dtrace", 6) == 0;
 	size_t insns_cnt = size / sizeof(struct bpf_insn);
 	enum bpf_prog_type prog_type;
 	char buf[256];
@@ -120,6 +121,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_SK_SKB;
 	} else if (is_sk_msg) {
 		prog_type = BPF_PROG_TYPE_SK_MSG;
+	} else if (is_dtrace) {
+		prog_type = BPF_PROG_TYPE_DTRACE;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -140,8 +143,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
 		return 0;
 
-	if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
-		if (is_socket)
+	if (is_socket || is_sockops || is_sk_skb || is_sk_msg || is_dtrace) {
+		if (is_socket || is_dtrace)
 			event += 6;
 		else
 			event += 7;
@@ -643,7 +646,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
 		    memcmp(shname, "cgroup/", 7) == 0 ||
 		    memcmp(shname, "sockops", 7) == 0 ||
 		    memcmp(shname, "sk_skb", 6) == 0 ||
-		    memcmp(shname, "sk_msg", 6) == 0) {
+		    memcmp(shname, "sk_msg", 6) == 0 ||
+		    memcmp(shname, "dtrace", 6) == 0) {
 			ret = load_and_attach(shname, data->d_buf,
 					      data->d_size);
 			if (ret != 0)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 61abe6b56948..7bcb707539d1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -170,6 +170,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_FLOW_DISSECTOR,
 	BPF_PROG_TYPE_CGROUP_SYSCTL,
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+	BPF_PROG_TYPE_DTRACE,
 };
 
 enum bpf_attach_type {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 7e3b79d7c25f..44704a7d395d 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2269,6 +2269,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
 		return false;
 	case BPF_PROG_TYPE_KPROBE:
+	case BPF_PROG_TYPE_DTRACE:
 	default:
 		return true;
 	}
@@ -3209,6 +3210,7 @@ static const struct {
 						BPF_CGROUP_UDP6_SENDMSG),
 	BPF_EAPROG_SEC("cgroup/sysctl",		BPF_PROG_TYPE_CGROUP_SYSCTL,
 						BPF_CGROUP_SYSCTL),
+	BPF_PROG_SEC("dtrace/",			BPF_PROG_TYPE_DTRACE),
 };
 
 #undef BPF_PROG_SEC_IMPL
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 5e2aa83f637a..544d8530915e 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -101,6 +101,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
 	case BPF_PROG_TYPE_SK_REUSEPORT:
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
+	case BPF_PROG_TYPE_DTRACE:
 	default:
 		break;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 03/11] bpf: export proto for bpf_perf_event_output helper
  2019-05-20 23:47 [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
                   ` (2 preceding siblings ...)
  2019-05-21 20:39 ` [RFC PATCH 02/11] bpf: add BPF_PROG_TYPE_DTRACE Kris Van Hees
@ 2019-05-21 20:39 ` Kris Van Hees
       [not found] ` <facilities>
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:39 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

The bpf_perf_event_output helper is used by various tracer BPF program
types, but it was not visible outside of bpf_trace.c.  In order to make
it available to tracer BPF program types that are implemented elsewhere,
a function is added similar to bpf_get_trace_printk_proto() to query the
prototype (bpf_get_perf_event_output_proto()).

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
---
 include/linux/bpf.h      | 1 +
 kernel/trace/bpf_trace.c | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7a40a3cd7ff2..e4bcb79656c4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -479,6 +479,7 @@ bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *f
 int bpf_prog_calc_tag(struct bpf_prog *fp);
 
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
+const struct bpf_func_proto *bpf_get_perf_event_output_proto(void);
 
 typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
 					unsigned long off, unsigned long len);
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index b496ffdf5f36..3d812238bc40 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -473,6 +473,12 @@ static const struct bpf_func_proto bpf_perf_event_output_proto = {
 	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
 };
 
+const struct bpf_func_proto *bpf_get_perf_event_output_proto(void)
+{
+	return &bpf_perf_event_output_proto;
+}
+
+
 static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs);
 static DEFINE_PER_CPU(struct perf_sample_data, bpf_misc_sd);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 04/11] trace: initial implementation of DTrace based on kernel
       [not found] ` <facilities>
@ 2019-05-21 20:39   ` Kris Van Hees
  0 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:39 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

This patch adds an implementation for BPF_PROG_TYPE_DTRACE, making use of
the tail-call mechanism feature to along tail-calls between programs of a
different BPF program type.  A new config option (DTRACE) is added to
control whether to include this new feature.

The DTRACE BPF program type provides an environment for executing probe
actions within the generalized trace context, independent from the type of
probe that triggered the program.  Probe types support specific BPF program
types, and this implementation uses the tail-call mechanism to call into
the DTRACE BPF program type from probe BPF prgroam types.  This initial
implementation provides support for the KPROBE type only - more will be
added in the near future.

The implementation provides:
 - dtrace_get_func_proto() as helper validator
 - dtrace_is_valid_access() as context access validator
 - dtrace_convert_ctx_access() as context access rewriter
 - dtrace_is_valid_tail_call() to validate the calling program type
 - dtrace_convert_ctx() to convert the context of the calling program into
   a DTRACE BPF program type context
 - dtrace_finalize_context() as bpf_finalize_context() helper for the
   DTRACE BPF program type

The dtrace_bpf_ctx struct defines the DTRACE BPF program type context at
the kernel level, and stores the following members:

	struct pt_reg *regs		- register state when probe fired
	u32 ecb_id			- probe enabling ID
	u32 probe_id			- probe ID
	struct task_struct *task	- executing task when probe fired

The regs and task members are populated from dtrace_convert_ctx() which is
called during the tail-call processing.  The ecb_id and probe_id are
populated from dtrace_finalize_context().

Sample use:

	#include <linux/dtrace.h>

	/*
	 * Map to store DTRACE BPF programs that can be called using
	 * the tail-call mechanism from other BPF program types.
	 */
	struct bpf_map_def SEC("maps") progs = {
		.type = BPF_MAP_TYPE_PROG_ARRAY,
		.key_size = sizeof(u32),
		.value_size = sizeof(u32),
		.max_entries = 8192,
	};

	/*
	 * Map to store DTrace probe specific information and share
	 * it across program boundaries.  This makes it possible for
	 * DTRACE BPF program to know what probe caused them to get
	 * called.
	 */
	struct bpf_map_def SEC("maps") probemap = {
		.type = BPF_MAP_TYPE_HASH,
		.key_size = sizeof(u32),
		.value_size = sizeof(struct dtrace_ecb),
		.max_entries = NR_CPUS,
	};

	SEC("dtrace/1") int dt_probe1(struct dtrace_bpf_context *ctx)
	{
		struct dtrace_ecb	*ecb;
                char			fmt[] = "EPID %d PROBE %d\n";

		bpf_finalize_context(ctx, &probemap);
                bpf_trace_printk(fmt, sizeof(fmt),
				 ctx->ecb_id, ctx->probe_id);

		return 0;
	}

	SEC("kprobe/sys_write") int bpf_prog1(struct pt_regs *ctx)
	{
		struct dtrace_ecb	ecb;
		int			cpu;

		cpu = bpf_get_smp_processor_id();
		ecb.id = 3;
		ecb.probe_id = 123;

		/* Store the ECB. */
		bpf_map_update_elem(&probemap, &cpu, &ecb, BPF_ANY);

		/* Issue tail-call into DTRACE BPF program. */
		bpf_tail_call(ctx, &progs, 1);

		/* fall through -> program not found or call failed */
		return 0;
	}

This patch also adds DTrace as a new subsystem in the MAINTAINERS file,
with me as current maintainer and our development mailing list for
specific development discussions.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
---
 MAINTAINERS                  |   7 +
 include/uapi/linux/dtrace.h  |  50 ++++++
 kernel/trace/dtrace/Kconfig  |   7 +
 kernel/trace/dtrace/Makefile |   3 +
 kernel/trace/dtrace/bpf.c    | 321 +++++++++++++++++++++++++++++++++++
 5 files changed, 388 insertions(+)
 create mode 100644 include/uapi/linux/dtrace.h
 create mode 100644 kernel/trace/dtrace/Kconfig
 create mode 100644 kernel/trace/dtrace/Makefile
 create mode 100644 kernel/trace/dtrace/bpf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ce573aaa04df..07da7cc69f23 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5450,6 +5450,13 @@ W:	https://linuxtv.org
 S:	Odd Fixes
 F:	drivers/media/pci/dt3155/
 
+DTRACE
+M:	Kris Van Hees <kris.van.hees@oracle.com>
+L:	dtrace-devel@oss.oracle.com
+S:	Maintained
+F:	include/uapi/linux/dtrace.h
+F:	kernel/trace/dtrace
+
 DVB_USB_AF9015 MEDIA DRIVER
 M:	Antti Palosaari <crope@iki.fi>
 L:	linux-media@vger.kernel.org
diff --git a/include/uapi/linux/dtrace.h b/include/uapi/linux/dtrace.h
new file mode 100644
index 000000000000..bbe2562c11f2
--- /dev/null
+++ b/include/uapi/linux/dtrace.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#ifndef _UAPI_LINUX_DTRACE_H
+#define _UAPI_LINUX_DTRACE_H
+
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <asm/bpf_perf_event.h>
+
+/*
+ * Public definition of the BPF context for DTrace BPF programs.  It stores
+ * probe firing state, probe definition information, and current task state.
+ */
+struct dtrace_bpf_context {
+	/* CPU registers */
+	bpf_user_pt_regs_t regs;
+
+	/* Probe info */
+	u32 ecb_id;
+	u32 probe_id;
+
+	/* Current task info */
+	u64 task;	/* current */
+	u64 state;	/* current->state */
+	u32 prio;	/* current->prio */
+	u32 cpu;	/* current->cpu or current->thread_info->cpu */
+	u32 tid;	/* current->pid */
+	u32 pid;	/* current->tgid */
+	u32 ppid;	/* current->real_parent->tgid */
+	u32 uid;	/* from_kuid(&init_user_ns, current_real_cred()->uid */
+	u32 gid;	/* from_kgid(&init_user_ns, current_real_cred()->gid */
+	u32 euid;	/* from_kuid(&init_user_ns, current_real_cred()->euid */
+	u32 egid;	/* from_kgid(&init_user_ns, current_real_cred()->egid */
+};
+
+/*
+ * Struct to identify BPF programs attached to probes.  The BPF program should
+ * populate a dtrace_ecb struct with a unique ID and the ID by which the probe
+ * is known to DTrace.  The struct will be stored in a map by the BPF program
+ * attached to the probe, and it is retrieved by the DTrace BPF program that
+ * implements the actual probe actions.
+ */
+struct dtrace_ecb {
+	u32	id;
+	u32	probe_id;
+};
+
+#endif /* _UAPI_LINUX_DTRACE_H */
diff --git a/kernel/trace/dtrace/Kconfig b/kernel/trace/dtrace/Kconfig
new file mode 100644
index 000000000000..e94af706ae70
--- /dev/null
+++ b/kernel/trace/dtrace/Kconfig
@@ -0,0 +1,7 @@
+config DTRACE
+	bool "DTrace"
+	depends on BPF_EVENTS
+	help
+	  Enable DTrace support.  This version of DTrace is implemented using
+	  existing kernel facilities such as BPF and the perf event output
+	  buffer.
diff --git a/kernel/trace/dtrace/Makefile b/kernel/trace/dtrace/Makefile
new file mode 100644
index 000000000000..d04a8d7be577
--- /dev/null
+++ b/kernel/trace/dtrace/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y += bpf.o
diff --git a/kernel/trace/dtrace/bpf.c b/kernel/trace/dtrace/bpf.c
new file mode 100644
index 000000000000..95f4103d749e
--- /dev/null
+++ b/kernel/trace/dtrace/bpf.c
@@ -0,0 +1,321 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <linux/bpf.h>
+#include <linux/dtrace.h>
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/sched.h>
+
+/*
+ * Actual kernel definition of the DTrace BPF context.
+ */
+struct dtrace_bpf_ctx {
+	struct pt_regs			*regs;
+	u32				ecb_id;
+	u32				probe_id;
+	struct task_struct		*task;
+};
+
+/*
+ * Helper to complete the setup of the BPF context for DTrace BPF programs.  It
+ * is to be called at the very beginning of a BPF function that is getting
+ * tail-called from another BPF program type.
+ *
+ * The provided map should be a bpf_array that holds dtrace_ecb structs as
+ * elements and is should be indexed by CPU id.
+ */
+BPF_CALL_2(dtrace_finalize_context, struct dtrace_bpf_ctx *, ctx,
+	   struct bpf_map *, map)
+{
+	struct bpf_array	*arr = container_of(map, struct bpf_array, map);
+	struct dtrace_ecb	*ecb;
+	unsigned int		cpu = smp_processor_id();
+
+	/*
+	 * There is no way to ensure that we were called with the correct map.
+	 * Perform sanity checking on the map, and ensure that the index is
+	 * not out of range.
+	 * This won't guarantee that the content is meaningful, but at least we
+	 * can ensure that accessing the map is safe.
+	 * If the content is garbage, the resulting context will be garbage
+	 * also - but it won't be unsafe.
+	 */
+	if (unlikely(map->map_type != BPF_MAP_TYPE_ARRAY))
+		return -EINVAL;
+	if (unlikely(map->value_size != sizeof(*ecb)))
+		return -EINVAL;
+	if (unlikely(cpu >= map->max_entries))
+		return -E2BIG;
+
+	ecb = READ_ONCE(arr->ptrs[cpu]);
+	if (!ecb)
+		return -ENOENT;
+
+	ctx->ecb_id = ecb->id;
+	ctx->probe_id = ecb->probe_id;
+
+	return 0;
+}
+
+static const struct bpf_func_proto dtrace_finalize_context_proto = {
+	.func           = dtrace_finalize_context,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,		/* ctx */
+	.arg2_type      = ARG_CONST_MAP_PTR,		/* map */
+};
+
+static const struct bpf_func_proto *
+dtrace_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_finalize_context:
+		return &dtrace_finalize_context_proto;
+	case BPF_FUNC_perf_event_output:
+		return bpf_get_perf_event_output_proto();
+	case BPF_FUNC_trace_printk:
+		return bpf_get_trace_printk_proto();
+	case BPF_FUNC_get_smp_processor_id:
+		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_map_lookup_elem:
+		return &bpf_map_lookup_elem_proto;
+	case BPF_FUNC_map_update_elem:
+		return &bpf_map_update_elem_proto;
+	case BPF_FUNC_map_delete_elem:
+		return &bpf_map_delete_elem_proto;
+	default:
+		return NULL;
+	}
+}
+
+/*
+ * Verify access to context data members.
+ */
+static bool dtrace_is_valid_access(int off, int size, enum bpf_access_type type,
+				   const struct bpf_prog *prog,
+				   struct bpf_insn_access_aux *info)
+{
+	/* Ensure offset is within the context structure. */
+	if (off < 0 || off >= sizeof(struct dtrace_bpf_context))
+		return false;
+
+	/* Only READ access is allowed. */
+	if (type != BPF_READ)
+		return false;
+
+	/* Ensure offset is aligned (verifier guarantees size > 0). */
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct dtrace_bpf_context, task):
+	case bpf_ctx_range(struct dtrace_bpf_context, state):
+		bpf_ctx_record_field_size(info, sizeof(u64));
+		if (bpf_ctx_narrow_access_ok(off, size, sizeof(u64)))
+			return true;
+		break;
+	case bpf_ctx_range(struct dtrace_bpf_context, ecb_id):
+	case bpf_ctx_range(struct dtrace_bpf_context, probe_id):
+	case bpf_ctx_range(struct dtrace_bpf_context, prio):
+	case bpf_ctx_range(struct dtrace_bpf_context, cpu):
+	case bpf_ctx_range(struct dtrace_bpf_context, tid):
+	case bpf_ctx_range(struct dtrace_bpf_context, pid):
+	case bpf_ctx_range(struct dtrace_bpf_context, ppid):
+	case bpf_ctx_range(struct dtrace_bpf_context, uid):
+	case bpf_ctx_range(struct dtrace_bpf_context, gid):
+	case bpf_ctx_range(struct dtrace_bpf_context, euid):
+	case bpf_ctx_range(struct dtrace_bpf_context, egid):
+		bpf_ctx_record_field_size(info, sizeof(u32));
+		if (bpf_ctx_narrow_access_ok(off, size, sizeof(u32)))
+			return true;
+		break;
+	default:
+		if (size == sizeof(unsigned long))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * A set of macros to make the access conversion code a little easier to read:
+ *
+ *  BPF_LDX_CTX_PTR(type, member, si)
+ *	si->dst_reg = ((type *)si->src_reg)->member	[member must be a ptr]
+ *
+ *  BPF_LDX_LNK_PTR(type, member, si)
+ *	si->dst_reg = ((type *)si->dst_reg)->member	[member must be a ptr]
+ *
+ *  BPF_LDX_CTX_FIELD(type, member, si, target_size)
+ *	si->dst_reg = ((type *)si->src_reg)->member
+ *	target_size = sizeof(((type *)si->src_reg)->member)
+ *
+ *  BPF_LDX_LNK_FIELD(type, member, si, target_size)
+ *	si->dst_reg = ((type *)si->dst_reg)->member
+ *	target_size = sizeof(((type *)si->dst_reg)->member)
+ *
+ * BPF_LDX_LNK_PTR must be preceded by BPF_LDX_CTX_PTR or BPF_LDX_LNK_PTR.
+ * BPF_LDX_LNK_FIELD must be preceded by BPF_LDX_CTX_PTR or BPF_LDX_LNK_PTR.
+ */
+#define BPF_LDX_CTX_PTR(type, member, si) \
+	BPF_LDX_MEM(BPF_FIELD_SIZEOF(type, member), \
+		    (si)->dst_reg, (si)->src_reg, offsetof(type, member))
+#define BPF_LDX_LNK_PTR(type, member, si) \
+	BPF_LDX_MEM(BPF_FIELD_SIZEOF(type, member), \
+		    (si)->dst_reg, (si)->dst_reg, offsetof(type, member))
+#define BPF_LDX_CTX_FIELD(type, member, si, target_size) \
+	BPF_LDX_MEM(BPF_FIELD_SIZEOF(type, member), \
+		    (si)->dst_reg, (si)->src_reg, \
+		    ({ \
+			*(target_size) = FIELD_SIZEOF(type, member); \
+			offsetof(type, member); \
+		    }))
+#define BPF_LDX_LNK_FIELD(type, member, si, target_size) \
+	BPF_LDX_MEM(BPF_FIELD_SIZEOF(type, member), \
+		    (si)->dst_reg, (si)->dst_reg, \
+		    ({ \
+			*(target_size) = FIELD_SIZEOF(type, member); \
+			offsetof(type, member); \
+		    }))
+
+/*
+ * Generate BPF instructions to retrieve the actual value for a member in the
+ * public BPF context, based on the kernel implementation of the context.
+ */
+static u32 dtrace_convert_ctx_access(enum bpf_access_type type,
+				     const struct bpf_insn *si,
+				     struct bpf_insn *insn_buf,
+				     struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct dtrace_bpf_context, ecb_id):
+		*insn++ = BPF_LDX_CTX_FIELD(struct dtrace_bpf_ctx, ecb_id, si,
+					    target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, probe_id):
+		*insn++ = BPF_LDX_CTX_FIELD(struct dtrace_bpf_ctx, probe_id, si,
+					    target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, task):
+		*insn++ = BPF_LDX_CTX_FIELD(struct dtrace_bpf_ctx, task, si,
+					    target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, state):
+		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, task, si);
+		*insn++ = BPF_LDX_LNK_FIELD(struct task_struct, state, si,
+					    target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, prio):
+		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, task, si);
+		*insn++ = BPF_LDX_LNK_FIELD(struct task_struct, prio, si,
+					    target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, cpu):
+		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, task, si);
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+		*insn++ = BPF_LDX_LNK_FIELD(struct task_struct, cpu, si,
+					    target_size);
+#else
+		*insn++ = BPF_LDX_LNK_PTR(struct task_struct, stack, si);
+		*insn++ = BPF_LDX_LNK_FIELD(struct thread_info, cpu, si,
+					    target_size);
+#endif
+		break;
+	case offsetof(struct dtrace_bpf_context, tid):
+		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, task, si);
+		*insn++ = BPF_LDX_LNK_FIELD(struct task_struct, pid, si,
+					    target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, pid):
+		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, task, si);
+		*insn++ = BPF_LDX_LNK_FIELD(struct task_struct, tgid, si,
+					    target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, ppid):
+		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, task, si);
+		*insn++ = BPF_LDX_LNK_PTR(struct task_struct, real_parent, si);
+		*insn++ = BPF_LDX_LNK_FIELD(struct task_struct, tgid, si,
+					    target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, uid):
+		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, task, si);
+		*insn++ = BPF_LDX_LNK_PTR(struct task_struct, cred, si);
+		*insn++ = BPF_LDX_LNK_FIELD(struct cred, uid, si, target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, gid):
+		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, task, si);
+		*insn++ = BPF_LDX_LNK_PTR(struct task_struct, cred, si);
+		*insn++ = BPF_LDX_LNK_FIELD(struct cred, gid, si, target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, euid):
+		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, task, si);
+		*insn++ = BPF_LDX_LNK_PTR(struct task_struct, cred, si);
+		*insn++ = BPF_LDX_LNK_FIELD(struct cred, euid, si, target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, egid):
+		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, task, si);
+		*insn++ = BPF_LDX_LNK_PTR(struct task_struct, cred, si);
+		*insn++ = BPF_LDX_LNK_FIELD(struct cred, egid, si, target_size);
+		break;
+	default:
+		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, regs, si);
+		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(long), si->dst_reg, si->dst_reg,
+				      si->off);
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+const struct bpf_verifier_ops dtrace_verifier_ops = {
+	.get_func_proto		= dtrace_get_func_proto,
+	.is_valid_access	= dtrace_is_valid_access,
+	.convert_ctx_access	= dtrace_convert_ctx_access,
+};
+
+/*
+ * Verify whether BPF programs of the given program type can tail-call DTrace
+ * BPF programs.
+ */
+static bool dtrace_is_valid_tail_call(enum bpf_prog_type stype)
+{
+	if (stype == BPF_PROG_TYPE_KPROBE)
+		return true;
+
+	return false;
+}
+
+/*
+ * Only one BPF program can be running on a given CPU at a time, and tail-call
+ * execution is effectively a jump (no return possble).  Therefore, we never
+ * need more than one DTrace BPF context per CPU.
+ */
+DEFINE_PER_CPU(struct dtrace_bpf_ctx, dtrace_ctx);
+
+/*
+ * Create a DTrace BPF program execution context based on the provided context
+ * for the given BPF program type.
+ */
+static void *dtrace_convert_ctx(enum bpf_prog_type stype, void *ctx)
+{
+	struct dtrace_bpf_ctx *gctx;
+
+	if (stype == BPF_PROG_TYPE_KPROBE) {
+		gctx = this_cpu_ptr(&dtrace_ctx);
+		gctx->regs = (struct pt_regs *)ctx;
+		gctx->task = current;
+
+		return gctx;
+	}
+
+	return NULL;
+}
+
+const struct bpf_prog_ops dtrace_prog_ops = {
+	.is_valid_tail_call	= dtrace_is_valid_tail_call,
+	.convert_ctx		= dtrace_convert_ctx,
+};
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 05/11] trace: update Kconfig and Makefile to include DTrace
  2019-05-20 23:47 [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
                   ` (4 preceding siblings ...)
       [not found] ` <facilities>
@ 2019-05-21 20:39 ` Kris Van Hees
       [not found] ` <features>
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:39 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

This commit adds the dtrace implementation in kernel/trace/dtrace to
the trace Kconfig and Makefile.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
---
 kernel/trace/Kconfig  | 2 ++
 kernel/trace/Makefile | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 5d965cef6c77..59c3bdfbaffc 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -786,6 +786,8 @@ config GCOV_PROFILE_FTRACE
 	  Note that on a kernel compiled with this config, ftrace will
 	  run significantly slower.
 
+source "kernel/trace/dtrace/Kconfig"
+
 endif # FTRACE
 
 endif # TRACING_SUPPORT
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c2b2148bb1d2..e643c4eac8f6 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -82,6 +82,7 @@ endif
 obj-$(CONFIG_DYNAMIC_EVENTS) += trace_dynevent.o
 obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
 obj-$(CONFIG_UPROBE_EVENTS) += trace_uprobe.o
+obj-$(CONFIG_DTRACE) += dtrace/
 
 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 06/11] dtrace: tiny userspace tool to exercise DTrace support
       [not found] ` <features>
@ 2019-05-21 20:39   ` Kris Van Hees
  0 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:39 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

This commit provides a small tool that makes use of the following new
features in BPF, as a sample of how the DTrace userspace code will be
interacting with the kernel:

- It uses the ability to tail-call into a BPF program of a different
  type (as long as the proper context conversion is implemented).  This
  is used to attach a BPF kprobe program to a kprobe, and having it
  tail-call a BPF dtrace program.  We do this so the probe action can
  execute in a tracer specific context rather than in a probe context.
  This way, probes of different types can all execute the probe actions
  in the same probe-independent context.

- It uses the new bpf_finalize_context() helper to retrieve data that is
  set in the BPF kprobe program attached to the probe, and use that data
  to further populate the tracer context for the BPF dtrace program.

Output is generated using the bpf_perf_event_output() helper.  This tiny
proof of concept tool demonstrates that the tail-call mechanism into a
different BPF program type works correctly, and that it is possible to
create a new context that contains information that is currently not
available to BPF programs from either a context or by means of a helper
(aside from using bpf_probe_read() on an address that is derived from
the current task).

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
---
 MAINTAINERS               |   1 +
 tools/dtrace/.gitignore   |   1 +
 tools/dtrace/Makefile     |  79 ++++++++
 tools/dtrace/dt_bpf.c     |  15 ++
 tools/dtrace/dt_buffer.c  | 386 ++++++++++++++++++++++++++++++++++++++
 tools/dtrace/dt_utils.c   | 132 +++++++++++++
 tools/dtrace/dtrace.c     |  38 ++++
 tools/dtrace/dtrace.h     |  44 +++++
 tools/dtrace/probe1_bpf.c | 100 ++++++++++
 9 files changed, 796 insertions(+)
 create mode 100644 tools/dtrace/.gitignore
 create mode 100644 tools/dtrace/Makefile
 create mode 100644 tools/dtrace/dt_bpf.c
 create mode 100644 tools/dtrace/dt_buffer.c
 create mode 100644 tools/dtrace/dt_utils.c
 create mode 100644 tools/dtrace/dtrace.c
 create mode 100644 tools/dtrace/dtrace.h
 create mode 100644 tools/dtrace/probe1_bpf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 07da7cc69f23..6d934c9f5f93 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5456,6 +5456,7 @@ L:	dtrace-devel@oss.oracle.com
 S:	Maintained
 F:	include/uapi/linux/dtrace.h
 F:	kernel/trace/dtrace
+F:	tools/dtrace
 
 DVB_USB_AF9015 MEDIA DRIVER
 M:	Antti Palosaari <crope@iki.fi>
diff --git a/tools/dtrace/.gitignore b/tools/dtrace/.gitignore
new file mode 100644
index 000000000000..d60e73526296
--- /dev/null
+++ b/tools/dtrace/.gitignore
@@ -0,0 +1 @@
+dtrace
diff --git a/tools/dtrace/Makefile b/tools/dtrace/Makefile
new file mode 100644
index 000000000000..c2ee3fb2576f
--- /dev/null
+++ b/tools/dtrace/Makefile
@@ -0,0 +1,79 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# This Makefile is shamelessly copied from samples/bpf and modified to support
+# building this prototype tracing tool.
+
+DTRACE_PATH		?= $(abspath $(srctree)/$(src))
+TOOLS_PATH		:= $(DTRACE_PATH)/..
+SAMPLES_PATH		:= $(DTRACE_PATH)/../../samples
+
+hostprogs-y		:= dtrace
+
+LIBBPF			:= $(TOOLS_PATH)/lib/bpf/libbpf.a
+OBJS			:= ../../samples/bpf/bpf_load.o dt_bpf.o dt_buffer.o dt_utils.o
+
+dtrace-objs		:= $(OBJS) dtrace.o
+
+always			:= $(hostprogs-y)
+always			+= probe1_bpf.o
+
+KBUILD_HOSTCFLAGS	+= -I$(srctree)/tools/lib
+KBUILD_HOSTCFLAGS	+= -I$(srctree)/tools/perf
+KBUILD_HOSTCFLAGS	+= -I$(srctree)/tools/include/uapi
+KBUILD_HOSTCFLAGS	+= -I$(srctree)/tools/include/
+KBUILD_HOSTCFLAGS	+= -I$(srctree)/usr/include
+
+KBUILD_HOSTLDLIBS	:= $(LIBBPF) -lelf
+
+LLC			?= llc
+CLANG			?= clang
+LLVM_OBJCOPY		?= llvm-objcopy
+
+ifdef CROSS_COMPILE
+HOSTCC			= $(CROSS_COMPILE)gcc
+CLANG_ARCH_ARGS		= -target $(ARCH)
+endif
+
+all:
+	$(MAKE) -C ../../ $(CURDIR)/ DTRACE_PATH=$(CURDIR)
+
+clean:
+	$(MAKE) -C ../../ M=$(CURDIR) clean
+	@rm -f *~
+
+$(LIBBPF): FORCE
+	$(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(DTRACE_PATH)/../../ O=
+
+FORCE:
+
+.PHONY: verify_cmds verify_target_bpf $(CLANG) $(LLC)
+
+verify_cmds: $(CLANG) $(LLC)
+	@for TOOL in $^ ; do \
+		if ! (which -- "$${TOOL}" > /dev/null 2>&1); then \
+			echo "*** ERROR: Cannot find LLVM tool $${TOOL}" ;\
+			exit 1; \
+		else true; fi; \
+	done
+
+verify_target_bpf: verify_cmds
+	@if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
+		echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
+		echo "   NOTICE: LLVM version >= 3.7.1 required" ;\
+		exit 2; \
+	else true; fi
+
+$(DTRACE_PATH)/*.c: verify_target_bpf $(LIBBPF)
+$(src)/*.c: verify_target_bpf $(LIBBPF)
+
+$(obj)/%.o: $(src)/%.c
+	@echo "  CLANG-bpf " $@
+	$(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
+		-I$(srctree)/tools/testing/selftests/bpf/ \
+		-D__KERNEL__ -D__BPF_TRACING__ -Wno-unused-value -Wno-pointer-sign \
+		-D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \
+		-Wno-gnu-variable-sized-type-not-at-end \
+		-Wno-address-of-packed-member -Wno-tautological-compare \
+		-Wno-unknown-warning-option $(CLANG_ARCH_ARGS) \
+		-I$(srctree)/samples/bpf/ -include asm_goto_workaround.h \
+		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf $(LLC_FLAGS) -filetype=obj -o $@
diff --git a/tools/dtrace/dt_bpf.c b/tools/dtrace/dt_bpf.c
new file mode 100644
index 000000000000..7919fc070685
--- /dev/null
+++ b/tools/dtrace/dt_bpf.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <stdio.h>
+
+#include "dtrace.h"
+
+/*
+ * Load the given BPF ELF object file, and apply any necessary BPF map fixups.
+ */
+int dt_bpf_load_file(char *fn)
+{
+	return load_bpf_file_fixup_map(fn, dt_buffer_fixup_map);
+}
diff --git a/tools/dtrace/dt_buffer.c b/tools/dtrace/dt_buffer.c
new file mode 100644
index 000000000000..65c107ca8ac4
--- /dev/null
+++ b/tools/dtrace/dt_buffer.c
@@ -0,0 +1,386 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <sys/epoll.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <linux/perf_event.h>
+
+#include "../../include/uapi/linux/dtrace.h"
+#include "dtrace.h"
+
+/*
+ * Probe data is recorded in per-CPU perf ring buffers.
+ */
+struct dtrace_buffer {
+	int	cpu;			/* ID of CPU that uses this buffer */
+	int	fd;			/* fd of perf output buffer */
+	size_t	page_size;		/* size of each page in buffer */
+	size_t	data_size;		/* total buffer size */
+	void	*base;			/* address of buffer */
+};
+
+static struct dtrace_buffer	*dt_buffers;
+
+/*
+ * File descriptor for the BPF map that holds the buffers for the online CPUs.
+ * The map is a bpf_array indexed by CPU id, and it stores a file descriptor as
+ * value (the fd for the perf_event that represents the CPU buffer).
+ */
+static int			bufmap_fd = -1;
+
+/*
+ * Create the BPF map (bpf_array) between the CPU id and the fd for the
+ * perf_event that owns the buffer for that CPU.  If the fd is 0 for a CPU id,
+ * that CPU is not particpating in the tracing session.
+ *
+ * BPF programs must use this definition of the map:
+ *
+ *	struct bpf_map_def SEC("maps") buffer_map = {
+ *		.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+ *		.key_size = sizeof(int),
+ *		.value_size = sizeof(u32),
+ *	};
+ *
+ * Maximum number of entries need not be specified because the map will not be
+ * created by loading the BPF program since it is being created here already.
+ */
+static int create_buffer_map(void)
+{
+	union bpf_attr	attr;
+
+	memset(&attr, 0, sizeof(attr));
+
+	attr.map_type = BPF_MAP_TYPE_PERF_EVENT_ARRAY;
+	memcpy(attr.map_name, "buffer_map", 11);
+	attr.key_size = sizeof(u32);
+	attr.value_size = sizeof(u32);
+	attr.max_entries = dt_maxcpuid;
+	attr.flags = 0;
+
+	return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
+}
+
+/*
+ * Store the (key, value) pair in the map referenced by the given fd.
+ */
+static int bpf_map_update_elem(int fd, const void *key, const void *value,
+			       u64 flags)
+{
+	union bpf_attr	attr;
+
+	memset(&attr, 0, sizeof(attr));
+
+	attr.map_fd = fd;
+	attr.key = (u64)(unsigned long)key;
+	attr.value = (u64)(unsigned long)value;
+	attr.flags = flags;
+
+	return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
+}
+
+/*
+ * Provide the fd of pre-created BPF maps that BPF programs refer to.
+ */
+void dt_buffer_fixup_map(struct bpf_map_data *map, int idx)
+{
+	if (!strcmp("buffer_map", map->name))
+		map->fd = bufmap_fd;
+}
+
+/*
+ * Create a perf_event buffer for the given DTrace buffer.  This will create
+ * a perf_event ring_buffer, mmap it, and enable the perf_event that owns the
+ * buffer.
+ */
+static int perf_buffer_open(struct dtrace_buffer *buf)
+{
+	int			pefd;
+	struct perf_event_attr	attr = {};
+
+	/*
+	 * Event configuration for BPF-generated output in perf_event ring
+	 * buffers.
+	 */
+	attr.config = PERF_COUNT_SW_BPF_OUTPUT;
+	attr.type = PERF_TYPE_SOFTWARE;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+	pefd = syscall(__NR_perf_event_open, &attr, -1, buf->cpu, -1,
+		       PERF_FLAG_FD_CLOEXEC);
+	if (pefd < 0) {
+		fprintf(stderr, "perf_event_open(cpu %d): %s\n", buf->cpu,
+			strerror(errno));
+		goto fail;
+	}
+
+	buf->fd = pefd;
+	buf->base = mmap(NULL, buf->page_size + buf->data_size,
+			 PROT_READ | PROT_WRITE, MAP_SHARED, buf->fd, 0);
+	if (!buf->base)
+		goto fail;
+
+	if (ioctl(pefd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
+		fprintf(stderr, "PERF_EVENT_IOC_ENABLE(cpu %d): %s\n",
+			buf->cpu, strerror(errno));
+		goto fail;
+	}
+
+	return 0;
+
+fail:
+	if (buf->base) {
+		munmap(buf->base, buf->page_size + buf->data_size);
+		buf->base = NULL;
+	}
+	if (buf->fd) {
+		close(buf->fd);
+		buf->fd = -1;
+	}
+
+	return -1;
+}
+
+/*
+ * Close the given DTrace buffer.  This function disables the perf_event that
+ * owns the buffer, munmaps the memory space, and closes the perf buffer fd.
+ */
+static void perf_buffer_close(struct dtrace_buffer *buf)
+{
+	/*
+	 * If the perf buffer failed to open, there is no need to close it.
+	 */
+	if (buf->fd < 0)
+		return;
+
+	if (ioctl(buf->fd, PERF_EVENT_IOC_DISABLE, 0) < 0)
+		fprintf(stderr, "PERF_EVENT_IOC_DISABLE(cpu %d): %s\n",
+			buf->cpu, strerror(errno));
+
+	munmap(buf->base, buf->page_size + buf->data_size);
+
+	if (close(buf->fd))
+		fprintf(stderr, "perf buffer close(cpu %d): %s\n",
+			buf->cpu, strerror(errno));
+
+	buf->base = NULL;
+	buf->fd = -1;
+}
+
+/*
+ * Initialize the probe data buffers (one per online CPU).  Each buffer will
+ * contain the given number of pages (i.e. total size of each buffer will be
+ * num_pages * getpagesize()).  This function also sets up an event polling
+ * descriptor that monitors all CPU buffers at once.
+ */
+int dt_buffer_init(int num_pages)
+{
+	int	i;
+	int	epoll_fd;
+
+	/* Set up the buffer BPF map. */
+	bufmap_fd = create_buffer_map();
+	if (bufmap_fd < 0)
+		return -EINVAL;
+
+	/* Allocate the per-CPU buffer structs. */
+	dt_buffers = calloc(dt_numcpus, sizeof(struct dtrace_buffer));
+	if (dt_buffers == NULL)
+		return -ENOMEM;
+
+	/* Set up the event polling file descriptor. */
+	epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+	if (epoll_fd < 0) {
+		free(dt_buffers);
+		return -errno;
+	}
+
+	for (i = 0; i < dt_numcpus; i++) {
+		int			cpu = dt_cpuids[i];
+		struct epoll_event	ev;
+		struct dtrace_buffer	*buf = &dt_buffers[i];
+
+		/*
+		 * We allocate a number of pages that is a power of 2, and add
+		 * one extra page as the reader page.
+		 */
+		buf->cpu = cpu;
+		buf->page_size = getpagesize();
+		buf->data_size = num_pages * buf->page_size;
+
+		/* Try to create the perf buffer for this DTrace buffer. */
+		if (perf_buffer_open(buf) == -1)
+			continue;
+
+		/* Store the perf buffer fd in the buffer map. */
+		bpf_map_update_elem(bufmap_fd, &cpu, &buf->fd, 0);
+
+		/* Add the buffer to the event polling descriptor. */
+		ev.events = EPOLLIN;
+		ev.data.ptr = buf;
+		if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, buf->fd, &ev) == -1) {
+			fprintf(stderr, "EPOLL_CTL_ADD(cpu %d): %s\n",
+				buf->cpu, strerror(errno));
+			continue;
+		}
+	}
+
+	return epoll_fd;
+}
+
+/*
+ * Clean up the buffers.
+ */
+void dt_buffer_exit(int epoll_fd)
+{
+	int	i;
+
+	for (i = 0; i < dt_numcpus; i++)
+		perf_buffer_close(&dt_buffers[i]);
+
+	free(dt_buffers);
+	close(epoll_fd);
+}
+
+/*
+ * Read the data_head offset from the header page of the ring buffer.  The
+ * argument is declared 'volatile' because it references a memory mapped page
+ * that the kernel may be writing to while we access it here.
+ */
+static u64 read_rb_head(volatile struct perf_event_mmap_page *rb_page)
+{
+	u64	head = rb_page->data_head;
+
+	asm volatile("" ::: "memory");
+
+	return head;
+}
+
+/*
+ * Write the data_tail offset in the header page of the ring buffer.  The
+ * argument is declared 'volatile' because it references a memory mapped page
+ * that the kernel may be writing to while we access it here.
+ */
+static void write_rb_tail(volatile struct perf_event_mmap_page *rb_page,
+			  u64 tail)
+{
+	asm volatile("" ::: "memory");
+
+	rb_page->data_tail = tail;
+}
+
+/*
+ * Process and output the probe data at the supplied address.
+ */
+static int output_event(u64 *buf)
+{
+	u8				*data = (u8 *)buf;
+	struct perf_event_header	*hdr;
+	u32				size;
+	u64				probe_id, task;
+	u32				pid, ppid, cpu, euid, egid, tag;
+
+	hdr = (struct perf_event_header *)data;
+	data += sizeof(struct perf_event_header);
+
+	if (hdr->type != PERF_RECORD_SAMPLE)
+		return 1;
+
+	size = *(u32 *)data;
+	data += sizeof(u32);
+
+	/*
+	 * The sample should only take up 48 bytes, but as a result of how the
+	 * BPF program stores the data (filling in a struct that resides on the
+	 * stack, and sending that off using bpf_perf_event_output()), there is
+	 * some internal padding
+	 */
+	if (size != 52) {
+		printf("Sample size is wrong (%d vs expected %d)\n", size, 52);
+		goto out;
+	}
+
+	probe_id = *(u64 *)&(data[0]);
+	pid = *(u32 *)&(data[8]);
+	ppid = *(u32 *)&(data[12]);
+	cpu = *(u32 *)&(data[16]);
+	euid = *(u32 *)&(data[20]);
+	egid = *(u32 *)&(data[24]);
+	task = *(u64 *)&(data[32]);
+	tag = *(u32 *)&(data[40]);
+
+	if (probe_id != 123)
+		printf("Corrupted data (probe_id = %ld)\n", probe_id);
+	if (tag != 0xdace)
+		printf("Corrupted data (tag = %x)\n", tag);
+
+	printf("CPU-%d: EPID %ld PID %d PPID %d EUID %d EGID %d TASK %08lx\n",
+	       cpu, probe_id, pid, ppid, euid, egid, task);
+
+out:
+	/*
+	 * We processed the perf_event_header, the size, and ;size; bytes of
+	 * probe data.
+	 */
+	return sizeof(struct perf_event_header) + sizeof(u32) + size;
+}
+
+/*
+ * Process the available probe data in the given buffer.
+ */
+static void process_data(struct dtrace_buffer *buf)
+{
+	/* This is volatile because the kernel may be updating the content. */
+	volatile struct perf_event_mmap_page	*rb_page = buf->base;
+	u8					*base = (u8 *)buf->base +
+							buf->page_size;
+	u64					head = read_rb_head(rb_page);
+
+	while (rb_page->data_tail != head) {
+		u64	tail = rb_page->data_tail;
+		u64	*ptr = (u64 *)(base + tail % buf->data_size);
+		int	len;
+
+		/*
+		 * Ensure that the buffer contains enough data for at least one
+		 * sample (header + sample size + sample data).
+		 */
+		if (head - tail < sizeof(struct perf_event_header) +
+				  sizeof(u32) + 48)
+			break;
+
+		if (*ptr)
+			len = output_event(ptr);
+		else
+			len = sizeof(*ptr);
+
+		write_rb_tail(rb_page, tail + len);
+		head = read_rb_head(rb_page);
+	}
+}
+
+/*
+ * Wait for data to become available in any of the buffers.
+ */
+int dt_buffer_poll(int epoll_fd, int timeout)
+{
+	struct epoll_event	events[dt_numcpus];
+	int			i, cnt;
+
+	cnt = epoll_wait(epoll_fd, events, dt_numcpus, timeout);
+	if (cnt < 0)
+		return -errno;
+
+	for (i = 0; i < cnt; i++)
+		process_data((struct dtrace_buffer *)events[i].data.ptr);
+
+	return cnt;
+}
diff --git a/tools/dtrace/dt_utils.c b/tools/dtrace/dt_utils.c
new file mode 100644
index 000000000000..e434a8a4769b
--- /dev/null
+++ b/tools/dtrace/dt_utils.c
@@ -0,0 +1,132 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "dtrace.h"
+
+#define BUF_SIZE	1024		/* max size for online cpu data */
+
+int	dt_numcpus;			/* number of online CPUs */
+int	dt_maxcpuid;			/* highest CPU id */
+int	*dt_cpuids;			/* list of CPU ids */
+
+/*
+ * Populate the online CPU id information from sysfs data.  We only do this
+ * once because we do not care about CPUs coming online after we started
+ * tracing.  If a CPU goes offline during tracing, we do not care either
+ * because that simply means that it won't be writing any new probe data into
+ * its buffer.
+ */
+void cpu_list_populate(void)
+{
+	char buf[BUF_SIZE];
+	int fd, cnt, start, end, i;
+	int *cpu;
+	char *p, *q;
+
+	fd = open("/sys/devices/system/cpu/online", O_RDONLY);
+	if (fd < 0)
+		goto fail;
+	cnt = read(fd, buf, sizeof(buf));
+	close(fd);
+	if (cnt <= 0)
+		goto fail;
+
+	/*
+	 * The string should always end with a newline, but let's make sure.
+	 */
+	if (buf[cnt - 1] == '\n')
+		buf[--cnt] = 0;
+
+	/*
+	 * Count how many CPUs we have.
+	 */
+	dt_numcpus = 0;
+	p = buf;
+	do {
+		start = (int)strtol(p, &q, 10);
+		switch (*q) {
+		case '-':		/* range */
+			p = q + 1;
+			end = (int)strtol(p, &q, 10);
+			dt_numcpus += end - start + 1;
+			if (*q == 0) {	/* end of string */
+				p = q;
+				break;
+			}
+			if (*q != ',')
+				goto fail;
+			p = q + 1;
+			break;
+		case 0:			/* end of string */
+			dt_numcpus++;
+			p = q;
+			break;
+		case ',':	/* gap  */
+			dt_numcpus++;
+			p = q + 1;
+			break;
+		}
+	} while (*p != 0);
+
+	dt_cpuids = calloc(dt_numcpus,  sizeof(int));
+	cpu = dt_cpuids;
+
+	/*
+	 * Fill in the CPU ids.
+	 */
+	p = buf;
+	do {
+		start = (int)strtol(p, &q, 10);
+		switch (*q) {
+		case '-':		/* range */
+			p = q + 1;
+			end = (int)strtol(p, &q, 10);
+			for (i = start; i <= end; i++)
+				*cpu++ = i;
+			if (*q == 0) {	/* end of string */
+				p = q;
+				break;
+			}
+			if (*q != ',')
+				goto fail;
+			p = q + 1;
+			break;
+		case 0:			/* end of string */
+			*cpu = start;
+			p = q;
+			break;
+		case ',':	/* gap  */
+			*cpu++ = start;
+			p = q + 1;
+			break;
+		}
+	} while (*p != 0);
+
+	/* Record the highest CPU id of the set of online CPUs. */
+	dt_maxcpuid = *(cpu - 1);
+
+	return;
+fail:
+	if (dt_cpuids)
+		free(dt_cpuids);
+
+	dt_numcpus = 0;
+	dt_maxcpuid = 0;
+	dt_cpuids = NULL;
+}
+
+void cpu_list_free(void)
+{
+	free(dt_cpuids);
+	dt_numcpus = 0;
+	dt_maxcpuid = 0;
+	dt_cpuids = NULL;
+}
diff --git a/tools/dtrace/dtrace.c b/tools/dtrace/dtrace.c
new file mode 100644
index 000000000000..6a6af8e3123a
--- /dev/null
+++ b/tools/dtrace/dtrace.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <stdlib.h>
+
+#include "dtrace.h"
+
+int main(int argc, char *argv[])
+{
+	int	epoll_fd;
+	int	cnt;
+
+	/* Get the list of online CPUs. */
+	cpu_list_populate();
+
+	/* Initialize buffers. */
+	epoll_fd = dt_buffer_init(32);
+	if (epoll_fd < 0) {
+		perror("dt_buffer_init");
+		exit(1);
+	}
+
+	/* Load the BPF program. */
+	if (argc < 2 || dt_bpf_load_file(argv[1]))
+		goto out;
+
+	printf("BPF loaded from %s...\n", argv[1]);
+
+	/* Process probe data. */
+	do {
+		cnt = dt_buffer_poll(epoll_fd, 100);
+	} while (cnt >= 0);
+
+out:
+	dt_buffer_exit(epoll_fd);
+	cpu_list_free();
+
+	exit(0);
+}
diff --git a/tools/dtrace/dtrace.h b/tools/dtrace/dtrace.h
new file mode 100644
index 000000000000..708b1477d39e
--- /dev/null
+++ b/tools/dtrace/dtrace.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#ifndef _DTRACE_H
+#define _DTRACE_H
+
+extern int	dt_numcpus;
+extern int	dt_maxcpuid;
+extern int	*dt_cpuids;
+
+extern void cpu_list_populate(void);
+extern void cpu_list_free(void);
+
+extern int dt_bpf_load_file(char *fn);
+
+extern int dt_buffer_init(int num_pages);
+extern int dt_buffer_poll(int epoll_fd, int timeout);
+extern void dt_buffer_exit(int epoll_fd);
+
+struct bpf_load_map_def {
+	unsigned int type;
+	unsigned int key_size;
+	unsigned int value_size;
+	unsigned int max_entries;
+	unsigned int map_flags;
+	unsigned int inner_map_idx;
+	unsigned int numa_node;
+};
+
+struct bpf_map_data {
+	int fd;
+	char *name;
+	size_t elf_offset;
+	struct bpf_load_map_def def;
+};
+
+typedef void (*fixup_map_cb)(struct bpf_map_data *map, int idx);
+
+extern void dt_buffer_fixup_map(struct bpf_map_data *map, int idx);
+
+extern int load_bpf_file_fixup_map(const char *path, fixup_map_cb fixup_map);
+
+#endif /* _DTRACE_H */
diff --git a/tools/dtrace/probe1_bpf.c b/tools/dtrace/probe1_bpf.c
new file mode 100644
index 000000000000..5b34edb61412
--- /dev/null
+++ b/tools/dtrace/probe1_bpf.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ *
+ * This sample BPF program was inspired by samples/bpf/tracex5_kern.c:
+ *   Copyright (c) 2015 PLUMgrid, http://plumgrid.com
+ */
+#include <uapi/linux/bpf.h>
+#include <linux/dtrace.h>
+#include <linux/version.h>
+#include <uapi/linux/unistd.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") progs = {
+	.type = BPF_MAP_TYPE_PROG_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = 8192,
+};
+
+struct bpf_map_def SEC("maps") probemap = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(struct dtrace_ecb),
+	.max_entries = NR_CPUS,
+};
+
+/*
+ * Here so we have the map specification - it actually gets created by the
+ * userspace component of DTrace, and the loader code simply modifies the
+ * code by inserting the correct fd value.
+ */
+struct bpf_map_def SEC("maps") buffer_map = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(u32),
+	.max_entries = 2,
+};
+
+struct sample {
+	u64 probe_id;
+	u32 pid;
+	u32 ppid;
+	u32 cpu;
+	u32 euid;
+	u32 egid;
+	u64 task;
+	u32 tag;
+};
+
+#define DPROG(F)	SEC("dtrace/"__stringify(F)) int bpf_func_##F
+
+/* we jump here when syscall number == __NR_write */
+DPROG(__NR_write)(struct dtrace_bpf_context *ctx)
+{
+	int			cpu = bpf_get_smp_processor_id();
+	struct dtrace_ecb	*ecb;
+	struct sample		smpl;
+
+	bpf_finalize_context(ctx, &probemap);
+
+	ecb = bpf_map_lookup_elem(&probemap, &cpu);
+	if (!ecb)
+		return 0;
+
+	memset(&smpl, 0, sizeof(smpl));
+	smpl.probe_id = ecb->probe_id;
+	smpl.pid = ctx->pid;
+	smpl.ppid = ctx->ppid;
+	smpl.cpu = ctx->cpu;
+	smpl.euid = ctx->euid;
+	smpl.egid = ctx->egid;
+	smpl.task = ctx->task;
+	smpl.tag = 0xdace;
+
+	bpf_perf_event_output(ctx, &buffer_map, cpu, &smpl, sizeof(smpl));
+
+	return 0;
+}
+
+SEC("kprobe/sys_write")
+int bpf_prog1(struct pt_regs *ctx)
+{
+	struct dtrace_ecb	ecb;
+	int			cpu = bpf_get_smp_processor_id();
+
+	ecb.id = 1;
+	ecb.probe_id = 123;
+
+	bpf_map_update_elem(&probemap, &cpu, &ecb, BPF_ANY);
+
+	/* dispatch into next BPF program depending on syscall number */
+	bpf_tail_call(ctx, &progs, __NR_write);
+
+	/* fall through -> unknown syscall */
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 07/11] bpf: implement writable buffers in contexts
  2019-05-20 23:47 [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
                   ` (6 preceding siblings ...)
       [not found] ` <features>
@ 2019-05-21 20:39 ` Kris Van Hees
  2019-05-21 20:39 ` [RFC PATCH 08/11] perf: add perf_output_begin_forward_in_page Kris Van Hees
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:39 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

Currently, BPF supports writes to packet data in very specific cases.
The implementation can be of more general use and can be extended to any
number of writable buffers in a context.  The implementation adds two new
register types: PTR_TO_BUFFER and PTR_TO_BUFFER_END, similar to the types
PTR_TO_PACKET and PTR_TO_PACKET_END.  In addition, a field 'buf_id' is
added to the reg_state structure as a way to distinguish between different
buffers in a single context.

Buffers are specified in the context by a pair of members:
- a pointer to the start of the buffer (type PTR_TO_BUFFER)
- a pointer to the first byte beyond the buffer (type PTR_TO_BUFFER_END)

A context can contain multiple buffers.  Each buffer/buffer_end pair is
identified by a unique id (buf_id).  The start-of-buffer member offset is
usually a good unique identifier.

The semantics for using a writable buffer are the same as for packet data.
The BPF program must contain a range test (buf + num > buf_end) to ensure
that the verifier can verify that offsets are within the allowed range.

Whenever a helper is called that might update the content of the context
all range information for registers that hold pointers to a buffer is
cleared, just as it is done for packet pointers.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
---
 include/linux/bpf.h          |   3 +
 include/linux/bpf_verifier.h |   4 +-
 kernel/bpf/verifier.c        | 198 ++++++++++++++++++++++++-----------
 3 files changed, 145 insertions(+), 60 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e4bcb79656c4..fc3eda0192fb 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -275,6 +275,8 @@ enum bpf_reg_type {
 	PTR_TO_TCP_SOCK,	 /* reg points to struct tcp_sock */
 	PTR_TO_TCP_SOCK_OR_NULL, /* reg points to struct tcp_sock or NULL */
 	PTR_TO_TP_BUFFER,	 /* reg points to a writable raw tp's buffer */
+	PTR_TO_BUFFER,		 /* reg points to ctx buffer */
+	PTR_TO_BUFFER_END,	 /* reg points to ctx buffer end */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -283,6 +285,7 @@ enum bpf_reg_type {
 struct bpf_insn_access_aux {
 	enum bpf_reg_type reg_type;
 	int ctx_field_size;
+	u32 buf_id;
 };
 
 static inline void
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 1305ccbd8fe6..3538382184f3 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -45,7 +45,7 @@ struct bpf_reg_state {
 	/* Ordering of fields matters.  See states_equal() */
 	enum bpf_reg_type type;
 	union {
-		/* valid when type == PTR_TO_PACKET */
+		/* valid when type == PTR_TO_PACKET | PTR_TO_BUFFER */
 		u16 range;
 
 		/* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
@@ -132,6 +132,8 @@ struct bpf_reg_state {
 	 */
 	u32 frameno;
 	enum bpf_reg_liveness live;
+	/* For PTR_TO_BUFFER, to identify distinct buffers in a context. */
+	u32 buf_id;
 };
 
 enum bpf_stack_slot_type {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f9e5536fd1af..5fba4e6f5424 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -406,6 +406,8 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_TCP_SOCK]	= "tcp_sock",
 	[PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null",
 	[PTR_TO_TP_BUFFER]	= "tp_buffer",
+	[PTR_TO_BUFFER]		= "buf",
+	[PTR_TO_BUFFER_END]	= "buf_end",
 };
 
 static char slot_type_char[] = {
@@ -467,6 +469,9 @@ static void print_verifier_state(struct bpf_verifier_env *env,
 				verbose(env, ",off=%d", reg->off);
 			if (type_is_pkt_pointer(t))
 				verbose(env, ",r=%d", reg->range);
+			else if (t == PTR_TO_BUFFER)
+				verbose(env, ",r=%d,bid=%d", reg->range,
+					reg->buf_id);
 			else if (t == CONST_PTR_TO_MAP ||
 				 t == PTR_TO_MAP_VALUE ||
 				 t == PTR_TO_MAP_VALUE_OR_NULL)
@@ -855,6 +860,12 @@ static bool reg_is_pkt_pointer_any(const struct bpf_reg_state *reg)
 	       reg->type == PTR_TO_PACKET_END;
 }
 
+static bool reg_is_buf_pointer_any(const struct bpf_reg_state *reg)
+{
+	return reg_is_pkt_pointer_any(reg) ||
+	       reg->type == PTR_TO_BUFFER || reg->type == PTR_TO_BUFFER_END;
+}
+
 /* Unmodified PTR_TO_PACKET[_META,_END] register from ctx access. */
 static bool reg_is_init_pkt_pointer(const struct bpf_reg_state *reg,
 				    enum bpf_reg_type which)
@@ -1550,7 +1561,7 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 	return err;
 }
 
-#define MAX_PACKET_OFF 0xffff
+#define MAX_BUFFER_OFF 0xffff
 
 static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 				       const struct bpf_call_arg_meta *meta,
@@ -1585,7 +1596,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	}
 }
 
-static int __check_packet_access(struct bpf_verifier_env *env, u32 regno,
+static int __check_buffer_access(struct bpf_verifier_env *env, u32 regno,
 				 int off, int size, bool zero_size_allowed)
 {
 	struct bpf_reg_state *regs = cur_regs(env);
@@ -1593,14 +1604,15 @@ static int __check_packet_access(struct bpf_verifier_env *env, u32 regno,
 
 	if (off < 0 || size < 0 || (size == 0 && !zero_size_allowed) ||
 	    (u64)off + size > reg->range) {
-		verbose(env, "invalid access to packet, off=%d size=%d, R%d(id=%d,off=%d,r=%d)\n",
-			off, size, regno, reg->id, reg->off, reg->range);
+		verbose(env, "invalid access to %s, off=%d size=%d, R%d(id=%d,off=%d,r=%d)\n",
+			reg_is_pkt_pointer(reg) ? "packet" : "buffer", off,
+			size, regno, reg->id, reg->off, reg->range);
 		return -EACCES;
 	}
 	return 0;
 }
 
-static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
+static int check_buffer_access(struct bpf_verifier_env *env, u32 regno, int off,
 			       int size, bool zero_size_allowed)
 {
 	struct bpf_reg_state *regs = cur_regs(env);
@@ -1620,35 +1632,37 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
 			regno);
 		return -EACCES;
 	}
-	err = __check_packet_access(env, regno, off, size, zero_size_allowed);
+	err = __check_buffer_access(env, regno, off, size, zero_size_allowed);
 	if (err) {
-		verbose(env, "R%d offset is outside of the packet\n", regno);
+		verbose(env, "R%d offset is outside of the %s\n",
+			regno, reg_is_pkt_pointer(reg) ? "packet" : "buffer");
 		return err;
 	}
 
-	/* __check_packet_access has made sure "off + size - 1" is within u16.
-	 * reg->umax_value can't be bigger than MAX_PACKET_OFF which is 0xffff,
-	 * otherwise find_good_pkt_pointers would have refused to set range info
-	 * that __check_packet_access would have rejected this pkt access.
-	 * Therefore, "off + reg->umax_value + size - 1" won't overflow u32.
-	 */
-	env->prog->aux->max_pkt_offset =
-		max_t(u32, env->prog->aux->max_pkt_offset,
-		      off + reg->umax_value + size - 1);
+	if (reg_is_pkt_pointer(reg)) {
+		/* __check_buffer_access ensures "off + size - 1" is within u16
+		 * reg->umax_value can't be bigger than * MAX_BUFFER_OFF which
+		 * is 0xffff, otherwise find_good_buf_pointers would have
+		 * refused to set range info and __check_buffer_access would
+		 * have rejected this pkt access.
+		 * Therefore, "off + reg->umax_value + size - 1" won't overflow
+		 * u32.
+		 */
+		env->prog->aux->max_pkt_offset =
+			max_t(u32, env->prog->aux->max_pkt_offset,
+			      off + reg->umax_value + size - 1);
+	}
 
 	return err;
 }
 
 /* check access to 'struct bpf_context' fields.  Supports fixed offsets only */
-static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, int size,
-			    enum bpf_access_type t, enum bpf_reg_type *reg_type)
+static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx,
+			    int off, int size, enum bpf_access_type t,
+			    struct bpf_insn_access_aux *info)
 {
-	struct bpf_insn_access_aux info = {
-		.reg_type = *reg_type,
-	};
-
 	if (env->ops->is_valid_access &&
-	    env->ops->is_valid_access(off, size, t, env->prog, &info)) {
+	    env->ops->is_valid_access(off, size, t, env->prog, info)) {
 		/* A non zero info.ctx_field_size indicates that this field is a
 		 * candidate for later verifier transformation to load the whole
 		 * field and then apply a mask when accessed with a narrower
@@ -1656,9 +1670,7 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
 		 * will only allow for whole field access and rejects any other
 		 * type of narrower access.
 		 */
-		*reg_type = info.reg_type;
-
-		env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
+		env->insn_aux_data[insn_idx].ctx_field_size = info->ctx_field_size;
 		/* remember the offset of last byte accessed in ctx */
 		if (env->prog->aux->max_ctx_offset < off + size)
 			env->prog->aux->max_ctx_offset = off + size;
@@ -1870,6 +1882,10 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_TCP_SOCK:
 		pointer_desc = "tcp_sock ";
 		break;
+	case PTR_TO_BUFFER:
+		pointer_desc = "buffer ";
+		strict = true;
+		break;
 	default:
 		break;
 	}
@@ -2084,7 +2100,11 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 			mark_reg_unknown(env, regs, value_regno);
 
 	} else if (reg->type == PTR_TO_CTX) {
-		enum bpf_reg_type reg_type = SCALAR_VALUE;
+		struct bpf_insn_access_aux info = {
+			.reg_type = SCALAR_VALUE,
+			.buf_id = 0,
+		};
+
 
 		if (t == BPF_WRITE && value_regno >= 0 &&
 		    is_pointer_value(env, value_regno)) {
@@ -2096,21 +2116,22 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		if (err < 0)
 			return err;
 
-		err = check_ctx_access(env, insn_idx, off, size, t, &reg_type);
+		err = check_ctx_access(env, insn_idx, off, size, t, &info);
 		if (!err && t == BPF_READ && value_regno >= 0) {
 			/* ctx access returns either a scalar, or a
 			 * PTR_TO_PACKET[_META,_END]. In the latter
 			 * case, we know the offset is zero.
 			 */
-			if (reg_type == SCALAR_VALUE) {
+			if (info.reg_type == SCALAR_VALUE) {
 				mark_reg_unknown(env, regs, value_regno);
 			} else {
 				mark_reg_known_zero(env, regs,
 						    value_regno);
-				if (reg_type_may_be_null(reg_type))
+				if (reg_type_may_be_null(info.reg_type))
 					regs[value_regno].id = ++env->id_gen;
 			}
-			regs[value_regno].type = reg_type;
+			regs[value_regno].type = info.reg_type;
+			regs[value_regno].buf_id = info.buf_id;
 		}
 
 	} else if (reg->type == PTR_TO_STACK) {
@@ -2141,7 +2162,17 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 				value_regno);
 			return -EACCES;
 		}
-		err = check_packet_access(env, regno, off, size, false);
+		err = check_buffer_access(env, regno, off, size, false);
+		if (!err && t == BPF_READ && value_regno >= 0)
+			mark_reg_unknown(env, regs, value_regno);
+	} else if (reg->type == PTR_TO_BUFFER) {
+		if (t == BPF_WRITE && value_regno >= 0 &&
+		    is_pointer_value(env, value_regno)) {
+			verbose(env, "R%d leaks addr into buffer\n",
+				value_regno);
+			return -EACCES;
+		}
+		err = check_buffer_access(env, regno, off, size, false);
 		if (!err && t == BPF_READ && value_regno >= 0)
 			mark_reg_unknown(env, regs, value_regno);
 	} else if (reg->type == PTR_TO_FLOW_KEYS) {
@@ -2382,7 +2413,7 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 	switch (reg->type) {
 	case PTR_TO_PACKET:
 	case PTR_TO_PACKET_META:
-		return check_packet_access(env, regno, reg->off, access_size,
+		return check_buffer_access(env, regno, reg->off, access_size,
 					   zero_size_allowed);
 	case PTR_TO_MAP_VALUE:
 		if (check_map_access_type(env, regno, reg->off, access_size,
@@ -2962,34 +2993,35 @@ static int check_func_proto(const struct bpf_func_proto *fn, int func_id)
 	       check_refcount_ok(fn, func_id) ? 0 : -EINVAL;
 }
 
-/* Packet data might have moved, any old PTR_TO_PACKET[_META,_END]
- * are now invalid, so turn them into unknown SCALAR_VALUE.
+/* Packet or buffer data might have moved, any old PTR_TO_PACKET[_META,_END]
+ * and/or PTR_TO_BUFFER[_END] are now invalid, so turn them into unknown
+ * SCALAR_VALUE.
  */
-static void __clear_all_pkt_pointers(struct bpf_verifier_env *env,
+static void __clear_all_buf_pointers(struct bpf_verifier_env *env,
 				     struct bpf_func_state *state)
 {
 	struct bpf_reg_state *regs = state->regs, *reg;
 	int i;
 
 	for (i = 0; i < MAX_BPF_REG; i++)
-		if (reg_is_pkt_pointer_any(&regs[i]))
+		if (reg_is_buf_pointer_any(&regs[i]))
 			mark_reg_unknown(env, regs, i);
 
 	bpf_for_each_spilled_reg(i, state, reg) {
 		if (!reg)
 			continue;
-		if (reg_is_pkt_pointer_any(reg))
+		if (reg_is_buf_pointer_any(reg))
 			__mark_reg_unknown(reg);
 	}
 }
 
-static void clear_all_pkt_pointers(struct bpf_verifier_env *env)
+static void clear_all_buf_pointers(struct bpf_verifier_env *env)
 {
 	struct bpf_verifier_state *vstate = env->cur_state;
 	int i;
 
 	for (i = 0; i <= vstate->curframe; i++)
-		__clear_all_pkt_pointers(env, vstate->frame[i]);
+		__clear_all_buf_pointers(env, vstate->frame[i]);
 }
 
 static void release_reg_references(struct bpf_verifier_env *env,
@@ -3417,7 +3449,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 	}
 
 	if (changes_data)
-		clear_all_pkt_pointers(env);
+		clear_all_buf_pointers(env);
 	return 0;
 }
 
@@ -4349,7 +4381,7 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 	return 0;
 }
 
-static void __find_good_pkt_pointers(struct bpf_func_state *state,
+static void __find_good_buf_pointers(struct bpf_func_state *state,
 				     struct bpf_reg_state *dst_reg,
 				     enum bpf_reg_type type, u16 new_range)
 {
@@ -4358,7 +4390,11 @@ static void __find_good_pkt_pointers(struct bpf_func_state *state,
 
 	for (i = 0; i < MAX_BPF_REG; i++) {
 		reg = &state->regs[i];
-		if (reg->type == type && reg->id == dst_reg->id)
+		if (reg->type != type)
+			continue;
+		if (type == PTR_TO_BUFFER && reg->buf_id != dst_reg->buf_id)
+			continue;
+		if (reg->id == dst_reg->id)
 			/* keep the maximum range already checked */
 			reg->range = max(reg->range, new_range);
 	}
@@ -4366,12 +4402,16 @@ static void __find_good_pkt_pointers(struct bpf_func_state *state,
 	bpf_for_each_spilled_reg(i, state, reg) {
 		if (!reg)
 			continue;
-		if (reg->type == type && reg->id == dst_reg->id)
+		if (reg->type != type)
+			continue;
+		if (type == PTR_TO_BUFFER && reg->buf_id != dst_reg->buf_id)
+			continue;
+		if (reg->id == dst_reg->id)
 			reg->range = max(reg->range, new_range);
 	}
 }
 
-static void find_good_pkt_pointers(struct bpf_verifier_state *vstate,
+static void find_good_buf_pointers(struct bpf_verifier_state *vstate,
 				   struct bpf_reg_state *dst_reg,
 				   enum bpf_reg_type type,
 				   bool range_right_open)
@@ -4384,8 +4424,8 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate,
 		/* This doesn't give us any range */
 		return;
 
-	if (dst_reg->umax_value > MAX_PACKET_OFF ||
-	    dst_reg->umax_value + dst_reg->off > MAX_PACKET_OFF)
+	if (dst_reg->umax_value > MAX_BUFFER_OFF ||
+	    dst_reg->umax_value + dst_reg->off > MAX_BUFFER_OFF)
 		/* Risk of overflow.  For instance, ptr + (1<<63) may be less
 		 * than pkt_end, but that's because it's also less than pkt.
 		 */
@@ -4440,10 +4480,10 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate,
 	/* If our ids match, then we must have the same max_value.  And we
 	 * don't care about the other reg's fixed offset, since if it's too big
 	 * the range won't allow anything.
-	 * dst_reg->off is known < MAX_PACKET_OFF, therefore it fits in a u16.
+	 * dst_reg->off is known < MAX_BUFFER_OFF, therefore it fits in a u16.
 	 */
 	for (i = 0; i <= vstate->curframe; i++)
-		__find_good_pkt_pointers(vstate->frame[i], dst_reg, type,
+		__find_good_buf_pointers(vstate->frame[i], dst_reg, type,
 					 new_range);
 }
 
@@ -4934,7 +4974,7 @@ static void __mark_ptr_or_null_regs(struct bpf_func_state *state, u32 id,
 	}
 }
 
-/* The logic is similar to find_good_pkt_pointers(), both could eventually
+/* The logic is similar to find_good_buf_pointers(), both could eventually
  * be folded together at some point.
  */
 static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
@@ -4977,14 +5017,24 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn,
 		    (dst_reg->type == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' > pkt_end, pkt_meta' > pkt_data */
-			find_good_pkt_pointers(this_branch, dst_reg,
+			find_good_buf_pointers(this_branch, dst_reg,
 					       dst_reg->type, false);
 		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
 			    src_reg->type == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
 			    src_reg->type == PTR_TO_PACKET_META)) {
 			/* pkt_end > pkt_data', pkt_data > pkt_meta' */
-			find_good_pkt_pointers(other_branch, src_reg,
+			find_good_buf_pointers(other_branch, src_reg,
+					       src_reg->type, true);
+		} else if (dst_reg->type == PTR_TO_BUFFER &&
+			   src_reg->type == PTR_TO_BUFFER_END) {
+			/* buf' > buf_end */
+			find_good_buf_pointers(this_branch, dst_reg,
+					       dst_reg->type, false);
+		} else if (dst_reg->type == PTR_TO_BUFFER_END &&
+			   src_reg->type == PTR_TO_BUFFER) {
+			/* buf_end > buf' */
+			find_good_buf_pointers(other_branch, src_reg,
 					       src_reg->type, true);
 		} else {
 			return false;
@@ -4996,14 +5046,24 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn,
 		    (dst_reg->type == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' < pkt_end, pkt_meta' < pkt_data */
-			find_good_pkt_pointers(other_branch, dst_reg,
+			find_good_buf_pointers(other_branch, dst_reg,
 					       dst_reg->type, true);
 		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
 			    src_reg->type == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
 			    src_reg->type == PTR_TO_PACKET_META)) {
 			/* pkt_end < pkt_data', pkt_data > pkt_meta' */
-			find_good_pkt_pointers(this_branch, src_reg,
+			find_good_buf_pointers(this_branch, src_reg,
+					       src_reg->type, false);
+		} else if (dst_reg->type == PTR_TO_BUFFER &&
+			   src_reg->type == PTR_TO_BUFFER_END) {
+			/* buf' < buf_end */
+			find_good_buf_pointers(other_branch, dst_reg,
+					       dst_reg->type, true);
+		} else if (dst_reg->type == PTR_TO_BUFFER_END &&
+			   src_reg->type == PTR_TO_BUFFER) {
+			/* buf_end < buf' */
+			find_good_buf_pointers(this_branch, src_reg,
 					       src_reg->type, false);
 		} else {
 			return false;
@@ -5015,14 +5075,24 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn,
 		    (dst_reg->type == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' >= pkt_end, pkt_meta' >= pkt_data */
-			find_good_pkt_pointers(this_branch, dst_reg,
+			find_good_buf_pointers(this_branch, dst_reg,
 					       dst_reg->type, true);
 		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
 			    src_reg->type == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
 			    src_reg->type == PTR_TO_PACKET_META)) {
 			/* pkt_end >= pkt_data', pkt_data >= pkt_meta' */
-			find_good_pkt_pointers(other_branch, src_reg,
+			find_good_buf_pointers(other_branch, src_reg,
+					       src_reg->type, false);
+		} else if (dst_reg->type == PTR_TO_BUFFER &&
+			   src_reg->type == PTR_TO_BUFFER_END) {
+			/* buf' >= buf_end */
+			find_good_buf_pointers(this_branch, dst_reg,
+					       dst_reg->type, true);
+		} else if (dst_reg->type == PTR_TO_BUFFER_END &&
+			   src_reg->type == PTR_TO_BUFFER) {
+			/* buf_end >= buf' */
+			find_good_buf_pointers(other_branch, src_reg,
 					       src_reg->type, false);
 		} else {
 			return false;
@@ -5034,15 +5104,25 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn,
 		    (dst_reg->type == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' <= pkt_end, pkt_meta' <= pkt_data */
-			find_good_pkt_pointers(other_branch, dst_reg,
+			find_good_buf_pointers(other_branch, dst_reg,
 					       dst_reg->type, false);
 		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
 			    src_reg->type == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
 			    src_reg->type == PTR_TO_PACKET_META)) {
 			/* pkt_end <= pkt_data', pkt_data <= pkt_meta' */
-			find_good_pkt_pointers(this_branch, src_reg,
+			find_good_buf_pointers(this_branch, src_reg,
 					       src_reg->type, true);
+		} else if (dst_reg->type == PTR_TO_BUFFER &&
+			   src_reg->type == PTR_TO_BUFFER_END) {
+			/* buf' <= buf_end */
+			find_good_buf_pointers(other_branch, dst_reg,
+					       dst_reg->type, true);
+		} else if (dst_reg->type == PTR_TO_BUFFER_END &&
+			   src_reg->type == PTR_TO_BUFFER) {
+			/* buf_end <= buf' */
+			find_good_buf_pointers(this_branch, src_reg,
+					       src_reg->type, false);
 		} else {
 			return false;
 		}
@@ -7972,7 +8052,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			 */
 			prog->cb_access = 1;
 			env->prog->aux->stack_depth = MAX_BPF_STACK;
-			env->prog->aux->max_pkt_offset = MAX_PACKET_OFF;
+			env->prog->aux->max_pkt_offset = MAX_BUFFER_OFF;
 
 			/* mark bpf_tail_call as different opcode to avoid
 			 * conditional branch in the interpeter for every normal
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 08/11] perf: add perf_output_begin_forward_in_page
  2019-05-20 23:47 [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
                   ` (7 preceding siblings ...)
  2019-05-21 20:39 ` [RFC PATCH 07/11] bpf: implement writable buffers in contexts Kris Van Hees
@ 2019-05-21 20:39 ` Kris Van Hees
       [not found] ` <the>
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:39 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

Right now, BPF programs can only write to a perf event ring buffer by
constructing a sample (as an arbitrary chunk of memory of a given size),
and calling perf_event_output() to have it written to the ring buffer.

A new implementation of DTrace (based on BPF) avoids constructing the
data sample prior to writing it to the ring buffer.  Instead, it expects
to be able to reserve a block of memory of a given size, write to that
memory region as it sees fit, and then finalize the written data (making
it available for reading from userspace).

This can (in part) be accomplished as follows:
1. reserve buffer space
    Call perf_output_begin_forward_in_page(&handle, event, size) passing
    in a handle to be used for this data output session, an event that
    identifies the output buffer, and the size (in bytes) to set aside.

2. write data
    Perform store operations to the buffer space that was set aside.
    The buffer is a writable buffer in the BPF program context, which
    means that operations like *(u32 *)&buf[offset] = val can be used.

3. finalize the output session
    Call perf_output_end(&handle) to finalize the output and make the
    new data available for reading from userspace by updating the head
    of the ring buffer.

The one caveat is that ring buffers may be allocated from non-contiguous
pages in kernel memory.  This means that a reserved block of memory could
be spread across two non-consecutive pages, and accessing the buffer
space using buf[offset] is no longer safe.  Forcing the ring buffer to be
allocated using vmalloc would avoid this problem, but that would impose
a limitation on all perf event output buffers which is not an acceptable
cost.

The solution implemented here adds a flag to the __perf_output_begin()
function that performs the reserving of buffer space.  The new flag
(stay_in_page) indicates whether the requested chunk of memory must be
on a single page.  In this case, the requested size cannot exceed the
page size.  If the request cannot be satisfied within the current page,
the unused portion of the current page is filled with 0s.

A new function perf_output_begin_forward_in_page() is to be used to
commence output that cannot cross page boundaries.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
---
 include/linux/perf_event.h  |  3 ++
 kernel/events/ring_buffer.c | 65 ++++++++++++++++++++++++++++++++-----
 2 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 15a82ff0aefe..2b35d1ce61f8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1291,6 +1291,9 @@ extern int perf_output_begin(struct perf_output_handle *handle,
 extern int perf_output_begin_forward(struct perf_output_handle *handle,
 				    struct perf_event *event,
 				    unsigned int size);
+extern int perf_output_begin_forward_in_page(struct perf_output_handle *handle,
+					     struct perf_event *event,
+					     unsigned int size);
 extern int perf_output_begin_backward(struct perf_output_handle *handle,
 				      struct perf_event *event,
 				      unsigned int size);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 674b35383491..01ba540e3ee0 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -116,9 +116,11 @@ ring_buffer_has_space(unsigned long head, unsigned long tail,
 static __always_inline int
 __perf_output_begin(struct perf_output_handle *handle,
 		    struct perf_event *event, unsigned int size,
-		    bool backward)
+		    bool backward, bool stay_in_page)
 {
 	struct ring_buffer *rb;
+	unsigned int adj_size;
+	unsigned int gap_size;
 	unsigned long tail, offset, head;
 	int have_lost, page_shift;
 	struct {
@@ -144,6 +146,13 @@ __perf_output_begin(struct perf_output_handle *handle,
 		goto out;
 	}
 
+	page_shift = PAGE_SHIFT + page_order(rb);
+
+	if (unlikely(stay_in_page)) {
+		if (size > (1UL << page_shift))
+			goto out;
+	}
+
 	handle->rb    = rb;
 	handle->event = event;
 
@@ -156,13 +165,24 @@ __perf_output_begin(struct perf_output_handle *handle,
 
 	perf_output_get_handle(handle);
 
+	gap_size = 0;
+	adj_size = size;
 	do {
 		tail = READ_ONCE(rb->user_page->data_tail);
 		offset = head = local_read(&rb->head);
+
+		if (unlikely(stay_in_page)) {
+			gap_size = (1UL << page_shift) -
+				   (offset & ((1UL << page_shift) - 1));
+			if (gap_size < size)
+				adj_size += gap_size;
+		}
+
 		if (!rb->overwrite) {
 			if (unlikely(!ring_buffer_has_space(head, tail,
 							    perf_data_size(rb),
-							    size, backward)))
+							    adj_size,
+							    backward)))
 				goto fail;
 		}
 
@@ -179,9 +199,9 @@ __perf_output_begin(struct perf_output_handle *handle,
 		 */
 
 		if (!backward)
-			head += size;
+			head += adj_size;
 		else
-			head -= size;
+			head -= adj_size;
 	} while (local_cmpxchg(&rb->head, offset, head) != offset);
 
 	if (backward) {
@@ -189,6 +209,22 @@ __perf_output_begin(struct perf_output_handle *handle,
 		head = (u64)(-head);
 	}
 
+	/*
+	 * If we had to skip over the remainder of the current page because it
+	 * is not large enough to hold the sample and the sample is not allowed
+	 * to cross a page boundary, we need to clear the remainder of the page
+	 * (fill it with 0s so it is clear we skipped it), and adjust the start
+	 * of the sample (offset).
+	 */
+	if (stay_in_page && gap_size > 0) {
+		int page = (offset >> page_shift) & (rb->nr_pages - 1);
+
+		offset &= (1UL << page_shift) - 1;
+		memset(rb->data_pages[page] + offset, 0, gap_size);
+
+		offset = head - size;
+	}
+
 	/*
 	 * We rely on the implied barrier() by local_cmpxchg() to ensure
 	 * none of the data stores below can be lifted up by the compiler.
@@ -197,8 +233,6 @@ __perf_output_begin(struct perf_output_handle *handle,
 	if (unlikely(head - local_read(&rb->wakeup) > rb->watermark))
 		local_add(rb->watermark, &rb->wakeup);
 
-	page_shift = PAGE_SHIFT + page_order(rb);
-
 	handle->page = (offset >> page_shift) & (rb->nr_pages - 1);
 	offset &= (1UL << page_shift) - 1;
 	handle->addr = rb->data_pages[handle->page] + offset;
@@ -233,13 +267,26 @@ __perf_output_begin(struct perf_output_handle *handle,
 int perf_output_begin_forward(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size)
 {
-	return __perf_output_begin(handle, event, size, false);
+	return __perf_output_begin(handle, event, size, false, false);
+}
+
+/*
+ * Prepare the ring buffer for 'size' bytes of output for the given event.
+ * This particular version is used when the event data is not allowed to cross
+ * a page boundary.  This means size cannot be more than PAGE_SIZE.  It also
+ * ensures that any unused portion of a page is filled with zeros.
+ */
+int perf_output_begin_forward_in_page(struct perf_output_handle *handle,
+				      struct perf_event *event,
+				      unsigned int size)
+{
+	return __perf_output_begin(handle, event, size, false, true);
 }
 
 int perf_output_begin_backward(struct perf_output_handle *handle,
 			       struct perf_event *event, unsigned int size)
 {
-	return __perf_output_begin(handle, event, size, true);
+	return __perf_output_begin(handle, event, size, true, false);
 }
 
 int perf_output_begin(struct perf_output_handle *handle,
@@ -247,7 +294,7 @@ int perf_output_begin(struct perf_output_handle *handle,
 {
 
 	return __perf_output_begin(handle, event, size,
-				   unlikely(is_write_backward(event)));
+				   unlikely(is_write_backward(event)), false);
 }
 
 unsigned int perf_output_copy(struct perf_output_handle *handle,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 09/11] bpf: mark helpers explicitly whether they may change
       [not found]   ` <context>
@ 2019-05-21 20:39     ` Kris Van Hees
  0 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:39 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

Some helpers may update the context.  Right now, various network filter
helpers may make changes to the packet data.  This is verified by calling
the bpf_helper_changes_pkt_data() function with the function pointer.

This function resides in net/core/filter.c and needs to be updated for any
helper function that modifies packet data.  To allow for other helpers
(possibly not part of the network filter code) to do the same, this patch
changes the code from using a central function to list all helpers that
have this feature to marking each individual helper that may change the
context data.  This way, whenever a new helper is added that may change
the content of the context, there is no need to update a hardcoded list of
functions.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
---
 include/linux/bpf.h    |  1 +
 include/linux/filter.h |  1 -
 kernel/bpf/core.c      |  5 ----
 kernel/bpf/verifier.c  |  2 +-
 net/core/filter.c      | 59 ++++++++++++++++++------------------------
 5 files changed, 27 insertions(+), 41 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fc3eda0192fb..9e255d5b1062 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -226,6 +226,7 @@ enum bpf_return_type {
 struct bpf_func_proto {
 	u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 	bool gpl_only;
+	bool ctx_update;
 	bool pkt_access;
 	enum bpf_return_type ret_type;
 	enum bpf_arg_type arg1_type;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7148bab96943..9dacca7d3ef6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -811,7 +811,6 @@ u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
-bool bpf_helper_changes_pkt_data(void *func);
 
 static inline bool bpf_dump_raw_ok(void)
 {
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 225b1be766b0..8e9accf90c37 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2112,11 +2112,6 @@ void __weak bpf_jit_compile(struct bpf_prog *prog)
 {
 }
 
-bool __weak bpf_helper_changes_pkt_data(void *func)
-{
-	return false;
-}
-
 /* To execute LD_ABS/LD_IND instructions __bpf_prog_run() may call
  * skb_copy_bits(), so provide a weak definition of it for NET-less config.
  */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5fba4e6f5424..90ae04b4d5c7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3283,7 +3283,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 	}
 
 	/* With LD_ABS/IND some JITs save/restore skb from r1. */
-	changes_data = bpf_helper_changes_pkt_data(fn->func);
+	changes_data = fn->ctx_update;
 	if (changes_data && fn->arg1_type != ARG_PTR_TO_CTX) {
 		verbose(env, "kernel subsystem misconfigured func %s#%d: r1 != ctx\n",
 			func_id_name(func_id), func_id);
diff --git a/net/core/filter.c b/net/core/filter.c
index 55bfc941d17a..a9e7d3174d36 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1693,6 +1693,7 @@ BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
 static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
 	.func		= bpf_skb_store_bytes,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -1825,6 +1826,7 @@ BPF_CALL_2(bpf_skb_pull_data, struct sk_buff *, skb, u32, len)
 static const struct bpf_func_proto bpf_skb_pull_data_proto = {
 	.func		= bpf_skb_pull_data,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -1868,6 +1870,7 @@ BPF_CALL_2(sk_skb_pull_data, struct sk_buff *, skb, u32, len)
 static const struct bpf_func_proto sk_skb_pull_data_proto = {
 	.func		= sk_skb_pull_data,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -1909,6 +1912,7 @@ BPF_CALL_5(bpf_l3_csum_replace, struct sk_buff *, skb, u32, offset,
 static const struct bpf_func_proto bpf_l3_csum_replace_proto = {
 	.func		= bpf_l3_csum_replace,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -1962,6 +1966,7 @@ BPF_CALL_5(bpf_l4_csum_replace, struct sk_buff *, skb, u32, offset,
 static const struct bpf_func_proto bpf_l4_csum_replace_proto = {
 	.func		= bpf_l4_csum_replace,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -2145,6 +2150,7 @@ BPF_CALL_3(bpf_clone_redirect, struct sk_buff *, skb, u32, ifindex, u64, flags)
 static const struct bpf_func_proto bpf_clone_redirect_proto = {
 	.func           = bpf_clone_redirect,
 	.gpl_only       = false,
+	.ctx_update	= true,
 	.ret_type       = RET_INTEGER,
 	.arg1_type      = ARG_PTR_TO_CTX,
 	.arg2_type      = ARG_ANYTHING,
@@ -2337,6 +2343,7 @@ BPF_CALL_4(bpf_msg_pull_data, struct sk_msg *, msg, u32, start,
 static const struct bpf_func_proto bpf_msg_pull_data_proto = {
 	.func		= bpf_msg_pull_data,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -2468,6 +2475,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
 static const struct bpf_func_proto bpf_msg_push_data_proto = {
 	.func		= bpf_msg_push_data,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -2636,6 +2644,7 @@ BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start,
 static const struct bpf_func_proto bpf_msg_pop_data_proto = {
 	.func		= bpf_msg_pop_data,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -2738,6 +2747,7 @@ BPF_CALL_3(bpf_skb_vlan_push, struct sk_buff *, skb, __be16, vlan_proto,
 static const struct bpf_func_proto bpf_skb_vlan_push_proto = {
 	.func           = bpf_skb_vlan_push,
 	.gpl_only       = false,
+	.ctx_update	= true,
 	.ret_type       = RET_INTEGER,
 	.arg1_type      = ARG_PTR_TO_CTX,
 	.arg2_type      = ARG_ANYTHING,
@@ -2759,6 +2769,7 @@ BPF_CALL_1(bpf_skb_vlan_pop, struct sk_buff *, skb)
 static const struct bpf_func_proto bpf_skb_vlan_pop_proto = {
 	.func           = bpf_skb_vlan_pop,
 	.gpl_only       = false,
+	.ctx_update	= true,
 	.ret_type       = RET_INTEGER,
 	.arg1_type      = ARG_PTR_TO_CTX,
 };
@@ -2962,6 +2973,7 @@ BPF_CALL_3(bpf_skb_change_proto, struct sk_buff *, skb, __be16, proto,
 static const struct bpf_func_proto bpf_skb_change_proto_proto = {
 	.func		= bpf_skb_change_proto,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -3198,6 +3210,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
 static const struct bpf_func_proto bpf_skb_adjust_room_proto = {
 	.func		= bpf_skb_adjust_room,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -3285,6 +3298,7 @@ BPF_CALL_3(bpf_skb_change_tail, struct sk_buff *, skb, u32, new_len,
 static const struct bpf_func_proto bpf_skb_change_tail_proto = {
 	.func		= bpf_skb_change_tail,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -3303,6 +3317,7 @@ BPF_CALL_3(sk_skb_change_tail, struct sk_buff *, skb, u32, new_len,
 static const struct bpf_func_proto sk_skb_change_tail_proto = {
 	.func		= sk_skb_change_tail,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -3351,6 +3366,7 @@ BPF_CALL_3(bpf_skb_change_head, struct sk_buff *, skb, u32, head_room,
 static const struct bpf_func_proto bpf_skb_change_head_proto = {
 	.func		= bpf_skb_change_head,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -3369,6 +3385,7 @@ BPF_CALL_3(sk_skb_change_head, struct sk_buff *, skb, u32, head_room,
 static const struct bpf_func_proto sk_skb_change_head_proto = {
 	.func		= sk_skb_change_head,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -3403,6 +3420,7 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
 static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
 	.func		= bpf_xdp_adjust_head,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -3427,6 +3445,7 @@ BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
 static const struct bpf_func_proto bpf_xdp_adjust_tail_proto = {
 	.func		= bpf_xdp_adjust_tail,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -3455,6 +3474,7 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
 static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 	.func		= bpf_xdp_adjust_meta,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -4987,6 +5007,7 @@ BPF_CALL_4(bpf_lwt_xmit_push_encap, struct sk_buff *, skb, u32, type,
 static const struct bpf_func_proto bpf_lwt_in_push_encap_proto = {
 	.func		= bpf_lwt_in_push_encap,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -4997,6 +5018,7 @@ static const struct bpf_func_proto bpf_lwt_in_push_encap_proto = {
 static const struct bpf_func_proto bpf_lwt_xmit_push_encap_proto = {
 	.func		= bpf_lwt_xmit_push_encap,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -5040,6 +5062,7 @@ BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
 static const struct bpf_func_proto bpf_lwt_seg6_store_bytes_proto = {
 	.func		= bpf_lwt_seg6_store_bytes,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -5128,6 +5151,7 @@ BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
 static const struct bpf_func_proto bpf_lwt_seg6_action_proto = {
 	.func		= bpf_lwt_seg6_action,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -5188,6 +5212,7 @@ BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
 static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = {
 	.func		= bpf_lwt_seg6_adjust_srh,
 	.gpl_only	= false,
+	.ctx_update	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -5756,40 +5781,6 @@ static const struct bpf_func_proto bpf_tcp_check_syncookie_proto = {
 
 #endif /* CONFIG_INET */
 
-bool bpf_helper_changes_pkt_data(void *func)
-{
-	if (func == bpf_skb_vlan_push ||
-	    func == bpf_skb_vlan_pop ||
-	    func == bpf_skb_store_bytes ||
-	    func == bpf_skb_change_proto ||
-	    func == bpf_skb_change_head ||
-	    func == sk_skb_change_head ||
-	    func == bpf_skb_change_tail ||
-	    func == sk_skb_change_tail ||
-	    func == bpf_skb_adjust_room ||
-	    func == bpf_skb_pull_data ||
-	    func == sk_skb_pull_data ||
-	    func == bpf_clone_redirect ||
-	    func == bpf_l3_csum_replace ||
-	    func == bpf_l4_csum_replace ||
-	    func == bpf_xdp_adjust_head ||
-	    func == bpf_xdp_adjust_meta ||
-	    func == bpf_msg_pull_data ||
-	    func == bpf_msg_push_data ||
-	    func == bpf_msg_pop_data ||
-	    func == bpf_xdp_adjust_tail ||
-#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
-	    func == bpf_lwt_seg6_store_bytes ||
-	    func == bpf_lwt_seg6_adjust_srh ||
-	    func == bpf_lwt_seg6_action ||
-#endif
-	    func == bpf_lwt_in_push_encap ||
-	    func == bpf_lwt_xmit_push_encap)
-		return true;
-
-	return false;
-}
-
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 10/11] bpf: add bpf_buffer_reserve and bpf_buffer_commit
       [not found] ` <helpers>
@ 2019-05-21 20:39   ` Kris Van Hees
  0 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:39 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

Add two helpers that are primarily used in combination with the
writable-buffer support.  The bpf_buffer_reserve() helper sets aside
a chunk of buffer space that can be written to, and once all data
has been written, the bpf_buffer_commit() helper is used to make the
data in the ring buffer visible to userspace.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
---
 include/uapi/linux/bpf.h                  | 39 ++++++++++++++++++++++-
 kernel/bpf/verifier.c                     |  6 +++-
 tools/include/uapi/linux/bpf.h            | 39 ++++++++++++++++++++++-
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +++
 4 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7bcb707539d1..2b7772aa00b6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2681,6 +2681,41 @@ union bpf_attr {
  *		the implementing program type.
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_buffer_reserve(void *ctx, int id, struct bpf_map *map, int size)
+ *	Description
+ *		Reserve *size* bytes in the output buffer for the special BPF
+ *		BPF perf event referenced by *map*, a BPF map of type
+ *		**BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The perf event must have
+ *		the attributes: **PERF_SAMPLE_RAW** as **sample_type**,
+ *		**PERF_TYPE_SOFTWARE** as **type**, and
+ *		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.  The reserved space
+ *		will be available as the writable buffer identified with
+ *		numeric ID **id** in the context.
+ *
+ *		The amount of reserved bytes cannot exceed the page size.
+ *		The chunk of buffer space will be reserved within a single
+ *		page, and if this results in unused space at the end of the
+ *		previous page in the ring-buffer, that unsused space will be
+ *		filled with zeros.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_buffer_commit(void *ctx, int id, struct bpf_map *map)
+ *	Description
+ *		FInalize the previously reserved space in the output buffer
+ *		for the special BPF perf event referenced by *map*, a BPF map
+ *		of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The perf event must
+ *		have the attributes: **PERF_SAMPLE_RAW** as **sample_type**,
+ *		**PERF_TYPE_SOFTWARE** as **type**, and
+ *		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.
+ *
+ *		The writable buffer identified with numeric ID **id** in the
+ *		context will be invalidated, and can no longer be used to
+ *		write data to until a new **bpf_buffer_reserve**\ () has been
+ *		invoked.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2792,7 +2827,9 @@ union bpf_attr {
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
-	FN(finalize_context),
+	FN(finalize_context),		\
+	FN(buffer_reserve),		\
+	FN(buffer_commit),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 90ae04b4d5c7..ff73ed743a58 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2763,7 +2763,9 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_MAP_TYPE_PERF_EVENT_ARRAY:
 		if (func_id != BPF_FUNC_perf_event_read &&
 		    func_id != BPF_FUNC_perf_event_output &&
-		    func_id != BPF_FUNC_perf_event_read_value)
+		    func_id != BPF_FUNC_perf_event_read_value &&
+		    func_id != BPF_FUNC_buffer_reserve &&
+		    func_id != BPF_FUNC_buffer_commit)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_STACK_TRACE:
@@ -2848,6 +2850,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_FUNC_perf_event_read:
 	case BPF_FUNC_perf_event_output:
 	case BPF_FUNC_perf_event_read_value:
+	case BPF_FUNC_buffer_reserve:
+	case BPF_FUNC_buffer_commit:
 		if (map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
 			goto error;
 		break;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7bcb707539d1..2b7772aa00b6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2681,6 +2681,41 @@ union bpf_attr {
  *		the implementing program type.
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_buffer_reserve(void *ctx, int id, struct bpf_map *map, int size)
+ *	Description
+ *		Reserve *size* bytes in the output buffer for the special BPF
+ *		BPF perf event referenced by *map*, a BPF map of type
+ *		**BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The perf event must have
+ *		the attributes: **PERF_SAMPLE_RAW** as **sample_type**,
+ *		**PERF_TYPE_SOFTWARE** as **type**, and
+ *		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.  The reserved space
+ *		will be available as the writable buffer identified with
+ *		numeric ID **id** in the context.
+ *
+ *		The amount of reserved bytes cannot exceed the page size.
+ *		The chunk of buffer space will be reserved within a single
+ *		page, and if this results in unused space at the end of the
+ *		previous page in the ring-buffer, that unsused space will be
+ *		filled with zeros.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_buffer_commit(void *ctx, int id, struct bpf_map *map)
+ *	Description
+ *		FInalize the previously reserved space in the output buffer
+ *		for the special BPF perf event referenced by *map*, a BPF map
+ *		of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The perf event must
+ *		have the attributes: **PERF_SAMPLE_RAW** as **sample_type**,
+ *		**PERF_TYPE_SOFTWARE** as **type**, and
+ *		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.
+ *
+ *		The writable buffer identified with numeric ID **id** in the
+ *		context will be invalidated, and can no longer be used to
+ *		write data to until a new **bpf_buffer_reserve**\ () has been
+ *		invoked.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2792,7 +2827,9 @@ union bpf_attr {
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
-	FN(finalize_context),
+	FN(finalize_context),		\
+	FN(buffer_reserve),		\
+	FN(buffer_commit),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index d98a62b3b56c..72af8157d4db 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -218,6 +218,10 @@ static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
 	(void *)BPF_FUNC_sk_storage_delete;
 static int (*bpf_finalize_context)(void *ctx, void *map) =
 	(void *) BPF_FUNC_finalize_context;
+static int (*bpf_buffer_reserve)(void *ctx, int id, void *map, int size) =
+	(void *) BPF_FUNC_buffer_reserve;
+static int (*bpf_buffer_commit)(void *ctx, int id, void *map) =
+	(void *) BPF_FUNC_buffer_commit;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 11/11] dtrace: make use of writable buffers in BPF
  2019-05-20 23:47 [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
                   ` (10 preceding siblings ...)
       [not found] ` <helpers>
@ 2019-05-21 20:40 ` Kris Van Hees
  2019-05-21 20:48 ` [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
  12 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:40 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel

This commit modifies the tiny proof-of-concept DTrace utility to use
the writable-buffer support in BPF along with the new helpers for
buffer reservation and commit.  The dtrace_finalize_context() helper
is updated and is now marked with ctx_update because it sets the
buffer pointer to NULL (and size 0).

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
---
 include/uapi/linux/dtrace.h |   4 +
 kernel/trace/dtrace/bpf.c   | 150 ++++++++++++++++++++++++++++++++++++
 tools/dtrace/dt_buffer.c    |  54 +++++--------
 tools/dtrace/probe1_bpf.c   |  47 ++++++-----
 4 files changed, 198 insertions(+), 57 deletions(-)

diff --git a/include/uapi/linux/dtrace.h b/include/uapi/linux/dtrace.h
index bbe2562c11f2..3fcc075a429f 100644
--- a/include/uapi/linux/dtrace.h
+++ b/include/uapi/linux/dtrace.h
@@ -33,6 +33,10 @@ struct dtrace_bpf_context {
 	u32 gid;	/* from_kgid(&init_user_ns, current_real_cred()->gid */
 	u32 euid;	/* from_kuid(&init_user_ns, current_real_cred()->euid */
 	u32 egid;	/* from_kgid(&init_user_ns, current_real_cred()->egid */
+
+	/* General output buffer */
+	__bpf_md_ptr(u8 *, buf);
+	__bpf_md_ptr(u8 *, buf_end);
 };
 
 /*
diff --git a/kernel/trace/dtrace/bpf.c b/kernel/trace/dtrace/bpf.c
index 95f4103d749e..93bd2f0319cc 100644
--- a/kernel/trace/dtrace/bpf.c
+++ b/kernel/trace/dtrace/bpf.c
@@ -7,6 +7,7 @@
 #include <linux/filter.h>
 #include <linux/ptrace.h>
 #include <linux/sched.h>
+#include <linux/perf_event.h>
 
 /*
  * Actual kernel definition of the DTrace BPF context.
@@ -16,6 +17,9 @@ struct dtrace_bpf_ctx {
 	u32				ecb_id;
 	u32				probe_id;
 	struct task_struct		*task;
+	struct perf_output_handle	handle;
+	u64				buf_len;
+	u8				*buf;
 };
 
 /*
@@ -55,6 +59,8 @@ BPF_CALL_2(dtrace_finalize_context, struct dtrace_bpf_ctx *, ctx,
 
 	ctx->ecb_id = ecb->id;
 	ctx->probe_id = ecb->probe_id;
+	ctx->buf_len = 0;
+	ctx->buf = NULL;
 
 	return 0;
 }
@@ -62,17 +68,119 @@ BPF_CALL_2(dtrace_finalize_context, struct dtrace_bpf_ctx *, ctx,
 static const struct bpf_func_proto dtrace_finalize_context_proto = {
 	.func           = dtrace_finalize_context,
 	.gpl_only       = false,
+	.ctx_update	= true,
 	.ret_type       = RET_INTEGER,
 	.arg1_type      = ARG_PTR_TO_CTX,		/* ctx */
 	.arg2_type      = ARG_CONST_MAP_PTR,		/* map */
 };
 
+BPF_CALL_4(dtrace_buffer_reserve, struct dtrace_bpf_ctx *, ctx,
+				  int, id, struct bpf_map *, map, int, size)
+{
+	struct bpf_array	*arr = container_of(map, struct bpf_array, map);
+	int			cpu = smp_processor_id();
+	struct bpf_event_entry	*ee;
+	struct perf_event	*ev;
+	int			err;
+
+	/*
+	 * Make sure the writable-buffer id is valid.  We use the default which
+	 * is the offset of the start-of-buffer pointer in the public context.
+	 */
+	if (id != offsetof(struct dtrace_bpf_context, buf))
+		return -EINVAL;
+
+	/*
+	 * Verify whether we have an uncommitted reserve.  If so, we deny this
+	 * request.
+	 */
+	if (ctx->handle.rb)
+		return -EBUSY;
+
+	/*
+	 * Perform sanity checks.
+	 */
+	if (cpu >= arr->map.max_entries)
+		return -E2BIG;
+	ee = READ_ONCE(arr->ptrs[cpu]);
+	if (!ee)
+		return -ENOENT;
+	ev = ee->event;
+	if (unlikely(ev->attr.type != PERF_TYPE_SOFTWARE ||
+		     ev->attr.config != PERF_COUNT_SW_BPF_OUTPUT))
+		return -EINVAL;
+	if (unlikely(ev->oncpu != cpu))
+		return -EOPNOTSUPP;
+
+	size = round_up(size, sizeof(u64));
+
+	err = perf_output_begin_forward_in_page(&ctx->handle, ev, size);
+	if (err < 0)
+		return err;
+
+	ctx->buf_len = size;
+	ctx->buf = ctx->handle.addr;
+
+	return 0;
+}
+
+static const struct bpf_func_proto dtrace_buffer_reserve_proto = {
+	.func           = dtrace_buffer_reserve,
+	.gpl_only       = false,
+	.ctx_update	= true,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,		/* ctx */
+	.arg2_type      = ARG_ANYTHING,			/* id */
+	.arg3_type      = ARG_CONST_MAP_PTR,		/* map */
+	.arg4_type      = ARG_ANYTHING,			/* size */
+};
+
+BPF_CALL_3(dtrace_buffer_commit, struct dtrace_bpf_ctx *, ctx,
+				 int, id, struct bpf_map *, map)
+{
+	/*
+	 * Make sure the writable-buffer id is valid.  We use the default which
+	 * is the offset of the start-of-buffer pointer in the public context.
+	 */
+	if (id != offsetof(struct dtrace_bpf_context, buf))
+		return -EINVAL;
+
+	/*
+	 * Verify that we have an uncommitted reserve.  If not, there is really
+	 * nothing to be done here.
+	 */
+	if (!ctx->handle.rb)
+		return 0;
+
+	perf_output_end(&ctx->handle);
+
+	ctx->handle.rb = NULL;
+	ctx->buf_len = 0;
+	ctx->buf = NULL;
+
+	return 0;
+}
+
+static const struct bpf_func_proto dtrace_buffer_commit_proto = {
+	.func           = dtrace_buffer_commit,
+	.gpl_only       = false,
+	.ctx_update	= true,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,		/* ctx */
+	.arg2_type      = ARG_ANYTHING,			/* id */
+	.arg3_type      = ARG_CONST_MAP_PTR,		/* map */
+};
+
 static const struct bpf_func_proto *
 dtrace_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_finalize_context:
 		return &dtrace_finalize_context_proto;
+	case BPF_FUNC_buffer_reserve:
+		return &dtrace_buffer_reserve_proto;
+	case BPF_FUNC_buffer_commit:
+		return &dtrace_buffer_commit_proto;
 	case BPF_FUNC_perf_event_output:
 		return bpf_get_perf_event_output_proto();
 	case BPF_FUNC_trace_printk:
@@ -131,6 +239,22 @@ static bool dtrace_is_valid_access(int off, int size, enum bpf_access_type type,
 		if (bpf_ctx_narrow_access_ok(off, size, sizeof(u32)))
 			return true;
 		break;
+	case bpf_ctx_range(struct dtrace_bpf_context, buf):
+		info->reg_type = PTR_TO_BUFFER;
+		info->buf_id = offsetof(struct dtrace_bpf_context, buf);
+
+		bpf_ctx_record_field_size(info, sizeof(u64));
+		if (bpf_ctx_narrow_access_ok(off, size, sizeof(u64)))
+			return true;
+		break;
+	case bpf_ctx_range(struct dtrace_bpf_context, buf_end):
+		info->reg_type = PTR_TO_BUFFER_END;
+		info->buf_id = offsetof(struct dtrace_bpf_context, buf);
+
+		bpf_ctx_record_field_size(info, sizeof(u64));
+		if (bpf_ctx_narrow_access_ok(off, size, sizeof(u64)))
+			return true;
+		break;
 	default:
 		if (size == sizeof(unsigned long))
 			return true;
@@ -152,6 +276,10 @@ static bool dtrace_is_valid_access(int off, int size, enum bpf_access_type type,
  *	si->dst_reg = ((type *)si->src_reg)->member
  *	target_size = sizeof(((type *)si->src_reg)->member)
  *
+ *  BPF_LDX_CTX_FIELD_DST(type, member, dst, si, target_size)
+ *	dst = ((type *)si->src_reg)->member
+ *	target_size = sizeof(((type *)si->src_reg)->member)
+ *
  *  BPF_LDX_LNK_FIELD(type, member, si, target_size)
  *	si->dst_reg = ((type *)si->dst_reg)->member
  *	target_size = sizeof(((type *)si->dst_reg)->member)
@@ -172,6 +300,13 @@ static bool dtrace_is_valid_access(int off, int size, enum bpf_access_type type,
 			*(target_size) = FIELD_SIZEOF(type, member); \
 			offsetof(type, member); \
 		    }))
+#define BPF_LDX_CTX_FIELD_DST(type, member, dst, si, target_size) \
+	BPF_LDX_MEM(BPF_FIELD_SIZEOF(type, member), \
+		    (dst), (si)->src_reg, \
+		    ({ \
+			*(target_size) = FIELD_SIZEOF(type, member); \
+			offsetof(type, member); \
+		    }))
 #define BPF_LDX_LNK_FIELD(type, member, si, target_size) \
 	BPF_LDX_MEM(BPF_FIELD_SIZEOF(type, member), \
 		    (si)->dst_reg, (si)->dst_reg, \
@@ -261,6 +396,18 @@ static u32 dtrace_convert_ctx_access(enum bpf_access_type type,
 		*insn++ = BPF_LDX_LNK_PTR(struct task_struct, cred, si);
 		*insn++ = BPF_LDX_LNK_FIELD(struct cred, egid, si, target_size);
 		break;
+	case offsetof(struct dtrace_bpf_context, buf):
+		*insn++ = BPF_LDX_CTX_FIELD(struct dtrace_bpf_ctx, buf, si,
+					    target_size);
+		break;
+	case offsetof(struct dtrace_bpf_context, buf_end):
+		/* buf_end = ctx->buf + ctx->buf_len */
+		*insn++ = BPF_LDX_CTX_FIELD(struct dtrace_bpf_ctx, buf, si,
+					    target_size);
+		*insn++ = BPF_LDX_CTX_FIELD_DST(struct dtrace_bpf_ctx, buf_len,
+						BPF_REG_AX, si, target_size);
+		*insn++ = BPF_ALU64_REG(BPF_ADD, si->dst_reg, BPF_REG_AX);
+		break;
 	default:
 		*insn++ = BPF_LDX_CTX_PTR(struct dtrace_bpf_ctx, regs, si);
 		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(long), si->dst_reg, si->dst_reg,
@@ -308,6 +455,9 @@ static void *dtrace_convert_ctx(enum bpf_prog_type stype, void *ctx)
 		gctx = this_cpu_ptr(&dtrace_ctx);
 		gctx->regs = (struct pt_regs *)ctx;
 		gctx->task = current;
+		gctx->handle.rb = NULL;
+		gctx->buf_len = 0;
+		gctx->buf = NULL;
 
 		return gctx;
 	}
diff --git a/tools/dtrace/dt_buffer.c b/tools/dtrace/dt_buffer.c
index 65c107ca8ac4..28fac9036d69 100644
--- a/tools/dtrace/dt_buffer.c
+++ b/tools/dtrace/dt_buffer.c
@@ -282,33 +282,27 @@ static void write_rb_tail(volatile struct perf_event_mmap_page *rb_page,
  */
 static int output_event(u64 *buf)
 {
-	u8				*data = (u8 *)buf;
-	struct perf_event_header	*hdr;
-	u32				size;
-	u64				probe_id, task;
-	u32				pid, ppid, cpu, euid, egid, tag;
+	u8	*data = (u8 *)buf;
+	u32	probe_id;
+	u32	flags;
+	u64	task;
+	u32	pid, ppid, cpu, euid, egid, tag;
 
-	hdr = (struct perf_event_header *)data;
-	data += sizeof(struct perf_event_header);
+	probe_id = *(u32 *)&(data[0]);
 
-	if (hdr->type != PERF_RECORD_SAMPLE)
-		return 1;
+	if (probe_id == PERF_RECORD_LOST) {
+		u16	size;
+		u64	lost;
 
-	size = *(u32 *)data;
-	data += sizeof(u32);
+		size = *(u16 *)&(data[6]);
+		lost = *(u16 *)&(data[16]);
 
-	/*
-	 * The sample should only take up 48 bytes, but as a result of how the
-	 * BPF program stores the data (filling in a struct that resides on the
-	 * stack, and sending that off using bpf_perf_event_output()), there is
-	 * some internal padding
-	 */
-	if (size != 52) {
-		printf("Sample size is wrong (%d vs expected %d)\n", size, 52);
-		goto out;
+		printf("[%ld probes dropped]\n", lost);
+
+		return size;
 	}
 
-	probe_id = *(u64 *)&(data[0]);
+	flags = *(u32 *)&(data[4]);
 	pid = *(u32 *)&(data[8]);
 	ppid = *(u32 *)&(data[12]);
 	cpu = *(u32 *)&(data[16]);
@@ -318,19 +312,14 @@ static int output_event(u64 *buf)
 	tag = *(u32 *)&(data[40]);
 
 	if (probe_id != 123)
-		printf("Corrupted data (probe_id = %ld)\n", probe_id);
+		printf("Corrupted data (probe_id = %d)\n", probe_id);
 	if (tag != 0xdace)
 		printf("Corrupted data (tag = %x)\n", tag);
 
-	printf("CPU-%d: EPID %ld PID %d PPID %d EUID %d EGID %d TASK %08lx\n",
-	       cpu, probe_id, pid, ppid, euid, egid, task);
+	printf("CPU-%d: [%d/%d] PID %d PPID %d EUID %d EGID %d TASK %08lx\n",
+	       cpu, probe_id, flags, pid, ppid, euid, egid, task);
 
-out:
-	/*
-	 * We processed the perf_event_header, the size, and ;size; bytes of
-	 * probe data.
-	 */
-	return sizeof(struct perf_event_header) + sizeof(u32) + size;
+	return 48;
 }
 
 /*
@@ -351,10 +340,9 @@ static void process_data(struct dtrace_buffer *buf)
 
 		/*
 		 * Ensure that the buffer contains enough data for at least one
-		 * sample (header + sample size + sample data).
+		 * sample.
 		 */
-		if (head - tail < sizeof(struct perf_event_header) +
-				  sizeof(u32) + 48)
+		if (head - tail < 48)
 			break;
 
 		if (*ptr)
diff --git a/tools/dtrace/probe1_bpf.c b/tools/dtrace/probe1_bpf.c
index 5b34edb61412..a3196261e66e 100644
--- a/tools/dtrace/probe1_bpf.c
+++ b/tools/dtrace/probe1_bpf.c
@@ -37,25 +37,16 @@ struct bpf_map_def SEC("maps") buffer_map = {
 	.max_entries = 2,
 };
 
-struct sample {
-	u64 probe_id;
-	u32 pid;
-	u32 ppid;
-	u32 cpu;
-	u32 euid;
-	u32 egid;
-	u64 task;
-	u32 tag;
-};
-
 #define DPROG(F)	SEC("dtrace/"__stringify(F)) int bpf_func_##F
+#define BUF_ID		offsetof(struct dtrace_bpf_context, buf)
 
 /* we jump here when syscall number == __NR_write */
 DPROG(__NR_write)(struct dtrace_bpf_context *ctx)
 {
 	int			cpu = bpf_get_smp_processor_id();
 	struct dtrace_ecb	*ecb;
-	struct sample		smpl;
+	u8			*buf, *buf_end;
+	int			err;
 
 	bpf_finalize_context(ctx, &probemap);
 
@@ -63,17 +54,25 @@ DPROG(__NR_write)(struct dtrace_bpf_context *ctx)
 	if (!ecb)
 		return 0;
 
-	memset(&smpl, 0, sizeof(smpl));
-	smpl.probe_id = ecb->probe_id;
-	smpl.pid = ctx->pid;
-	smpl.ppid = ctx->ppid;
-	smpl.cpu = ctx->cpu;
-	smpl.euid = ctx->euid;
-	smpl.egid = ctx->egid;
-	smpl.task = ctx->task;
-	smpl.tag = 0xdace;
-
-	bpf_perf_event_output(ctx, &buffer_map, cpu, &smpl, sizeof(smpl));
+	err = bpf_buffer_reserve(ctx, BUF_ID, &buffer_map, 48);
+	if (err < 0)
+		return -1;
+	buf = ctx->buf;
+	buf_end = ctx->buf_end;
+	if (buf + 48 > buf_end)
+		return -1;
+
+	*(u32 *)(&buf[0]) = ecb->probe_id;
+	*(u32 *)(&buf[4]) = 0;
+	*(u32 *)(&buf[8]) = ctx->pid;
+	*(u32 *)(&buf[12]) = ctx->ppid;
+	*(u32 *)(&buf[16]) = ctx->cpu;
+	*(u32 *)(&buf[20]) = ctx->euid;
+	*(u32 *)(&buf[24]) = ctx->egid;
+	*(u64 *)(&buf[32]) = ctx->task;
+	*(u32 *)(&buf[40]) = 0xdace;
+
+	bpf_buffer_commit(ctx, BUF_ID, &buffer_map);
 
 	return 0;
 }
@@ -84,7 +83,7 @@ int bpf_prog1(struct pt_regs *ctx)
 	struct dtrace_ecb	ecb;
 	int			cpu = bpf_get_smp_processor_id();
 
-	ecb.id = 1;
+	ecb.id = 3;
 	ecb.probe_id = 123;
 
 	bpf_map_update_elem(&probemap, &cpu, &ecb, BPF_ANY);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-20 23:47 [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
                   ` (11 preceding siblings ...)
  2019-05-21 20:40 ` [RFC PATCH 11/11] dtrace: make use of writable buffers in BPF Kris Van Hees
@ 2019-05-21 20:48 ` Kris Van Hees
  2019-05-21 20:54   ` Steven Rostedt
  2019-05-21 20:56   ` Alexei Starovoitov
  12 siblings, 2 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 20:48 UTC (permalink / raw)
  To: dtrace-devel
  Cc: netdev, bpf, linux-kernel, daniel, acme, mhiramat, rostedt, ast

As suggested, I resent the patch set as replies to the cover letter post
to support threaded access to the patches.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 20:48 ` [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
@ 2019-05-21 20:54   ` Steven Rostedt
  2019-05-21 20:56   ` Alexei Starovoitov
  1 sibling, 0 replies; 54+ messages in thread
From: Steven Rostedt @ 2019-05-21 20:54 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: dtrace-devel, netdev, bpf, linux-kernel, daniel, acme, mhiramat, ast

On Tue, 21 May 2019 16:48:48 -0400
Kris Van Hees <kris.van.hees@oracle.com> wrote:

> As suggested, I resent the patch set as replies to the cover letter post
> to support threaded access to the patches.

Note, you should also have added a v2 in the subject:

[RFC PATCH 00/11 v2] ...

The next one should have v3.

Cheers,

-- Steve

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 18:41   ` Kris Van Hees
@ 2019-05-21 20:55     ` Alexei Starovoitov
  2019-05-21 21:36       ` Steven Rostedt
  2019-05-21 21:36       ` Kris Van Hees
  0 siblings, 2 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-21 20:55 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: netdev, bpf, dtrace-devel, linux-kernel, rostedt, mhiramat, acme,
	ast, daniel, peterz

On Tue, May 21, 2019 at 02:41:37PM -0400, Kris Van Hees wrote:
> On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> > On Mon, May 20, 2019 at 11:47:00PM +0000, Kris Van Hees wrote:
> > > 
> > >     2. bpf: add BPF_PROG_TYPE_DTRACE
> > > 
> > > 	This patch adds BPF_PROG_TYPE_DTRACE as a new BPF program type, without
> > > 	actually providing an implementation.  The actual implementation is
> > > 	added in patch 4 (see below).  We do it this way because the
> > > 	implementation is being added to the tracing subsystem as a component
> > > 	that I would be happy to maintain (if merged) whereas the declaration
> > > 	of the program type must be in the bpf subsystem.  Since the two
> > > 	subsystems are maintained by different people, we split the
> > > 	implementing patches across maintainer boundaries while ensuring that
> > > 	the kernel remains buildable between patches.
> > 
> > None of these kernel patches are necessary for what you want to achieve.
> 
> I disagree.  The current support for BPF programs for probes associates a
> specific BPF program type with a specific set of probes, which means that I
> cannot write BPF programs based on a more general concept of a 'DTrace probe'
> and provide functionality based on that.  It also means that if I have a D
> clause (DTrace probe action code associated with probes) that is to be executed
> for a list of probes of different types, I need to duplicate the program
> because I cannot cross program type boundaries.

tracepoint vs kprobe vs raw_tracepoint vs perf event work on different input.
There is no common denominator to them that can serve as single 'generic' context.
We're working on the concept of bpf libraries where different bpf program
with different types can call single bpf function with arbitrary arguments.
This concept already works in bpf2bpf calls. We're working on extending it
to different program types.
You're more then welcome to help in that direction,
but type casting of tracepoint into kprobe is no go.

> The reasons for these patches is because I cannot do the same with the existing
> implementation.  Yes, I can do some of it or use some workarounds to accomplish
> kind of the same thing, but at the expense of not being able to do what I need
> to do but rather do some kind of best effort alternative.  That is not the goal
> here.

what you call 'workaround' other people call 'feature'.
The kernel community doesn't accept extra code into the kernel
when user space can do the same.

> 
> > Feel free to add tools/dtrace/ directory and maintain it though.
> 
> Thank you.
> 
> > The new dtrace_buffer doesn't need to replicate existing bpf+kernel functionality
> > and no changes are necessary in kernel/events/ring_buffer.c either.
> > tools/dtrace/ user space component can use either per-cpu array map
> > or hash map as a buffer to store arbitrary data into and use
> > existing bpf_perf_event_output() to send it to user space via perf ring buffer.
> > 
> > See, for example, how bpftrace does that.
> 
> When using bpf_perf_event_output() you need to construct the sample first,
> and then send it off to user space using the perf ring-buffer.  That is extra
> work that is unnecessary.  Also, storing arbitrary data from userspace in maps
> is not relevant here because this is about data that is generated at the level
> of the kernel and sent to userspace as part of the probe action that is
> executed when the probe fires.
> 
> Bpftrace indeed uses maps and ways to construct the sample and then uses the
> perf ring-buffer to pass data to userspace.  And that is not the way DTrace
> works and that is not the mechanism that we need here,  So, while this may be
> satisfactory for bpftrace, it is not for DTrace.  We need more fine-grained
> control over how we write data to the buffer (doing direct stores from BPF
> code) and without the overhead of constructing a complete sample that can just
> be handed over to bpf_perf_event_output().

I think we're not on the same page vs how bpftrace and bpf_perf_event_output work.
What you're proposing in these patches is _slower_ than existing mechanism.

> 
> Also, please note that I am not duplicating any kernel functionality when it
> comes to buffer handling, and in fact, I found it very easy to be able to
> tap into the perf event ring-buffer implementation and add a feature that I
> need for DTrace.  That was a very pleasant experience for sure!

Let's agree to disagree. All I see is a code duplication and lack of understanding
of existing bpf features.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 20:48 ` [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
  2019-05-21 20:54   ` Steven Rostedt
@ 2019-05-21 20:56   ` Alexei Starovoitov
  1 sibling, 0 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-21 20:56 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: dtrace-devel, netdev, bpf, linux-kernel, daniel, acme, mhiramat,
	rostedt, ast

On Tue, May 21, 2019 at 04:48:48PM -0400, Kris Van Hees wrote:
> As suggested, I resent the patch set as replies to the cover letter post
> to support threaded access to the patches.

As explained in the other email it's a Nack.
Please stop this email spam.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 20:55     ` Alexei Starovoitov
@ 2019-05-21 21:36       ` Steven Rostedt
  2019-05-21 21:43         ` Alexei Starovoitov
  2019-05-21 21:36       ` Kris Van Hees
  1 sibling, 1 reply; 54+ messages in thread
From: Steven Rostedt @ 2019-05-21 21:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, mhiramat,
	acme, ast, daniel, peterz

On Tue, 21 May 2019 13:55:34 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > The reasons for these patches is because I cannot do the same with the existing
> > implementation.  Yes, I can do some of it or use some workarounds to accomplish
> > kind of the same thing, but at the expense of not being able to do what I need
> > to do but rather do some kind of best effort alternative.  That is not the goal
> > here.  
> 
> what you call 'workaround' other people call 'feature'.
> The kernel community doesn't accept extra code into the kernel
> when user space can do the same.

If that was really true, all file systems would be implemented on
FUSE ;-)

I was just at a technical conference that was not Linux focused, and I
talked to a lot of admins that said they would love to have Dtrace
scripts working on Linux unmodified.

I need to start getting more familiar with the workings of eBPF and
then look at what Dtrace has to see where something like this can be
achieved, but right now just NACKing patches outright isn't being
helpful. If you are not happy with this direction, I would love to see
conversations where Kris shows you exactly what is required (from a
feature perspective, not an implementation one) and we come up with a
solution.

-- Steve

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 20:55     ` Alexei Starovoitov
  2019-05-21 21:36       ` Steven Rostedt
@ 2019-05-21 21:36       ` Kris Van Hees
  2019-05-21 23:26         ` Alexei Starovoitov
  1 sibling, 1 reply; 54+ messages in thread
From: Kris Van Hees @ 2019-05-21 21:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, rostedt,
	mhiramat, acme, ast, daniel, peterz

On Tue, May 21, 2019 at 01:55:34PM -0700, Alexei Starovoitov wrote:
> On Tue, May 21, 2019 at 02:41:37PM -0400, Kris Van Hees wrote:
> > On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> > > On Mon, May 20, 2019 at 11:47:00PM +0000, Kris Van Hees wrote:
> > > > 
> > > >     2. bpf: add BPF_PROG_TYPE_DTRACE
> > > > 
> > > > 	This patch adds BPF_PROG_TYPE_DTRACE as a new BPF program type, without
> > > > 	actually providing an implementation.  The actual implementation is
> > > > 	added in patch 4 (see below).  We do it this way because the
> > > > 	implementation is being added to the tracing subsystem as a component
> > > > 	that I would be happy to maintain (if merged) whereas the declaration
> > > > 	of the program type must be in the bpf subsystem.  Since the two
> > > > 	subsystems are maintained by different people, we split the
> > > > 	implementing patches across maintainer boundaries while ensuring that
> > > > 	the kernel remains buildable between patches.
> > > 
> > > None of these kernel patches are necessary for what you want to achieve.
> > 
> > I disagree.  The current support for BPF programs for probes associates a
> > specific BPF program type with a specific set of probes, which means that I
> > cannot write BPF programs based on a more general concept of a 'DTrace probe'
> > and provide functionality based on that.  It also means that if I have a D
> > clause (DTrace probe action code associated with probes) that is to be executed
> > for a list of probes of different types, I need to duplicate the program
> > because I cannot cross program type boundaries.
> 
> tracepoint vs kprobe vs raw_tracepoint vs perf event work on different input.
> There is no common denominator to them that can serve as single 'generic' context.
> We're working on the concept of bpf libraries where different bpf program
> with different types can call single bpf function with arbitrary arguments.
> This concept already works in bpf2bpf calls. We're working on extending it
> to different program types.
> You're more then welcome to help in that direction,
> but type casting of tracepoint into kprobe is no go.

I am happy to hear about the direction you are going in adding functionality.
Please note though that I am not type casting tracepoint into kprobe or
anything like that.  I am making it possible to transfer execution from
tracepoint, kprobe, raw-tracepoint, perf event, etc into a BPF program of
a different type (BPF_PROG_TYPE_DTRACE) which operates as a general probe
action execution program type.  It provides functionality that is used to
implement actions to be executed when a probe fires, independent of the
actual probe type that fired.

What you describe seems to me to be rather equivalent to what I already
implement in my patch.

> > The reasons for these patches is because I cannot do the same with the existing
> > implementation.  Yes, I can do some of it or use some workarounds to accomplish
> > kind of the same thing, but at the expense of not being able to do what I need
> > to do but rather do some kind of best effort alternative.  That is not the goal
> > here.
> 
> what you call 'workaround' other people call 'feature'.
> The kernel community doesn't accept extra code into the kernel
> when user space can do the same.

Sure, but userspace cannot do the same because in the case of DTrace much
of this needs to execute at the kernel level within the context of the probe
firing, because once you get back to userspace, the system has moved on.  We
need to capture information and perform processing of that information at the
time of probe firing.  I am spending quite a lot of my time in the design of
DTrace based on BPF and other kernel features to avoid adding more to the
kernel than is really needed, to certainly also to avoid duplicating code.

But I am not designing and implementing a new tracer - I am making an
existing one available based on existing features (as much as possible).  So,
something that comes close but doesn't quite do what we need is not a
solution.

> > > Feel free to add tools/dtrace/ directory and maintain it though.
> > 
> > Thank you.
> > 
> > > The new dtrace_buffer doesn't need to replicate existing bpf+kernel functionality
> > > and no changes are necessary in kernel/events/ring_buffer.c either.
> > > tools/dtrace/ user space component can use either per-cpu array map
> > > or hash map as a buffer to store arbitrary data into and use
> > > existing bpf_perf_event_output() to send it to user space via perf ring buffer.
> > > 
> > > See, for example, how bpftrace does that.
> > 
> > When using bpf_perf_event_output() you need to construct the sample first,
> > and then send it off to user space using the perf ring-buffer.  That is extra
> > work that is unnecessary.  Also, storing arbitrary data from userspace in maps
> > is not relevant here because this is about data that is generated at the level
> > of the kernel and sent to userspace as part of the probe action that is
> > executed when the probe fires.
> > 
> > Bpftrace indeed uses maps and ways to construct the sample and then uses the
> > perf ring-buffer to pass data to userspace.  And that is not the way DTrace
> > works and that is not the mechanism that we need here,  So, while this may be
> > satisfactory for bpftrace, it is not for DTrace.  We need more fine-grained
> > control over how we write data to the buffer (doing direct stores from BPF
> > code) and without the overhead of constructing a complete sample that can just
> > be handed over to bpf_perf_event_output().
> 
> I think we're not on the same page vs how bpftrace and bpf_perf_event_output work.
> What you're proposing in these patches is _slower_ than existing mechanism.

How can it be slower?  Is a sequence of BPF store instructions, writing
directly to memory in the ring-buffer slower than using BPF store instructions
to write data into a temporary location from which data is then copied into
the ring-buffer by bpf_perf_event_output()?

Other than this, my implementation uses exactly the same functions at the
perf ring-buffer level as bpf_perf_event_output() does.  In my case, the
buffer reserve work is done with one helper, and the final commit is done
with another helper.  So yes, I use two helper calls vs one helper call if
you use bpf_perf_event_output() but as I mention above, I avoid the creation
and copying of the sample data.

> > Also, please note that I am not duplicating any kernel functionality when it
> > comes to buffer handling, and in fact, I found it very easy to be able to
> > tap into the perf event ring-buffer implementation and add a feature that I
> > need for DTrace.  That was a very pleasant experience for sure!
> 
> Let's agree to disagree. All I see is a code duplication and lack of understanding
> of existing bpf features.

Could you point out to me where you believe I am duplicating code?  I'd really
like to address that.

	Kris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 21:36       ` Steven Rostedt
@ 2019-05-21 21:43         ` Alexei Starovoitov
  2019-05-21 21:48           ` Steven Rostedt
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-21 21:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, mhiramat,
	acme, ast, daniel, peterz

On Tue, May 21, 2019 at 05:36:18PM -0400, Steven Rostedt wrote:
> On Tue, 21 May 2019 13:55:34 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > > The reasons for these patches is because I cannot do the same with the existing
> > > implementation.  Yes, I can do some of it or use some workarounds to accomplish
> > > kind of the same thing, but at the expense of not being able to do what I need
> > > to do but rather do some kind of best effort alternative.  That is not the goal
> > > here.  
> > 
> > what you call 'workaround' other people call 'feature'.
> > The kernel community doesn't accept extra code into the kernel
> > when user space can do the same.
> 
> If that was really true, all file systems would be implemented on
> FUSE ;-)
> 
> I was just at a technical conference that was not Linux focused, and I
> talked to a lot of admins that said they would love to have Dtrace
> scripts working on Linux unmodified.
> 
> I need to start getting more familiar with the workings of eBPF and
> then look at what Dtrace has to see where something like this can be
> achieved, but right now just NACKing patches outright isn't being
> helpful. If you are not happy with this direction, I would love to see
> conversations where Kris shows you exactly what is required (from a
> feature perspective, not an implementation one) and we come up with a
> solution.

Steve,
sounds like you've missed all prior threads.
The feedback was given to Kris it was very clear:
implement dtrace the same way as bpftrace is working with bpf.
No changes are necessary to dtrace scripts
and no kernel changes are necessary.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 21:43         ` Alexei Starovoitov
@ 2019-05-21 21:48           ` Steven Rostedt
  2019-05-22  5:23             ` Kris Van Hees
  0 siblings, 1 reply; 54+ messages in thread
From: Steven Rostedt @ 2019-05-21 21:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, mhiramat,
	acme, ast, daniel, peterz

On Tue, 21 May 2019 14:43:26 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> Steve,
> sounds like you've missed all prior threads.

I probably have missed them ;-)

> The feedback was given to Kris it was very clear:
> implement dtrace the same way as bpftrace is working with bpf.
> No changes are necessary to dtrace scripts
> and no kernel changes are necessary.

Kris, I haven't been keeping up on all the discussions. But what
exactly is the issue where Dtrace can't be done the same way as the
bpftrace is done?

-- Steve

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 21:36       ` Kris Van Hees
@ 2019-05-21 23:26         ` Alexei Starovoitov
  2019-05-22  4:12           ` Kris Van Hees
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-21 23:26 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: netdev, bpf, dtrace-devel, linux-kernel, rostedt, mhiramat, acme,
	ast, daniel, peterz

On Tue, May 21, 2019 at 05:36:49PM -0400, Kris Van Hees wrote:
> On Tue, May 21, 2019 at 01:55:34PM -0700, Alexei Starovoitov wrote:
> > On Tue, May 21, 2019 at 02:41:37PM -0400, Kris Van Hees wrote:
> > > On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> > > > On Mon, May 20, 2019 at 11:47:00PM +0000, Kris Van Hees wrote:
> > > > > 
> > > > >     2. bpf: add BPF_PROG_TYPE_DTRACE
> > > > > 
> > > > > 	This patch adds BPF_PROG_TYPE_DTRACE as a new BPF program type, without
> > > > > 	actually providing an implementation.  The actual implementation is
> > > > > 	added in patch 4 (see below).  We do it this way because the
> > > > > 	implementation is being added to the tracing subsystem as a component
> > > > > 	that I would be happy to maintain (if merged) whereas the declaration
> > > > > 	of the program type must be in the bpf subsystem.  Since the two
> > > > > 	subsystems are maintained by different people, we split the
> > > > > 	implementing patches across maintainer boundaries while ensuring that
> > > > > 	the kernel remains buildable between patches.
> > > > 
> > > > None of these kernel patches are necessary for what you want to achieve.
> > > 
> > > I disagree.  The current support for BPF programs for probes associates a
> > > specific BPF program type with a specific set of probes, which means that I
> > > cannot write BPF programs based on a more general concept of a 'DTrace probe'
> > > and provide functionality based on that.  It also means that if I have a D
> > > clause (DTrace probe action code associated with probes) that is to be executed
> > > for a list of probes of different types, I need to duplicate the program
> > > because I cannot cross program type boundaries.
> > 
> > tracepoint vs kprobe vs raw_tracepoint vs perf event work on different input.
> > There is no common denominator to them that can serve as single 'generic' context.
> > We're working on the concept of bpf libraries where different bpf program
> > with different types can call single bpf function with arbitrary arguments.
> > This concept already works in bpf2bpf calls. We're working on extending it
> > to different program types.
> > You're more then welcome to help in that direction,
> > but type casting of tracepoint into kprobe is no go.
> 
> I am happy to hear about the direction you are going in adding functionality.
> Please note though that I am not type casting tracepoint into kprobe or
> anything like that.  I am making it possible to transfer execution from
> tracepoint, kprobe, raw-tracepoint, perf event, etc into a BPF program of
> a different type (BPF_PROG_TYPE_DTRACE) which operates as a general probe
> action execution program type.  It provides functionality that is used to
> implement actions to be executed when a probe fires, independent of the
> actual probe type that fired.
> 
> What you describe seems to me to be rather equivalent to what I already
> implement in my patch.

except they're not.
you're converting to one new prog type only that no one else can use.
Whereas bpf infra is aiming to be as generic as possible and
fit networking, tracing, security use case all at once.

> > > The reasons for these patches is because I cannot do the same with the existing
> > > implementation.  Yes, I can do some of it or use some workarounds to accomplish
> > > kind of the same thing, but at the expense of not being able to do what I need
> > > to do but rather do some kind of best effort alternative.  That is not the goal
> > > here.
> > 
> > what you call 'workaround' other people call 'feature'.
> > The kernel community doesn't accept extra code into the kernel
> > when user space can do the same.
> 
> Sure, but userspace cannot do the same because in the case of DTrace much
> of this needs to execute at the kernel level within the context of the probe
> firing, because once you get back to userspace, the system has moved on.  We
> need to capture information and perform processing of that information at the
> time of probe firing.  I am spending quite a lot of my time in the design of
> DTrace based on BPF and other kernel features to avoid adding more to the
> kernel than is really needed, to certainly also to avoid duplicating code.
> 
> But I am not designing and implementing a new tracer - I am making an
> existing one available based on existing features (as much as possible).  So,
> something that comes close but doesn't quite do what we need is not a
> solution.

Your patches disagree with your words.
This dtrace buffer is a redundant feature.
per-cpu array plus perf_event_output achieve _exactly_ the same.

> 
> > > > Feel free to add tools/dtrace/ directory and maintain it though.
> > > 
> > > Thank you.
> > > 
> > > > The new dtrace_buffer doesn't need to replicate existing bpf+kernel functionality
> > > > and no changes are necessary in kernel/events/ring_buffer.c either.
> > > > tools/dtrace/ user space component can use either per-cpu array map
> > > > or hash map as a buffer to store arbitrary data into and use
> > > > existing bpf_perf_event_output() to send it to user space via perf ring buffer.
> > > > 
> > > > See, for example, how bpftrace does that.
> > > 
> > > When using bpf_perf_event_output() you need to construct the sample first,
> > > and then send it off to user space using the perf ring-buffer.  That is extra
> > > work that is unnecessary.  Also, storing arbitrary data from userspace in maps
> > > is not relevant here because this is about data that is generated at the level
> > > of the kernel and sent to userspace as part of the probe action that is
> > > executed when the probe fires.
> > > 
> > > Bpftrace indeed uses maps and ways to construct the sample and then uses the
> > > perf ring-buffer to pass data to userspace.  And that is not the way DTrace
> > > works and that is not the mechanism that we need here,  So, while this may be
> > > satisfactory for bpftrace, it is not for DTrace.  We need more fine-grained
> > > control over how we write data to the buffer (doing direct stores from BPF
> > > code) and without the overhead of constructing a complete sample that can just
> > > be handed over to bpf_perf_event_output().
> > 
> > I think we're not on the same page vs how bpftrace and bpf_perf_event_output work.
> > What you're proposing in these patches is _slower_ than existing mechanism.
> 
> How can it be slower?  Is a sequence of BPF store instructions, writing
> directly to memory in the ring-buffer slower than using BPF store instructions
> to write data into a temporary location from which data is then copied into
> the ring-buffer by bpf_perf_event_output()?
> 
> Other than this, my implementation uses exactly the same functions at the
> perf ring-buffer level as bpf_perf_event_output() does.  In my case, the
> buffer reserve work is done with one helper, and the final commit is done
> with another helper.  So yes, I use two helper calls vs one helper call if
> you use bpf_perf_event_output() but as I mention above, I avoid the creation
> and copying of the sample data.

What stops you from using per-cpu array and perf_event_output?
No 'reserve' call necessary. lookup from per-cpu array gives a pointer
to large buffer that can be feed into perf_event_output.
It's also faster for small buffers and has no issues with multi-page.
No hacks on perf side necessary.

> 
> > > Also, please note that I am not duplicating any kernel functionality when it
> > > comes to buffer handling, and in fact, I found it very easy to be able to
> > > tap into the perf event ring-buffer implementation and add a feature that I
> > > need for DTrace.  That was a very pleasant experience for sure!
> > 
> > Let's agree to disagree. All I see is a code duplication and lack of understanding
> > of existing bpf features.
> 
> Could you point out to me where you believe I am duplicating code?  I'd really
> like to address that.

see above.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 23:26         ` Alexei Starovoitov
@ 2019-05-22  4:12           ` Kris Van Hees
  2019-05-22 20:16             ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Kris Van Hees @ 2019-05-22  4:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, rostedt,
	mhiramat, acme, ast, daniel, peterz

On Tue, May 21, 2019 at 04:26:19PM -0700, Alexei Starovoitov wrote:
> On Tue, May 21, 2019 at 05:36:49PM -0400, Kris Van Hees wrote:
> > On Tue, May 21, 2019 at 01:55:34PM -0700, Alexei Starovoitov wrote:
> > > On Tue, May 21, 2019 at 02:41:37PM -0400, Kris Van Hees wrote:
> > > > On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> > > > > On Mon, May 20, 2019 at 11:47:00PM +0000, Kris Van Hees wrote:
> > > > > > 
> > > > > >     2. bpf: add BPF_PROG_TYPE_DTRACE
> > > > > > 
> > > > > > 	This patch adds BPF_PROG_TYPE_DTRACE as a new BPF program type, without
> > > > > > 	actually providing an implementation.  The actual implementation is
> > > > > > 	added in patch 4 (see below).  We do it this way because the
> > > > > > 	implementation is being added to the tracing subsystem as a component
> > > > > > 	that I would be happy to maintain (if merged) whereas the declaration
> > > > > > 	of the program type must be in the bpf subsystem.  Since the two
> > > > > > 	subsystems are maintained by different people, we split the
> > > > > > 	implementing patches across maintainer boundaries while ensuring that
> > > > > > 	the kernel remains buildable between patches.
> > > > > 
> > > > > None of these kernel patches are necessary for what you want to achieve.
> > > > 
> > > > I disagree.  The current support for BPF programs for probes associates a
> > > > specific BPF program type with a specific set of probes, which means that I
> > > > cannot write BPF programs based on a more general concept of a 'DTrace probe'
> > > > and provide functionality based on that.  It also means that if I have a D
> > > > clause (DTrace probe action code associated with probes) that is to be executed
> > > > for a list of probes of different types, I need to duplicate the program
> > > > because I cannot cross program type boundaries.
> > > 
> > > tracepoint vs kprobe vs raw_tracepoint vs perf event work on different input.
> > > There is no common denominator to them that can serve as single 'generic' context.
> > > We're working on the concept of bpf libraries where different bpf program
> > > with different types can call single bpf function with arbitrary arguments.
> > > This concept already works in bpf2bpf calls. We're working on extending it
> > > to different program types.
> > > You're more then welcome to help in that direction,
> > > but type casting of tracepoint into kprobe is no go.
> > 
> > I am happy to hear about the direction you are going in adding functionality.
> > Please note though that I am not type casting tracepoint into kprobe or
> > anything like that.  I am making it possible to transfer execution from
> > tracepoint, kprobe, raw-tracepoint, perf event, etc into a BPF program of
> > a different type (BPF_PROG_TYPE_DTRACE) which operates as a general probe
> > action execution program type.  It provides functionality that is used to
> > implement actions to be executed when a probe fires, independent of the
> > actual probe type that fired.
> > 
> > What you describe seems to me to be rather equivalent to what I already
> > implement in my patch.
> 
> except they're not.
> you're converting to one new prog type only that no one else can use.
> Whereas bpf infra is aiming to be as generic as possible and
> fit networking, tracing, security use case all at once.

Two points here...  the patch that implements cross-prog type tail-call support
is not specific to *any* specific prog type.  Each prog type can specify which
(if any) prog types is can receive calls from (and it can implement context
conversion code to carry any relevant info from the caller context into the
context for the callee).  There is nothing in that patch that is specific to
DTrace or any other prog type.

Then I also introduce a new prog type (not tied to any specific probe type) to
provide the ability to execute programs in a probe type independent context,
and it makes use of the cross-prog-type tail-call support in order to be able
to invoke programs in that probe-independent context from probe-specific BPF
programs.  And there is nothing that prevents anyone from using that new prog
type as well - it is available for use just like any other prog type that
already exists.

But I am confused...  the various probes you mentioned a few emails back
(kprobe, tracepoint, raw_tracepoint, perf event) each have their own BPF
program type associated with them (raw_tracepoint has two program types
serving it), which doesn't sound very generic.  But you are objecting to the
introduction of a generic prog type that can be used to execute programs
regardless of the probe type that caused the invocation because the bpf
infrastructure is aimed at being as generic as possible.

Could you elaborate on why you believe my patches are not adding generic
features?  I can certainly agree that the DTrace-specific portions are less
generic (although they are certainly available for anyone to use), but I
don't quite understand why the new features are deemed non-generic and why
you believe no one else can use this?

> > > > The reasons for these patches is because I cannot do the same with the existing
> > > > implementation.  Yes, I can do some of it or use some workarounds to accomplish
> > > > kind of the same thing, but at the expense of not being able to do what I need
> > > > to do but rather do some kind of best effort alternative.  That is not the goal
> > > > here.
> > > 
> > > what you call 'workaround' other people call 'feature'.
> > > The kernel community doesn't accept extra code into the kernel
> > > when user space can do the same.
> > 
> > Sure, but userspace cannot do the same because in the case of DTrace much
> > of this needs to execute at the kernel level within the context of the probe
> > firing, because once you get back to userspace, the system has moved on.  We
> > need to capture information and perform processing of that information at the
> > time of probe firing.  I am spending quite a lot of my time in the design of
> > DTrace based on BPF and other kernel features to avoid adding more to the
> > kernel than is really needed, to certainly also to avoid duplicating code.
> > 
> > But I am not designing and implementing a new tracer - I am making an
> > existing one available based on existing features (as much as possible).  So,
> > something that comes close but doesn't quite do what we need is not a
> > solution.
> 
> Your patches disagree with your words.
> This dtrace buffer is a redundant feature.
> per-cpu array plus perf_event_output achieve _exactly_ the same.

How can it be exactly the same when the per-cpu array plus perf_event_output
approach relies on memory to be allocated in the per-cpu array to be used as
scratch space for constructing the sample, and then that sample data gets
copied from the per-cpu array memory into the memory that was allocated for
the perf ring-buffer?  And my patch provides a way to write the data directly
into the perf ring-buffer, without the need for a scratch area to be allocated,
and without needing to copy the data from one memory chunk into another.

> > > > > Feel free to add tools/dtrace/ directory and maintain it though.
> > > > 
> > > > Thank you.
> > > > 
> > > > > The new dtrace_buffer doesn't need to replicate existing bpf+kernel functionality
> > > > > and no changes are necessary in kernel/events/ring_buffer.c either.
> > > > > tools/dtrace/ user space component can use either per-cpu array map
> > > > > or hash map as a buffer to store arbitrary data into and use
> > > > > existing bpf_perf_event_output() to send it to user space via perf ring buffer.
> > > > > 
> > > > > See, for example, how bpftrace does that.
> > > > 
> > > > When using bpf_perf_event_output() you need to construct the sample first,
> > > > and then send it off to user space using the perf ring-buffer.  That is extra
> > > > work that is unnecessary.  Also, storing arbitrary data from userspace in maps
> > > > is not relevant here because this is about data that is generated at the level
> > > > of the kernel and sent to userspace as part of the probe action that is
> > > > executed when the probe fires.
> > > > 
> > > > Bpftrace indeed uses maps and ways to construct the sample and then uses the
> > > > perf ring-buffer to pass data to userspace.  And that is not the way DTrace
> > > > works and that is not the mechanism that we need here,  So, while this may be
> > > > satisfactory for bpftrace, it is not for DTrace.  We need more fine-grained
> > > > control over how we write data to the buffer (doing direct stores from BPF
> > > > code) and without the overhead of constructing a complete sample that can just
> > > > be handed over to bpf_perf_event_output().
> > > 
> > > I think we're not on the same page vs how bpftrace and bpf_perf_event_output work.
> > > What you're proposing in these patches is _slower_ than existing mechanism.
> > 
> > How can it be slower?  Is a sequence of BPF store instructions, writing
> > directly to memory in the ring-buffer slower than using BPF store instructions
> > to write data into a temporary location from which data is then copied into
> > the ring-buffer by bpf_perf_event_output()?
> > 
> > Other than this, my implementation uses exactly the same functions at the
> > perf ring-buffer level as bpf_perf_event_output() does.  In my case, the
> > buffer reserve work is done with one helper, and the final commit is done
> > with another helper.  So yes, I use two helper calls vs one helper call if
> > you use bpf_perf_event_output() but as I mention above, I avoid the creation
> > and copying of the sample data.
> 
> What stops you from using per-cpu array and perf_event_output?
> No 'reserve' call necessary. lookup from per-cpu array gives a pointer
> to large buffer that can be feed into perf_event_output.
> It's also faster for small buffers and has no issues with multi-page.
> No hacks on perf side necessary.

Please see my comments above.  And please note that aside from the overhead of
making one extra helper call (buffer_reserve), my implementation uses the very
functions that are used to implement perf_event_output.  The only difference
is that the first half of perf_event_output (reserving the needed space in the
ring-buffer - not something I came up with - it already gets done for every
write operation to the ring-buffer) gets done from buffer_reserve, and the last
part (recording the new head in the ring-buffer so userspace can see it) is
done form buffer_commit.  Yes, there is a little bit of extra code involved
because the ring-buffer is usually comprised of non-contiguous pages, but
that extra code is minimal.  The real difference with just using
perf_event_output is that perf_event_output copies a chunk of data from a
given memory location into the ring-buffer whereas my implementation places
the data into the ring-buffer directly using BPF store instructions.

The DTrace userspace implementation has an established format in which the
probe data is expected to be found in the buffer.  My proposed (minimal)
extension to the perf ring-buffer code makes it possible to write data into
the ring-buffer in the expected format.  This is not possible by simply using
perf_event_output because that adds a header to the sample data.

> > > > Also, please note that I am not duplicating any kernel functionality when it
> > > > comes to buffer handling, and in fact, I found it very easy to be able to
> > > > tap into the perf event ring-buffer implementation and add a feature that I
> > > > need for DTrace.  That was a very pleasant experience for sure!
> > > 
> > > Let's agree to disagree. All I see is a code duplication and lack of understanding
> > > of existing bpf features.
> > 
> > Could you point out to me where you believe I am duplicating code?  I'd really
> > like to address that.
> 
> see above.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 21:48           ` Steven Rostedt
@ 2019-05-22  5:23             ` Kris Van Hees
  2019-05-22 20:53               ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Kris Van Hees @ 2019-05-22  5:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Alexei Starovoitov, Kris Van Hees, netdev, bpf, dtrace-devel,
	linux-kernel, mhiramat, acme, ast, daniel, peterz

On Tue, May 21, 2019 at 05:48:11PM -0400, Steven Rostedt wrote:
> On Tue, 21 May 2019 14:43:26 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > Steve,
> > sounds like you've missed all prior threads.
> 
> I probably have missed them ;-)
> 
> > The feedback was given to Kris it was very clear:
> > implement dtrace the same way as bpftrace is working with bpf.
> > No changes are necessary to dtrace scripts
> > and no kernel changes are necessary.
> 
> Kris, I haven't been keeping up on all the discussions. But what
> exactly is the issue where Dtrace can't be done the same way as the
> bpftrace is done?

There are several issues (and I keep finding new ones as I move forward) but
the biggest one is that I am not trying to re-design and re-implement) DTrace
from the ground up.  We have an existing userspace component that is getting
modified to work with a new kernel implementation (based on BPF and various
other kernel features that are thankfully available these days).  But we need
to ensure that the userspace component continues to function exactly as one
would expect.  There should be no need to modify DTrace scripts.  Perhaps
bpftrace could be taught to parse DTrace scripts (i.e. implement the D script
language with all its bells and whistles) but it currently cannot and DTrace
obviously can.  It seems to be a better use of resources to focus on the
kernel component, where we can really provide a much cleaner implementation
for DTrace probe execution because BPF is available and very powerful.

Userspace aside, there are various features that are not currently available
such as retrieving the ppid of the current task, and various other data items
that relate to the current task that triggered a probe.  There are ways to
work around it (using the bpf_probe_read() helper, which actually performs a
probe_kernel_read()) but that is rather clunky and definitely shouldn't be
something that can be done from a BPF program if we're doing unprivileged
tracing (which is a goal that is important for us).  New helpers can be added
for things like this, but the list grows large very quickly once you look at
what information DTrace scripts tend to use.

One of the benefits of DTrace is that probes are largely abstracted entities
when you get to the script level.  While different probes provide different
data, they are all represented as probe arguments and they are accessed in a
very consistent manner that is independent from the actual kind of probe that
triggered the execution.  Often, a single DTrace clause is associated with
multiple probes, of different types.  Probes in the kernel (kprobe, perf event,
tracepoint, ...) are associated with their own BPF program type, so it is not
possible to load the DTrace clause (translated into BPF code) once and
associate it with probes of different types.  Instead, I'd have to load it
as a BPF_PROG_TYPE_KPROBE program to associate it with a kprobe, and I'd have
to load it as a BPF_PROG_TYPE_TRACEPOINT program to associate it with a
tracepoint, and so on.  This also means that I suddenly have to add code to
the userspace component to know about the different program types with more
detail, like what helpers are available to specific program types.

Another advantage of being able to operate on a more abstract probe concept
that is not tied to a specific probe type is that the userspace component does
not need to know about the implementation details of the specific probes.
This avoids a tight coupling between the userspace component and the kernel
implementation.

Another feature that is currently not supported is speculative tracing.  This
is a feature that is not as commonly used (although I personally have found it
to be very useful in the past couple of years) but it quite powerful because
it allows for probe data to be recorded, and have the decision on whether it
is to be made available to userspace postponed to a later event.  At that time,
the data can be discarded or committed.

These are just some examples of issues I have been working on.  I spent quite
a bit of time to look for ways to implement what we need for DTrace with a
minimal amount of patches to the kernel because there really isn't any point
in doing unnecessary work.  I do not doubt that there are possible clever
ways to somehow get around some of these issues with clever hacks and
workarounds, but I am not trying to hack something together that hopefully
will be close enough to the expected functionality.

DTrace has proven itself to be quite useful and dependable as a tracing
solution, and I am working on continuing to deliver on that while recognizing
the significant work that others have put into advancing the tracing
infrastructure in Linux in recent years.  So many people have contributed
excellent features - and I am making use of those features as much as I can.
But as is often the case, not everything that I need is currently implemented.
As I expressed during last year's Plumbers in Vancouver, I am putting a very
strong emphasis on ensuring that what I propose as contributions is not
limited to just DTrace.  My goal is to work in an open, collaborative manner,
providing features that anyone can use if they want to.

I wish that the assertion that "no changes are necessary to dtrace scripts and
no kernel changes are necessary" were true, but my own findings contradict
that.  To my knowledge no tool exists right now that can execute any and all
valid DTrace scripts without any changes to the scripts and without any changes
to the kernel.  The only tool I know that can execute DTrace scripts right now
does require rather extensive kernel changes, and the work I am doing right now
is aimed at doing much better than that.

	Kris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-21 17:56 ` Alexei Starovoitov
  2019-05-21 18:41   ` Kris Van Hees
@ 2019-05-22 14:25   ` Peter Zijlstra
  2019-05-22 18:22     ` Kris Van Hees
  1 sibling, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2019-05-22 14:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, rostedt,
	mhiramat, acme, ast, daniel

On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:

> and no changes are necessary in kernel/events/ring_buffer.c either.

Let me just NAK them on the principle that I don't see them in my inbox.

Let me further NAK it for adding all sorts of garbage to the code --
we're not going to do gaps and stay_in_page nonsense.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-22 14:25   ` Peter Zijlstra
@ 2019-05-22 18:22     ` Kris Van Hees
  2019-05-22 19:55       ` Alexei Starovoitov
  2019-05-24  7:27       ` Peter Zijlstra
  0 siblings, 2 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-22 18:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, Kris Van Hees, netdev, bpf, dtrace-devel,
	linux-kernel, rostedt, mhiramat, acme, ast, daniel

On Wed, May 22, 2019 at 04:25:32PM +0200, Peter Zijlstra wrote:
> On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> 
> > and no changes are necessary in kernel/events/ring_buffer.c either.
> 
> Let me just NAK them on the principle that I don't see them in my inbox.

My apologies for failing to include you on the Cc for the patches.  That was
an oversight on my end and certainly not intentional.

> Let me further NAK it for adding all sorts of garbage to the code --
> we're not going to do gaps and stay_in_page nonsense.

Could you give some guidance in terms of an alternative?  The ring buffer code
provides both non-contiguous page allocation support and a vmalloc-based
allocation, and the vmalloc version certainly would avoid the entire gap and
page boundary stuff.  But since the allocator is chosen at build time based on
the arch capabilities, there is no way to select a specific memory allocator.
I'd be happy to use an alternative approach that allows direct writing into
the ring buffer.

	Thanks,
	Kris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-22 18:22     ` Kris Van Hees
@ 2019-05-22 19:55       ` Alexei Starovoitov
  2019-05-22 20:20         ` David Miller
  2019-05-23  5:19         ` Kris Van Hees
  2019-05-24  7:27       ` Peter Zijlstra
  1 sibling, 2 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-22 19:55 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: Peter Zijlstra, netdev, bpf, dtrace-devel, linux-kernel, rostedt,
	mhiramat, acme, ast, daniel

On Wed, May 22, 2019 at 02:22:15PM -0400, Kris Van Hees wrote:
> On Wed, May 22, 2019 at 04:25:32PM +0200, Peter Zijlstra wrote:
> > On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> > 
> > > and no changes are necessary in kernel/events/ring_buffer.c either.
> > 
> > Let me just NAK them on the principle that I don't see them in my inbox.
> 
> My apologies for failing to include you on the Cc for the patches.  That was
> an oversight on my end and certainly not intentional.
> 
> > Let me further NAK it for adding all sorts of garbage to the code --
> > we're not going to do gaps and stay_in_page nonsense.
> 
> Could you give some guidance in terms of an alternative?  The ring buffer code
> provides both non-contiguous page allocation support and a vmalloc-based
> allocation, and the vmalloc version certainly would avoid the entire gap and
> page boundary stuff.  But since the allocator is chosen at build time based on
> the arch capabilities, there is no way to select a specific memory allocator.
> I'd be happy to use an alternative approach that allows direct writing into
> the ring buffer.

You do not _need_ direct write from bpf prog.
dtrace language doesn't mandate direct write.
'direct write into ring buffer form bpf prog' is an interesting idea and
may be nice performance optimization, but in no way it's a blocker for dtrace scripts.
Also it's far from clear that it actually brings performance benefits.
Letting bpf progs write directly into ring buffer comes with
a lot of corner cases. It's something to carefully analyze.
I suggest to proceed with user space dtrace conversion to bpf
without introducing kernel changes.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-22  4:12           ` Kris Van Hees
@ 2019-05-22 20:16             ` Alexei Starovoitov
  2019-05-23  5:16               ` Kris Van Hees
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-22 20:16 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: netdev, bpf, dtrace-devel, linux-kernel, rostedt, mhiramat, acme,
	ast, daniel, peterz

On Wed, May 22, 2019 at 12:12:53AM -0400, Kris Van Hees wrote:
> 
> Could you elaborate on why you believe my patches are not adding generic
> features?  I can certainly agree that the DTrace-specific portions are less
> generic (although they are certainly available for anyone to use), but I
> don't quite understand why the new features are deemed non-generic and why
> you believe no one else can use this?

And once again your statement above contradicts your own patches.
The patch 2 adds new prog type BPF_PROG_TYPE_DTRACE and the rest of the patches
are tying everything to it.
This approach contradicts bpf philosophy of being generic execution engine
and not favoriting one program type vs another.

I have nothing against dtrace language and dtrace scripts.
Go ahead and compile them into bpf.
All patches to improve bpf infrastructure are very welcomed.

In particular you brought up a good point that there is a use case
for sharing a piece of bpf program between kprobe and tracepoint events.
The better way to do that is via bpf2bpf call.
Example:
void bpf_subprog(arbitrary args)
{
}

SEC("kprobe/__set_task_comm")
int bpf_prog_kprobe(struct pt_regs *ctx)
{
  bpf_subprog(...);
}

SEC("tracepoint/sched/sched_switch")
int bpf_prog_tracepoint(struct sched_switch_args *ctx)
{
  bpf_subprog(...);
}

Such configuration is not supported by the verifier yet.
We've been discussing it for some time, but no work has started,
since there was no concrete use case.
If you can work on adding support for it everyone will benefit.

Could you please consider doing that as a step forward?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-22 19:55       ` Alexei Starovoitov
@ 2019-05-22 20:20         ` David Miller
  2019-05-23  5:19         ` Kris Van Hees
  1 sibling, 0 replies; 54+ messages in thread
From: David Miller @ 2019-05-22 20:20 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: kris.van.hees, peterz, netdev, bpf, dtrace-devel, linux-kernel,
	rostedt, mhiramat, acme, ast, daniel

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Wed, 22 May 2019 12:55:27 -0700

> I suggest to proceed with user space dtrace conversion to bpf
> without introducing kernel changes.

Yes, please...

+1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-22  5:23             ` Kris Van Hees
@ 2019-05-22 20:53               ` Alexei Starovoitov
  2019-05-23  5:46                 ` Kris Van Hees
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-22 20:53 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: Steven Rostedt, netdev, bpf, dtrace-devel, linux-kernel,
	mhiramat, acme, ast, daniel, peterz

On Wed, May 22, 2019 at 01:23:27AM -0400, Kris Van Hees wrote:
> 
> Userspace aside, there are various features that are not currently available
> such as retrieving the ppid of the current task, and various other data items
> that relate to the current task that triggered a probe.  There are ways to
> work around it (using the bpf_probe_read() helper, which actually performs a
> probe_kernel_read()) but that is rather clunky

Sounds like you're admiting that the access to all kernel data structures
is actually available, but you don't want to change user space to use it?

> triggered the execution.  Often, a single DTrace clause is associated with
> multiple probes, of different types.  Probes in the kernel (kprobe, perf event,
> tracepoint, ...) are associated with their own BPF program type, so it is not
> possible to load the DTrace clause (translated into BPF code) once and
> associate it with probes of different types.  Instead, I'd have to load it
> as a BPF_PROG_TYPE_KPROBE program to associate it with a kprobe, and I'd have
> to load it as a BPF_PROG_TYPE_TRACEPOINT program to associate it with a
> tracepoint, and so on.  This also means that I suddenly have to add code to
> the userspace component to know about the different program types with more
> detail, like what helpers are available to specific program types.

That also sounds that there is a solution, but you don't want to change user space ?

> Another advantage of being able to operate on a more abstract probe concept
> that is not tied to a specific probe type is that the userspace component does
> not need to know about the implementation details of the specific probes.

If that is indeed the case that dtrace is broken _by design_
and nothing on the kernel side can fix it.

bpf prog attached to NMI is running in NMI.
That is very different execution context vs kprobe.
kprobe execution context is also different from syscall.

The user writing the script has to be aware in what context
that script will be executing.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-22 20:16             ` Alexei Starovoitov
@ 2019-05-23  5:16               ` Kris Van Hees
  2019-05-23 20:28                 ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Kris Van Hees @ 2019-05-23  5:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, rostedt,
	mhiramat, acme, ast, daniel, peterz

On Wed, May 22, 2019 at 01:16:25PM -0700, Alexei Starovoitov wrote:
> On Wed, May 22, 2019 at 12:12:53AM -0400, Kris Van Hees wrote:
> > 
> > Could you elaborate on why you believe my patches are not adding generic
> > features?  I can certainly agree that the DTrace-specific portions are less
> > generic (although they are certainly available for anyone to use), but I
> > don't quite understand why the new features are deemed non-generic and why
> > you believe no one else can use this?
> 
> And once again your statement above contradicts your own patches.
> The patch 2 adds new prog type BPF_PROG_TYPE_DTRACE and the rest of the patches
> are tying everything to it.
> This approach contradicts bpf philosophy of being generic execution engine
> and not favoriting one program type vs another.

I am not sure I understand where you see a contradiction.  What I posted is
a generic feature, and sample code that demonstrates how it can be used based
on the use-case that I am currently working on.  So yes, the sample code is
very specific but it does not restrict the use of the cross-prog-type tail-call
feature.  That feature is designed to be generic.

Probes come in different types (kprobe, tracepoint, perf event, ...) and they
each have their own very specific data associated with them.  I agree 100%
with you on that.  And sometimes tracing makes use of those specifics.  But
even from looking at the implementation of the various probe related prog
types (and e.g. the list of helpers they each support) it shows that there is
a lot of commonality as well.  That common functionality is common to all the
probe program types, and that is where I suggest introducing a program type
that captures the common concept of a probe, so perhaps a better name would
be BPF_PROG_TYPE_PROBE.

The principle remains the same though...  I am proposing adding support for
program types that provide common functionality so that programs for various
program types can make use of the more generic programs stored in prog arrays.

> I have nothing against dtrace language and dtrace scripts.
> Go ahead and compile them into bpf.
> All patches to improve bpf infrastructure are very welcomed.
> 
> In particular you brought up a good point that there is a use case
> for sharing a piece of bpf program between kprobe and tracepoint events.
> The better way to do that is via bpf2bpf call.
> Example:
> void bpf_subprog(arbitrary args)
> {
> }
> 
> SEC("kprobe/__set_task_comm")
> int bpf_prog_kprobe(struct pt_regs *ctx)
> {
>   bpf_subprog(...);
> }
> 
> SEC("tracepoint/sched/sched_switch")
> int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> {
>   bpf_subprog(...);
> }
> 
> Such configuration is not supported by the verifier yet.
> We've been discussing it for some time, but no work has started,
> since there was no concrete use case.
> If you can work on adding support for it everyone will benefit.
> 
> Could you please consider doing that as a step forward?

This definitely looks to be an interesting addition and I am happy to look into
that further.  I have a few questions that I hope you can shed light on...

1. What context would bpf_subprog execute with?  If it can be called from
   multiple different prog types, would it see whichever context the caller
   is executing with?  Or would you envision bpf_subprog to not be allowed to
   access the execution context because it cannot know which one is in use?

2. Given that BPF programs are loaded with a specification of the prog type, 
   how would one load a code construct as the one you outline above?  How can
   you load a BPF function and have it be used as subprog from programs that
   are loaded separately?  I.e. in the sample above, if bpf_subprog is loaded
   as part of loading bpf_prog_kprobe (prog type KPROBE), how can it be
   referenced from bpf_prog_tracepoint (prog type TRACEPOINT) which would be
   loaded separately?

	Cheers,
	Kris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-22 19:55       ` Alexei Starovoitov
  2019-05-22 20:20         ` David Miller
@ 2019-05-23  5:19         ` Kris Van Hees
  1 sibling, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-23  5:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, Peter Zijlstra, netdev, bpf, dtrace-devel,
	linux-kernel, rostedt, mhiramat, acme, ast, daniel

On Wed, May 22, 2019 at 12:55:27PM -0700, Alexei Starovoitov wrote:
> On Wed, May 22, 2019 at 02:22:15PM -0400, Kris Van Hees wrote:
> > On Wed, May 22, 2019 at 04:25:32PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> > > 
> > > > and no changes are necessary in kernel/events/ring_buffer.c either.
> > > 
> > > Let me just NAK them on the principle that I don't see them in my inbox.
> > 
> > My apologies for failing to include you on the Cc for the patches.  That was
> > an oversight on my end and certainly not intentional.
> > 
> > > Let me further NAK it for adding all sorts of garbage to the code --
> > > we're not going to do gaps and stay_in_page nonsense.
> > 
> > Could you give some guidance in terms of an alternative?  The ring buffer code
> > provides both non-contiguous page allocation support and a vmalloc-based
> > allocation, and the vmalloc version certainly would avoid the entire gap and
> > page boundary stuff.  But since the allocator is chosen at build time based on
> > the arch capabilities, there is no way to select a specific memory allocator.
> > I'd be happy to use an alternative approach that allows direct writing into
> > the ring buffer.
> 
> You do not _need_ direct write from bpf prog.
> dtrace language doesn't mandate direct write.
> 'direct write into ring buffer form bpf prog' is an interesting idea and
> may be nice performance optimization, but in no way it's a blocker for dtrace scripts.
> Also it's far from clear that it actually brings performance benefits.
> Letting bpf progs write directly into ring buffer comes with
> a lot of corner cases. It's something to carefully analyze.

I agree that doing direct writes is something that can be deferred right now,
especially because there are more fundamental things to focus on.  Thank you
for your acknowledgement of the idea, and I certainly look forward to exploring
this further at a later time,

> I suggest to proceed with user space dtrace conversion to bpf
> without introducing kernel changes.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-22 20:53               ` Alexei Starovoitov
@ 2019-05-23  5:46                 ` Kris Van Hees
  2019-05-23 21:13                   ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Kris Van Hees @ 2019-05-23  5:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, Steven Rostedt, netdev, bpf, dtrace-devel,
	linux-kernel, mhiramat, acme, ast, daniel, peterz

On Wed, May 22, 2019 at 01:53:31PM -0700, Alexei Starovoitov wrote:
> On Wed, May 22, 2019 at 01:23:27AM -0400, Kris Van Hees wrote:
> > 
> > Userspace aside, there are various features that are not currently available
> > such as retrieving the ppid of the current task, and various other data items
> > that relate to the current task that triggered a probe.  There are ways to
> > work around it (using the bpf_probe_read() helper, which actually performs a
> > probe_kernel_read()) but that is rather clunky
> 
> Sounds like you're admiting that the access to all kernel data structures
> is actually available, but you don't want to change user space to use it?

I of course agree that access to all kernel structures can be done using the
bpf_probe_read() helper.  But I hope you agree that the availability of that
helper doesn't mean that there is no room for more elegant ways to access
information.  There are already helpers (e.g. bpf_get_current_pid_tgid) that
could be replaced by BPF code that uses bpf_probe_read to accomplish the same
thing.

> > triggered the execution.  Often, a single DTrace clause is associated with
> > multiple probes, of different types.  Probes in the kernel (kprobe, perf event,
> > tracepoint, ...) are associated with their own BPF program type, so it is not
> > possible to load the DTrace clause (translated into BPF code) once and
> > associate it with probes of different types.  Instead, I'd have to load it
> > as a BPF_PROG_TYPE_KPROBE program to associate it with a kprobe, and I'd have
> > to load it as a BPF_PROG_TYPE_TRACEPOINT program to associate it with a
> > tracepoint, and so on.  This also means that I suddenly have to add code to
> > the userspace component to know about the different program types with more
> > detail, like what helpers are available to specific program types.
> 
> That also sounds that there is a solution, but you don't want to change user space ?

I think there is a difference between a solution and a good solution.  Adding
a lot of knowledge in the userspace component about how things are imeplemented
at the kernel level makes for a more fragile infrastructure and involves
breaking down well established boundaries in DTrace that are part of the design
specifically to ensure that userspace doesn't need to depend on such intimate
knowledge.

> > Another advantage of being able to operate on a more abstract probe concept
> > that is not tied to a specific probe type is that the userspace component does
> > not need to know about the implementation details of the specific probes.
> 
> If that is indeed the case that dtrace is broken _by design_
> and nothing on the kernel side can fix it.
> 
> bpf prog attached to NMI is running in NMI.
> That is very different execution context vs kprobe.
> kprobe execution context is also different from syscall.
> 
> The user writing the script has to be aware in what context
> that script will be executing.

The design behind DTrace definitely recognizes that different types of probes
operate in different ways and have different data associated with them.  That
is why probes (in legacy DTrace) are managed by providers, one for each type
of probe.  The providers handle the specifics of a probe type, and provide a
generic probe API to the processing component of DTrace:

    SDT probes -----> SDT provider -------+
                                          |
    FBT probes -----> FBT provider -------+--> DTrace engine
                                          |
    syscall probes -> systrace provider --+

This means that the DTrace processing component can be implemented based on a
generic probe concept, and the providers will take care of the specifics.  In
that sense, it is similar to so many other parts of the kernel where a generic
API is exposed so that higher level components don't need to know implementation
details.

In DTrace, people write scripts based on UAPI-style interfaces and they don't
have to concern themselves with e.g. knowing how to get the value of the 3rd
argument that was passed by the firing probe.  All they need to know is that
the probe will have a 3rd argument, and that the 3rd argument to *any* probe
can be accessed as 'arg2' (or args[2] for typed arguments, if the provider is
capable of providing that).  Different probes have different ways of passing
arguments, and only the provider code for each probe type needs to know how
to retrieve the argument values.

Does this help bring clarity to the reasons why an abstract (generic) probe
concept is part of DTrace's design?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-23  5:16               ` Kris Van Hees
@ 2019-05-23 20:28                 ` Alexei Starovoitov
  2019-05-30 16:15                   ` Kris Van Hees
  2019-06-18  1:25                   ` Kris Van Hees
  0 siblings, 2 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-23 20:28 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: netdev, bpf, dtrace-devel, linux-kernel, rostedt, mhiramat, acme,
	ast, daniel, peterz

On Thu, May 23, 2019 at 01:16:08AM -0400, Kris Van Hees wrote:
> On Wed, May 22, 2019 at 01:16:25PM -0700, Alexei Starovoitov wrote:
> > On Wed, May 22, 2019 at 12:12:53AM -0400, Kris Van Hees wrote:
> > > 
> > > Could you elaborate on why you believe my patches are not adding generic
> > > features?  I can certainly agree that the DTrace-specific portions are less
> > > generic (although they are certainly available for anyone to use), but I
> > > don't quite understand why the new features are deemed non-generic and why
> > > you believe no one else can use this?
> > 
> > And once again your statement above contradicts your own patches.
> > The patch 2 adds new prog type BPF_PROG_TYPE_DTRACE and the rest of the patches
> > are tying everything to it.
> > This approach contradicts bpf philosophy of being generic execution engine
> > and not favoriting one program type vs another.
> 
> I am not sure I understand where you see a contradiction.  What I posted is
> a generic feature, and sample code that demonstrates how it can be used based
> on the use-case that I am currently working on.  So yes, the sample code is
> very specific but it does not restrict the use of the cross-prog-type tail-call
> feature.  That feature is designed to be generic.
> 
> Probes come in different types (kprobe, tracepoint, perf event, ...) and they
> each have their own very specific data associated with them.  I agree 100%
> with you on that.  And sometimes tracing makes use of those specifics.  But
> even from looking at the implementation of the various probe related prog
> types (and e.g. the list of helpers they each support) it shows that there is
> a lot of commonality as well.  That common functionality is common to all the
> probe program types, and that is where I suggest introducing a program type
> that captures the common concept of a probe, so perhaps a better name would
> be BPF_PROG_TYPE_PROBE.
> 
> The principle remains the same though...  I am proposing adding support for
> program types that provide common functionality so that programs for various
> program types can make use of the more generic programs stored in prog arrays.

Except that prog array is indirect call based and got awfully slow due
to retpoline and we're trying to redesign the whole tail_call approach.
So more extensions to tail_call facility is the opposite of that direction.

> > I have nothing against dtrace language and dtrace scripts.
> > Go ahead and compile them into bpf.
> > All patches to improve bpf infrastructure are very welcomed.
> > 
> > In particular you brought up a good point that there is a use case
> > for sharing a piece of bpf program between kprobe and tracepoint events.
> > The better way to do that is via bpf2bpf call.
> > Example:
> > void bpf_subprog(arbitrary args)
> > {
> > }
> > 
> > SEC("kprobe/__set_task_comm")
> > int bpf_prog_kprobe(struct pt_regs *ctx)
> > {
> >   bpf_subprog(...);
> > }
> > 
> > SEC("tracepoint/sched/sched_switch")
> > int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> > {
> >   bpf_subprog(...);
> > }
> > 
> > Such configuration is not supported by the verifier yet.
> > We've been discussing it for some time, but no work has started,
> > since there was no concrete use case.
> > If you can work on adding support for it everyone will benefit.
> > 
> > Could you please consider doing that as a step forward?
> 
> This definitely looks to be an interesting addition and I am happy to look into
> that further.  I have a few questions that I hope you can shed light on...
> 
> 1. What context would bpf_subprog execute with?  If it can be called from
>    multiple different prog types, would it see whichever context the caller
>    is executing with?  Or would you envision bpf_subprog to not be allowed to
>    access the execution context because it cannot know which one is in use?

bpf_subprog() won't be able to access 'ctx' pointer _if_ it's ambiguous.
The verifier already smart enough to track all the data flow, so it's fine to
pass 'struct pt_regs *ctx' as long as it's accessed safely.
For example:
void bpf_subprog(int kind, struct pt_regs *ctx1, struct sched_switch_args *ctx2)
{
  if (kind == 1)
     bpf_printk("%d", ctx1->pc);
  if (kind == 2)
     bpf_printk("%d", ctx2->next_pid);
}

SEC("kprobe/__set_task_comm")
int bpf_prog_kprobe(struct pt_regs *ctx)
{
  bpf_subprog(1, ctx, NULL);
}

SEC("tracepoint/sched/sched_switch")
int bpf_prog_tracepoint(struct sched_switch_args *ctx)
{
  bpf_subprog(2, NULL, ctx);
}

The verifier should be able to prove that the above is correct.
It can do so already if s/ctx1/map_value1/, s/ctx2/map_value2/
What's missing is an ability to have more than one 'starting' or 'root caller'
program.

Now replace SEC("tracepoint/sched/sched_switch") with SEC("cgroup/ingress")
and it's becoming clear that BPF_PROG_TYPE_PROBE approach is not good enough, right?
Folks are already sharing the bpf progs between kprobe and networking.
Currently it's done via code duplication and actual sharing happens via maps.
That's not ideal, hence we've been discussing 'shared library' approach for
quite some time. We need a way to support common bpf functions that can be called
from networking and from tracing programs.

> 2. Given that BPF programs are loaded with a specification of the prog type, 
>    how would one load a code construct as the one you outline above?  How can
>    you load a BPF function and have it be used as subprog from programs that
>    are loaded separately?  I.e. in the sample above, if bpf_subprog is loaded
>    as part of loading bpf_prog_kprobe (prog type KPROBE), how can it be
>    referenced from bpf_prog_tracepoint (prog type TRACEPOINT) which would be
>    loaded separately?

The api to support shared libraries was discussed, but not yet implemented.
We've discussed 'FD + name' approach.
FD identifies a loaded program (which is root program + a set of subprogs)
and other programs can be loaded at any time later. The BPF_CALL instructions
in such later program would refer to older subprogs via FD + name.
Note that both tracing and networking progs can be part of single elf file.
libbpf has to be smart to load progs into kernel step by step
and reusing subprogs that are already loaded.

Note that libbpf work for such feature can begin _without_ kernel changes.
libbpf can pass bpf_prog_kprobe+bpf_subprog as a single program first,
then pass bpf_prog_tracepoint+bpf_subprog second (as a separate program).
The bpf_subprog will be duplicated and JITed twice, but sharing will happen
because data structures (maps, global and static data) will be shared.
This way the support for 'pseudo shared libraries' can begin.
(later accompanied by FD+name kernel support)

There are other things we discsused. Ideally the body of bpf_subprog()
wouldn't need to be kept around for future verification when this bpf
function is called by a different program. The idea was to
use BTF and similar mechanism to ongoing 'bounded loop' work.
So the verifier can analyze bpf_subprog() once and reuse that knowledge
for dynamic linking with progs that will be loaded later.
This is more long term work.
A simple short term would be to verify the full call chain every time
the subprog (bpf function) is reused.

All that aside the kernel support for shared libraries is an awesome
feature to have and a bunch of folks want to see it happen, but
it's not a blocker for 'dtrace to bpf' user space work.
libbpf can be taught to do this 'pseudo shared library' feature
while 'dtrace to bpf' side doesn't need to do anything special.
It can generate normal elf file with bpf functions calling each other
and have tracing, kprobes, etc in one .c file.
Or don't generate .c file if you don't want to use clang/llvm.
If you think "dtrace to bpf" can generate bpf directly then go for that.
All such decisions are in user space and there is a freedom to course
correct when direct bpf generation will turn out to be underperforming
comparing to llvm generated code.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-23  5:46                 ` Kris Van Hees
@ 2019-05-23 21:13                   ` Alexei Starovoitov
  2019-05-23 23:02                     ` Steven Rostedt
  2019-05-24  4:05                     ` Kris Van Hees
  0 siblings, 2 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-23 21:13 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: Steven Rostedt, netdev, bpf, dtrace-devel, linux-kernel,
	mhiramat, acme, ast, daniel, peterz

On Thu, May 23, 2019 at 01:46:10AM -0400, Kris Van Hees wrote:
> 
> I think there is a difference between a solution and a good solution.  Adding
> a lot of knowledge in the userspace component about how things are imeplemented
> at the kernel level makes for a more fragile infrastructure and involves
> breaking down well established boundaries in DTrace that are part of the design
> specifically to ensure that userspace doesn't need to depend on such intimate
> knowledge.

argh. see more below. This is fundamental disagreement.

> > > Another advantage of being able to operate on a more abstract probe concept
> > > that is not tied to a specific probe type is that the userspace component does
> > > not need to know about the implementation details of the specific probes.
> > 
> > If that is indeed the case that dtrace is broken _by design_
> > and nothing on the kernel side can fix it.
> > 
> > bpf prog attached to NMI is running in NMI.
> > That is very different execution context vs kprobe.
> > kprobe execution context is also different from syscall.
> > 
> > The user writing the script has to be aware in what context
> > that script will be executing.
> 
> The design behind DTrace definitely recognizes that different types of probes
> operate in different ways and have different data associated with them.  That
> is why probes (in legacy DTrace) are managed by providers, one for each type
> of probe.  The providers handle the specifics of a probe type, and provide a
> generic probe API to the processing component of DTrace:
> 
>     SDT probes -----> SDT provider -------+
>                                           |
>     FBT probes -----> FBT provider -------+--> DTrace engine
>                                           |
>     syscall probes -> systrace provider --+
> 
> This means that the DTrace processing component can be implemented based on a
> generic probe concept, and the providers will take care of the specifics.  In
> that sense, it is similar to so many other parts of the kernel where a generic
> API is exposed so that higher level components don't need to know implementation
> details.
> 
> In DTrace, people write scripts based on UAPI-style interfaces and they don't
> have to concern themselves with e.g. knowing how to get the value of the 3rd
> argument that was passed by the firing probe.  All they need to know is that
> the probe will have a 3rd argument, and that the 3rd argument to *any* probe
> can be accessed as 'arg2' (or args[2] for typed arguments, if the provider is
> capable of providing that).  Different probes have different ways of passing
> arguments, and only the provider code for each probe type needs to know how
> to retrieve the argument values.
> 
> Does this help bring clarity to the reasons why an abstract (generic) probe
> concept is part of DTrace's design?

It actually sounds worse than I thought.
If dtrace script reads some kernel field it's considered to be uapi?! ouch.
It means dtrace development philosophy is incompatible with the linux kernel.
There is no way kernel is going to bend itself to make dtrace scripts
runnable if that means that all dtrace accessible fields become uapi.

In stark contrast to dtrace all of bpf tracing scripts (bcc scripts
and bpftrace scripts) are written for specific kernel with intimate
knowledge of kernel details. They do break all the time when kernel changes.
kprobe and tracepoints are NOT uapi. All of them can change.
tracepoints are a bit more stable than kprobes, but they are not uapi.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-23 21:13                   ` Alexei Starovoitov
@ 2019-05-23 23:02                     ` Steven Rostedt
  2019-05-24  0:31                       ` Alexei Starovoitov
  2019-05-24  5:10                       ` Kris Van Hees
  2019-05-24  4:05                     ` Kris Van Hees
  1 sibling, 2 replies; 54+ messages in thread
From: Steven Rostedt @ 2019-05-23 23:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, mhiramat,
	acme, ast, daniel, peterz

On Thu, 23 May 2019 14:13:31 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > In DTrace, people write scripts based on UAPI-style interfaces and they don't
> > have to concern themselves with e.g. knowing how to get the value of the 3rd
> > argument that was passed by the firing probe.  All they need to know is that
> > the probe will have a 3rd argument, and that the 3rd argument to *any* probe
> > can be accessed as 'arg2' (or args[2] for typed arguments, if the provider is
> > capable of providing that).  Different probes have different ways of passing
> > arguments, and only the provider code for each probe type needs to know how
> > to retrieve the argument values.
> > 
> > Does this help bring clarity to the reasons why an abstract (generic) probe
> > concept is part of DTrace's design?  
> 
> It actually sounds worse than I thought.
> If dtrace script reads some kernel field it's considered to be uapi?! ouch.
> It means dtrace development philosophy is incompatible with the linux kernel.
> There is no way kernel is going to bend itself to make dtrace scripts
> runnable if that means that all dtrace accessible fields become uapi.

Now from what I'm reading, it seams that the Dtrace layer may be
abstracting out fields from the kernel. This is actually something I
have been thinking about to solve the "tracepoint abi" issue. There's
usually basic ideas that happen. An interrupt goes off, there's a
handler, etc. We could abstract that out that we trace when an
interrupt goes off and the handler happens, and record the vector
number, and/or what device it was for. We have tracepoints in the
kernel that do this, but they do depend a bit on the implementation.
Now, if we could get a layer that abstracts this information away from
the implementation, then I think that's a *good* thing.


> 
> In stark contrast to dtrace all of bpf tracing scripts (bcc scripts
> and bpftrace scripts) are written for specific kernel with intimate
> knowledge of kernel details. They do break all the time when kernel changes.
> kprobe and tracepoints are NOT uapi. All of them can change.
> tracepoints are a bit more stable than kprobes, but they are not uapi.

I wish that was totally true, but tracepoints *can* be an abi. I had
code reverted because powertop required one to be a specific format. To
this day, the wakeup event has a "success" field that writes in a
hardcoded "1", because there's tools that depend on it, and they only
work if there's a success field and the value is 1.

I do definitely agree with you that the Dtrace code shall *never* keep
the kernel from changing. That is, if Dtrace depends on something that
changes (let's say we record priority of a task, but someday priority
is replaced by something else), then Dtrace must cope with it. It must
not be a blocker like user space applications can be.


-- Steve

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-23 23:02                     ` Steven Rostedt
@ 2019-05-24  0:31                       ` Alexei Starovoitov
  2019-05-24  1:57                         ` Steven Rostedt
  2019-05-24  5:10                       ` Kris Van Hees
  1 sibling, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-24  0:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, mhiramat,
	acme, ast, daniel, peterz

On Thu, May 23, 2019 at 07:02:43PM -0400, Steven Rostedt wrote:
> On Thu, 23 May 2019 14:13:31 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > > In DTrace, people write scripts based on UAPI-style interfaces and they don't
> > > have to concern themselves with e.g. knowing how to get the value of the 3rd
> > > argument that was passed by the firing probe.  All they need to know is that
> > > the probe will have a 3rd argument, and that the 3rd argument to *any* probe
> > > can be accessed as 'arg2' (or args[2] for typed arguments, if the provider is
> > > capable of providing that).  Different probes have different ways of passing
> > > arguments, and only the provider code for each probe type needs to know how
> > > to retrieve the argument values.
> > > 
> > > Does this help bring clarity to the reasons why an abstract (generic) probe
> > > concept is part of DTrace's design?  
> > 
> > It actually sounds worse than I thought.
> > If dtrace script reads some kernel field it's considered to be uapi?! ouch.
> > It means dtrace development philosophy is incompatible with the linux kernel.
> > There is no way kernel is going to bend itself to make dtrace scripts
> > runnable if that means that all dtrace accessible fields become uapi.
> 
> Now from what I'm reading, it seams that the Dtrace layer may be
> abstracting out fields from the kernel. This is actually something I
> have been thinking about to solve the "tracepoint abi" issue. There's
> usually basic ideas that happen. An interrupt goes off, there's a
> handler, etc. We could abstract that out that we trace when an
> interrupt goes off and the handler happens, and record the vector
> number, and/or what device it was for. We have tracepoints in the
> kernel that do this, but they do depend a bit on the implementation.
> Now, if we could get a layer that abstracts this information away from
> the implementation, then I think that's a *good* thing.

I don't like this deferred irq idea at all.
Abstracting details from the users is _never_ a good idea.
A ton of people use bcc scripts and bpftrace because they want those details.
They need to know what kernel is doing to make better decisions.
Delaying irq record is the opposite.

> > 
> > In stark contrast to dtrace all of bpf tracing scripts (bcc scripts
> > and bpftrace scripts) are written for specific kernel with intimate
> > knowledge of kernel details. They do break all the time when kernel changes.
> > kprobe and tracepoints are NOT uapi. All of them can change.
> > tracepoints are a bit more stable than kprobes, but they are not uapi.
> 
> I wish that was totally true, but tracepoints *can* be an abi. I had
> code reverted because powertop required one to be a specific format. To
> this day, the wakeup event has a "success" field that writes in a
> hardcoded "1", because there's tools that depend on it, and they only
> work if there's a success field and the value is 1.

I really think that you should put powertop nightmares to rest.
That was long ago. The kernel is different now.
Linus made it clear several times that it is ok to change _all_ tracepoints.
Period. Some maintainers somehow still don't believe that they can do it.

Some tracepoints are used more than others and more people will
complain: "ohh I need to change my script" when that tracepoint changes.
But the kernel development is not going to be hampered by a tracepoint.
No matter how widespread its usage in scripts.

Sometimes that pain of change can be mitigated a bit. Like that
'success' field example, but tracepoints still change.
Meaningful value before vs hardcoded constant is still a breakage for
some scripts.

> I do definitely agree with you that the Dtrace code shall *never* keep
> the kernel from changing. That is, if Dtrace depends on something that
> changes (let's say we record priority of a task, but someday priority
> is replaced by something else), then Dtrace must cope with it. It must
> not be a blocker like user space applications can be.
> 
> 
> -- Steve

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-24  0:31                       ` Alexei Starovoitov
@ 2019-05-24  1:57                         ` Steven Rostedt
  2019-05-24  2:08                           ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Steven Rostedt @ 2019-05-24  1:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, mhiramat,
	acme, ast, daniel, peterz

On Thu, 23 May 2019 17:31:50 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:


> > Now from what I'm reading, it seams that the Dtrace layer may be
> > abstracting out fields from the kernel. This is actually something I
> > have been thinking about to solve the "tracepoint abi" issue. There's
> > usually basic ideas that happen. An interrupt goes off, there's a
> > handler, etc. We could abstract that out that we trace when an
> > interrupt goes off and the handler happens, and record the vector
> > number, and/or what device it was for. We have tracepoints in the
> > kernel that do this, but they do depend a bit on the implementation.
> > Now, if we could get a layer that abstracts this information away from
> > the implementation, then I think that's a *good* thing.  
> 
> I don't like this deferred irq idea at all.

What do you mean deferred?

> Abstracting details from the users is _never_ a good idea.

Really? Most everything we do is to abstract details from the user. The
key is to make the abstraction more meaningful than the raw data.

> A ton of people use bcc scripts and bpftrace because they want those details.
> They need to know what kernel is doing to make better decisions.
> Delaying irq record is the opposite.

I never said anything about delaying the record. Just getting the
information that is needed.

> > 
> > I wish that was totally true, but tracepoints *can* be an abi. I had
> > code reverted because powertop required one to be a specific
> > format. To this day, the wakeup event has a "success" field that
> > writes in a hardcoded "1", because there's tools that depend on it,
> > and they only work if there's a success field and the value is 1.  
> 
> I really think that you should put powertop nightmares to rest.
> That was long ago. The kernel is different now.

Is it?

> Linus made it clear several times that it is ok to change _all_
> tracepoints. Period. Some maintainers somehow still don't believe
> that they can do it.

From what I remember him saying several times, is that you can change
all tracepoints, but if it breaks a tool that is useful, then that
change will get reverted. He will allow you to go and fix that tool and
bring back the change (which was the solution to powertop).

> 
> Some tracepoints are used more than others and more people will
> complain: "ohh I need to change my script" when that tracepoint
> changes. But the kernel development is not going to be hampered by a
> tracepoint. No matter how widespread its usage in scripts.

That's because we'll treat bpf (and Dtrace) scripts like modules (no
abi), at least we better. But if there's a tool that doesn't use the
script and reads the tracepoint directly via perf, then that's a
different story.

-- Steve

> 
> Sometimes that pain of change can be mitigated a bit. Like that
> 'success' field example, but tracepoints still change.
> Meaningful value before vs hardcoded constant is still a breakage for
> some scripts.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-24  1:57                         ` Steven Rostedt
@ 2019-05-24  2:08                           ` Alexei Starovoitov
  2019-05-24  2:40                             ` Steven Rostedt
  2019-05-24  5:26                             ` Kris Van Hees
  0 siblings, 2 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2019-05-24  2:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, mhiramat,
	acme, ast, daniel, peterz

On Thu, May 23, 2019 at 09:57:37PM -0400, Steven Rostedt wrote:
> On Thu, 23 May 2019 17:31:50 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> 
> > > Now from what I'm reading, it seams that the Dtrace layer may be
> > > abstracting out fields from the kernel. This is actually something I
> > > have been thinking about to solve the "tracepoint abi" issue. There's
> > > usually basic ideas that happen. An interrupt goes off, there's a
> > > handler, etc. We could abstract that out that we trace when an
> > > interrupt goes off and the handler happens, and record the vector
> > > number, and/or what device it was for. We have tracepoints in the
> > > kernel that do this, but they do depend a bit on the implementation.
> > > Now, if we could get a layer that abstracts this information away from
> > > the implementation, then I think that's a *good* thing.  
> > 
> > I don't like this deferred irq idea at all.
> 
> What do you mean deferred?

that's how I interpreted your proposal: 
"interrupt goes off and the handler happens, and record the vector number"
It's not a good thing to tell about irq later.
Just like saying lets record perf counter event and report it later.

> > Abstracting details from the users is _never_ a good idea.
> 
> Really? Most everything we do is to abstract details from the user. The
> key is to make the abstraction more meaningful than the raw data.
> 
> > A ton of people use bcc scripts and bpftrace because they want those details.
> > They need to know what kernel is doing to make better decisions.
> > Delaying irq record is the opposite.
> 
> I never said anything about delaying the record. Just getting the
> information that is needed.
> 
> > > 
> > > I wish that was totally true, but tracepoints *can* be an abi. I had
> > > code reverted because powertop required one to be a specific
> > > format. To this day, the wakeup event has a "success" field that
> > > writes in a hardcoded "1", because there's tools that depend on it,
> > > and they only work if there's a success field and the value is 1.  
> > 
> > I really think that you should put powertop nightmares to rest.
> > That was long ago. The kernel is different now.
> 
> Is it?
> 
> > Linus made it clear several times that it is ok to change _all_
> > tracepoints. Period. Some maintainers somehow still don't believe
> > that they can do it.
> 
> From what I remember him saying several times, is that you can change
> all tracepoints, but if it breaks a tool that is useful, then that
> change will get reverted. He will allow you to go and fix that tool and
> bring back the change (which was the solution to powertop).

my interpretation is different.
We changed tracepoints. It broke scripts. People changed scripts.

> 
> > 
> > Some tracepoints are used more than others and more people will
> > complain: "ohh I need to change my script" when that tracepoint
> > changes. But the kernel development is not going to be hampered by a
> > tracepoint. No matter how widespread its usage in scripts.
> 
> That's because we'll treat bpf (and Dtrace) scripts like modules (no
> abi), at least we better. But if there's a tool that doesn't use the
> script and reads the tracepoint directly via perf, then that's a
> different story.

absolutely not.
tracepoint is a tracepoint. It can change regardless of what
and how is using it.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-24  2:08                           ` Alexei Starovoitov
@ 2019-05-24  2:40                             ` Steven Rostedt
  2019-05-24  5:26                             ` Kris Van Hees
  1 sibling, 0 replies; 54+ messages in thread
From: Steven Rostedt @ 2019-05-24  2:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, mhiramat,
	acme, ast, daniel, peterz, Linus Torvalds, Al Viro


[ Added Linus and Al ]

On Thu, 23 May 2019 19:08:51 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > > > 
> > > > I wish that was totally true, but tracepoints *can* be an abi. I had
> > > > code reverted because powertop required one to be a specific
> > > > format. To this day, the wakeup event has a "success" field that
> > > > writes in a hardcoded "1", because there's tools that depend on it,
> > > > and they only work if there's a success field and the value is 1.    
> > > 
> > > I really think that you should put powertop nightmares to rest.
> > > That was long ago. The kernel is different now.  
> > 
> > Is it?
> >   
> > > Linus made it clear several times that it is ok to change _all_
> > > tracepoints. Period. Some maintainers somehow still don't believe
> > > that they can do it.  
> > 
> > From what I remember him saying several times, is that you can change
> > all tracepoints, but if it breaks a tool that is useful, then that
> > change will get reverted. He will allow you to go and fix that tool and
> > bring back the change (which was the solution to powertop).  
> 
> my interpretation is different.
> We changed tracepoints. It broke scripts. People changed scripts.

Scripts are different than binary tools.

> 
> >   
> > > 
> > > Some tracepoints are used more than others and more people will
> > > complain: "ohh I need to change my script" when that tracepoint
> > > changes. But the kernel development is not going to be hampered by a
> > > tracepoint. No matter how widespread its usage in scripts.  
> > 
> > That's because we'll treat bpf (and Dtrace) scripts like modules (no
> > abi), at least we better. But if there's a tool that doesn't use the
> > script and reads the tracepoint directly via perf, then that's a
> > different story.  
> 
> absolutely not.
> tracepoint is a tracepoint. It can change regardless of what
> and how is using it.

Instead of putting words into Linus's mouth, I'll just let him speak
for himself. If a useful tool that reads a tracepoint breaks because we
changed the tracepoint, and Linus is fine with that. Then great, we can
start adding them to VFS and not worry about them being an ABI.

-- Steve



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-23 21:13                   ` Alexei Starovoitov
  2019-05-23 23:02                     ` Steven Rostedt
@ 2019-05-24  4:05                     ` Kris Van Hees
  2019-05-24 13:28                       ` Steven Rostedt
  1 sibling, 1 reply; 54+ messages in thread
From: Kris Van Hees @ 2019-05-24  4:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, Steven Rostedt, netdev, bpf, dtrace-devel,
	linux-kernel, mhiramat, acme, ast, daniel, peterz

On Thu, May 23, 2019 at 02:13:31PM -0700, Alexei Starovoitov wrote:
> On Thu, May 23, 2019 at 01:46:10AM -0400, Kris Van Hees wrote:
> > 
> > I think there is a difference between a solution and a good solution.  Adding
> > a lot of knowledge in the userspace component about how things are imeplemented
> > at the kernel level makes for a more fragile infrastructure and involves
> > breaking down well established boundaries in DTrace that are part of the design
> > specifically to ensure that userspace doesn't need to depend on such intimate
> > knowledge.
> 
> argh. see more below. This is fundamental disagreement.
> 
> > > > Another advantage of being able to operate on a more abstract probe concept
> > > > that is not tied to a specific probe type is that the userspace component does
> > > > not need to know about the implementation details of the specific probes.
> > > 
> > > If that is indeed the case that dtrace is broken _by design_
> > > and nothing on the kernel side can fix it.
> > > 
> > > bpf prog attached to NMI is running in NMI.
> > > That is very different execution context vs kprobe.
> > > kprobe execution context is also different from syscall.
> > > 
> > > The user writing the script has to be aware in what context
> > > that script will be executing.
> > 
> > The design behind DTrace definitely recognizes that different types of probes
> > operate in different ways and have different data associated with them.  That
> > is why probes (in legacy DTrace) are managed by providers, one for each type
> > of probe.  The providers handle the specifics of a probe type, and provide a
> > generic probe API to the processing component of DTrace:
> > 
> >     SDT probes -----> SDT provider -------+
> >                                           |
> >     FBT probes -----> FBT provider -------+--> DTrace engine
> >                                           |
> >     syscall probes -> systrace provider --+
> > 
> > This means that the DTrace processing component can be implemented based on a
> > generic probe concept, and the providers will take care of the specifics.  In
> > that sense, it is similar to so many other parts of the kernel where a generic
> > API is exposed so that higher level components don't need to know implementation
> > details.
> > 
> > In DTrace, people write scripts based on UAPI-style interfaces and they don't
> > have to concern themselves with e.g. knowing how to get the value of the 3rd
> > argument that was passed by the firing probe.  All they need to know is that
> > the probe will have a 3rd argument, and that the 3rd argument to *any* probe
> > can be accessed as 'arg2' (or args[2] for typed arguments, if the provider is
> > capable of providing that).  Different probes have different ways of passing
> > arguments, and only the provider code for each probe type needs to know how
> > to retrieve the argument values.
> > 
> > Does this help bring clarity to the reasons why an abstract (generic) probe
> > concept is part of DTrace's design?
> 
> It actually sounds worse than I thought.
> If dtrace script reads some kernel field it's considered to be uapi?! ouch.
> It means dtrace development philosophy is incompatible with the linux kernel.
> There is no way kernel is going to bend itself to make dtrace scripts
> runnable if that means that all dtrace accessible fields become uapi.

No, no, that is not at all what I am saying.  In DTrace, the particulars of
how you get to e.g. probe arguments or current task information are not
something that script writers need to concern themselves about.  Similar to
how BPF contexts have a public (uapi) declaration and a kernel-level context
declaration taht is used to actually implement accessing the data (using the
is_valid_access and convert_ctx_access functions that prog types implement).
DTrace exposes an abstract probe entity to script writers where they can
access probe arguments as arg0 through arg9.  Nothing in the userspace needs
to know how you obtain the value of those arguments.  So, scripts can be
written for any kind of probe, and the only information that is used to
verify programs is obtained from the abstract probe description (things like
its unique id, number of arguments, and possible type information for each
argument).  The knowledge of how to get to the value of the probe arguments
is only known at the level of the kernel, so that when the implementation of
the probe in the kernel is modified, the mapping from actual probe to abstract
representation of the probe (in the kernel) can be modified along with it,
and userspace won't even notice that anything changed.

Many parts of the kernel work the same way.  E.g. file system implementations
change, yet the API to use the file systems remains the same.

> In stark contrast to dtrace all of bpf tracing scripts (bcc scripts
> and bpftrace scripts) are written for specific kernel with intimate
> knowledge of kernel details. They do break all the time when kernel changes.
> kprobe and tracepoints are NOT uapi. All of them can change.
> tracepoints are a bit more stable than kprobes, but they are not uapi.

Sure, and I understand why.  And in DTrace, there is an extra layer in the
design of the tracing framework that isolates implementation changes at the
level of the probes from what is exposed to userspace.  That way changes can
be made at the kernel level without worrying about the implications for
userspace.  Of course one can simply not care about userspace altogether,
whether there is an abstraction in place or not, but the added bonus of the
abstraction is that not caring about userspace won't affect userspace much :)

By the way, the point behind this design is also that it doesn't enforce the
use of that abstraction.  Nothing prevents people from using probes directly.
But it provides a generic probe concept that isolates the probe implementation
details from the tracing tool.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-23 23:02                     ` Steven Rostedt
  2019-05-24  0:31                       ` Alexei Starovoitov
@ 2019-05-24  5:10                       ` Kris Van Hees
  1 sibling, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-24  5:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Alexei Starovoitov, Kris Van Hees, netdev, bpf, dtrace-devel,
	linux-kernel, mhiramat, acme, ast, daniel, peterz

On Thu, May 23, 2019 at 07:02:43PM -0400, Steven Rostedt wrote:
> On Thu, 23 May 2019 14:13:31 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > > In DTrace, people write scripts based on UAPI-style interfaces and they don't
> > > have to concern themselves with e.g. knowing how to get the value of the 3rd
> > > argument that was passed by the firing probe.  All they need to know is that
> > > the probe will have a 3rd argument, and that the 3rd argument to *any* probe
> > > can be accessed as 'arg2' (or args[2] for typed arguments, if the provider is
> > > capable of providing that).  Different probes have different ways of passing
> > > arguments, and only the provider code for each probe type needs to know how
> > > to retrieve the argument values.
> > > 
> > > Does this help bring clarity to the reasons why an abstract (generic) probe
> > > concept is part of DTrace's design?  
> > 
> > It actually sounds worse than I thought.
> > If dtrace script reads some kernel field it's considered to be uapi?! ouch.
> > It means dtrace development philosophy is incompatible with the linux kernel.
> > There is no way kernel is going to bend itself to make dtrace scripts
> > runnable if that means that all dtrace accessible fields become uapi.
> 
> Now from what I'm reading, it seams that the Dtrace layer may be
> abstracting out fields from the kernel. This is actually something I
> have been thinking about to solve the "tracepoint abi" issue. There's
> usually basic ideas that happen. An interrupt goes off, there's a
> handler, etc. We could abstract that out that we trace when an
> interrupt goes off and the handler happens, and record the vector
> number, and/or what device it was for. We have tracepoints in the
> kernel that do this, but they do depend a bit on the implementation.
> Now, if we could get a layer that abstracts this information away from
> the implementation, then I think that's a *good* thing.

This is indeed what DTrace uses.  When a probe triggers (be it kprobe, network
event, tracepoint, etc), the core execution component is invoked with a probe
id, and a set of data items.  In its current implementation (not BPF based),
the probe triggers which causes a probe type specific handler to be called in
the provider module for that probe type.  The handler determines the probe id
(e.g. for a kprobe that might be based on the program counter value), and it
also prepares the list of data items (which we call arguments to the probe).
It then calls the execution component with the probe id and arguments.

All probe types are handled by a provider, and each provider has a handler
that determines the probe id and arguments, and then calls the execution
component.  So, at the level of the execution component all probes look the
same.

Scripts commonly operate on the abstract probe, but scriptr writers can opt
to do more fancy things that do depend on probe implementation details.  In
that case, there is of course no guarantee that the script will keep working
as kernel releases change.

> > In stark contrast to dtrace all of bpf tracing scripts (bcc scripts
> > and bpftrace scripts) are written for specific kernel with intimate
> > knowledge of kernel details. They do break all the time when kernel changes.
> > kprobe and tracepoints are NOT uapi. All of them can change.
> > tracepoints are a bit more stable than kprobes, but they are not uapi.
> 
> I wish that was totally true, but tracepoints *can* be an abi. I had
> code reverted because powertop required one to be a specific format. To
> this day, the wakeup event has a "success" field that writes in a
> hardcoded "1", because there's tools that depend on it, and they only
> work if there's a success field and the value is 1.
> 
> I do definitely agree with you that the Dtrace code shall *never* keep
> the kernel from changing. That is, if Dtrace depends on something that
> changes (let's say we record priority of a task, but someday priority
> is replaced by something else), then Dtrace must cope with it. It must
> not be a blocker like user space applications can be.

I fully agree that DTrace or any other tool should never prevent changes from
happening at the kernel level.  Even in its current (non-BPF) implementation
it has had to cope with changes.  The abstraction through the providers has
been a real benefit for that because changes to probe mechanisms can be dealt
with at the level of the providers, and everything else can remain the same
because the abstraction "hides" the implementation details.

	Kris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-24  2:08                           ` Alexei Starovoitov
  2019-05-24  2:40                             ` Steven Rostedt
@ 2019-05-24  5:26                             ` Kris Van Hees
  1 sibling, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-05-24  5:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Kris Van Hees, netdev, bpf, dtrace-devel,
	linux-kernel, mhiramat, acme, ast, daniel, peterz

On Thu, May 23, 2019 at 07:08:51PM -0700, Alexei Starovoitov wrote:
> On Thu, May 23, 2019 at 09:57:37PM -0400, Steven Rostedt wrote:
> > On Thu, 23 May 2019 17:31:50 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> > 
> > > > Now from what I'm reading, it seams that the Dtrace layer may be
> > > > abstracting out fields from the kernel. This is actually something I
> > > > have been thinking about to solve the "tracepoint abi" issue. There's
> > > > usually basic ideas that happen. An interrupt goes off, there's a
> > > > handler, etc. We could abstract that out that we trace when an
> > > > interrupt goes off and the handler happens, and record the vector
> > > > number, and/or what device it was for. We have tracepoints in the
> > > > kernel that do this, but they do depend a bit on the implementation.
> > > > Now, if we could get a layer that abstracts this information away from
> > > > the implementation, then I think that's a *good* thing.  
> > > 
> > > I don't like this deferred irq idea at all.
> > 
> > What do you mean deferred?
> 
> that's how I interpreted your proposal: 
> "interrupt goes off and the handler happens, and record the vector number"
> It's not a good thing to tell about irq later.
> Just like saying lets record perf counter event and report it later.

The abstraction I mentioned does not defer anything - it merely provides a way
for all probe events to be processed as a generic probe with a set of values
associated with it (e.g. syscall arguments for a syscall entry probe).  The
program that implements what needs to happen when that probe fires still does
whatever is necessary to collect information, and dump data in the output
buffers before execution continues.

I could trace entry into a syscall by using a syscall entry tracepoint or by
putting a kprobe on the syscall function itself.  I am usually interested in
whether the syscall was called, what the arguments were, and perhaps I need to
collect some other data related to it.  More often than not, both probes would
get the job done.  With an abstraction that hides the implementation details
of the probe mechanism itself, both cases are essentially the same.

> > > Abstracting details from the users is _never_ a good idea.
> > 
> > Really? Most everything we do is to abstract details from the user. The
> > key is to make the abstraction more meaningful than the raw data.
> > 
> > > A ton of people use bcc scripts and bpftrace because they want those details.
> > > They need to know what kernel is doing to make better decisions.
> > > Delaying irq record is the opposite.
> > 
> > I never said anything about delaying the record. Just getting the
> > information that is needed.
> > 
> > > > 
> > > > I wish that was totally true, but tracepoints *can* be an abi. I had
> > > > code reverted because powertop required one to be a specific
> > > > format. To this day, the wakeup event has a "success" field that
> > > > writes in a hardcoded "1", because there's tools that depend on it,
> > > > and they only work if there's a success field and the value is 1.  
> > > 
> > > I really think that you should put powertop nightmares to rest.
> > > That was long ago. The kernel is different now.
> > 
> > Is it?
> > 
> > > Linus made it clear several times that it is ok to change _all_
> > > tracepoints. Period. Some maintainers somehow still don't believe
> > > that they can do it.
> > 
> > From what I remember him saying several times, is that you can change
> > all tracepoints, but if it breaks a tool that is useful, then that
> > change will get reverted. He will allow you to go and fix that tool and
> > bring back the change (which was the solution to powertop).
> 
> my interpretation is different.
> We changed tracepoints. It broke scripts. People changed scripts.

In my world, the sequence is more like: tracepoints get changed, scripts
break, I fix the provider (abstraction), scripts work again.  Users really
appreciate that aspect because many of our users are not kernel experts.

> > > Some tracepoints are used more than others and more people will
> > > complain: "ohh I need to change my script" when that tracepoint
> > > changes. But the kernel development is not going to be hampered by a
> > > tracepoint. No matter how widespread its usage in scripts.
> > 
> > That's because we'll treat bpf (and Dtrace) scripts like modules (no
> > abi), at least we better. But if there's a tool that doesn't use the
> > script and reads the tracepoint directly via perf, then that's a
> > different story.
> 
> absolutely not.
> tracepoint is a tracepoint. It can change regardless of what
> and how is using it.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-22 18:22     ` Kris Van Hees
  2019-05-22 19:55       ` Alexei Starovoitov
@ 2019-05-24  7:27       ` Peter Zijlstra
  1 sibling, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2019-05-24  7:27 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: Alexei Starovoitov, netdev, bpf, dtrace-devel, linux-kernel,
	rostedt, mhiramat, acme, ast, daniel

On Wed, May 22, 2019 at 02:22:15PM -0400, Kris Van Hees wrote:

> > Let me further NAK it for adding all sorts of garbage to the code --
> > we're not going to do gaps and stay_in_page nonsense.
> 
> Could you give some guidance in terms of an alternative?  The ring buffer code
> provides both non-contiguous page allocation support and a vmalloc-based
> allocation, and the vmalloc version certainly would avoid the entire gap and
> page boundary stuff.  But since the allocator is chosen at build time based on
> the arch capabilities, there is no way to select a specific memory allocator.
> I'd be happy to use an alternative approach that allows direct writing into
> the ring buffer.

So why can't you do what the regular perf does? Use an output iterator
that knows about the page breaks? See perf_output_put() for example.

Anyway, I agree with Alexei and DaveM, get it working without/minimal
kernel changes first, and then we can talk about possible optimizations.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-24  4:05                     ` Kris Van Hees
@ 2019-05-24 13:28                       ` Steven Rostedt
  0 siblings, 0 replies; 54+ messages in thread
From: Steven Rostedt @ 2019-05-24 13:28 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: Alexei Starovoitov, netdev, bpf, dtrace-devel, linux-kernel,
	mhiramat, acme, ast, daniel, peterz

On Fri, 24 May 2019 00:05:27 -0400
Kris Van Hees <kris.van.hees@oracle.com> wrote:

> No, no, that is not at all what I am saying.  In DTrace, the particulars of
> how you get to e.g. probe arguments or current task information are not
> something that script writers need to concern themselves about.  Similar to
> how BPF contexts have a public (uapi) declaration and a kernel-level context
> declaration taht is used to actually implement accessing the data (using the
> is_valid_access and convert_ctx_access functions that prog types implement).
> DTrace exposes an abstract probe entity to script writers where they can
> access probe arguments as arg0 through arg9.  Nothing in the userspace needs
> to know how you obtain the value of those arguments.  So, scripts can be
> written for any kind of probe, and the only information that is used to
> verify programs is obtained from the abstract probe description (things like
> its unique id, number of arguments, and possible type information for each
> argument).  The knowledge of how to get to the value of the probe arguments
> is only known at the level of the kernel, so that when the implementation of
> the probe in the kernel is modified, the mapping from actual probe to abstract
> representation of the probe (in the kernel) can be modified along with it,
> and userspace won't even notice that anything changed.
> 
> Many parts of the kernel work the same way.  E.g. file system implementations
> change, yet the API to use the file systems remains the same.

Another example is actually the tracefs events directory. It represents
normal trace events (tracepoints), kprobes, uprobes, and synthetic
events. You don't need to know what they are to use them as soon as
they are created. You can even add triggers and such on top of each,
and there shouldn't be any difference.

-- Steve

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-23 20:28                 ` Alexei Starovoitov
@ 2019-05-30 16:15                   ` Kris Van Hees
  2019-05-31 15:25                     ` Chris Mason
  2019-06-18  1:25                   ` Kris Van Hees
  1 sibling, 1 reply; 54+ messages in thread
From: Kris Van Hees @ 2019-05-30 16:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, rostedt,
	mhiramat, acme, ast, daniel, peterz

On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:
> On Thu, May 23, 2019 at 01:16:08AM -0400, Kris Van Hees wrote:
> > On Wed, May 22, 2019 at 01:16:25PM -0700, Alexei Starovoitov wrote:
> > > On Wed, May 22, 2019 at 12:12:53AM -0400, Kris Van Hees wrote:
> > > > 
> > > > Could you elaborate on why you believe my patches are not adding generic
> > > > features?  I can certainly agree that the DTrace-specific portions are less
> > > > generic (although they are certainly available for anyone to use), but I
> > > > don't quite understand why the new features are deemed non-generic and why
> > > > you believe no one else can use this?
> > > 
> > > And once again your statement above contradicts your own patches.
> > > The patch 2 adds new prog type BPF_PROG_TYPE_DTRACE and the rest of the patches
> > > are tying everything to it.
> > > This approach contradicts bpf philosophy of being generic execution engine
> > > and not favoriting one program type vs another.
> > 
> > I am not sure I understand where you see a contradiction.  What I posted is
> > a generic feature, and sample code that demonstrates how it can be used based
> > on the use-case that I am currently working on.  So yes, the sample code is
> > very specific but it does not restrict the use of the cross-prog-type tail-call
> > feature.  That feature is designed to be generic.
> > 
> > Probes come in different types (kprobe, tracepoint, perf event, ...) and they
> > each have their own very specific data associated with them.  I agree 100%
> > with you on that.  And sometimes tracing makes use of those specifics.  But
> > even from looking at the implementation of the various probe related prog
> > types (and e.g. the list of helpers they each support) it shows that there is
> > a lot of commonality as well.  That common functionality is common to all the
> > probe program types, and that is where I suggest introducing a program type
> > that captures the common concept of a probe, so perhaps a better name would
> > be BPF_PROG_TYPE_PROBE.
> > 
> > The principle remains the same though...  I am proposing adding support for
> > program types that provide common functionality so that programs for various
> > program types can make use of the more generic programs stored in prog arrays.
> 
> Except that prog array is indirect call based and got awfully slow due
> to retpoline and we're trying to redesign the whole tail_call approach.
> So more extensions to tail_call facility is the opposite of that direction.

OK, I see the point of retpoline having slowed down tail_call.  Do you have
any suggestions in how to accomplish the concept that I am proposing in a
different way?  I believe that the discussion that has been going on in other
emails has shown that while introducing a program type that provides a
generic (abstracted) context is a different approach from what has been done
so far, it is a new use case that provides for additional ways in which BPF
can be used.

> > > I have nothing against dtrace language and dtrace scripts.
> > > Go ahead and compile them into bpf.
> > > All patches to improve bpf infrastructure are very welcomed.
> > > 
> > > In particular you brought up a good point that there is a use case
> > > for sharing a piece of bpf program between kprobe and tracepoint events.
> > > The better way to do that is via bpf2bpf call.
> > > Example:
> > > void bpf_subprog(arbitrary args)
> > > {
> > > }
> > > 
> > > SEC("kprobe/__set_task_comm")
> > > int bpf_prog_kprobe(struct pt_regs *ctx)
> > > {
> > >   bpf_subprog(...);
> > > }
> > > 
> > > SEC("tracepoint/sched/sched_switch")
> > > int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> > > {
> > >   bpf_subprog(...);
> > > }
> > > 
> > > Such configuration is not supported by the verifier yet.
> > > We've been discussing it for some time, but no work has started,
> > > since there was no concrete use case.
> > > If you can work on adding support for it everyone will benefit.
> > > 
> > > Could you please consider doing that as a step forward?
> > 
> > This definitely looks to be an interesting addition and I am happy to look into
> > that further.  I have a few questions that I hope you can shed light on...
> > 
> > 1. What context would bpf_subprog execute with?  If it can be called from
> >    multiple different prog types, would it see whichever context the caller
> >    is executing with?  Or would you envision bpf_subprog to not be allowed to
> >    access the execution context because it cannot know which one is in use?
> 
> bpf_subprog() won't be able to access 'ctx' pointer _if_ it's ambiguous.
> The verifier already smart enough to track all the data flow, so it's fine to
> pass 'struct pt_regs *ctx' as long as it's accessed safely.
> For example:
> void bpf_subprog(int kind, struct pt_regs *ctx1, struct sched_switch_args *ctx2)
> {
>   if (kind == 1)
>      bpf_printk("%d", ctx1->pc);
>   if (kind == 2)
>      bpf_printk("%d", ctx2->next_pid);
> }
> 
> SEC("kprobe/__set_task_comm")
> int bpf_prog_kprobe(struct pt_regs *ctx)
> {
>   bpf_subprog(1, ctx, NULL);
> }
> 
> SEC("tracepoint/sched/sched_switch")
> int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> {
>   bpf_subprog(2, NULL, ctx);
> }
> 
> The verifier should be able to prove that the above is correct.
> It can do so already if s/ctx1/map_value1/, s/ctx2/map_value2/
> What's missing is an ability to have more than one 'starting' or 'root caller'
> program.
> 
> Now replace SEC("tracepoint/sched/sched_switch") with SEC("cgroup/ingress")
> and it's becoming clear that BPF_PROG_TYPE_PROBE approach is not good enough, right?

Yes and no.  It depends on what you are trying to do with the BPF program that
is attached to the different events.  From a tracing perspective, providing a
single BPF program with an abstract context would make sense in your example
when you are collecting the same kind of information about the task and system
state at the time the event happens.

In the tracing model that provides the use cases I am concerned with, a probe
or event triggering execution is equivalent to making a function call like
(in pseudo-code):

    process_probe(probe_id, args, ...)

where the probe_id identifies the actual probe that fired (and can be used to
access meta data about the probe, etc) and args captures the parameters that
are provided by the probe.

In this model kprobe/ksys_write and tracepoint/syscalls/sys_enter_write are
equivalent for most tracing purposes (because we provide function arguments
as args in the first one, and we provide the tracepoint parameters as args
in the second one).  When you are tracing the use of writes, it doesn't really
matter which of these two you attach the program to.

> Folks are already sharing the bpf progs between kprobe and networking.
> Currently it's done via code duplication and actual sharing happens via maps.
> That's not ideal, hence we've been discussing 'shared library' approach for
> quite some time. We need a way to support common bpf functions that can be called
> from networking and from tracing programs.

I agree with what you are saying but I am presenting an additional use case
that goes beyond providing providing a library of functions (though I
definitely have a use for that also).  I am hoping you have some suggestions
on how to accomplish that in view of your comment that tail_call isn't the way
to go. 

> > 2. Given that BPF programs are loaded with a specification of the prog type, 
> >    how would one load a code construct as the one you outline above?  How can
> >    you load a BPF function and have it be used as subprog from programs that
> >    are loaded separately?  I.e. in the sample above, if bpf_subprog is loaded
> >    as part of loading bpf_prog_kprobe (prog type KPROBE), how can it be
> >    referenced from bpf_prog_tracepoint (prog type TRACEPOINT) which would be
> >    loaded separately?
> 
> The api to support shared libraries was discussed, but not yet implemented.
> We've discussed 'FD + name' approach.
> FD identifies a loaded program (which is root program + a set of subprogs)
> and other programs can be loaded at any time later. The BPF_CALL instructions
> in such later program would refer to older subprogs via FD + name.
> Note that both tracing and networking progs can be part of single elf file.
> libbpf has to be smart to load progs into kernel step by step
> and reusing subprogs that are already loaded.

OK.

> Note that libbpf work for such feature can begin _without_ kernel changes.
> libbpf can pass bpf_prog_kprobe+bpf_subprog as a single program first,
> then pass bpf_prog_tracepoint+bpf_subprog second (as a separate program).
> The bpf_subprog will be duplicated and JITed twice, but sharing will happen
> because data structures (maps, global and static data) will be shared.
> This way the support for 'pseudo shared libraries' can begin.
> (later accompanied by FD+name kernel support)

Makes sense.

> There are other things we discsused. Ideally the body of bpf_subprog()
> wouldn't need to be kept around for future verification when this bpf
> function is called by a different program. The idea was to
> use BTF and similar mechanism to ongoing 'bounded loop' work.
> So the verifier can analyze bpf_subprog() once and reuse that knowledge
> for dynamic linking with progs that will be loaded later.
> This is more long term work.

Hm, yes.  You should be able to get away with just storing the access
constraints of pointer (ctx, map_value) arguments passed to the functions,
and verify whether they are compatible with the information obtained while
running the verifier on the caller.  For loop detection you're likely to
need more information as well though.  Definitely longer term work.

> A simple short term would be to verify the full call chain every time
> the subprog (bpf function) is reused.
> 
> All that aside the kernel support for shared libraries is an awesome
> feature to have and a bunch of folks want to see it happen, but
> it's not a blocker for 'dtrace to bpf' user space work.
> libbpf can be taught to do this 'pseudo shared library' feature
> while 'dtrace to bpf' side doesn't need to do anything special.
> It can generate normal elf file with bpf functions calling each other
> and have tracing, kprobes, etc in one .c file.
> Or don't generate .c file if you don't want to use clang/llvm.
> If you think "dtrace to bpf" can generate bpf directly then go for that.
> All such decisions are in user space and there is a freedom to course
> correct when direct bpf generation will turn out to be underperforming
> comparing to llvm generated code.

So you are basically saying that I should redesign DTrace?  The ability to
use shared functions is not sufficient for this use case.  It is also putting
a burden on the users where a single piece of script code may have to be
compiled into different BPF code because it is going to be attached to two
different probe types.

In my example earlier in this email, a simple script that would collect the
3 arguments to a write would (right now) require very different BPF code to
get to those arguments (let's assume we're on x86):

  kprobe: get them from the pt_regs structure and possibly the stack
  tracepoint: get them from the parameters stored in the context

If something breaks, you suddenly put the burden on the user to try to debug
two (or more) generated BPF programs, despite the fact that they came from
the same source code.  And on top of that the problem may turn out to be that
the tracepoint changed in the kernel, and the code generation wasn't updated
yet.

As you said before...  userspace shouldn't block kernel changes.  In addition,
I think that isolating kernel changes from userspace changes is a benefit to
both camps.  That's why the bpf syscall is a great benefit because you can
change how things are done at the kernel level, and userspace often doesn't
need to know you changed the implementation of e.g. attaching programs.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-30 16:15                   ` Kris Van Hees
@ 2019-05-31 15:25                     ` Chris Mason
  2019-06-06 20:58                       ` Kris Van Hees
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2019-05-31 15:25 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: Alexei Starovoitov, netdev, bpf, dtrace-devel, linux-kernel,
	rostedt, mhiramat, acme, ast, daniel, peterz


I'm being pretty liberal with chopping down quoted material to help 
emphasize a particular opinion about how to bootstrap existing 
out-of-tree projects into the kernel.  My goal here is to talk more 
about the process and less about the technical details, so please 
forgive me if I've ignored or changed the technical meaning of anything 
below.

On 30 May 2019, at 12:15, Kris Van Hees wrote:

> On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:
>
> ... I believe that the discussion that has been going on in other
> emails has shown that while introducing a program type that provides a
> generic (abstracted) context is a different approach from what has 
> been done
> so far, it is a new use case that provides for additional ways in 
> which BPF
> can be used.
>

[ ... ]

>
> Yes and no.  It depends on what you are trying to do with the BPF 
> program that
> is attached to the different events.  From a tracing perspective, 
> providing a
> single BPF program with an abstract context would ...

[ ... ]

>
> In this model kprobe/ksys_write and 
> tracepoint/syscalls/sys_enter_write are
> equivalent for most tracing purposes ...

[ ... ]

>
> I agree with what you are saying but I am presenting an additional use 
> case

[ ... ]

>>
>> All that aside the kernel support for shared libraries is an awesome
>> feature to have and a bunch of folks want to see it happen, but
>> it's not a blocker for 'dtrace to bpf' user space work.
>> libbpf can be taught to do this 'pseudo shared library' feature
>> while 'dtrace to bpf' side doesn't need to do anything special.

[ ... ]

This thread intermixes some abstract conceptual changes with smaller 
technical improvements, and in general it follows a familiar pattern 
other out-of-tree projects have hit while trying to adapt the kernel to 
their existing code.  Just from this one email, I quoted the abstract 
models with use cases etc, and this is often where the discussions side 
track into less productive areas.

>
> So you are basically saying that I should redesign DTrace?

In your place, I would have removed features and adapted dtrace as much 
as possible to require the absolute minimum of kernel patches, or even 
better, no patches at all.  I'd document all of the features that worked 
as expected, and underline anything either missing or suboptimal that 
needed additional kernel changes.  Then I'd focus on expanding the 
community of people using dtrace against the mainline kernel, and work 
through the series features and improvements one by one upstream over 
time.

Your current approach relies on an all-or-nothing landing of patches 
upstream, and this consistently leads to conflict every time a project 
tries it.  A more incremental approach will require bigger changes on 
the dtrace application side, but over time it'll be much easier to 
justify your kernel changes.  You won't have to talk in abstract models, 
and you'll have many more concrete examples of people asking for dtrace 
features against mainline.  Most importantly, you'll make dtrace 
available on more kernels than just the absolute latest mainline, and 
removing dependencies makes the project much easier for new users to 
try.

-chris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-31 15:25                     ` Chris Mason
@ 2019-06-06 20:58                       ` Kris Van Hees
  0 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-06-06 20:58 UTC (permalink / raw)
  To: Chris Mason
  Cc: Kris Van Hees, Alexei Starovoitov, netdev, bpf, dtrace-devel,
	linux-kernel, rostedt, mhiramat, acme, ast, daniel, peterz

On Fri, May 31, 2019 at 03:25:25PM +0000, Chris Mason wrote:
> 
> I'm being pretty liberal with chopping down quoted material to help 
> emphasize a particular opinion about how to bootstrap existing 
> out-of-tree projects into the kernel.  My goal here is to talk more 
> about the process and less about the technical details, so please 
> forgive me if I've ignored or changed the technical meaning of anything 
> below.
> 
> On 30 May 2019, at 12:15, Kris Van Hees wrote:
> 
> > On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:
> >
> > ... I believe that the discussion that has been going on in other
> > emails has shown that while introducing a program type that provides a
> > generic (abstracted) context is a different approach from what has 
> > been done
> > so far, it is a new use case that provides for additional ways in 
> > which BPF
> > can be used.
> >
> 
> [ ... ]
> 
> >
> > Yes and no.  It depends on what you are trying to do with the BPF 
> > program that
> > is attached to the different events.  From a tracing perspective, 
> > providing a
> > single BPF program with an abstract context would ...
> 
> [ ... ]
> 
> >
> > In this model kprobe/ksys_write and 
> > tracepoint/syscalls/sys_enter_write are
> > equivalent for most tracing purposes ...
> 
> [ ... ]
> 
> >
> > I agree with what you are saying but I am presenting an additional use 
> > case
> 
> [ ... ]
> 
> >>
> >> All that aside the kernel support for shared libraries is an awesome
> >> feature to have and a bunch of folks want to see it happen, but
> >> it's not a blocker for 'dtrace to bpf' user space work.
> >> libbpf can be taught to do this 'pseudo shared library' feature
> >> while 'dtrace to bpf' side doesn't need to do anything special.
> 
> [ ... ]
> 
> This thread intermixes some abstract conceptual changes with smaller 
> technical improvements, and in general it follows a familiar pattern 
> other out-of-tree projects have hit while trying to adapt the kernel to 
> their existing code.  Just from this one email, I quoted the abstract 
> models with use cases etc, and this is often where the discussions side 
> track into less productive areas.
> 
> >
> > So you are basically saying that I should redesign DTrace?
> 
> In your place, I would have removed features and adapted dtrace as much 
> as possible to require the absolute minimum of kernel patches, or even 
> better, no patches at all.  I'd document all of the features that worked 
> as expected, and underline anything either missing or suboptimal that 
> needed additional kernel changes.  Then I'd focus on expanding the 
> community of people using dtrace against the mainline kernel, and work 
> through the series features and improvements one by one upstream over 
> time.

Well, that is actually what I am doing in the sense that the proposed patches
are quite minimal and lie at the core of the style of tracing that we need to
support.  So I definitely agree with your statement.  The code I posted
implements a minimal set of features (hardly any at all), although as Peter
pointed out, some more can be stripped from it and I have done that already
in a revision of the patchset I was preparing.

> Your current approach relies on an all-or-nothing landing of patches 
> upstream, and this consistently leads to conflict every time a project 
> tries it.  A more incremental approach will require bigger changes on 
> the dtrace application side, but over time it'll be much easier to 
> justify your kernel changes.  You won't have to talk in abstract models, 
> and you'll have many more concrete examples of people asking for dtrace 
> features against mainline.  Most importantly, you'll make dtrace 
> available on more kernels than just the absolute latest mainline, and 
> removing dependencies makes the project much easier for new users to 
> try.

I am not sure where I gave the impression that my approach relies on an
all-or-nothing landing of patches.  My intent (and the content of the patches
reflects that I think) was to work from a minimal base and build on that,
adding things as needed.  Granted, it depends on a rather crucial feature in
the design that apparently should be avoided for now as well, and I can
definitely work on avoiding that for now.  But I hope that it is clear from
the patch set I posted that an incremental approach is indeed what I intend
to do.

Thank you for putting it in clear terms and explaining patfalls that have
be observed in the past with projects.  I will proceed with an even more
minimalist approach.

To that end, could you advice on who patches should be Cc'd to to have the
first minimal code submitted to a tools/dtrace directory in the kernel tree?

	Kris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-05-23 20:28                 ` Alexei Starovoitov
  2019-05-30 16:15                   ` Kris Van Hees
@ 2019-06-18  1:25                   ` Kris Van Hees
  2019-06-18  1:32                     ` Alexei Starovoitov
  1 sibling, 1 reply; 54+ messages in thread
From: Kris Van Hees @ 2019-06-18  1:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, rostedt,
	mhiramat, acme, ast, daniel, peterz

On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:

<< stuff skipped because it is not relevant to the technical discussion... >>

> > > In particular you brought up a good point that there is a use case
> > > for sharing a piece of bpf program between kprobe and tracepoint events.
> > > The better way to do that is via bpf2bpf call.
> > > Example:
> > > void bpf_subprog(arbitrary args)
> > > {
> > > }
> > > 
> > > SEC("kprobe/__set_task_comm")
> > > int bpf_prog_kprobe(struct pt_regs *ctx)
> > > {
> > >   bpf_subprog(...);
> > > }
> > > 
> > > SEC("tracepoint/sched/sched_switch")
> > > int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> > > {
> > >   bpf_subprog(...);
> > > }
> > > 
> > > Such configuration is not supported by the verifier yet.
> > > We've been discussing it for some time, but no work has started,
> > > since there was no concrete use case.
> > > If you can work on adding support for it everyone will benefit.
> > > 
> > > Could you please consider doing that as a step forward?
> > 
> > This definitely looks to be an interesting addition and I am happy to look into
> > that further.  I have a few questions that I hope you can shed light on...
> > 
> > 1. What context would bpf_subprog execute with?  If it can be called from
> >    multiple different prog types, would it see whichever context the caller
> >    is executing with?  Or would you envision bpf_subprog to not be allowed to
> >    access the execution context because it cannot know which one is in use?
> 
> bpf_subprog() won't be able to access 'ctx' pointer _if_ it's ambiguous.
> The verifier already smart enough to track all the data flow, so it's fine to
> pass 'struct pt_regs *ctx' as long as it's accessed safely.
> For example:
> void bpf_subprog(int kind, struct pt_regs *ctx1, struct sched_switch_args *ctx2)
> {
>   if (kind == 1)
>      bpf_printk("%d", ctx1->pc);
>   if (kind == 2)
>      bpf_printk("%d", ctx2->next_pid);
> }
> 
> SEC("kprobe/__set_task_comm")
> int bpf_prog_kprobe(struct pt_regs *ctx)
> {
>   bpf_subprog(1, ctx, NULL);
> }
> 
> SEC("tracepoint/sched/sched_switch")
> int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> {
>   bpf_subprog(2, NULL, ctx);
> }
> 
> The verifier should be able to prove that the above is correct.
> It can do so already if s/ctx1/map_value1/, s/ctx2/map_value2/
> What's missing is an ability to have more than one 'starting' or 'root caller'
> program.
> 
> Now replace SEC("tracepoint/sched/sched_switch") with SEC("cgroup/ingress")
> and it's becoming clear that BPF_PROG_TYPE_PROBE approach is not good enough, right?
> Folks are already sharing the bpf progs between kprobe and networking.
> Currently it's done via code duplication and actual sharing happens via maps.
> That's not ideal, hence we've been discussing 'shared library' approach for
> quite some time. We need a way to support common bpf functions that can be called
> from networking and from tracing programs.
> 
> > 2. Given that BPF programs are loaded with a specification of the prog type, 
> >    how would one load a code construct as the one you outline above?  How can
> >    you load a BPF function and have it be used as subprog from programs that
> >    are loaded separately?  I.e. in the sample above, if bpf_subprog is loaded
> >    as part of loading bpf_prog_kprobe (prog type KPROBE), how can it be
> >    referenced from bpf_prog_tracepoint (prog type TRACEPOINT) which would be
> >    loaded separately?
> 
> The api to support shared libraries was discussed, but not yet implemented.
> We've discussed 'FD + name' approach.
> FD identifies a loaded program (which is root program + a set of subprogs)
> and other programs can be loaded at any time later. The BPF_CALL instructions
> in such later program would refer to older subprogs via FD + name.
> Note that both tracing and networking progs can be part of single elf file.
> libbpf has to be smart to load progs into kernel step by step
> and reusing subprogs that are already loaded.
> 
> Note that libbpf work for such feature can begin _without_ kernel changes.
> libbpf can pass bpf_prog_kprobe+bpf_subprog as a single program first,
> then pass bpf_prog_tracepoint+bpf_subprog second (as a separate program).
> The bpf_subprog will be duplicated and JITed twice, but sharing will happen
> because data structures (maps, global and static data) will be shared.
> This way the support for 'pseudo shared libraries' can begin.
> (later accompanied by FD+name kernel support)

As far as I can determine, the current libbpd implementation is already able
to do the duplication of the called function, even when the ELF object contains
programs of differemt program types.  I.e. the example you give at the top
of the email actually seems to work already.  Right?

In that case, I am a bit unsure what more can be done on the side of libbpf
without needing kernel changes?

> There are other things we discsused. Ideally the body of bpf_subprog()
> wouldn't need to be kept around for future verification when this bpf
> function is called by a different program. The idea was to
> use BTF and similar mechanism to ongoing 'bounded loop' work.
> So the verifier can analyze bpf_subprog() once and reuse that knowledge
> for dynamic linking with progs that will be loaded later.
> This is more long term work.
> A simple short term would be to verify the full call chain every time
> the subprog (bpf function) is reused.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-06-18  1:25                   ` Kris Van Hees
@ 2019-06-18  1:32                     ` Alexei Starovoitov
  2019-06-18  1:54                       ` Kris Van Hees
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2019-06-18  1:32 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: Network Development, bpf, dtrace-devel, LKML, Steven Rostedt,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Alexei Starovoitov,
	Daniel Borkmann, Peter Zijlstra

On Mon, Jun 17, 2019 at 6:25 PM Kris Van Hees <kris.van.hees@oracle.com> wrote:
>
> On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:
>
> << stuff skipped because it is not relevant to the technical discussion... >>
>
> > > > In particular you brought up a good point that there is a use case
> > > > for sharing a piece of bpf program between kprobe and tracepoint events.
> > > > The better way to do that is via bpf2bpf call.
> > > > Example:
> > > > void bpf_subprog(arbitrary args)
> > > > {
> > > > }
> > > >
> > > > SEC("kprobe/__set_task_comm")
> > > > int bpf_prog_kprobe(struct pt_regs *ctx)
> > > > {
> > > >   bpf_subprog(...);
> > > > }
> > > >
> > > > SEC("tracepoint/sched/sched_switch")
> > > > int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> > > > {
> > > >   bpf_subprog(...);
> > > > }
> > > >
> > > > Such configuration is not supported by the verifier yet.
> > > > We've been discussing it for some time, but no work has started,
> > > > since there was no concrete use case.
> > > > If you can work on adding support for it everyone will benefit.
> > > >
> > > > Could you please consider doing that as a step forward?
> > >
> > > This definitely looks to be an interesting addition and I am happy to look into
> > > that further.  I have a few questions that I hope you can shed light on...
> > >
> > > 1. What context would bpf_subprog execute with?  If it can be called from
> > >    multiple different prog types, would it see whichever context the caller
> > >    is executing with?  Or would you envision bpf_subprog to not be allowed to
> > >    access the execution context because it cannot know which one is in use?
> >
> > bpf_subprog() won't be able to access 'ctx' pointer _if_ it's ambiguous.
> > The verifier already smart enough to track all the data flow, so it's fine to
> > pass 'struct pt_regs *ctx' as long as it's accessed safely.
> > For example:
> > void bpf_subprog(int kind, struct pt_regs *ctx1, struct sched_switch_args *ctx2)
> > {
> >   if (kind == 1)
> >      bpf_printk("%d", ctx1->pc);
> >   if (kind == 2)
> >      bpf_printk("%d", ctx2->next_pid);
> > }
> >
> > SEC("kprobe/__set_task_comm")
> > int bpf_prog_kprobe(struct pt_regs *ctx)
> > {
> >   bpf_subprog(1, ctx, NULL);
> > }
> >
> > SEC("tracepoint/sched/sched_switch")
> > int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> > {
> >   bpf_subprog(2, NULL, ctx);
> > }
> >
> > The verifier should be able to prove that the above is correct.
> > It can do so already if s/ctx1/map_value1/, s/ctx2/map_value2/
> > What's missing is an ability to have more than one 'starting' or 'root caller'
> > program.
> >
> > Now replace SEC("tracepoint/sched/sched_switch") with SEC("cgroup/ingress")
> > and it's becoming clear that BPF_PROG_TYPE_PROBE approach is not good enough, right?
> > Folks are already sharing the bpf progs between kprobe and networking.
> > Currently it's done via code duplication and actual sharing happens via maps.
> > That's not ideal, hence we've been discussing 'shared library' approach for
> > quite some time. We need a way to support common bpf functions that can be called
> > from networking and from tracing programs.
> >
> > > 2. Given that BPF programs are loaded with a specification of the prog type,
> > >    how would one load a code construct as the one you outline above?  How can
> > >    you load a BPF function and have it be used as subprog from programs that
> > >    are loaded separately?  I.e. in the sample above, if bpf_subprog is loaded
> > >    as part of loading bpf_prog_kprobe (prog type KPROBE), how can it be
> > >    referenced from bpf_prog_tracepoint (prog type TRACEPOINT) which would be
> > >    loaded separately?
> >
> > The api to support shared libraries was discussed, but not yet implemented.
> > We've discussed 'FD + name' approach.
> > FD identifies a loaded program (which is root program + a set of subprogs)
> > and other programs can be loaded at any time later. The BPF_CALL instructions
> > in such later program would refer to older subprogs via FD + name.
> > Note that both tracing and networking progs can be part of single elf file.
> > libbpf has to be smart to load progs into kernel step by step
> > and reusing subprogs that are already loaded.
> >
> > Note that libbpf work for such feature can begin _without_ kernel changes.
> > libbpf can pass bpf_prog_kprobe+bpf_subprog as a single program first,
> > then pass bpf_prog_tracepoint+bpf_subprog second (as a separate program).
> > The bpf_subprog will be duplicated and JITed twice, but sharing will happen
> > because data structures (maps, global and static data) will be shared.
> > This way the support for 'pseudo shared libraries' can begin.
> > (later accompanied by FD+name kernel support)
>
> As far as I can determine, the current libbpd implementation is already able
> to do the duplication of the called function, even when the ELF object contains
> programs of differemt program types.  I.e. the example you give at the top
> of the email actually seems to work already.  Right?

Have you tried it?

> In that case, I am a bit unsure what more can be done on the side of libbpf
> without needing kernel changes?

it's a bit weird to discuss hypothetical kernel changes when the first step
of changing libbpf wasn't even attempted.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-06-18  1:32                     ` Alexei Starovoitov
@ 2019-06-18  1:54                       ` Kris Van Hees
  2019-06-18  3:01                         ` Alexei Starovoitov
  0 siblings, 1 reply; 54+ messages in thread
From: Kris Van Hees @ 2019-06-18  1:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, Network Development, bpf, dtrace-devel, LKML,
	Steven Rostedt, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Alexei Starovoitov, Daniel Borkmann, Peter Zijlstra

On Mon, Jun 17, 2019 at 06:32:22PM -0700, Alexei Starovoitov wrote:
> On Mon, Jun 17, 2019 at 6:25 PM Kris Van Hees <kris.van.hees@oracle.com> wrote:
> >
> > On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:
> >
> > << stuff skipped because it is not relevant to the technical discussion... >>
> >
> > > > > In particular you brought up a good point that there is a use case
> > > > > for sharing a piece of bpf program between kprobe and tracepoint events.
> > > > > The better way to do that is via bpf2bpf call.
> > > > > Example:
> > > > > void bpf_subprog(arbitrary args)
> > > > > {
> > > > > }
> > > > >
> > > > > SEC("kprobe/__set_task_comm")
> > > > > int bpf_prog_kprobe(struct pt_regs *ctx)
> > > > > {
> > > > >   bpf_subprog(...);
> > > > > }
> > > > >
> > > > > SEC("tracepoint/sched/sched_switch")
> > > > > int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> > > > > {
> > > > >   bpf_subprog(...);
> > > > > }
> > > > >
> > > > > Such configuration is not supported by the verifier yet.
> > > > > We've been discussing it for some time, but no work has started,
> > > > > since there was no concrete use case.
> > > > > If you can work on adding support for it everyone will benefit.
> > > > >
> > > > > Could you please consider doing that as a step forward?
> > > >
> > > > This definitely looks to be an interesting addition and I am happy to look into
> > > > that further.  I have a few questions that I hope you can shed light on...
> > > >
> > > > 1. What context would bpf_subprog execute with?  If it can be called from
> > > >    multiple different prog types, would it see whichever context the caller
> > > >    is executing with?  Or would you envision bpf_subprog to not be allowed to
> > > >    access the execution context because it cannot know which one is in use?
> > >
> > > bpf_subprog() won't be able to access 'ctx' pointer _if_ it's ambiguous.
> > > The verifier already smart enough to track all the data flow, so it's fine to
> > > pass 'struct pt_regs *ctx' as long as it's accessed safely.
> > > For example:
> > > void bpf_subprog(int kind, struct pt_regs *ctx1, struct sched_switch_args *ctx2)
> > > {
> > >   if (kind == 1)
> > >      bpf_printk("%d", ctx1->pc);
> > >   if (kind == 2)
> > >      bpf_printk("%d", ctx2->next_pid);
> > > }
> > >
> > > SEC("kprobe/__set_task_comm")
> > > int bpf_prog_kprobe(struct pt_regs *ctx)
> > > {
> > >   bpf_subprog(1, ctx, NULL);
> > > }
> > >
> > > SEC("tracepoint/sched/sched_switch")
> > > int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> > > {
> > >   bpf_subprog(2, NULL, ctx);
> > > }
> > >
> > > The verifier should be able to prove that the above is correct.
> > > It can do so already if s/ctx1/map_value1/, s/ctx2/map_value2/
> > > What's missing is an ability to have more than one 'starting' or 'root caller'
> > > program.
> > >
> > > Now replace SEC("tracepoint/sched/sched_switch") with SEC("cgroup/ingress")
> > > and it's becoming clear that BPF_PROG_TYPE_PROBE approach is not good enough, right?
> > > Folks are already sharing the bpf progs between kprobe and networking.
> > > Currently it's done via code duplication and actual sharing happens via maps.
> > > That's not ideal, hence we've been discussing 'shared library' approach for
> > > quite some time. We need a way to support common bpf functions that can be called
> > > from networking and from tracing programs.
> > >
> > > > 2. Given that BPF programs are loaded with a specification of the prog type,
> > > >    how would one load a code construct as the one you outline above?  How can
> > > >    you load a BPF function and have it be used as subprog from programs that
> > > >    are loaded separately?  I.e. in the sample above, if bpf_subprog is loaded
> > > >    as part of loading bpf_prog_kprobe (prog type KPROBE), how can it be
> > > >    referenced from bpf_prog_tracepoint (prog type TRACEPOINT) which would be
> > > >    loaded separately?
> > >
> > > The api to support shared libraries was discussed, but not yet implemented.
> > > We've discussed 'FD + name' approach.
> > > FD identifies a loaded program (which is root program + a set of subprogs)
> > > and other programs can be loaded at any time later. The BPF_CALL instructions
> > > in such later program would refer to older subprogs via FD + name.
> > > Note that both tracing and networking progs can be part of single elf file.
> > > libbpf has to be smart to load progs into kernel step by step
> > > and reusing subprogs that are already loaded.
> > >
> > > Note that libbpf work for such feature can begin _without_ kernel changes.
> > > libbpf can pass bpf_prog_kprobe+bpf_subprog as a single program first,
> > > then pass bpf_prog_tracepoint+bpf_subprog second (as a separate program).
> > > The bpf_subprog will be duplicated and JITed twice, but sharing will happen
> > > because data structures (maps, global and static data) will be shared.
> > > This way the support for 'pseudo shared libraries' can begin.
> > > (later accompanied by FD+name kernel support)
> >
> > As far as I can determine, the current libbpd implementation is already able
> > to do the duplication of the called function, even when the ELF object contains
> > programs of differemt program types.  I.e. the example you give at the top
> > of the email actually seems to work already.  Right?
> 
> Have you tried it?

Yes, of course.  I wouldn't want to make an unfounded claim.

> > In that case, I am a bit unsure what more can be done on the side of libbpf
> > without needing kernel changes?
> 
> it's a bit weird to discuss hypothetical kernel changes when the first step
> of changing libbpf wasn't even attempted.

It is not hypothetical.  The folowing example works fine:

static int noinline bpf_action(void *ctx, long fd, long buf, long count)
{
        int                     cpu = bpf_get_smp_processor_id();
        struct data {
                u64     arg0;
                u64     arg1;
                u64     arg2;
        }                       rec;

        memset(&rec, 0, sizeof(rec));

        rec.arg0 = fd;
        rec.arg1 = buf;
        rec.arg2 = count;

        bpf_perf_event_output(ctx, &buffers, cpu, &rec, sizeof(rec));

        return 0;
}

SEC("kprobe/ksys_write")
int bpf_kprobe(struct pt_regs *ctx)
{
        return bpf_action(ctx, ctx->di, ctx->si, ctx->dx);
}

SEC("tracepoint/syscalls/sys_enter_write")
int bpf_tp(struct syscalls_enter_write_args *ctx)
{
        return bpf_action(ctx, ctx->fd, ctx->buf, ctx->count);
}

char _license[] SEC("license") = "GPL";
u32 _version SEC("version") = LINUX_VERSION_CODE;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-06-18  1:54                       ` Kris Van Hees
@ 2019-06-18  3:01                         ` Alexei Starovoitov
  2019-06-18  3:19                           ` Kris Van Hees
  0 siblings, 1 reply; 54+ messages in thread
From: Alexei Starovoitov @ 2019-06-18  3:01 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: Network Development, bpf, dtrace-devel, LKML, Steven Rostedt,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Alexei Starovoitov,
	Daniel Borkmann, Peter Zijlstra

On Mon, Jun 17, 2019 at 6:54 PM Kris Van Hees <kris.van.hees@oracle.com> wrote:
>
> It is not hypothetical.  The folowing example works fine:
>
> static int noinline bpf_action(void *ctx, long fd, long buf, long count)
> {
>         int                     cpu = bpf_get_smp_processor_id();
>         struct data {
>                 u64     arg0;
>                 u64     arg1;
>                 u64     arg2;
>         }                       rec;
>
>         memset(&rec, 0, sizeof(rec));
>
>         rec.arg0 = fd;
>         rec.arg1 = buf;
>         rec.arg2 = count;
>
>         bpf_perf_event_output(ctx, &buffers, cpu, &rec, sizeof(rec));
>
>         return 0;
> }
>
> SEC("kprobe/ksys_write")
> int bpf_kprobe(struct pt_regs *ctx)
> {
>         return bpf_action(ctx, ctx->di, ctx->si, ctx->dx);
> }
>
> SEC("tracepoint/syscalls/sys_enter_write")
> int bpf_tp(struct syscalls_enter_write_args *ctx)
> {
>         return bpf_action(ctx, ctx->fd, ctx->buf, ctx->count);
> }
>
> char _license[] SEC("license") = "GPL";
> u32 _version SEC("version") = LINUX_VERSION_CODE;

Great. Then you're all set to proceed with user space dtrace tooling, right?

What you'll discover thought that it works only for simplest things
like above. libbpf assumes that everything in single elf will be used
and passes the whole thing to the kernel.
The verifer removes dead code only from single program.
It disallows unused functions. Hence libbpf needs to start doing
more "linker work" than it does today.
When it loads .o it needs to pass to the kernel only the functions
that are used by the program.
This work should be straightforward to implement.
Unfortunately no one had time to do it.
It's also going to be the first step to multi-elf support.
libbpf would need to do the same "linker work" across .o-s.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
  2019-06-18  3:01                         ` Alexei Starovoitov
@ 2019-06-18  3:19                           ` Kris Van Hees
  0 siblings, 0 replies; 54+ messages in thread
From: Kris Van Hees @ 2019-06-18  3:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kris Van Hees, Network Development, bpf, dtrace-devel, LKML,
	Steven Rostedt, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Alexei Starovoitov, Daniel Borkmann, Peter Zijlstra

On Mon, Jun 17, 2019 at 08:01:52PM -0700, Alexei Starovoitov wrote:
> On Mon, Jun 17, 2019 at 6:54 PM Kris Van Hees <kris.van.hees@oracle.com> wrote:
> >
> > It is not hypothetical.  The folowing example works fine:
> >
> > static int noinline bpf_action(void *ctx, long fd, long buf, long count)
> > {
> >         int                     cpu = bpf_get_smp_processor_id();
> >         struct data {
> >                 u64     arg0;
> >                 u64     arg1;
> >                 u64     arg2;
> >         }                       rec;
> >
> >         memset(&rec, 0, sizeof(rec));
> >
> >         rec.arg0 = fd;
> >         rec.arg1 = buf;
> >         rec.arg2 = count;
> >
> >         bpf_perf_event_output(ctx, &buffers, cpu, &rec, sizeof(rec));
> >
> >         return 0;
> > }
> >
> > SEC("kprobe/ksys_write")
> > int bpf_kprobe(struct pt_regs *ctx)
> > {
> >         return bpf_action(ctx, ctx->di, ctx->si, ctx->dx);
> > }
> >
> > SEC("tracepoint/syscalls/sys_enter_write")
> > int bpf_tp(struct syscalls_enter_write_args *ctx)
> > {
> >         return bpf_action(ctx, ctx->fd, ctx->buf, ctx->count);
> > }
> >
> > char _license[] SEC("license") = "GPL";
> > u32 _version SEC("version") = LINUX_VERSION_CODE;
> 
> Great. Then you're all set to proceed with user space dtrace tooling, right?

I can indeed proceed with the initial basics, yes, and have started.  I hope
to have a first bare bones patch for review sometime next week.

> What you'll discover thought that it works only for simplest things
> like above. libbpf assumes that everything in single elf will be used
> and passes the whole thing to the kernel.
> The verifer removes dead code only from single program.
> It disallows unused functions. Hence libbpf needs to start doing
> more "linker work" than it does today.
> When it loads .o it needs to pass to the kernel only the functions
> that are used by the program.
> This work should be straightforward to implement.
> Unfortunately no one had time to do it.

Ah yes, I see what you mean.  I'll work on that next since I will definitely
be needing that.

> It's also going to be the first step to multi-elf support.
> libbpf would need to do the same "linker work" across .o-s.

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2019-06-18  3:21 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-20 23:47 [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
2019-05-21 17:56 ` Alexei Starovoitov
2019-05-21 18:41   ` Kris Van Hees
2019-05-21 20:55     ` Alexei Starovoitov
2019-05-21 21:36       ` Steven Rostedt
2019-05-21 21:43         ` Alexei Starovoitov
2019-05-21 21:48           ` Steven Rostedt
2019-05-22  5:23             ` Kris Van Hees
2019-05-22 20:53               ` Alexei Starovoitov
2019-05-23  5:46                 ` Kris Van Hees
2019-05-23 21:13                   ` Alexei Starovoitov
2019-05-23 23:02                     ` Steven Rostedt
2019-05-24  0:31                       ` Alexei Starovoitov
2019-05-24  1:57                         ` Steven Rostedt
2019-05-24  2:08                           ` Alexei Starovoitov
2019-05-24  2:40                             ` Steven Rostedt
2019-05-24  5:26                             ` Kris Van Hees
2019-05-24  5:10                       ` Kris Van Hees
2019-05-24  4:05                     ` Kris Van Hees
2019-05-24 13:28                       ` Steven Rostedt
2019-05-21 21:36       ` Kris Van Hees
2019-05-21 23:26         ` Alexei Starovoitov
2019-05-22  4:12           ` Kris Van Hees
2019-05-22 20:16             ` Alexei Starovoitov
2019-05-23  5:16               ` Kris Van Hees
2019-05-23 20:28                 ` Alexei Starovoitov
2019-05-30 16:15                   ` Kris Van Hees
2019-05-31 15:25                     ` Chris Mason
2019-06-06 20:58                       ` Kris Van Hees
2019-06-18  1:25                   ` Kris Van Hees
2019-06-18  1:32                     ` Alexei Starovoitov
2019-06-18  1:54                       ` Kris Van Hees
2019-06-18  3:01                         ` Alexei Starovoitov
2019-06-18  3:19                           ` Kris Van Hees
2019-05-22 14:25   ` Peter Zijlstra
2019-05-22 18:22     ` Kris Van Hees
2019-05-22 19:55       ` Alexei Starovoitov
2019-05-22 20:20         ` David Miller
2019-05-23  5:19         ` Kris Van Hees
2019-05-24  7:27       ` Peter Zijlstra
2019-05-21 20:39 ` [RFC PATCH 01/11] bpf: context casting for tail call Kris Van Hees
2019-05-21 20:39 ` [RFC PATCH 02/11] bpf: add BPF_PROG_TYPE_DTRACE Kris Van Hees
2019-05-21 20:39 ` [RFC PATCH 03/11] bpf: export proto for bpf_perf_event_output helper Kris Van Hees
     [not found] ` <facilities>
2019-05-21 20:39   ` [RFC PATCH 04/11] trace: initial implementation of DTrace based on kernel Kris Van Hees
2019-05-21 20:39 ` [RFC PATCH 05/11] trace: update Kconfig and Makefile to include DTrace Kris Van Hees
     [not found] ` <features>
2019-05-21 20:39   ` [RFC PATCH 06/11] dtrace: tiny userspace tool to exercise DTrace support Kris Van Hees
2019-05-21 20:39 ` [RFC PATCH 07/11] bpf: implement writable buffers in contexts Kris Van Hees
2019-05-21 20:39 ` [RFC PATCH 08/11] perf: add perf_output_begin_forward_in_page Kris Van Hees
     [not found] ` <the>
     [not found]   ` <context>
2019-05-21 20:39     ` [RFC PATCH 09/11] bpf: mark helpers explicitly whether they may change Kris Van Hees
     [not found] ` <helpers>
2019-05-21 20:39   ` [RFC PATCH 10/11] bpf: add bpf_buffer_reserve and bpf_buffer_commit Kris Van Hees
2019-05-21 20:40 ` [RFC PATCH 11/11] dtrace: make use of writable buffers in BPF Kris Van Hees
2019-05-21 20:48 ` [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use Kris Van Hees
2019-05-21 20:54   ` Steven Rostedt
2019-05-21 20:56   ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).