Re: [RFC PATCH bpf-next 2/5] bpf: add get_reg_val helper

From: David Vernet <void@manifault.com>
To: Dave Marchevsky <davemarchevsky@fb.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Rik van Riel <riel@surriel.com>,
	Ilya Leoshkevich <iii@linux.ibm.com>, Yonghong Song <yhs@fb.com>,
	kernel-team@fb.com
Subject: Re: [RFC PATCH bpf-next 2/5] bpf: add get_reg_val helper
Date: Thu, 12 May 2022 08:29:38 -0700	[thread overview]
Message-ID: <20220512152938.hfm64odsrrqlvfiy@dev0025.ash9.facebook.com> (raw)
In-Reply-To: <20220512074321.2090073-3-davemarchevsky@fb.com>

On Thu, May 12, 2022 at 12:43:18AM -0700, Dave Marchevsky wrote:
> Add a helper which reads the value of specified register into memory.
> 
> Currently, bpf programs only have access to general-purpose registers
> via struct pt_regs. Other registers, like SSE regs %xmm0-15, are
> inaccessible, which makes some tracing usecases impossible. For example,
> User Statically-Defined Tracing (USDT) probes may use SSE registers to
> pass their arguments on x86. While this patch adds support for %xmm0-15
> only, the helper is meant to be generic enough to support fetching any
> reg.
> 
> A useful "value of register" definition for bpf programs is "value of
> register before control transfer to kernel". pt_regs gives us this
> currently, so it's the default behavior of the new helper. Fetching the
> actual _current_ reg value is possible, though, by passing
> BPF_GETREG_F_CURRENT flag as part of input.
> 
> For SSE regs we try to avoid digging around in task's fpu state by first
> reading _current_ value, then checking to see if the state of cpu's
> floating point regs matches task's view of them. If so, we can just
> return _current_ value.
> 
> Further usecases which are straightforward to support, but
> unimplemented:
>   * using the helper to fetch general-purpose register value.
>   currently-unused pt_regs parameter exists for this reason.
> 
>   * fetching rdtsc (w/ BPF_GETREG_F_CURRENT)
> 
>   * other architectures. s390 specifically might benefit from similar
>   fpu reg fetching as USDT library was recently updated to support that
>   architecture.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>  include/uapi/linux/bpf.h       |  40 +++++++++
>  kernel/trace/bpf_trace.c       | 148 +++++++++++++++++++++++++++++++++
>  kernel/trace/bpf_trace.h       |   1 +
>  tools/include/uapi/linux/bpf.h |  40 +++++++++
>  4 files changed, 229 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 444fe6f1cf35..3ef8f683ed9e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5154,6 +5154,18 @@ union bpf_attr {
>   *		if not NULL, is a reference which must be released using its
>   *		corresponding release function, or moved into a BPF map before
>   *		program exit.
> + *
> + * long bpf_get_reg_val(void *dst, u32 size, u64 getreg_spec, struct pt_regs *regs, struct task_struct *tsk)
> + *	Description
> + *		Store the value of a SSE register specified by *getreg_spec*

Even though this patch only adds support for SSE, if the helper is meant to
be generic enough to support fetching any register, should the description
be updated to not imply that it's only meant for fetching SSE? IMO the
example below is sufficient to indicate that it can be used to fetch SSE
regs.

> + *		into memory region of size *size* specified by *dst*. *getreg_spec*
> + *		is a combination of BPF_GETREG enum AND BPF_GETREG_F flag e.g.
> + *		(BPF_GETREG_X86_XMM0 << 32) | BPF_GETREG_F_CURRENT.*
> + *	Return
> + *		0 on success
> + *		**-ENOENT** if the system architecture does not have requested reg
> + *		**-EINVAL** if *getreg_spec* is invalid
> + *		**-EINVAL** if *size* != bytes necessary to store requested reg val
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -5351,6 +5363,7 @@ union bpf_attr {
>  	FN(skb_set_tstamp),		\
>  	FN(ima_file_hash),		\
>  	FN(kptr_xchg),			\
> +	FN(get_reg_val),		\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -6318,6 +6331,33 @@ struct bpf_perf_event_value {
>  	__u64 running;
>  };
>  
> +/* bpf_get_reg_val register enum */
> +enum {
> +	BPF_GETREG_X86_XMM0 = 0,

I know we do this in a few places in bpf.h, so please feel free to ignore
this, but the C standard (section 6.7.2.2.1) formally states that if no
value is specified for the first enumerator that its value is 0, so
specifying the value here is strictly unnecessary. We're inconsistent in
how we apply this in bpf.h, but IMHO if we're adding new enums, we should
do the "standard" thing and only define the first element if it's nonzero.

> +	BPF_GETREG_X86_XMM1,
> +	BPF_GETREG_X86_XMM2,
> +	BPF_GETREG_X86_XMM3,
> +	BPF_GETREG_X86_XMM4,
> +	BPF_GETREG_X86_XMM5,
> +	BPF_GETREG_X86_XMM6,
> +	BPF_GETREG_X86_XMM7,
> +	BPF_GETREG_X86_XMM8,
> +	BPF_GETREG_X86_XMM9,
> +	BPF_GETREG_X86_XMM10,
> +	BPF_GETREG_X86_XMM11,
> +	BPF_GETREG_X86_XMM12,
> +	BPF_GETREG_X86_XMM13,
> +	BPF_GETREG_X86_XMM14,
> +	BPF_GETREG_X86_XMM15,
> +	__MAX_BPF_GETREG,
> +};
> +
> +/* bpf_get_reg_val flags */
> +enum {
> +	BPF_GETREG_F_NONE = 0,
> +	BPF_GETREG_F_CURRENT = (1U << 0),
> +};

Can you add a comment specifying what the BPF_GETREG_F_CURRENT flag does?
The commit summary is very helpful, but it would be good to persist this in
code as well.

> +
>  enum {
>  	BPF_DEVCG_ACC_MKNOD	= (1ULL << 0),
>  	BPF_DEVCG_ACC_READ	= (1ULL << 1),
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index f15b826f9899..0de7d6b3af5b 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -28,6 +28,10 @@
>  
>  #include <asm/tlb.h>
>  
> +#ifdef CONFIG_X86
> +#include <asm/fpu/context.h>
> +#endif
> +
>  #include "trace_probe.h"
>  #include "trace.h"
>  
> @@ -1166,6 +1170,148 @@ static const struct bpf_func_proto bpf_get_func_arg_cnt_proto = {
>  	.arg1_type	= ARG_PTR_TO_CTX,
>  };
>  
> +#define XMM_REG_SZ 16
> +
> +#define __xmm_space_off(regno)				\
> +	case BPF_GETREG_X86_XMM ## regno:		\
> +		xmm_space_off = regno * 16;		\
> +		break;
> +
> +static long getreg_read_xmm_fxsave(u32 reg, struct task_struct *tsk,
> +				   void *data)
> +{
> +	struct fxregs_state *fxsave;
> +	u32 xmm_space_off;
> +
> +	switch (reg) {
> +	__xmm_space_off(0);
> +	__xmm_space_off(1);
> +	__xmm_space_off(2);
> +	__xmm_space_off(3);
> +	__xmm_space_off(4);
> +	__xmm_space_off(5);
> +	__xmm_space_off(6);
> +	__xmm_space_off(7);
> +#ifdef	CONFIG_X86_64
> +	__xmm_space_off(8);
> +	__xmm_space_off(9);
> +	__xmm_space_off(10);
> +	__xmm_space_off(11);
> +	__xmm_space_off(12);
> +	__xmm_space_off(13);
> +	__xmm_space_off(14);
> +	__xmm_space_off(15);
> +#endif
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	fxsave = &tsk->thread.fpu.fpstate->regs.fxsave;
> +	memcpy(data, (void *)&fxsave->xmm_space + xmm_space_off, XMM_REG_SZ);
> +	return 0;
> +}

Does any of this also need to be wrapped in CONFIG_X86? IIUC, everything in
struct thread_struct is arch specific, so I think this may fail to compile
on a number of other architectures. Per my suggestion below, maybe we
should just compile all of this logic out if we're not on x86, and update
bpf_get_reg_val() to only call bpf_read_sse_reg() on x86?

> +
> +#undef __xmm_space_off
> +
> +static bool getreg_is_xmm(u32 reg)
> +{
> +	return reg >= BPF_GETREG_X86_XMM0 && reg <= BPF_GETREG_X86_XMM15;

I think it's a bit confusing that we have a function here which confirms
that a register is xmm, but then we have ifdef CONFIG_X86_64 in large
switch statements in functions where we actually read the register and then
return -EINVAL.  Should we just update this to do the CONFIG_X6_64
preprocessor check, and then we can assume in getreg_read_xmm_fxsave() and
bpf_read_sse_reg() that the register is a valid xmm register, and avoid
having to do these switch statements at all? Note that this wouldn't change
the existing behavior at all, as we'd still be returning -EINVAL on 32-bit
x86 in either case.

> +}
> +
> +#define __bpf_sse_read(regno)							\
> +	case BPF_GETREG_X86_XMM ## regno:					\
> +		asm("movdqa %%xmm" #regno ", %0" : "=m"(*(char *)data));	\
> +		break;
> +
> +static long bpf_read_sse_reg(u32 reg, u32 flags, struct task_struct *tsk,
> +			     void *data)
> +{
> +#ifdef CONFIG_X86
> +	unsigned long irq_flags;
> +	long err;
> +
> +	switch (reg) {
> +	__bpf_sse_read(0);
> +	__bpf_sse_read(1);
> +	__bpf_sse_read(2);
> +	__bpf_sse_read(3);
> +	__bpf_sse_read(4);
> +	__bpf_sse_read(5);
> +	__bpf_sse_read(6);
> +	__bpf_sse_read(7);
> +#ifdef CONFIG_X86_64
> +	__bpf_sse_read(8);
> +	__bpf_sse_read(9);
> +	__bpf_sse_read(10);
> +	__bpf_sse_read(11);
> +	__bpf_sse_read(12);
> +	__bpf_sse_read(13);
> +	__bpf_sse_read(14);
> +	__bpf_sse_read(15);
> +#endif /* CONFIG_X86_64 */
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (flags & BPF_GETREG_F_CURRENT)
> +		return 0;
> +
> +	if (!fpregs_state_valid(&tsk->thread.fpu, smp_processor_id())) {
> +		local_irq_save(irq_flags);
> +		err = getreg_read_xmm_fxsave(reg, tsk, data);
> +		local_irq_restore(irq_flags);
> +		return err;
> +	}

Should we move the checks for current and fpregs_state_valid() above where
we actually read the registers? That way we can avoid doing the xmm read if
we'd have to read the fxsave region anyways. Not sure if that's common in
practice or really necessary at all. What you have here seems fine as well.

> +
> +	return 0;
> +#else
> +	return -ENOENT;
> +#endif /* CONFIG_X86 */
> +}
> +
> +#undef __bpf_sse_read
> +
> +BPF_CALL_5(get_reg_val, void *, dst, u32, size,
> +	   u64, getreg_spec, struct pt_regs *, regs,
> +	   struct task_struct *, tsk)
> +{
> +	u32 reg, flags;
> +
> +	reg = (u32)(getreg_spec >> 32);
> +	flags = (u32)getreg_spec;
> +	if (reg >= __MAX_BPF_GETREG)
> +		return -EINVAL;
> +
> +	if (getreg_is_xmm(reg)) {
> +#ifndef CONFIG_X86
> +		return -ENOENT;

On CONFIG_X86 but !CONFIG_X86_64, we return -EINVAL if we try to access the
wrong xmm register. Should we just change this to be return -EINVAL to keep
the return value consistent between architectures? Or we should update the
32 bit x86 case to return -ENOENT as well, and probably update the last
return -EINVAL statement in the function to be return -ENOENT. In general,
I'd say that returning -ENOENT if a register is specified that's
< __MAX_BPF_GETREG seems like the most intuitive API.

> +#else

Is it necessary to have this ifdef check here if you also have it in
bpf_read_sse_reg()? Maybe it makes more sense to keep this preprocessor
check, and compile out bpf_read_sse_reg() altogether on other
architectures? I think that probably makes sense given that we likely also
want to wrap __bpf_sse_read() in an ifdef given that it emits x86 asm, and
getreg_read_xmm_fxsave() which relies on the x86 definition of struct
thread_struct.

> +		if (size != XMM_REG_SZ)
> +			return -EINVAL;
> +
> +		return bpf_read_sse_reg(reg, flags, tsk, dst);
> +	}
> +
> +	return -EINVAL;
> +#endif
> +}
> +
> +BTF_ID_LIST(bpf_get_reg_val_ids)
> +BTF_ID(struct, pt_regs)
> +
> +static const struct bpf_func_proto bpf_get_reg_val_proto = {
> +	.func	= get_reg_val,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
> +	.arg2_type	= ARG_CONST_SIZE,
> +	.arg3_type	= ARG_ANYTHING,
> +	.arg4_type	= ARG_PTR_TO_BTF_ID_OR_NULL,
> +	.arg4_btf_id	= &bpf_get_reg_val_ids[0],
> +	.arg5_type	= ARG_PTR_TO_BTF_ID_OR_NULL,
> +	.arg5_btf_id	= &btf_tracing_ids[BTF_TRACING_TYPE_TASK],
> +};
> +
>  static const struct bpf_func_proto *
>  bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  {
> @@ -1287,6 +1433,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_find_vma_proto;
>  	case BPF_FUNC_trace_vprintk:
>  		return bpf_get_trace_vprintk_proto();
> +	case BPF_FUNC_get_reg_val:
> +		return &bpf_get_reg_val_proto;
>  	default:
>  		return bpf_base_func_proto(func_id);
>  	}
> diff --git a/kernel/trace/bpf_trace.h b/kernel/trace/bpf_trace.h
> index 9acbc11ac7bb..b4b55706c2dd 100644
> --- a/kernel/trace/bpf_trace.h
> +++ b/kernel/trace/bpf_trace.h
> @@ -29,6 +29,7 @@ TRACE_EVENT(bpf_trace_printk,
>  
>  #undef TRACE_INCLUDE_PATH
>  #define TRACE_INCLUDE_PATH .
> +#undef TRACE_INCLUDE_FILE
>  #define TRACE_INCLUDE_FILE bpf_trace
>  
>  #include <trace/define_trace.h>
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 444fe6f1cf35..3ef8f683ed9e 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -5154,6 +5154,18 @@ union bpf_attr {
>   *		if not NULL, is a reference which must be released using its
>   *		corresponding release function, or moved into a BPF map before
>   *		program exit.
> + *
> + * long bpf_get_reg_val(void *dst, u32 size, u64 getreg_spec, struct pt_regs *regs, struct task_struct *tsk)
> + *	Description
> + *		Store the value of a SSE register specified by *getreg_spec*
> + *		into memory region of size *size* specified by *dst*. *getreg_spec*
> + *		is a combination of BPF_GETREG enum AND BPF_GETREG_F flag e.g.
> + *		(BPF_GETREG_X86_XMM0 << 32) | BPF_GETREG_F_CURRENT.*
> + *	Return
> + *		0 on success
> + *		**-ENOENT** if the system architecture does not have requested reg
> + *		**-EINVAL** if *getreg_spec* is invalid
> + *		**-EINVAL** if *size* != bytes necessary to store requested reg val
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -5351,6 +5363,7 @@ union bpf_attr {
>  	FN(skb_set_tstamp),		\
>  	FN(ima_file_hash),		\
>  	FN(kptr_xchg),			\
> +	FN(get_reg_val),		\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -6318,6 +6331,33 @@ struct bpf_perf_event_value {
>  	__u64 running;
>  };
>  
> +/* bpf_get_reg_val register enum */
> +enum {
> +	BPF_GETREG_X86_XMM0 = 0,
> +	BPF_GETREG_X86_XMM1,
> +	BPF_GETREG_X86_XMM2,
> +	BPF_GETREG_X86_XMM3,
> +	BPF_GETREG_X86_XMM4,
> +	BPF_GETREG_X86_XMM5,
> +	BPF_GETREG_X86_XMM6,
> +	BPF_GETREG_X86_XMM7,
> +	BPF_GETREG_X86_XMM8,
> +	BPF_GETREG_X86_XMM9,
> +	BPF_GETREG_X86_XMM10,
> +	BPF_GETREG_X86_XMM11,
> +	BPF_GETREG_X86_XMM12,
> +	BPF_GETREG_X86_XMM13,
> +	BPF_GETREG_X86_XMM14,
> +	BPF_GETREG_X86_XMM15,
> +	__MAX_BPF_GETREG,
> +};
> +
> +/* bpf_get_reg_val flags */
> +enum {
> +	BPF_GETREG_F_NONE = 0,
> +	BPF_GETREG_F_CURRENT = (1U << 0),
> +};
> +
>  enum {
>  	BPF_DEVCG_ACC_MKNOD	= (1ULL << 0),
>  	BPF_DEVCG_ACC_READ	= (1ULL << 1),
> -- 
> 2.30.2
>