bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
@ 2020-06-17 20:21 Andrii Nakryiko
  2020-06-17 20:21 ` [PATCH bpf-next 2/2] selftests/bpf: add variable-length data concatenation pattern test Andrii Nakryiko
  2020-06-18  6:49 ` [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long John Fastabend
  0 siblings, 2 replies; 18+ messages in thread
From: Andrii Nakryiko @ 2020-06-17 20:21 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Switch most of BPF helper definitions from returning int to long. These
definitions are coming from comments in BPF UAPI header and are used to
generate bpf_helper_defs.h (under libbpf) to be later included and used from
BPF programs.

In actual in-kernel implementation, all the helpers are defined as returning
u64, but due to some historical reasons, most of them are actually defined as
returning int in UAPI (usually, to return 0 on success, and negative value on
error).

This actually causes Clang to quite often generate sub-optimal code, because
compiler believes that return value is 32-bit, and in a lot of cases has to be
up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values,
before they can be used further in BPF code.

Besides just "polluting" the code, these 32-bit shifts quite often cause
problems for cases in which return value matters. This is especially the case
for the family of bpf_probe_read_str() functions. There are few other similar
helpers (e.g., bpf_read_branch_records()), in which return value is used by
BPF program logic to record variable-length data and process it. For such
cases, BPF program logic carefully manages offsets within some array or map to
read variable-length data. For such uses, it's crucial for BPF verifier to
track possible range of register values to prove that all the accesses happen
within given memory bounds. Those extraneous zero-extending bit shifts,
inserted by Clang (and quite often interleaved with other code, which makes
the issues even more challenging and sometimes requires employing extra
per-variable compiler barriers), throws off verifier logic and makes it mark
registers as having unknown variable offset. We'll study this pattern a bit
later below.

Another common pattern is to check return of BPF helper for non-zero state to
detect error conditions and attempt alternative actions in such case. Even in
this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode
quite often leads to sub-optimal and unnecessary extra code. We'll look at
this pattern as well.

Clang's BPF target supports two modes of code generation: ALU32, in which it
is capable of using lower 32-bit parts of registers, and no-ALU32, in which
only full 64-bit registers are being used. ALU32 mode somewhat mitigates the
above described problems, but not in all cases.

This patch switches all the cases in which BPF helpers return 0 or negative
error from returning int to returning long. It is shown below that such change
in definition leads to equivalent or better code. No-ALU32 mode benefits more,
but ALU32 mode doesn't degrade or still gets improved code generation.

Another class of cases switched from int to long are bpf_probe_read_str()-like
helpers, which encode successful case as non-negative values, while still
returning negative value for errors.

In all of such cases, correctness is preserved due to two's complement
encoding of negative values and the fact that all helpers return values with
32-bit absolute value. Two's complement ensures that for negative values
higher 32 bits are all ones and when truncated, leave valid negative 32-bit
value with the same value. Non-negative values have upper 32 bits set to zero
and similarly preserve value when high 32 bits are truncated. This means that
just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't
require any extra shifts).

To minimize the chances of regressions, two code patterns were investigated,
as mentioned above. For both patterns, BPF assembly was analyzed in
ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and
new 64-bit long return type.

Case 1. Variable-length data reading and concatenation. This is quite
ubiquitous pattern in tracing/monitoring applications, reading data like
process's environment variables, file path, etc. In such case, many pieces of
string-like variable-length data are read into a single big buffer, and at the
end of the process, only a part of array containing actual data is sent to
user-space for further processing. This case is tested in test_varlen.c
selftest (in the next patch). Code flow is roughly as follows:

  void *payload = &sample->payload;
  u64 len;

  len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
  if (len <= MAX_SZ1) {
      payload += len;
      sample->len1 = len;
  }
  len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
  if (len <= MAX_SZ2) {
      payload += len;
      sample->len2 = len;
  }
  /* and so on */
  sample->total_len = payload - &sample->payload;
  /* send over, e.g., perf buffer */

There could be two variations with slightly different code generated: when len
is 64-bit integer and when it is 32-bit integer. Both variations were analysed.
BPF assembly instructions between two successive invocations of
bpf_probe_read_kernel_str() were used to check code regressions. Results are
below, followed by short analysis. Left side is using helpers with int return
type, the right one is after the switch to long.

ALU32 + INT                                ALU32 + LONG
===========                                ============

64-BIT (13 insns):                         64-BIT (10 insns):
------------------------------------       ------------------------------------
  17:   call 115                             17:   call 115
  18:   if w0 > 256 goto +9 <LBB0_4>         18:   if r0 > 256 goto +6 <LBB0_4>
  19:   w1 = w0                              19:   r1 = 0 ll
  20:   r1 <<= 32                            21:   *(u64 *)(r1 + 0) = r0
  21:   r1 s>>= 32                           22:   r6 = 0 ll
  22:   r2 = 0 ll                            24:   r6 += r0
  24:   *(u64 *)(r2 + 0) = r1              00000000000000c8 <LBB0_4>:
  25:   r6 = 0 ll                            25:   r1 = r6
  27:   r6 += r1                             26:   w2 = 256
00000000000000e0 <LBB0_4>:                   27:   r3 = 0 ll
  28:   r1 = r6                              29:   call 115
  29:   w2 = 256
  30:   r3 = 0 ll
  32:   call 115

32-BIT (11 insns):                         32-BIT (12 insns):
------------------------------------       ------------------------------------
  17:   call 115                             17:   call 115
  18:   if w0 > 256 goto +7 <LBB1_4>         18:   if w0 > 256 goto +8 <LBB1_4>
  19:   r1 = 0 ll                            19:   r1 = 0 ll
  21:   *(u32 *)(r1 + 0) = r0                21:   *(u32 *)(r1 + 0) = r0
  22:   w1 = w0                              22:   r0 <<= 32
  23:   r6 = 0 ll                            23:   r0 >>= 32
  25:   r6 += r1                             24:   r6 = 0 ll
00000000000000d0 <LBB1_4>:                   26:   r6 += r0
  26:   r1 = r6                            00000000000000d8 <LBB1_4>:
  27:   w2 = 256                             27:   r1 = r6
  28:   r3 = 0 ll                            28:   w2 = 256
  30:   call 115                             29:   r3 = 0 ll
                                             31:   call 115

In ALU32 mode, the variant using 64-bit length variable clearly wins and
avoids unnecessary zero-extension bit shifts. In practice, this is even more
important and good, because BPF code won't need to do extra checks to "prove"
that payload/len are within good bounds.

32-bit len is one instruction longer. Clang decided to do 64-to-32 casting
with two bit shifts, instead of equivalent `w1 = w0` assignment. The former
uses extra register. The latter might potentially lose some range information,
but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256]
after check at 18:, and shifting 32 bits left/right keeps that range intact.
We should probably look into Clang's logic and see why it chooses bitshifts
over sub-register assignments for this.

NO-ALU32 + INT                             NO-ALU32 + LONG
==============                             ===============

64-BIT (14 insns):                         64-BIT (10 insns):
------------------------------------       ------------------------------------
  17:   call 115                             17:   call 115
  18:   r0 <<= 32                            18:   if r0 > 256 goto +6 <LBB0_4>
  19:   r1 = r0                              19:   r1 = 0 ll
  20:   r1 >>= 32                            21:   *(u64 *)(r1 + 0) = r0
  21:   if r1 > 256 goto +7 <LBB0_4>         22:   r6 = 0 ll
  22:   r0 s>>= 32                           24:   r6 += r0
  23:   r1 = 0 ll                          00000000000000c8 <LBB0_4>:
  25:   *(u64 *)(r1 + 0) = r0                25:   r1 = r6
  26:   r6 = 0 ll                            26:   r2 = 256
  28:   r6 += r0                             27:   r3 = 0 ll
00000000000000e8 <LBB0_4>:                   29:   call 115
  29:   r1 = r6
  30:   r2 = 256
  31:   r3 = 0 ll
  33:   call 115

32-BIT (13 insns):                         32-BIT (13 insns):
------------------------------------       ------------------------------------
  17:   call 115                             17:   call 115
  18:   r1 = r0                              18:   r1 = r0
  19:   r1 <<= 32                            19:   r1 <<= 32
  20:   r1 >>= 32                            20:   r1 >>= 32
  21:   if r1 > 256 goto +6 <LBB1_4>         21:   if r1 > 256 goto +6 <LBB1_4>
  22:   r2 = 0 ll                            22:   r2 = 0 ll
  24:   *(u32 *)(r2 + 0) = r0                24:   *(u32 *)(r2 + 0) = r0
  25:   r6 = 0 ll                            25:   r6 = 0 ll
  27:   r6 += r1                             27:   r6 += r1
00000000000000e0 <LBB1_4>:                 00000000000000e0 <LBB1_4>:
  28:   r1 = r6                              28:   r1 = r6
  29:   r2 = 256                             29:   r2 = 256
  30:   r3 = 0 ll                            30:   r3 = 0 ll
  32:   call 115                             32:   call 115

In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much
superior code, as expected, eliminating unnecessary bit shifts. For 32-bit
len, code is identical.

So overall, only ALU-32 32-bit len case is more-or-less equivalent and the
difference stems from internal Clang decision, rather than compiler lacking
enough information about types.

Case 2. Let's look at the simpler case of checking return result of BPF helper
for errors. The code is very simple:

  long bla;
  if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0))
      return 1;
  else
      return 0;

ALU32 + CHECK (9 insns)                    ALU32 + CHECK (9 insns)
====================================       ====================================
  0:    r1 = r10                             0:    r1 = r10
  1:    r1 += -8                             1:    r1 += -8
  2:    w2 = 8                               2:    w2 = 8
  3:    r3 = 0                               3:    r3 = 0
  4:    call 113                             4:    call 113
  5:    w1 = w0                              5:    r1 = r0
  6:    w0 = 1                               6:    w0 = 1
  7:    if w1 != 0 goto +1 <LBB2_2>          7:    if r1 != 0 goto +1 <LBB2_2>
  8:    w0 = 0                               8:    w0 = 0
0000000000000048 <LBB2_2>:                 0000000000000048 <LBB2_2>:
  9:    exit                                 9:    exit

Almost identical code, the only difference is the use of full register
assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit
architectures, new BPF assembly might be slightly less optimal, in theory. But
one can argue that's not a big issue, given that use of full registers is
still prevalent (e.g., for parameter passing).

NO-ALU32 + CHECK (11 insns)                NO-ALU32 + CHECK (9 insns)
====================================       ====================================
  0:    r1 = r10                             0:    r1 = r10
  1:    r1 += -8                             1:    r1 += -8
  2:    r2 = 8                               2:    r2 = 8
  3:    r3 = 0                               3:    r3 = 0
  4:    call 113                             4:    call 113
  5:    r1 = r0                              5:    r1 = r0
  6:    r1 <<= 32                            6:    r0 = 1
  7:    r1 >>= 32                            7:    if r1 != 0 goto +1 <LBB2_2>
  8:    r0 = 1                               8:    r0 = 0
  9:    if r1 != 0 goto +1 <LBB2_2>        0000000000000048 <LBB2_2>:
 10:    r0 = 0                               9:    exit
0000000000000058 <LBB2_2>:
 11:    exit

NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit
shifts.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 include/uapi/linux/bpf.h       | 192 ++++++++++++++++-----------------
 tools/include/uapi/linux/bpf.h | 192 ++++++++++++++++-----------------
 2 files changed, 192 insertions(+), 192 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 19684813faae..be0efee49093 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -653,7 +653,7 @@ union bpf_attr {
  * 		Map value associated to *key*, or **NULL** if no entry was
  * 		found.
  *
- * int bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
+ * long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
  * 	Description
  * 		Add or update the value of the entry associated to *key* in
  * 		*map* with *value*. *flags* is one of:
@@ -671,13 +671,13 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_map_delete_elem(struct bpf_map *map, const void *key)
+ * long bpf_map_delete_elem(struct bpf_map *map, const void *key)
  * 	Description
  * 		Delete entry with *key* from *map*.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_probe_read(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		For tracing programs, safely attempt to read *size* bytes from
  * 		kernel space address *unsafe_ptr* and store the data in *dst*.
@@ -695,7 +695,7 @@ union bpf_attr {
  * 	Return
  * 		Current *ktime*.
  *
- * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
+ * long bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
  * 	Description
  * 		This helper is a "printk()-like" facility for debugging. It
  * 		prints a message defined by format *fmt* (of size *fmt_size*)
@@ -775,7 +775,7 @@ union bpf_attr {
  * 	Return
  * 		The SMP id of the processor running the program.
  *
- * int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags)
+ * long bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags)
  * 	Description
  * 		Store *len* bytes from address *from* into the packet
  * 		associated to *skb*, at *offset*. *flags* are a combination of
@@ -792,7 +792,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_l3_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 size)
+ * long bpf_l3_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 size)
  * 	Description
  * 		Recompute the layer 3 (e.g. IP) checksum for the packet
  * 		associated to *skb*. Computation is incremental, so the helper
@@ -817,7 +817,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_l4_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 flags)
+ * long bpf_l4_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 flags)
  * 	Description
  * 		Recompute the layer 4 (e.g. TCP, UDP or ICMP) checksum for the
  * 		packet associated to *skb*. Computation is incremental, so the
@@ -849,7 +849,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)
+ * long bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)
  * 	Description
  * 		This special helper is used to trigger a "tail call", or in
  * 		other words, to jump into another eBPF program. The same stack
@@ -880,7 +880,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
+ * long bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
  * 	Description
  * 		Clone and redirect the packet associated to *skb* to another
  * 		net device of index *ifindex*. Both ingress and egress
@@ -916,7 +916,7 @@ union bpf_attr {
  * 		A 64-bit integer containing the current GID and UID, and
  * 		created as such: *current_gid* **<< 32 \|** *current_uid*.
  *
- * int bpf_get_current_comm(void *buf, u32 size_of_buf)
+ * long bpf_get_current_comm(void *buf, u32 size_of_buf)
  * 	Description
  * 		Copy the **comm** attribute of the current task into *buf* of
  * 		*size_of_buf*. The **comm** attribute contains the name of
@@ -953,7 +953,7 @@ union bpf_attr {
  * 	Return
  * 		The classid, or 0 for the default unconfigured classid.
  *
- * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
+ * long bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
  * 	Description
  * 		Push a *vlan_tci* (VLAN tag control information) of protocol
  * 		*vlan_proto* to the packet associated to *skb*, then update
@@ -969,7 +969,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_vlan_pop(struct sk_buff *skb)
+ * long bpf_skb_vlan_pop(struct sk_buff *skb)
  * 	Description
  * 		Pop a VLAN header from the packet associated to *skb*.
  *
@@ -981,7 +981,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
+ * long bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
  * 	Description
  * 		Get tunnel metadata. This helper takes a pointer *key* to an
  * 		empty **struct bpf_tunnel_key** of **size**, that will be
@@ -1032,7 +1032,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
+ * long bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
  * 	Description
  * 		Populate tunnel metadata for packet associated to *skb.* The
  * 		tunnel metadata is set to the contents of *key*, of *size*. The
@@ -1098,7 +1098,7 @@ union bpf_attr {
  * 		The value of the perf event counter read from the map, or a
  * 		negative error code in case of failure.
  *
- * int bpf_redirect(u32 ifindex, u64 flags)
+ * long bpf_redirect(u32 ifindex, u64 flags)
  * 	Description
  * 		Redirect the packet to another net device of index *ifindex*.
  * 		This helper is somewhat similar to **bpf_clone_redirect**\
@@ -1145,7 +1145,7 @@ union bpf_attr {
  * 		The realm of the route for the packet associated to *skb*, or 0
  * 		if none was found.
  *
- * int bpf_perf_event_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
+ * long bpf_perf_event_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
  * 	Description
  * 		Write raw *data* blob into a special BPF perf event held by
  * 		*map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf
@@ -1190,7 +1190,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_load_bytes(const void *skb, u32 offset, void *to, u32 len)
+ * long bpf_skb_load_bytes(const void *skb, u32 offset, void *to, u32 len)
  * 	Description
  * 		This helper was provided as an easy way to load data from a
  * 		packet. It can be used to load *len* bytes from *offset* from
@@ -1207,7 +1207,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_get_stackid(void *ctx, struct bpf_map *map, u64 flags)
+ * long bpf_get_stackid(void *ctx, struct bpf_map *map, u64 flags)
  * 	Description
  * 		Walk a user or a kernel stack and return its id. To achieve
  * 		this, the helper needs *ctx*, which is a pointer to the context
@@ -1276,7 +1276,7 @@ union bpf_attr {
  * 		The checksum result, or a negative error code in case of
  * 		failure.
  *
- * int bpf_skb_get_tunnel_opt(struct sk_buff *skb, void *opt, u32 size)
+ * long bpf_skb_get_tunnel_opt(struct sk_buff *skb, void *opt, u32 size)
  * 	Description
  * 		Retrieve tunnel options metadata for the packet associated to
  * 		*skb*, and store the raw tunnel option data to the buffer *opt*
@@ -1294,7 +1294,7 @@ union bpf_attr {
  * 	Return
  * 		The size of the option data retrieved.
  *
- * int bpf_skb_set_tunnel_opt(struct sk_buff *skb, void *opt, u32 size)
+ * long bpf_skb_set_tunnel_opt(struct sk_buff *skb, void *opt, u32 size)
  * 	Description
  * 		Set tunnel options metadata for the packet associated to *skb*
  * 		to the option data contained in the raw buffer *opt* of *size*.
@@ -1304,7 +1304,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_change_proto(struct sk_buff *skb, __be16 proto, u64 flags)
+ * long bpf_skb_change_proto(struct sk_buff *skb, __be16 proto, u64 flags)
  * 	Description
  * 		Change the protocol of the *skb* to *proto*. Currently
  * 		supported are transition from IPv4 to IPv6, and from IPv6 to
@@ -1331,7 +1331,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_change_type(struct sk_buff *skb, u32 type)
+ * long bpf_skb_change_type(struct sk_buff *skb, u32 type)
  * 	Description
  * 		Change the packet type for the packet associated to *skb*. This
  * 		comes down to setting *skb*\ **->pkt_type** to *type*, except
@@ -1358,7 +1358,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_under_cgroup(struct sk_buff *skb, struct bpf_map *map, u32 index)
+ * long bpf_skb_under_cgroup(struct sk_buff *skb, struct bpf_map *map, u32 index)
  * 	Description
  * 		Check whether *skb* is a descendant of the cgroup2 held by
  * 		*map* of type **BPF_MAP_TYPE_CGROUP_ARRAY**, at *index*.
@@ -1389,7 +1389,7 @@ union bpf_attr {
  * 	Return
  * 		A pointer to the current task struct.
  *
- * int bpf_probe_write_user(void *dst, const void *src, u32 len)
+ * long bpf_probe_write_user(void *dst, const void *src, u32 len)
  * 	Description
  * 		Attempt in a safe way to write *len* bytes from the buffer
  * 		*src* to *dst* in memory. It only works for threads that are in
@@ -1408,7 +1408,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_current_task_under_cgroup(struct bpf_map *map, u32 index)
+ * long bpf_current_task_under_cgroup(struct bpf_map *map, u32 index)
  * 	Description
  * 		Check whether the probe is being run is the context of a given
  * 		subset of the cgroup2 hierarchy. The cgroup2 to test is held by
@@ -1420,7 +1420,7 @@ union bpf_attr {
  * 		* 1, if the *skb* task does not belong to the cgroup2.
  * 		* A negative error code, if an error occurred.
  *
- * int bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags)
+ * long bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags)
  * 	Description
  * 		Resize (trim or grow) the packet associated to *skb* to the
  * 		new *len*. The *flags* are reserved for future usage, and must
@@ -1444,7 +1444,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_pull_data(struct sk_buff *skb, u32 len)
+ * long bpf_skb_pull_data(struct sk_buff *skb, u32 len)
  * 	Description
  * 		Pull in non-linear data in case the *skb* is non-linear and not
  * 		all of *len* are part of the linear section. Make *len* bytes
@@ -1500,7 +1500,7 @@ union bpf_attr {
  * 		recalculation the next time the kernel tries to access this
  * 		hash or when the **bpf_get_hash_recalc**\ () helper is called.
  *
- * int bpf_get_numa_node_id(void)
+ * long bpf_get_numa_node_id(void)
  * 	Description
  * 		Return the id of the current NUMA node. The primary use case
  * 		for this helper is the selection of sockets for the local NUMA
@@ -1511,7 +1511,7 @@ union bpf_attr {
  * 	Return
  * 		The id of current NUMA node.
  *
- * int bpf_skb_change_head(struct sk_buff *skb, u32 len, u64 flags)
+ * long bpf_skb_change_head(struct sk_buff *skb, u32 len, u64 flags)
  * 	Description
  * 		Grows headroom of packet associated to *skb* and adjusts the
  * 		offset of the MAC header accordingly, adding *len* bytes of
@@ -1532,7 +1532,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_xdp_adjust_head(struct xdp_buff *xdp_md, int delta)
+ * long bpf_xdp_adjust_head(struct xdp_buff *xdp_md, int delta)
  * 	Description
  * 		Adjust (move) *xdp_md*\ **->data** by *delta* bytes. Note that
  * 		it is possible to use a negative value for *delta*. This helper
@@ -1547,7 +1547,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_probe_read_str(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read_str(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		Copy a NUL terminated string from an unsafe kernel address
  * 		*unsafe_ptr* to *dst*. See **bpf_probe_read_kernel_str**\ () for
@@ -1595,14 +1595,14 @@ union bpf_attr {
  * 		is returned (note that **overflowuid** might also be the actual
  * 		UID value for the socket).
  *
- * u32 bpf_set_hash(struct sk_buff *skb, u32 hash)
+ * long bpf_set_hash(struct sk_buff *skb, u32 hash)
  * 	Description
  * 		Set the full hash for *skb* (set the field *skb*\ **->hash**)
  * 		to value *hash*.
  * 	Return
  * 		0
  *
- * int bpf_setsockopt(void *bpf_socket, int level, int optname, void *optval, int optlen)
+ * long bpf_setsockopt(void *bpf_socket, int level, int optname, void *optval, int optlen)
  * 	Description
  * 		Emulate a call to **setsockopt()** on the socket associated to
  * 		*bpf_socket*, which must be a full socket. The *level* at
@@ -1630,7 +1630,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_adjust_room(struct sk_buff *skb, s32 len_diff, u32 mode, u64 flags)
+ * long bpf_skb_adjust_room(struct sk_buff *skb, s32 len_diff, u32 mode, u64 flags)
  * 	Description
  * 		Grow or shrink the room for data in the packet associated to
  * 		*skb* by *len_diff*, and according to the selected *mode*.
@@ -1676,7 +1676,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
+ * long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
  * 	Description
  * 		Redirect the packet to the endpoint referenced by *map* at
  * 		index *key*. Depending on its type, this *map* can contain
@@ -1697,7 +1697,7 @@ union bpf_attr {
  * 		**XDP_REDIRECT** on success, or the value of the two lower bits
  * 		of the *flags* argument on error.
  *
- * int bpf_sk_redirect_map(struct sk_buff *skb, struct bpf_map *map, u32 key, u64 flags)
+ * long bpf_sk_redirect_map(struct sk_buff *skb, struct bpf_map *map, u32 key, u64 flags)
  * 	Description
  * 		Redirect the packet to the socket referenced by *map* (of type
  * 		**BPF_MAP_TYPE_SOCKMAP**) at index *key*. Both ingress and
@@ -1708,7 +1708,7 @@ union bpf_attr {
  * 	Return
  * 		**SK_PASS** on success, or **SK_DROP** on error.
  *
- * int bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
+ * long bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  * 	Description
  * 		Add an entry to, or update a *map* referencing sockets. The
  * 		*skops* is used as a new value for the entry associated to
@@ -1727,7 +1727,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_xdp_adjust_meta(struct xdp_buff *xdp_md, int delta)
+ * long bpf_xdp_adjust_meta(struct xdp_buff *xdp_md, int delta)
  * 	Description
  * 		Adjust the address pointed by *xdp_md*\ **->data_meta** by
  * 		*delta* (which can be positive or negative). Note that this
@@ -1756,7 +1756,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags, struct bpf_perf_event_value *buf, u32 buf_size)
+ * long bpf_perf_event_read_value(struct bpf_map *map, u64 flags, struct bpf_perf_event_value *buf, u32 buf_size)
  * 	Description
  * 		Read the value of a perf event counter, and store it into *buf*
  * 		of size *buf_size*. This helper relies on a *map* of type
@@ -1806,7 +1806,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_perf_prog_read_value(struct bpf_perf_event_data *ctx, struct bpf_perf_event_value *buf, u32 buf_size)
+ * long bpf_perf_prog_read_value(struct bpf_perf_event_data *ctx, struct bpf_perf_event_value *buf, u32 buf_size)
  * 	Description
  * 		For en eBPF program attached to a perf event, retrieve the
  * 		value of the event counter associated to *ctx* and store it in
@@ -1817,7 +1817,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_getsockopt(void *bpf_socket, int level, int optname, void *optval, int optlen)
+ * long bpf_getsockopt(void *bpf_socket, int level, int optname, void *optval, int optlen)
  * 	Description
  * 		Emulate a call to **getsockopt()** on the socket associated to
  * 		*bpf_socket*, which must be a full socket. The *level* at
@@ -1842,7 +1842,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_override_return(struct pt_regs *regs, u64 rc)
+ * long bpf_override_return(struct pt_regs *regs, u64 rc)
  * 	Description
  * 		Used for error injection, this helper uses kprobes to override
  * 		the return value of the probed function, and to set it to *rc*.
@@ -1867,7 +1867,7 @@ union bpf_attr {
  * 	Return
  * 		0
  *
- * int bpf_sock_ops_cb_flags_set(struct bpf_sock_ops *bpf_sock, int argval)
+ * long bpf_sock_ops_cb_flags_set(struct bpf_sock_ops *bpf_sock, int argval)
  * 	Description
  * 		Attempt to set the value of the **bpf_sock_ops_cb_flags** field
  * 		for the full TCP socket associated to *bpf_sock_ops* to
@@ -1911,7 +1911,7 @@ union bpf_attr {
  * 		be set is returned (which comes down to 0 if all bits were set
  * 		as required).
  *
- * int bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map, u32 key, u64 flags)
+ * long bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map, u32 key, u64 flags)
  * 	Description
  * 		This helper is used in programs implementing policies at the
  * 		socket level. If the message *msg* is allowed to pass (i.e. if
@@ -1925,7 +1925,7 @@ union bpf_attr {
  * 	Return
  * 		**SK_PASS** on success, or **SK_DROP** on error.
  *
- * int bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes)
+ * long bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes)
  * 	Description
  * 		For socket policies, apply the verdict of the eBPF program to
  * 		the next *bytes* (number of bytes) of message *msg*.
@@ -1959,7 +1959,7 @@ union bpf_attr {
  * 	Return
  * 		0
  *
- * int bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes)
+ * long bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes)
  * 	Description
  * 		For socket policies, prevent the execution of the verdict eBPF
  * 		program for message *msg* until *bytes* (byte number) have been
@@ -1977,7 +1977,7 @@ union bpf_attr {
  * 	Return
  * 		0
  *
- * int bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64 flags)
+ * long bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64 flags)
  * 	Description
  * 		For socket policies, pull in non-linear data from user space
  * 		for *msg* and set pointers *msg*\ **->data** and *msg*\
@@ -2008,7 +2008,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_bind(struct bpf_sock_addr *ctx, struct sockaddr *addr, int addr_len)
+ * long bpf_bind(struct bpf_sock_addr *ctx, struct sockaddr *addr, int addr_len)
  * 	Description
  * 		Bind the socket associated to *ctx* to the address pointed by
  * 		*addr*, of length *addr_len*. This allows for making outgoing
@@ -2026,7 +2026,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_xdp_adjust_tail(struct xdp_buff *xdp_md, int delta)
+ * long bpf_xdp_adjust_tail(struct xdp_buff *xdp_md, int delta)
  * 	Description
  * 		Adjust (move) *xdp_md*\ **->data_end** by *delta* bytes. It is
  * 		possible to both shrink and grow the packet tail.
@@ -2040,7 +2040,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_get_xfrm_state(struct sk_buff *skb, u32 index, struct bpf_xfrm_state *xfrm_state, u32 size, u64 flags)
+ * long bpf_skb_get_xfrm_state(struct sk_buff *skb, u32 index, struct bpf_xfrm_state *xfrm_state, u32 size, u64 flags)
  * 	Description
  * 		Retrieve the XFRM state (IP transform framework, see also
  * 		**ip-xfrm(8)**) at *index* in XFRM "security path" for *skb*.
@@ -2056,7 +2056,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_get_stack(void *ctx, void *buf, u32 size, u64 flags)
+ * long bpf_get_stack(void *ctx, void *buf, u32 size, u64 flags)
  * 	Description
  * 		Return a user or a kernel stack in bpf program provided buffer.
  * 		To achieve this, the helper needs *ctx*, which is a pointer
@@ -2089,7 +2089,7 @@ union bpf_attr {
  * 		A non-negative value equal to or less than *size* on success,
  * 		or a negative error in case of failure.
  *
- * int bpf_skb_load_bytes_relative(const void *skb, u32 offset, void *to, u32 len, u32 start_header)
+ * long bpf_skb_load_bytes_relative(const void *skb, u32 offset, void *to, u32 len, u32 start_header)
  * 	Description
  * 		This helper is similar to **bpf_skb_load_bytes**\ () in that
  * 		it provides an easy way to load *len* bytes from *offset*
@@ -2111,7 +2111,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
+ * long bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
  *	Description
  *		Do FIB lookup in kernel tables using parameters in *params*.
  *		If lookup is successful and result shows packet is to be
@@ -2142,7 +2142,7 @@ union bpf_attr {
  *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *		  packet is not forwarded or needs assist from full stack
  *
- * int bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
+ * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Add an entry to, or update a sockhash *map* referencing sockets.
  *		The *skops* is used as a new value for the entry associated to
@@ -2161,7 +2161,7 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure.
  *
- * int bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
+ * long bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		This helper is used in programs implementing policies at the
  *		socket level. If the message *msg* is allowed to pass (i.e. if
@@ -2175,7 +2175,7 @@ union bpf_attr {
  *	Return
  *		**SK_PASS** on success, or **SK_DROP** on error.
  *
- * int bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
+ * long bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		This helper is used in programs implementing policies at the
  *		skb socket level. If the sk_buff *skb* is allowed to pass (i.e.
@@ -2189,7 +2189,7 @@ union bpf_attr {
  *	Return
  *		**SK_PASS** on success, or **SK_DROP** on error.
  *
- * int bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+ * long bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
  *	Description
  *		Encapsulate the packet associated to *skb* within a Layer 3
  *		protocol header. This header is provided in the buffer at
@@ -2226,7 +2226,7 @@ union bpf_attr {
  *	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len)
+ * long bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len)
  *	Description
  *		Store *len* bytes from address *from* into the packet
  *		associated to *skb*, at *offset*. Only the flags, tag and TLVs
@@ -2241,7 +2241,7 @@ union bpf_attr {
  *	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta)
+ * long bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta)
  *	Description
  *		Adjust the size allocated to TLVs in the outermost IPv6
  *		Segment Routing Header contained in the packet associated to
@@ -2257,7 +2257,7 @@ union bpf_attr {
  *	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 param_len)
+ * long bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 param_len)
  *	Description
  *		Apply an IPv6 Segment Routing action of type *action* to the
  *		packet associated to *skb*. Each action takes a parameter
@@ -2286,7 +2286,7 @@ union bpf_attr {
  *	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_rc_repeat(void *ctx)
+ * long bpf_rc_repeat(void *ctx)
  *	Description
  *		This helper is used in programs implementing IR decoding, to
  *		report a successfully decoded repeat key message. This delays
@@ -2305,7 +2305,7 @@ union bpf_attr {
  *	Return
  *		0
  *
- * int bpf_rc_keydown(void *ctx, u32 protocol, u64 scancode, u32 toggle)
+ * long bpf_rc_keydown(void *ctx, u32 protocol, u64 scancode, u32 toggle)
  *	Description
  *		This helper is used in programs implementing IR decoding, to
  *		report a successfully decoded key press with *scancode*,
@@ -2370,7 +2370,7 @@ union bpf_attr {
  *	Return
  *		A pointer to the local storage area.
  *
- * int bpf_sk_select_reuseport(struct sk_reuseport_md *reuse, struct bpf_map *map, void *key, u64 flags)
+ * long bpf_sk_select_reuseport(struct sk_reuseport_md *reuse, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Select a **SO_REUSEPORT** socket from a
  *		**BPF_MAP_TYPE_REUSEPORT_ARRAY** *map*.
@@ -2471,7 +2471,7 @@ union bpf_attr {
  *		result is from *reuse*\ **->socks**\ [] using the hash of the
  *		tuple.
  *
- * int bpf_sk_release(struct bpf_sock *sock)
+ * long bpf_sk_release(struct bpf_sock *sock)
  *	Description
  *		Release the reference held by *sock*. *sock* must be a
  *		non-**NULL** pointer that was returned from
@@ -2479,7 +2479,7 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure.
  *
- * int bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
+ * long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
  * 	Description
  * 		Push an element *value* in *map*. *flags* is one of:
  *
@@ -2489,19 +2489,19 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_map_pop_elem(struct bpf_map *map, void *value)
+ * long bpf_map_pop_elem(struct bpf_map *map, void *value)
  * 	Description
  * 		Pop an element from *map*.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_map_peek_elem(struct bpf_map *map, void *value)
+ * long bpf_map_peek_elem(struct bpf_map *map, void *value)
  * 	Description
  * 		Get an element from *map* without removing it.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_msg_push_data(struct sk_msg_buff *msg, u32 start, u32 len, u64 flags)
+ * long bpf_msg_push_data(struct sk_msg_buff *msg, u32 start, u32 len, u64 flags)
  *	Description
  *		For socket policies, insert *len* bytes into *msg* at offset
  *		*start*.
@@ -2517,7 +2517,7 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure.
  *
- * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 len, u64 flags)
+ * long bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 len, u64 flags)
  *	Description
  *		Will remove *len* bytes from a *msg* starting at byte *start*.
  *		This may result in **ENOMEM** errors under certain situations if
@@ -2529,7 +2529,7 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure.
  *
- * int bpf_rc_pointer_rel(void *ctx, s32 rel_x, s32 rel_y)
+ * long bpf_rc_pointer_rel(void *ctx, s32 rel_x, s32 rel_y)
  *	Description
  *		This helper is used in programs implementing IR decoding, to
  *		report a successfully decoded pointer movement.
@@ -2543,7 +2543,7 @@ union bpf_attr {
  *	Return
  *		0
  *
- * int bpf_spin_lock(struct bpf_spin_lock *lock)
+ * long bpf_spin_lock(struct bpf_spin_lock *lock)
  *	Description
  *		Acquire a spinlock represented by the pointer *lock*, which is
  *		stored as part of a value of a map. Taking the lock allows to
@@ -2591,7 +2591,7 @@ union bpf_attr {
  *	Return
  *		0
  *
- * int bpf_spin_unlock(struct bpf_spin_lock *lock)
+ * long bpf_spin_unlock(struct bpf_spin_lock *lock)
  *	Description
  *		Release the *lock* previously locked by a call to
  *		**bpf_spin_lock**\ (\ *lock*\ ).
@@ -2614,7 +2614,7 @@ union bpf_attr {
  *		A **struct bpf_tcp_sock** pointer on success, or **NULL** in
  *		case of failure.
  *
- * int bpf_skb_ecn_set_ce(struct sk_buff *skb)
+ * long bpf_skb_ecn_set_ce(struct sk_buff *skb)
  *	Description
  *		Set ECN (Explicit Congestion Notification) field of IP header
  *		to **CE** (Congestion Encountered) if current value is **ECT**
@@ -2651,7 +2651,7 @@ union bpf_attr {
  *		result is from *reuse*\ **->socks**\ [] using the hash of the
  *		tuple.
  *
- * int bpf_tcp_check_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
+ * long bpf_tcp_check_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
  * 	Description
  * 		Check whether *iph* and *th* contain a valid SYN cookie ACK for
  * 		the listening socket in *sk*.
@@ -2666,7 +2666,7 @@ union bpf_attr {
  * 		0 if *iph* and *th* are a valid SYN cookie ACK, or a negative
  * 		error otherwise.
  *
- * int bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags)
+ * long bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags)
  *	Description
  *		Get name of sysctl in /proc/sys/ and copy it into provided by
  *		program buffer *buf* of size *buf_len*.
@@ -2682,7 +2682,7 @@ union bpf_attr {
  *		**-E2BIG** if the buffer wasn't big enough (*buf* will contain
  *		truncated name in this case).
  *
- * int bpf_sysctl_get_current_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
+ * long bpf_sysctl_get_current_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
  *	Description
  *		Get current value of sysctl as it is presented in /proc/sys
  *		(incl. newline, etc), and copy it as a string into provided
@@ -2701,7 +2701,7 @@ union bpf_attr {
  *		**-EINVAL** if current value was unavailable, e.g. because
  *		sysctl is uninitialized and read returns -EIO for it.
  *
- * int bpf_sysctl_get_new_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
+ * long bpf_sysctl_get_new_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
  *	Description
  *		Get new value being written by user space to sysctl (before
  *		the actual write happens) and copy it as a string into
@@ -2718,7 +2718,7 @@ union bpf_attr {
  *
  *		**-EINVAL** if sysctl is being read.
  *
- * int bpf_sysctl_set_new_value(struct bpf_sysctl *ctx, const char *buf, size_t buf_len)
+ * long bpf_sysctl_set_new_value(struct bpf_sysctl *ctx, const char *buf, size_t buf_len)
  *	Description
  *		Override new value being written by user space to sysctl with
  *		value provided by program in buffer *buf* of size *buf_len*.
@@ -2735,7 +2735,7 @@ union bpf_attr {
  *
  *		**-EINVAL** if sysctl is being read.
  *
- * int bpf_strtol(const char *buf, size_t buf_len, u64 flags, long *res)
+ * long bpf_strtol(const char *buf, size_t buf_len, u64 flags, long *res)
  *	Description
  *		Convert the initial part of the string from buffer *buf* of
  *		size *buf_len* to a long integer according to the given base
@@ -2759,7 +2759,7 @@ union bpf_attr {
  *
  *		**-ERANGE** if resulting value was out of range.
  *
- * int bpf_strtoul(const char *buf, size_t buf_len, u64 flags, unsigned long *res)
+ * long bpf_strtoul(const char *buf, size_t buf_len, u64 flags, unsigned long *res)
  *	Description
  *		Convert the initial part of the string from buffer *buf* of
  *		size *buf_len* to an unsigned long integer according to the
@@ -2810,7 +2810,7 @@ union bpf_attr {
  *		**NULL** if not found or there was an error in adding
  *		a new bpf-local-storage.
  *
- * int bpf_sk_storage_delete(struct bpf_map *map, struct bpf_sock *sk)
+ * long bpf_sk_storage_delete(struct bpf_map *map, struct bpf_sock *sk)
  *	Description
  *		Delete a bpf-local-storage from a *sk*.
  *	Return
@@ -2818,7 +2818,7 @@ union bpf_attr {
  *
  *		**-ENOENT** if the bpf-local-storage cannot be found.
  *
- * int bpf_send_signal(u32 sig)
+ * long bpf_send_signal(u32 sig)
  *	Description
  *		Send signal *sig* to the process of the current task.
  *		The signal may be delivered to any of this process's threads.
@@ -2859,7 +2859,7 @@ union bpf_attr {
  *
  *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
  *
- * int bpf_skb_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
+ * long bpf_skb_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
  * 	Description
  * 		Write raw *data* blob into a special BPF perf event held by
  * 		*map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf
@@ -2883,21 +2883,21 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_probe_read_user(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read_user(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		Safely attempt to read *size* bytes from user space address
  * 		*unsafe_ptr* and store the data in *dst*.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		Safely attempt to read *size* bytes from kernel space address
  * 		*unsafe_ptr* and store the data in *dst*.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_probe_read_user_str(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read_user_str(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		Copy a NUL terminated string from an unsafe user address
  * 		*unsafe_ptr* to *dst*. The *size* should include the
@@ -2941,7 +2941,7 @@ union bpf_attr {
  * 		including the trailing NUL character. On error, a negative
  * 		value.
  *
- * int bpf_probe_read_kernel_str(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read_kernel_str(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		Copy a NUL terminated string from an unsafe kernel address *unsafe_ptr*
  * 		to *dst*. Same semantics as with **bpf_probe_read_user_str**\ () apply.
@@ -2949,14 +2949,14 @@ union bpf_attr {
  * 		On success, the strictly positive length of the string, including
  * 		the trailing NUL character. On error, a negative value.
  *
- * int bpf_tcp_send_ack(void *tp, u32 rcv_nxt)
+ * long bpf_tcp_send_ack(void *tp, u32 rcv_nxt)
  *	Description
  *		Send out a tcp-ack. *tp* is the in-kernel struct **tcp_sock**.
  *		*rcv_nxt* is the ack_seq to be sent out.
  *	Return
  *		0 on success, or a negative error in case of failure.
  *
- * int bpf_send_signal_thread(u32 sig)
+ * long bpf_send_signal_thread(u32 sig)
  *	Description
  *		Send signal *sig* to the thread corresponding to the current task.
  *	Return
@@ -2976,7 +2976,7 @@ union bpf_attr {
  *	Return
  *		The 64 bit jiffies
  *
- * int bpf_read_branch_records(struct bpf_perf_event_data *ctx, void *buf, u32 size, u64 flags)
+ * long bpf_read_branch_records(struct bpf_perf_event_data *ctx, void *buf, u32 size, u64 flags)
  *	Description
  *		For an eBPF program attached to a perf event, retrieve the
  *		branch records (**struct perf_branch_entry**) associated to *ctx*
@@ -2995,7 +2995,7 @@ union bpf_attr {
  *
  *		**-ENOENT** if architecture does not support branch records.
  *
- * int bpf_get_ns_current_pid_tgid(u64 dev, u64 ino, struct bpf_pidns_info *nsdata, u32 size)
+ * long bpf_get_ns_current_pid_tgid(u64 dev, u64 ino, struct bpf_pidns_info *nsdata, u32 size)
  *	Description
  *		Returns 0 on success, values for *pid* and *tgid* as seen from the current
  *		*namespace* will be returned in *nsdata*.
@@ -3007,7 +3007,7 @@ union bpf_attr {
  *
  *		**-ENOENT** if pidns does not exists for the current task.
  *
- * int bpf_xdp_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
+ * long bpf_xdp_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
  *	Description
  *		Write raw *data* blob into a special BPF perf event held by
  *		*map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf
@@ -3062,7 +3062,7 @@ union bpf_attr {
  * 	Return
  * 		The id is returned or 0 in case the id could not be retrieved.
  *
- * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
+ * long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
  *	Description
  *		Assign the *sk* to the *skb*. When combined with appropriate
  *		routing configuration to receive the packet towards the socket,
@@ -3097,7 +3097,7 @@ union bpf_attr {
  * 	Return
  * 		Current *ktime*.
  *
- * int bpf_seq_printf(struct seq_file *m, const char *fmt, u32 fmt_size, const void *data, u32 data_len)
+ * long bpf_seq_printf(struct seq_file *m, const char *fmt, u32 fmt_size, const void *data, u32 data_len)
  * 	Description
  * 		**bpf_seq_printf**\ () uses seq_file **seq_printf**\ () to print
  * 		out the format string.
@@ -3126,7 +3126,7 @@ union bpf_attr {
  *
  *		**-EOVERFLOW** if an overflow happened: The same object will be tried again.
  *
- * int bpf_seq_write(struct seq_file *m, const void *data, u32 len)
+ * long bpf_seq_write(struct seq_file *m, const void *data, u32 len)
  * 	Description
  * 		**bpf_seq_write**\ () uses seq_file **seq_write**\ () to write the data.
  * 		The *m* represents the seq_file. The *data* and *len* represent the
@@ -3221,7 +3221,7 @@ union bpf_attr {
  *	Return
  *		Requested value, or 0, if flags are not recognized.
  *
- * int bpf_csum_level(struct sk_buff *skb, u64 level)
+ * long bpf_csum_level(struct sk_buff *skb, u64 level)
  * 	Description
  * 		Change the skbs checksum level by one layer up or down, or
  * 		reset it entirely to none in order to have the stack perform
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 19684813faae..be0efee49093 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -653,7 +653,7 @@ union bpf_attr {
  * 		Map value associated to *key*, or **NULL** if no entry was
  * 		found.
  *
- * int bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
+ * long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
  * 	Description
  * 		Add or update the value of the entry associated to *key* in
  * 		*map* with *value*. *flags* is one of:
@@ -671,13 +671,13 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_map_delete_elem(struct bpf_map *map, const void *key)
+ * long bpf_map_delete_elem(struct bpf_map *map, const void *key)
  * 	Description
  * 		Delete entry with *key* from *map*.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_probe_read(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		For tracing programs, safely attempt to read *size* bytes from
  * 		kernel space address *unsafe_ptr* and store the data in *dst*.
@@ -695,7 +695,7 @@ union bpf_attr {
  * 	Return
  * 		Current *ktime*.
  *
- * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
+ * long bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
  * 	Description
  * 		This helper is a "printk()-like" facility for debugging. It
  * 		prints a message defined by format *fmt* (of size *fmt_size*)
@@ -775,7 +775,7 @@ union bpf_attr {
  * 	Return
  * 		The SMP id of the processor running the program.
  *
- * int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags)
+ * long bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags)
  * 	Description
  * 		Store *len* bytes from address *from* into the packet
  * 		associated to *skb*, at *offset*. *flags* are a combination of
@@ -792,7 +792,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_l3_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 size)
+ * long bpf_l3_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 size)
  * 	Description
  * 		Recompute the layer 3 (e.g. IP) checksum for the packet
  * 		associated to *skb*. Computation is incremental, so the helper
@@ -817,7 +817,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_l4_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 flags)
+ * long bpf_l4_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 flags)
  * 	Description
  * 		Recompute the layer 4 (e.g. TCP, UDP or ICMP) checksum for the
  * 		packet associated to *skb*. Computation is incremental, so the
@@ -849,7 +849,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)
+ * long bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)
  * 	Description
  * 		This special helper is used to trigger a "tail call", or in
  * 		other words, to jump into another eBPF program. The same stack
@@ -880,7 +880,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
+ * long bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
  * 	Description
  * 		Clone and redirect the packet associated to *skb* to another
  * 		net device of index *ifindex*. Both ingress and egress
@@ -916,7 +916,7 @@ union bpf_attr {
  * 		A 64-bit integer containing the current GID and UID, and
  * 		created as such: *current_gid* **<< 32 \|** *current_uid*.
  *
- * int bpf_get_current_comm(void *buf, u32 size_of_buf)
+ * long bpf_get_current_comm(void *buf, u32 size_of_buf)
  * 	Description
  * 		Copy the **comm** attribute of the current task into *buf* of
  * 		*size_of_buf*. The **comm** attribute contains the name of
@@ -953,7 +953,7 @@ union bpf_attr {
  * 	Return
  * 		The classid, or 0 for the default unconfigured classid.
  *
- * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
+ * long bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
  * 	Description
  * 		Push a *vlan_tci* (VLAN tag control information) of protocol
  * 		*vlan_proto* to the packet associated to *skb*, then update
@@ -969,7 +969,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_vlan_pop(struct sk_buff *skb)
+ * long bpf_skb_vlan_pop(struct sk_buff *skb)
  * 	Description
  * 		Pop a VLAN header from the packet associated to *skb*.
  *
@@ -981,7 +981,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
+ * long bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
  * 	Description
  * 		Get tunnel metadata. This helper takes a pointer *key* to an
  * 		empty **struct bpf_tunnel_key** of **size**, that will be
@@ -1032,7 +1032,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
+ * long bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
  * 	Description
  * 		Populate tunnel metadata for packet associated to *skb.* The
  * 		tunnel metadata is set to the contents of *key*, of *size*. The
@@ -1098,7 +1098,7 @@ union bpf_attr {
  * 		The value of the perf event counter read from the map, or a
  * 		negative error code in case of failure.
  *
- * int bpf_redirect(u32 ifindex, u64 flags)
+ * long bpf_redirect(u32 ifindex, u64 flags)
  * 	Description
  * 		Redirect the packet to another net device of index *ifindex*.
  * 		This helper is somewhat similar to **bpf_clone_redirect**\
@@ -1145,7 +1145,7 @@ union bpf_attr {
  * 		The realm of the route for the packet associated to *skb*, or 0
  * 		if none was found.
  *
- * int bpf_perf_event_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
+ * long bpf_perf_event_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
  * 	Description
  * 		Write raw *data* blob into a special BPF perf event held by
  * 		*map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf
@@ -1190,7 +1190,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_load_bytes(const void *skb, u32 offset, void *to, u32 len)
+ * long bpf_skb_load_bytes(const void *skb, u32 offset, void *to, u32 len)
  * 	Description
  * 		This helper was provided as an easy way to load data from a
  * 		packet. It can be used to load *len* bytes from *offset* from
@@ -1207,7 +1207,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_get_stackid(void *ctx, struct bpf_map *map, u64 flags)
+ * long bpf_get_stackid(void *ctx, struct bpf_map *map, u64 flags)
  * 	Description
  * 		Walk a user or a kernel stack and return its id. To achieve
  * 		this, the helper needs *ctx*, which is a pointer to the context
@@ -1276,7 +1276,7 @@ union bpf_attr {
  * 		The checksum result, or a negative error code in case of
  * 		failure.
  *
- * int bpf_skb_get_tunnel_opt(struct sk_buff *skb, void *opt, u32 size)
+ * long bpf_skb_get_tunnel_opt(struct sk_buff *skb, void *opt, u32 size)
  * 	Description
  * 		Retrieve tunnel options metadata for the packet associated to
  * 		*skb*, and store the raw tunnel option data to the buffer *opt*
@@ -1294,7 +1294,7 @@ union bpf_attr {
  * 	Return
  * 		The size of the option data retrieved.
  *
- * int bpf_skb_set_tunnel_opt(struct sk_buff *skb, void *opt, u32 size)
+ * long bpf_skb_set_tunnel_opt(struct sk_buff *skb, void *opt, u32 size)
  * 	Description
  * 		Set tunnel options metadata for the packet associated to *skb*
  * 		to the option data contained in the raw buffer *opt* of *size*.
@@ -1304,7 +1304,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_change_proto(struct sk_buff *skb, __be16 proto, u64 flags)
+ * long bpf_skb_change_proto(struct sk_buff *skb, __be16 proto, u64 flags)
  * 	Description
  * 		Change the protocol of the *skb* to *proto*. Currently
  * 		supported are transition from IPv4 to IPv6, and from IPv6 to
@@ -1331,7 +1331,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_change_type(struct sk_buff *skb, u32 type)
+ * long bpf_skb_change_type(struct sk_buff *skb, u32 type)
  * 	Description
  * 		Change the packet type for the packet associated to *skb*. This
  * 		comes down to setting *skb*\ **->pkt_type** to *type*, except
@@ -1358,7 +1358,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_under_cgroup(struct sk_buff *skb, struct bpf_map *map, u32 index)
+ * long bpf_skb_under_cgroup(struct sk_buff *skb, struct bpf_map *map, u32 index)
  * 	Description
  * 		Check whether *skb* is a descendant of the cgroup2 held by
  * 		*map* of type **BPF_MAP_TYPE_CGROUP_ARRAY**, at *index*.
@@ -1389,7 +1389,7 @@ union bpf_attr {
  * 	Return
  * 		A pointer to the current task struct.
  *
- * int bpf_probe_write_user(void *dst, const void *src, u32 len)
+ * long bpf_probe_write_user(void *dst, const void *src, u32 len)
  * 	Description
  * 		Attempt in a safe way to write *len* bytes from the buffer
  * 		*src* to *dst* in memory. It only works for threads that are in
@@ -1408,7 +1408,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_current_task_under_cgroup(struct bpf_map *map, u32 index)
+ * long bpf_current_task_under_cgroup(struct bpf_map *map, u32 index)
  * 	Description
  * 		Check whether the probe is being run is the context of a given
  * 		subset of the cgroup2 hierarchy. The cgroup2 to test is held by
@@ -1420,7 +1420,7 @@ union bpf_attr {
  * 		* 1, if the *skb* task does not belong to the cgroup2.
  * 		* A negative error code, if an error occurred.
  *
- * int bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags)
+ * long bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags)
  * 	Description
  * 		Resize (trim or grow) the packet associated to *skb* to the
  * 		new *len*. The *flags* are reserved for future usage, and must
@@ -1444,7 +1444,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_pull_data(struct sk_buff *skb, u32 len)
+ * long bpf_skb_pull_data(struct sk_buff *skb, u32 len)
  * 	Description
  * 		Pull in non-linear data in case the *skb* is non-linear and not
  * 		all of *len* are part of the linear section. Make *len* bytes
@@ -1500,7 +1500,7 @@ union bpf_attr {
  * 		recalculation the next time the kernel tries to access this
  * 		hash or when the **bpf_get_hash_recalc**\ () helper is called.
  *
- * int bpf_get_numa_node_id(void)
+ * long bpf_get_numa_node_id(void)
  * 	Description
  * 		Return the id of the current NUMA node. The primary use case
  * 		for this helper is the selection of sockets for the local NUMA
@@ -1511,7 +1511,7 @@ union bpf_attr {
  * 	Return
  * 		The id of current NUMA node.
  *
- * int bpf_skb_change_head(struct sk_buff *skb, u32 len, u64 flags)
+ * long bpf_skb_change_head(struct sk_buff *skb, u32 len, u64 flags)
  * 	Description
  * 		Grows headroom of packet associated to *skb* and adjusts the
  * 		offset of the MAC header accordingly, adding *len* bytes of
@@ -1532,7 +1532,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_xdp_adjust_head(struct xdp_buff *xdp_md, int delta)
+ * long bpf_xdp_adjust_head(struct xdp_buff *xdp_md, int delta)
  * 	Description
  * 		Adjust (move) *xdp_md*\ **->data** by *delta* bytes. Note that
  * 		it is possible to use a negative value for *delta*. This helper
@@ -1547,7 +1547,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_probe_read_str(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read_str(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		Copy a NUL terminated string from an unsafe kernel address
  * 		*unsafe_ptr* to *dst*. See **bpf_probe_read_kernel_str**\ () for
@@ -1595,14 +1595,14 @@ union bpf_attr {
  * 		is returned (note that **overflowuid** might also be the actual
  * 		UID value for the socket).
  *
- * u32 bpf_set_hash(struct sk_buff *skb, u32 hash)
+ * long bpf_set_hash(struct sk_buff *skb, u32 hash)
  * 	Description
  * 		Set the full hash for *skb* (set the field *skb*\ **->hash**)
  * 		to value *hash*.
  * 	Return
  * 		0
  *
- * int bpf_setsockopt(void *bpf_socket, int level, int optname, void *optval, int optlen)
+ * long bpf_setsockopt(void *bpf_socket, int level, int optname, void *optval, int optlen)
  * 	Description
  * 		Emulate a call to **setsockopt()** on the socket associated to
  * 		*bpf_socket*, which must be a full socket. The *level* at
@@ -1630,7 +1630,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_adjust_room(struct sk_buff *skb, s32 len_diff, u32 mode, u64 flags)
+ * long bpf_skb_adjust_room(struct sk_buff *skb, s32 len_diff, u32 mode, u64 flags)
  * 	Description
  * 		Grow or shrink the room for data in the packet associated to
  * 		*skb* by *len_diff*, and according to the selected *mode*.
@@ -1676,7 +1676,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
+ * long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
  * 	Description
  * 		Redirect the packet to the endpoint referenced by *map* at
  * 		index *key*. Depending on its type, this *map* can contain
@@ -1697,7 +1697,7 @@ union bpf_attr {
  * 		**XDP_REDIRECT** on success, or the value of the two lower bits
  * 		of the *flags* argument on error.
  *
- * int bpf_sk_redirect_map(struct sk_buff *skb, struct bpf_map *map, u32 key, u64 flags)
+ * long bpf_sk_redirect_map(struct sk_buff *skb, struct bpf_map *map, u32 key, u64 flags)
  * 	Description
  * 		Redirect the packet to the socket referenced by *map* (of type
  * 		**BPF_MAP_TYPE_SOCKMAP**) at index *key*. Both ingress and
@@ -1708,7 +1708,7 @@ union bpf_attr {
  * 	Return
  * 		**SK_PASS** on success, or **SK_DROP** on error.
  *
- * int bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
+ * long bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  * 	Description
  * 		Add an entry to, or update a *map* referencing sockets. The
  * 		*skops* is used as a new value for the entry associated to
@@ -1727,7 +1727,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_xdp_adjust_meta(struct xdp_buff *xdp_md, int delta)
+ * long bpf_xdp_adjust_meta(struct xdp_buff *xdp_md, int delta)
  * 	Description
  * 		Adjust the address pointed by *xdp_md*\ **->data_meta** by
  * 		*delta* (which can be positive or negative). Note that this
@@ -1756,7 +1756,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags, struct bpf_perf_event_value *buf, u32 buf_size)
+ * long bpf_perf_event_read_value(struct bpf_map *map, u64 flags, struct bpf_perf_event_value *buf, u32 buf_size)
  * 	Description
  * 		Read the value of a perf event counter, and store it into *buf*
  * 		of size *buf_size*. This helper relies on a *map* of type
@@ -1806,7 +1806,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_perf_prog_read_value(struct bpf_perf_event_data *ctx, struct bpf_perf_event_value *buf, u32 buf_size)
+ * long bpf_perf_prog_read_value(struct bpf_perf_event_data *ctx, struct bpf_perf_event_value *buf, u32 buf_size)
  * 	Description
  * 		For en eBPF program attached to a perf event, retrieve the
  * 		value of the event counter associated to *ctx* and store it in
@@ -1817,7 +1817,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_getsockopt(void *bpf_socket, int level, int optname, void *optval, int optlen)
+ * long bpf_getsockopt(void *bpf_socket, int level, int optname, void *optval, int optlen)
  * 	Description
  * 		Emulate a call to **getsockopt()** on the socket associated to
  * 		*bpf_socket*, which must be a full socket. The *level* at
@@ -1842,7 +1842,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_override_return(struct pt_regs *regs, u64 rc)
+ * long bpf_override_return(struct pt_regs *regs, u64 rc)
  * 	Description
  * 		Used for error injection, this helper uses kprobes to override
  * 		the return value of the probed function, and to set it to *rc*.
@@ -1867,7 +1867,7 @@ union bpf_attr {
  * 	Return
  * 		0
  *
- * int bpf_sock_ops_cb_flags_set(struct bpf_sock_ops *bpf_sock, int argval)
+ * long bpf_sock_ops_cb_flags_set(struct bpf_sock_ops *bpf_sock, int argval)
  * 	Description
  * 		Attempt to set the value of the **bpf_sock_ops_cb_flags** field
  * 		for the full TCP socket associated to *bpf_sock_ops* to
@@ -1911,7 +1911,7 @@ union bpf_attr {
  * 		be set is returned (which comes down to 0 if all bits were set
  * 		as required).
  *
- * int bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map, u32 key, u64 flags)
+ * long bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map, u32 key, u64 flags)
  * 	Description
  * 		This helper is used in programs implementing policies at the
  * 		socket level. If the message *msg* is allowed to pass (i.e. if
@@ -1925,7 +1925,7 @@ union bpf_attr {
  * 	Return
  * 		**SK_PASS** on success, or **SK_DROP** on error.
  *
- * int bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes)
+ * long bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes)
  * 	Description
  * 		For socket policies, apply the verdict of the eBPF program to
  * 		the next *bytes* (number of bytes) of message *msg*.
@@ -1959,7 +1959,7 @@ union bpf_attr {
  * 	Return
  * 		0
  *
- * int bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes)
+ * long bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes)
  * 	Description
  * 		For socket policies, prevent the execution of the verdict eBPF
  * 		program for message *msg* until *bytes* (byte number) have been
@@ -1977,7 +1977,7 @@ union bpf_attr {
  * 	Return
  * 		0
  *
- * int bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64 flags)
+ * long bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64 flags)
  * 	Description
  * 		For socket policies, pull in non-linear data from user space
  * 		for *msg* and set pointers *msg*\ **->data** and *msg*\
@@ -2008,7 +2008,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_bind(struct bpf_sock_addr *ctx, struct sockaddr *addr, int addr_len)
+ * long bpf_bind(struct bpf_sock_addr *ctx, struct sockaddr *addr, int addr_len)
  * 	Description
  * 		Bind the socket associated to *ctx* to the address pointed by
  * 		*addr*, of length *addr_len*. This allows for making outgoing
@@ -2026,7 +2026,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_xdp_adjust_tail(struct xdp_buff *xdp_md, int delta)
+ * long bpf_xdp_adjust_tail(struct xdp_buff *xdp_md, int delta)
  * 	Description
  * 		Adjust (move) *xdp_md*\ **->data_end** by *delta* bytes. It is
  * 		possible to both shrink and grow the packet tail.
@@ -2040,7 +2040,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_get_xfrm_state(struct sk_buff *skb, u32 index, struct bpf_xfrm_state *xfrm_state, u32 size, u64 flags)
+ * long bpf_skb_get_xfrm_state(struct sk_buff *skb, u32 index, struct bpf_xfrm_state *xfrm_state, u32 size, u64 flags)
  * 	Description
  * 		Retrieve the XFRM state (IP transform framework, see also
  * 		**ip-xfrm(8)**) at *index* in XFRM "security path" for *skb*.
@@ -2056,7 +2056,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_get_stack(void *ctx, void *buf, u32 size, u64 flags)
+ * long bpf_get_stack(void *ctx, void *buf, u32 size, u64 flags)
  * 	Description
  * 		Return a user or a kernel stack in bpf program provided buffer.
  * 		To achieve this, the helper needs *ctx*, which is a pointer
@@ -2089,7 +2089,7 @@ union bpf_attr {
  * 		A non-negative value equal to or less than *size* on success,
  * 		or a negative error in case of failure.
  *
- * int bpf_skb_load_bytes_relative(const void *skb, u32 offset, void *to, u32 len, u32 start_header)
+ * long bpf_skb_load_bytes_relative(const void *skb, u32 offset, void *to, u32 len, u32 start_header)
  * 	Description
  * 		This helper is similar to **bpf_skb_load_bytes**\ () in that
  * 		it provides an easy way to load *len* bytes from *offset*
@@ -2111,7 +2111,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
+ * long bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
  *	Description
  *		Do FIB lookup in kernel tables using parameters in *params*.
  *		If lookup is successful and result shows packet is to be
@@ -2142,7 +2142,7 @@ union bpf_attr {
  *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *		  packet is not forwarded or needs assist from full stack
  *
- * int bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
+ * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Add an entry to, or update a sockhash *map* referencing sockets.
  *		The *skops* is used as a new value for the entry associated to
@@ -2161,7 +2161,7 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure.
  *
- * int bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
+ * long bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		This helper is used in programs implementing policies at the
  *		socket level. If the message *msg* is allowed to pass (i.e. if
@@ -2175,7 +2175,7 @@ union bpf_attr {
  *	Return
  *		**SK_PASS** on success, or **SK_DROP** on error.
  *
- * int bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
+ * long bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		This helper is used in programs implementing policies at the
  *		skb socket level. If the sk_buff *skb* is allowed to pass (i.e.
@@ -2189,7 +2189,7 @@ union bpf_attr {
  *	Return
  *		**SK_PASS** on success, or **SK_DROP** on error.
  *
- * int bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+ * long bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
  *	Description
  *		Encapsulate the packet associated to *skb* within a Layer 3
  *		protocol header. This header is provided in the buffer at
@@ -2226,7 +2226,7 @@ union bpf_attr {
  *	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len)
+ * long bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len)
  *	Description
  *		Store *len* bytes from address *from* into the packet
  *		associated to *skb*, at *offset*. Only the flags, tag and TLVs
@@ -2241,7 +2241,7 @@ union bpf_attr {
  *	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta)
+ * long bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta)
  *	Description
  *		Adjust the size allocated to TLVs in the outermost IPv6
  *		Segment Routing Header contained in the packet associated to
@@ -2257,7 +2257,7 @@ union bpf_attr {
  *	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 param_len)
+ * long bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 param_len)
  *	Description
  *		Apply an IPv6 Segment Routing action of type *action* to the
  *		packet associated to *skb*. Each action takes a parameter
@@ -2286,7 +2286,7 @@ union bpf_attr {
  *	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_rc_repeat(void *ctx)
+ * long bpf_rc_repeat(void *ctx)
  *	Description
  *		This helper is used in programs implementing IR decoding, to
  *		report a successfully decoded repeat key message. This delays
@@ -2305,7 +2305,7 @@ union bpf_attr {
  *	Return
  *		0
  *
- * int bpf_rc_keydown(void *ctx, u32 protocol, u64 scancode, u32 toggle)
+ * long bpf_rc_keydown(void *ctx, u32 protocol, u64 scancode, u32 toggle)
  *	Description
  *		This helper is used in programs implementing IR decoding, to
  *		report a successfully decoded key press with *scancode*,
@@ -2370,7 +2370,7 @@ union bpf_attr {
  *	Return
  *		A pointer to the local storage area.
  *
- * int bpf_sk_select_reuseport(struct sk_reuseport_md *reuse, struct bpf_map *map, void *key, u64 flags)
+ * long bpf_sk_select_reuseport(struct sk_reuseport_md *reuse, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Select a **SO_REUSEPORT** socket from a
  *		**BPF_MAP_TYPE_REUSEPORT_ARRAY** *map*.
@@ -2471,7 +2471,7 @@ union bpf_attr {
  *		result is from *reuse*\ **->socks**\ [] using the hash of the
  *		tuple.
  *
- * int bpf_sk_release(struct bpf_sock *sock)
+ * long bpf_sk_release(struct bpf_sock *sock)
  *	Description
  *		Release the reference held by *sock*. *sock* must be a
  *		non-**NULL** pointer that was returned from
@@ -2479,7 +2479,7 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure.
  *
- * int bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
+ * long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
  * 	Description
  * 		Push an element *value* in *map*. *flags* is one of:
  *
@@ -2489,19 +2489,19 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_map_pop_elem(struct bpf_map *map, void *value)
+ * long bpf_map_pop_elem(struct bpf_map *map, void *value)
  * 	Description
  * 		Pop an element from *map*.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_map_peek_elem(struct bpf_map *map, void *value)
+ * long bpf_map_peek_elem(struct bpf_map *map, void *value)
  * 	Description
  * 		Get an element from *map* without removing it.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_msg_push_data(struct sk_msg_buff *msg, u32 start, u32 len, u64 flags)
+ * long bpf_msg_push_data(struct sk_msg_buff *msg, u32 start, u32 len, u64 flags)
  *	Description
  *		For socket policies, insert *len* bytes into *msg* at offset
  *		*start*.
@@ -2517,7 +2517,7 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure.
  *
- * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 len, u64 flags)
+ * long bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 len, u64 flags)
  *	Description
  *		Will remove *len* bytes from a *msg* starting at byte *start*.
  *		This may result in **ENOMEM** errors under certain situations if
@@ -2529,7 +2529,7 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure.
  *
- * int bpf_rc_pointer_rel(void *ctx, s32 rel_x, s32 rel_y)
+ * long bpf_rc_pointer_rel(void *ctx, s32 rel_x, s32 rel_y)
  *	Description
  *		This helper is used in programs implementing IR decoding, to
  *		report a successfully decoded pointer movement.
@@ -2543,7 +2543,7 @@ union bpf_attr {
  *	Return
  *		0
  *
- * int bpf_spin_lock(struct bpf_spin_lock *lock)
+ * long bpf_spin_lock(struct bpf_spin_lock *lock)
  *	Description
  *		Acquire a spinlock represented by the pointer *lock*, which is
  *		stored as part of a value of a map. Taking the lock allows to
@@ -2591,7 +2591,7 @@ union bpf_attr {
  *	Return
  *		0
  *
- * int bpf_spin_unlock(struct bpf_spin_lock *lock)
+ * long bpf_spin_unlock(struct bpf_spin_lock *lock)
  *	Description
  *		Release the *lock* previously locked by a call to
  *		**bpf_spin_lock**\ (\ *lock*\ ).
@@ -2614,7 +2614,7 @@ union bpf_attr {
  *		A **struct bpf_tcp_sock** pointer on success, or **NULL** in
  *		case of failure.
  *
- * int bpf_skb_ecn_set_ce(struct sk_buff *skb)
+ * long bpf_skb_ecn_set_ce(struct sk_buff *skb)
  *	Description
  *		Set ECN (Explicit Congestion Notification) field of IP header
  *		to **CE** (Congestion Encountered) if current value is **ECT**
@@ -2651,7 +2651,7 @@ union bpf_attr {
  *		result is from *reuse*\ **->socks**\ [] using the hash of the
  *		tuple.
  *
- * int bpf_tcp_check_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
+ * long bpf_tcp_check_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
  * 	Description
  * 		Check whether *iph* and *th* contain a valid SYN cookie ACK for
  * 		the listening socket in *sk*.
@@ -2666,7 +2666,7 @@ union bpf_attr {
  * 		0 if *iph* and *th* are a valid SYN cookie ACK, or a negative
  * 		error otherwise.
  *
- * int bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags)
+ * long bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags)
  *	Description
  *		Get name of sysctl in /proc/sys/ and copy it into provided by
  *		program buffer *buf* of size *buf_len*.
@@ -2682,7 +2682,7 @@ union bpf_attr {
  *		**-E2BIG** if the buffer wasn't big enough (*buf* will contain
  *		truncated name in this case).
  *
- * int bpf_sysctl_get_current_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
+ * long bpf_sysctl_get_current_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
  *	Description
  *		Get current value of sysctl as it is presented in /proc/sys
  *		(incl. newline, etc), and copy it as a string into provided
@@ -2701,7 +2701,7 @@ union bpf_attr {
  *		**-EINVAL** if current value was unavailable, e.g. because
  *		sysctl is uninitialized and read returns -EIO for it.
  *
- * int bpf_sysctl_get_new_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
+ * long bpf_sysctl_get_new_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
  *	Description
  *		Get new value being written by user space to sysctl (before
  *		the actual write happens) and copy it as a string into
@@ -2718,7 +2718,7 @@ union bpf_attr {
  *
  *		**-EINVAL** if sysctl is being read.
  *
- * int bpf_sysctl_set_new_value(struct bpf_sysctl *ctx, const char *buf, size_t buf_len)
+ * long bpf_sysctl_set_new_value(struct bpf_sysctl *ctx, const char *buf, size_t buf_len)
  *	Description
  *		Override new value being written by user space to sysctl with
  *		value provided by program in buffer *buf* of size *buf_len*.
@@ -2735,7 +2735,7 @@ union bpf_attr {
  *
  *		**-EINVAL** if sysctl is being read.
  *
- * int bpf_strtol(const char *buf, size_t buf_len, u64 flags, long *res)
+ * long bpf_strtol(const char *buf, size_t buf_len, u64 flags, long *res)
  *	Description
  *		Convert the initial part of the string from buffer *buf* of
  *		size *buf_len* to a long integer according to the given base
@@ -2759,7 +2759,7 @@ union bpf_attr {
  *
  *		**-ERANGE** if resulting value was out of range.
  *
- * int bpf_strtoul(const char *buf, size_t buf_len, u64 flags, unsigned long *res)
+ * long bpf_strtoul(const char *buf, size_t buf_len, u64 flags, unsigned long *res)
  *	Description
  *		Convert the initial part of the string from buffer *buf* of
  *		size *buf_len* to an unsigned long integer according to the
@@ -2810,7 +2810,7 @@ union bpf_attr {
  *		**NULL** if not found or there was an error in adding
  *		a new bpf-local-storage.
  *
- * int bpf_sk_storage_delete(struct bpf_map *map, struct bpf_sock *sk)
+ * long bpf_sk_storage_delete(struct bpf_map *map, struct bpf_sock *sk)
  *	Description
  *		Delete a bpf-local-storage from a *sk*.
  *	Return
@@ -2818,7 +2818,7 @@ union bpf_attr {
  *
  *		**-ENOENT** if the bpf-local-storage cannot be found.
  *
- * int bpf_send_signal(u32 sig)
+ * long bpf_send_signal(u32 sig)
  *	Description
  *		Send signal *sig* to the process of the current task.
  *		The signal may be delivered to any of this process's threads.
@@ -2859,7 +2859,7 @@ union bpf_attr {
  *
  *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
  *
- * int bpf_skb_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
+ * long bpf_skb_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
  * 	Description
  * 		Write raw *data* blob into a special BPF perf event held by
  * 		*map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf
@@ -2883,21 +2883,21 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_probe_read_user(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read_user(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		Safely attempt to read *size* bytes from user space address
  * 		*unsafe_ptr* and store the data in *dst*.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		Safely attempt to read *size* bytes from kernel space address
  * 		*unsafe_ptr* and store the data in *dst*.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_probe_read_user_str(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read_user_str(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		Copy a NUL terminated string from an unsafe user address
  * 		*unsafe_ptr* to *dst*. The *size* should include the
@@ -2941,7 +2941,7 @@ union bpf_attr {
  * 		including the trailing NUL character. On error, a negative
  * 		value.
  *
- * int bpf_probe_read_kernel_str(void *dst, u32 size, const void *unsafe_ptr)
+ * long bpf_probe_read_kernel_str(void *dst, u32 size, const void *unsafe_ptr)
  * 	Description
  * 		Copy a NUL terminated string from an unsafe kernel address *unsafe_ptr*
  * 		to *dst*. Same semantics as with **bpf_probe_read_user_str**\ () apply.
@@ -2949,14 +2949,14 @@ union bpf_attr {
  * 		On success, the strictly positive length of the string, including
  * 		the trailing NUL character. On error, a negative value.
  *
- * int bpf_tcp_send_ack(void *tp, u32 rcv_nxt)
+ * long bpf_tcp_send_ack(void *tp, u32 rcv_nxt)
  *	Description
  *		Send out a tcp-ack. *tp* is the in-kernel struct **tcp_sock**.
  *		*rcv_nxt* is the ack_seq to be sent out.
  *	Return
  *		0 on success, or a negative error in case of failure.
  *
- * int bpf_send_signal_thread(u32 sig)
+ * long bpf_send_signal_thread(u32 sig)
  *	Description
  *		Send signal *sig* to the thread corresponding to the current task.
  *	Return
@@ -2976,7 +2976,7 @@ union bpf_attr {
  *	Return
  *		The 64 bit jiffies
  *
- * int bpf_read_branch_records(struct bpf_perf_event_data *ctx, void *buf, u32 size, u64 flags)
+ * long bpf_read_branch_records(struct bpf_perf_event_data *ctx, void *buf, u32 size, u64 flags)
  *	Description
  *		For an eBPF program attached to a perf event, retrieve the
  *		branch records (**struct perf_branch_entry**) associated to *ctx*
@@ -2995,7 +2995,7 @@ union bpf_attr {
  *
  *		**-ENOENT** if architecture does not support branch records.
  *
- * int bpf_get_ns_current_pid_tgid(u64 dev, u64 ino, struct bpf_pidns_info *nsdata, u32 size)
+ * long bpf_get_ns_current_pid_tgid(u64 dev, u64 ino, struct bpf_pidns_info *nsdata, u32 size)
  *	Description
  *		Returns 0 on success, values for *pid* and *tgid* as seen from the current
  *		*namespace* will be returned in *nsdata*.
@@ -3007,7 +3007,7 @@ union bpf_attr {
  *
  *		**-ENOENT** if pidns does not exists for the current task.
  *
- * int bpf_xdp_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
+ * long bpf_xdp_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
  *	Description
  *		Write raw *data* blob into a special BPF perf event held by
  *		*map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf
@@ -3062,7 +3062,7 @@ union bpf_attr {
  * 	Return
  * 		The id is returned or 0 in case the id could not be retrieved.
  *
- * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
+ * long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
  *	Description
  *		Assign the *sk* to the *skb*. When combined with appropriate
  *		routing configuration to receive the packet towards the socket,
@@ -3097,7 +3097,7 @@ union bpf_attr {
  * 	Return
  * 		Current *ktime*.
  *
- * int bpf_seq_printf(struct seq_file *m, const char *fmt, u32 fmt_size, const void *data, u32 data_len)
+ * long bpf_seq_printf(struct seq_file *m, const char *fmt, u32 fmt_size, const void *data, u32 data_len)
  * 	Description
  * 		**bpf_seq_printf**\ () uses seq_file **seq_printf**\ () to print
  * 		out the format string.
@@ -3126,7 +3126,7 @@ union bpf_attr {
  *
  *		**-EOVERFLOW** if an overflow happened: The same object will be tried again.
  *
- * int bpf_seq_write(struct seq_file *m, const void *data, u32 len)
+ * long bpf_seq_write(struct seq_file *m, const void *data, u32 len)
  * 	Description
  * 		**bpf_seq_write**\ () uses seq_file **seq_write**\ () to write the data.
  * 		The *m* represents the seq_file. The *data* and *len* represent the
@@ -3221,7 +3221,7 @@ union bpf_attr {
  *	Return
  *		Requested value, or 0, if flags are not recognized.
  *
- * int bpf_csum_level(struct sk_buff *skb, u64 level)
+ * long bpf_csum_level(struct sk_buff *skb, u64 level)
  * 	Description
  * 		Change the skbs checksum level by one layer up or down, or
  * 		reset it entirely to none in order to have the stack perform
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH bpf-next 2/2] selftests/bpf: add variable-length data concatenation pattern test
  2020-06-17 20:21 [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long Andrii Nakryiko
@ 2020-06-17 20:21 ` Andrii Nakryiko
  2020-06-18  6:49 ` [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long John Fastabend
  1 sibling, 0 replies; 18+ messages in thread
From: Andrii Nakryiko @ 2020-06-17 20:21 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Add selftest that validates variable-length data reading and concatentation
with one big shared data array. This is a common pattern in production use for
monitoring and tracing applications, that potentially can read a lot of data,
but overall read much less. Such pattern allows to determine precisely what
amount of data needs to be sent over perfbuf/ringbuf and maximize efficiency.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---

N.B. This patch requires 02553b91da5d ("bpf: bpf_probe_read_kernel_str() has to return amount of data read on success"),
currently in bpf tree only.

 .../testing/selftests/bpf/prog_tests/varlen.c | 56 +++++++++++
 .../testing/selftests/bpf/progs/test_varlen.c | 96 +++++++++++++++++++
 2 files changed, 152 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/varlen.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_varlen.c

diff --git a/tools/testing/selftests/bpf/prog_tests/varlen.c b/tools/testing/selftests/bpf/prog_tests/varlen.c
new file mode 100644
index 000000000000..7533565e096d
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/varlen.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#include <test_progs.h>
+#include <time.h>
+#include "test_varlen.skel.h"
+
+#define CHECK_VAL(got, exp) \
+	CHECK((got) != (exp), "check", "got %ld != exp %ld\n", \
+	      (long)(got), (long)(exp))
+
+void test_varlen(void)
+{
+	int duration = 0, err;
+	struct test_varlen* skel;
+	struct test_varlen__bss *bss;
+	struct test_varlen__data *data;
+	const char str1[] = "Hello, ";
+	const char str2[] = "World!";
+	const char exp_str[] = "Hello, \0World!\0";
+	const int size1 = sizeof(str1);
+	const int size2 = sizeof(str2);
+
+	skel = test_varlen__open_and_load();
+	if (CHECK(!skel, "skel_open", "failed to open skeleton\n"))
+		return;
+	bss = skel->bss;
+	data = skel->data;
+
+	err = test_varlen__attach(skel);
+	if (CHECK(err, "skel_attach", "skeleton attach failed: %d\n", err))
+		goto cleanup;
+
+	bss->test_pid = getpid();
+
+	/* trigger everything */
+	memcpy(bss->buf_in1, str1, size1);
+	memcpy(bss->buf_in2, str2, size2);
+	bss->capture = true;
+	usleep(1);
+	bss->capture = false;
+
+	CHECK_VAL(bss->payload1_len1, size1);
+	CHECK_VAL(bss->payload1_len2, size2);
+	CHECK_VAL(bss->total1, size1 + size2);
+	CHECK(memcmp(bss->payload1, exp_str, size1 + size2), "content_check",
+	      "doesn't match!");
+
+	CHECK_VAL(data->payload2_len1, size1);
+	CHECK_VAL(data->payload2_len2, size2);
+	CHECK_VAL(data->total2, size1 + size2);
+	CHECK(memcmp(data->payload2, exp_str, size1 + size2), "content_check",
+	      "doesn't match!");
+cleanup:
+	test_varlen__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_varlen.c b/tools/testing/selftests/bpf/progs/test_varlen.c
new file mode 100644
index 000000000000..09691852debf
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_varlen.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+#define MAX_LEN 256
+
+char buf_in1[MAX_LEN] = {};
+char buf_in2[MAX_LEN] = {};
+
+int test_pid = 0;
+bool capture = false;
+
+/* .bss */
+long payload1_len1 = 0;
+long payload1_len2 = 0;
+long total1 = 0;
+char payload1[MAX_LEN + MAX_LEN] = {};
+
+/* .data */
+int payload2_len1 = -1;
+int payload2_len2 = -1;
+int total2 = -1;
+char payload2[MAX_LEN + MAX_LEN] = { 1 };
+
+SEC("raw_tp/sys_enter")
+int handler64(void *regs)
+{
+	int pid = bpf_get_current_pid_tgid() >> 32;
+	void *payload = payload1;
+	u64 len;
+
+	/* ignore irrelevant invocations */
+	if (test_pid != pid || !capture)
+		return 0;
+
+	len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in1[0]);
+	if (len <= MAX_LEN) {
+		payload += len;
+		payload1_len1 = len;
+	}
+
+	len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in2[0]);
+	if (len <= MAX_LEN) {
+		payload += len;
+		payload1_len2 = len;
+	}
+
+	total1 = payload - (void *)payload1;
+
+	return 0;
+}
+
+SEC("tp_btf/sys_enter")
+int handler32(void *regs)
+{
+	int pid = bpf_get_current_pid_tgid() >> 32;
+	void *payload = payload2;
+	u32 len;
+
+	/* ignore irrelevant invocations */
+	if (test_pid != pid || !capture)
+		return 0;
+
+	len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in1[0]);
+	if (len <= MAX_LEN) {
+		payload += len;
+		payload2_len1 = len;
+	}
+
+	len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in2[0]);
+	if (len <= MAX_LEN) {
+		payload += len;
+		payload2_len2 = len;
+	}
+
+	total2 = payload - (void *)payload2;
+
+	return 0;
+}
+
+SEC("tp_btf/sys_exit")
+int handler_exit(void *regs)
+{
+	long bla;
+
+	if (bpf_probe_read_kernel(&bla, sizeof(bla), 0))
+		return 1;
+	else
+		return 0;
+}
+
+char LICENSE[] SEC("license") = "GPL";
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-17 20:21 [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long Andrii Nakryiko
  2020-06-17 20:21 ` [PATCH bpf-next 2/2] selftests/bpf: add variable-length data concatenation pattern test Andrii Nakryiko
@ 2020-06-18  6:49 ` John Fastabend
  2020-06-18  7:30   ` Andrii Nakryiko
  1 sibling, 1 reply; 18+ messages in thread
From: John Fastabend @ 2020-06-18  6:49 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Andrii Nakryiko wrote:
> Switch most of BPF helper definitions from returning int to long. These
> definitions are coming from comments in BPF UAPI header and are used to
> generate bpf_helper_defs.h (under libbpf) to be later included and used from
> BPF programs.
> 
> In actual in-kernel implementation, all the helpers are defined as returning
> u64, but due to some historical reasons, most of them are actually defined as
> returning int in UAPI (usually, to return 0 on success, and negative value on
> error).

Could we change the helpers side to return correct types now? Meaning if the
UAPI claims its an int lets actually return the int.

> 
> This actually causes Clang to quite often generate sub-optimal code, because
> compiler believes that return value is 32-bit, and in a lot of cases has to be
> up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values,
> before they can be used further in BPF code.
> 
> Besides just "polluting" the code, these 32-bit shifts quite often cause
> problems for cases in which return value matters. This is especially the case
> for the family of bpf_probe_read_str() functions. There are few other similar
> helpers (e.g., bpf_read_branch_records()), in which return value is used by
> BPF program logic to record variable-length data and process it. For such
> cases, BPF program logic carefully manages offsets within some array or map to
> read variable-length data. For such uses, it's crucial for BPF verifier to
> track possible range of register values to prove that all the accesses happen
> within given memory bounds. Those extraneous zero-extending bit shifts,
> inserted by Clang (and quite often interleaved with other code, which makes
> the issues even more challenging and sometimes requires employing extra
> per-variable compiler barriers), throws off verifier logic and makes it mark
> registers as having unknown variable offset. We'll study this pattern a bit
> later below.

With latest verifier zext with alu32 support should be implemented as a
MOV insn.

> 
> Another common pattern is to check return of BPF helper for non-zero state to
> detect error conditions and attempt alternative actions in such case. Even in
> this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode
> quite often leads to sub-optimal and unnecessary extra code. We'll look at
> this pattern as well.
> 
> Clang's BPF target supports two modes of code generation: ALU32, in which it
> is capable of using lower 32-bit parts of registers, and no-ALU32, in which
> only full 64-bit registers are being used. ALU32 mode somewhat mitigates the
> above described problems, but not in all cases.

A bit curious, do you see users running with no-ALU32 support? I have enabled
it by default now. It seems to generate better code and with latest 32-bit
bounds tracking I haven't hit any issues with verifier.

> 
> This patch switches all the cases in which BPF helpers return 0 or negative
> error from returning int to returning long. It is shown below that such change
> in definition leads to equivalent or better code. No-ALU32 mode benefits more,
> but ALU32 mode doesn't degrade or still gets improved code generation.
> 
> Another class of cases switched from int to long are bpf_probe_read_str()-like
> helpers, which encode successful case as non-negative values, while still
> returning negative value for errors.
> 
> In all of such cases, correctness is preserved due to two's complement
> encoding of negative values and the fact that all helpers return values with
> 32-bit absolute value. Two's complement ensures that for negative values
> higher 32 bits are all ones and when truncated, leave valid negative 32-bit
> value with the same value. Non-negative values have upper 32 bits set to zero
> and similarly preserve value when high 32 bits are truncated. This means that
> just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't
> require any extra shifts).
> 
> To minimize the chances of regressions, two code patterns were investigated,
> as mentioned above. For both patterns, BPF assembly was analyzed in
> ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and
> new 64-bit long return type.
> 
> Case 1. Variable-length data reading and concatenation. This is quite
> ubiquitous pattern in tracing/monitoring applications, reading data like
> process's environment variables, file path, etc. In such case, many pieces of
> string-like variable-length data are read into a single big buffer, and at the
> end of the process, only a part of array containing actual data is sent to
> user-space for further processing. This case is tested in test_varlen.c
> selftest (in the next patch). Code flow is roughly as follows:
> 
>   void *payload = &sample->payload;
>   u64 len;
> 
>   len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
>   if (len <= MAX_SZ1) {
>       payload += len;
>       sample->len1 = len;
>   }
>   len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
>   if (len <= MAX_SZ2) {
>       payload += len;
>       sample->len2 = len;
>   }
>   /* and so on */
>   sample->total_len = payload - &sample->payload;
>   /* send over, e.g., perf buffer */
> 
> There could be two variations with slightly different code generated: when len
> is 64-bit integer and when it is 32-bit integer. Both variations were analysed.
> BPF assembly instructions between two successive invocations of
> bpf_probe_read_kernel_str() were used to check code regressions. Results are
> below, followed by short analysis. Left side is using helpers with int return
> type, the right one is after the switch to long.
> 
> ALU32 + INT                                ALU32 + LONG
> ===========                                ============
> 
> 64-BIT (13 insns):                         64-BIT (10 insns):
> ------------------------------------       ------------------------------------
>   17:   call 115                             17:   call 115
>   18:   if w0 > 256 goto +9 <LBB0_4>         18:   if r0 > 256 goto +6 <LBB0_4>
>   19:   w1 = w0                              19:   r1 = 0 ll
>   20:   r1 <<= 32                            21:   *(u64 *)(r1 + 0) = r0
>   21:   r1 s>>= 32                           22:   r6 = 0 ll

What version of clang is this? That is probably a zext in llvm-ir that in
latest should be sufficient with the 'w1=w0'. I'm guessing (hoping?) you
might not have latest?

>   22:   r2 = 0 ll                            24:   r6 += r0
>   24:   *(u64 *)(r2 + 0) = r1              00000000000000c8 <LBB0_4>:
>   25:   r6 = 0 ll                            25:   r1 = r6
>   27:   r6 += r1                             26:   w2 = 256
> 00000000000000e0 <LBB0_4>:                   27:   r3 = 0 ll
>   28:   r1 = r6                              29:   call 115
>   29:   w2 = 256
>   30:   r3 = 0 ll
>   32:   call 115
> 
> 32-BIT (11 insns):                         32-BIT (12 insns):
> ------------------------------------       ------------------------------------
>   17:   call 115                             17:   call 115
>   18:   if w0 > 256 goto +7 <LBB1_4>         18:   if w0 > 256 goto +8 <LBB1_4>
>   19:   r1 = 0 ll                            19:   r1 = 0 ll
>   21:   *(u32 *)(r1 + 0) = r0                21:   *(u32 *)(r1 + 0) = r0
>   22:   w1 = w0                              22:   r0 <<= 32
>   23:   r6 = 0 ll                            23:   r0 >>= 32
>   25:   r6 += r1                             24:   r6 = 0 ll
> 00000000000000d0 <LBB1_4>:                   26:   r6 += r0
>   26:   r1 = r6                            00000000000000d8 <LBB1_4>:
>   27:   w2 = 256                             27:   r1 = r6
>   28:   r3 = 0 ll                            28:   w2 = 256
>   30:   call 115                             29:   r3 = 0 ll
>                                              31:   call 115
> 
> In ALU32 mode, the variant using 64-bit length variable clearly wins and
> avoids unnecessary zero-extension bit shifts. In practice, this is even more
> important and good, because BPF code won't need to do extra checks to "prove"
> that payload/len are within good bounds.

I bet with latest clang the shifts are removed. But if not we probably
should fix clang regardless of if helpers return longs or ints.

> 
> 32-bit len is one instruction longer. Clang decided to do 64-to-32 casting
> with two bit shifts, instead of equivalent `w1 = w0` assignment. The former
> uses extra register. The latter might potentially lose some range information,
> but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256]
> after check at 18:, and shifting 32 bits left/right keeps that range intact.
> We should probably look into Clang's logic and see why it chooses bitshifts
> over sub-register assignments for this.
> 
> NO-ALU32 + INT                             NO-ALU32 + LONG
> ==============                             ===============
> 
> 64-BIT (14 insns):                         64-BIT (10 insns):
> ------------------------------------       ------------------------------------
>   17:   call 115                             17:   call 115
>   18:   r0 <<= 32                            18:   if r0 > 256 goto +6 <LBB0_4>
>   19:   r1 = r0                              19:   r1 = 0 ll
>   20:   r1 >>= 32                            21:   *(u64 *)(r1 + 0) = r0
>   21:   if r1 > 256 goto +7 <LBB0_4>         22:   r6 = 0 ll
>   22:   r0 s>>= 32                           24:   r6 += r0
>   23:   r1 = 0 ll                          00000000000000c8 <LBB0_4>:
>   25:   *(u64 *)(r1 + 0) = r0                25:   r1 = r6
>   26:   r6 = 0 ll                            26:   r2 = 256
>   28:   r6 += r0                             27:   r3 = 0 ll
> 00000000000000e8 <LBB0_4>:                   29:   call 115
>   29:   r1 = r6
>   30:   r2 = 256
>   31:   r3 = 0 ll
>   33:   call 115
> 
> 32-BIT (13 insns):                         32-BIT (13 insns):
> ------------------------------------       ------------------------------------
>   17:   call 115                             17:   call 115
>   18:   r1 = r0                              18:   r1 = r0
>   19:   r1 <<= 32                            19:   r1 <<= 32
>   20:   r1 >>= 32                            20:   r1 >>= 32
>   21:   if r1 > 256 goto +6 <LBB1_4>         21:   if r1 > 256 goto +6 <LBB1_4>
>   22:   r2 = 0 ll                            22:   r2 = 0 ll
>   24:   *(u32 *)(r2 + 0) = r0                24:   *(u32 *)(r2 + 0) = r0
>   25:   r6 = 0 ll                            25:   r6 = 0 ll
>   27:   r6 += r1                             27:   r6 += r1
> 00000000000000e0 <LBB1_4>:                 00000000000000e0 <LBB1_4>:
>   28:   r1 = r6                              28:   r1 = r6
>   29:   r2 = 256                             29:   r2 = 256
>   30:   r3 = 0 ll                            30:   r3 = 0 ll
>   32:   call 115                             32:   call 115
> 
> In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much
> superior code, as expected, eliminating unnecessary bit shifts. For 32-bit
> len, code is identical.

Right I can't think of any way clang can avoid it here. OTOH I fix this
by enabling alu32 ;)

> 
> So overall, only ALU-32 32-bit len case is more-or-less equivalent and the
> difference stems from internal Clang decision, rather than compiler lacking
> enough information about types.
> 
> Case 2. Let's look at the simpler case of checking return result of BPF helper
> for errors. The code is very simple:
> 
>   long bla;
>   if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0))
>       return 1;
>   else
>       return 0;
> 
> ALU32 + CHECK (9 insns)                    ALU32 + CHECK (9 insns)
> ====================================       ====================================
>   0:    r1 = r10                             0:    r1 = r10
>   1:    r1 += -8                             1:    r1 += -8
>   2:    w2 = 8                               2:    w2 = 8
>   3:    r3 = 0                               3:    r3 = 0
>   4:    call 113                             4:    call 113
>   5:    w1 = w0                              5:    r1 = r0
>   6:    w0 = 1                               6:    w0 = 1
>   7:    if w1 != 0 goto +1 <LBB2_2>          7:    if r1 != 0 goto +1 <LBB2_2>
>   8:    w0 = 0                               8:    w0 = 0
> 0000000000000048 <LBB2_2>:                 0000000000000048 <LBB2_2>:
>   9:    exit                                 9:    exit
> 
> Almost identical code, the only difference is the use of full register
> assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit
> architectures, new BPF assembly might be slightly less optimal, in theory. But
> one can argue that's not a big issue, given that use of full registers is
> still prevalent (e.g., for parameter passing).
> 
> NO-ALU32 + CHECK (11 insns)                NO-ALU32 + CHECK (9 insns)
> ====================================       ====================================
>   0:    r1 = r10                             0:    r1 = r10
>   1:    r1 += -8                             1:    r1 += -8
>   2:    r2 = 8                               2:    r2 = 8
>   3:    r3 = 0                               3:    r3 = 0
>   4:    call 113                             4:    call 113
>   5:    r1 = r0                              5:    r1 = r0
>   6:    r1 <<= 32                            6:    r0 = 1
>   7:    r1 >>= 32                            7:    if r1 != 0 goto +1 <LBB2_2>
>   8:    r0 = 1                               8:    r0 = 0
>   9:    if r1 != 0 goto +1 <LBB2_2>        0000000000000048 <LBB2_2>:
>  10:    r0 = 0                               9:    exit
> 0000000000000058 <LBB2_2>:
>  11:    exit
> 
> NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit
> shifts.

It seems a win for the NO-ALU32 case but for the +ALU32 case I think its
the same with latest clang although I haven't tried yet. I was actually
considering going the other way and avoiding always returning u64 on
the other side. From a purely aesethetics point of view I prefer the
int type because it seems more clear/standard C. I'm also not so interested
in optimizing the no-alu32 case but curious if there is a use case for
that?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-18  6:49 ` [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long John Fastabend
@ 2020-06-18  7:30   ` Andrii Nakryiko
  2020-06-18 18:58     ` John Fastabend
  0 siblings, 1 reply; 18+ messages in thread
From: Andrii Nakryiko @ 2020-06-18  7:30 UTC (permalink / raw)
  To: John Fastabend
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Wed, Jun 17, 2020 at 11:49 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > Switch most of BPF helper definitions from returning int to long. These
> > definitions are coming from comments in BPF UAPI header and are used to
> > generate bpf_helper_defs.h (under libbpf) to be later included and used from
> > BPF programs.
> >
> > In actual in-kernel implementation, all the helpers are defined as returning
> > u64, but due to some historical reasons, most of them are actually defined as
> > returning int in UAPI (usually, to return 0 on success, and negative value on
> > error).
>
> Could we change the helpers side to return correct types now? Meaning if the
> UAPI claims its an int lets actually return the int.

I'm not sure how exactly you see this being done. BPF ABI dictates
that the helper's result is passed in a full 64-bit r0 register. Are
you suggesting that in addition to RET_ANYTHING we should add
RET_ANYTHING32 and teach verifier that higher 32 bits of r0 are
guaranteed to be zero? And then make helpers actually return 32-bit
values without up-casting them to u64?

>
> >
> > This actually causes Clang to quite often generate sub-optimal code, because
> > compiler believes that return value is 32-bit, and in a lot of cases has to be
> > up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values,
> > before they can be used further in BPF code.
> >
> > Besides just "polluting" the code, these 32-bit shifts quite often cause
> > problems for cases in which return value matters. This is especially the case
> > for the family of bpf_probe_read_str() functions. There are few other similar
> > helpers (e.g., bpf_read_branch_records()), in which return value is used by
> > BPF program logic to record variable-length data and process it. For such
> > cases, BPF program logic carefully manages offsets within some array or map to
> > read variable-length data. For such uses, it's crucial for BPF verifier to
> > track possible range of register values to prove that all the accesses happen
> > within given memory bounds. Those extraneous zero-extending bit shifts,
> > inserted by Clang (and quite often interleaved with other code, which makes
> > the issues even more challenging and sometimes requires employing extra
> > per-variable compiler barriers), throws off verifier logic and makes it mark
> > registers as having unknown variable offset. We'll study this pattern a bit
> > later below.
>
> With latest verifier zext with alu32 support should be implemented as a
> MOV insn.

Code generation is independent of verifier version or am I not getting
what you are saying? Also all this code was compiled with up-to-date
Clang.

>
> >
> > Another common pattern is to check return of BPF helper for non-zero state to
> > detect error conditions and attempt alternative actions in such case. Even in
> > this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode
> > quite often leads to sub-optimal and unnecessary extra code. We'll look at
> > this pattern as well.
> >
> > Clang's BPF target supports two modes of code generation: ALU32, in which it
> > is capable of using lower 32-bit parts of registers, and no-ALU32, in which
> > only full 64-bit registers are being used. ALU32 mode somewhat mitigates the
> > above described problems, but not in all cases.
>
> A bit curious, do you see users running with no-ALU32 support? I have enabled
> it by default now. It seems to generate better code and with latest 32-bit
> bounds tracking I haven't hit any issues with verifier.

Yes, all Facebook apps are built with no-ALU32. And those apps have to
run on quite old kernels as well, so relying on latest bug fixes in
kernel is not an option right now.

>
> >
> > This patch switches all the cases in which BPF helpers return 0 or negative
> > error from returning int to returning long. It is shown below that such change
> > in definition leads to equivalent or better code. No-ALU32 mode benefits more,
> > but ALU32 mode doesn't degrade or still gets improved code generation.
> >
> > Another class of cases switched from int to long are bpf_probe_read_str()-like
> > helpers, which encode successful case as non-negative values, while still
> > returning negative value for errors.
> >
> > In all of such cases, correctness is preserved due to two's complement
> > encoding of negative values and the fact that all helpers return values with
> > 32-bit absolute value. Two's complement ensures that for negative values
> > higher 32 bits are all ones and when truncated, leave valid negative 32-bit
> > value with the same value. Non-negative values have upper 32 bits set to zero
> > and similarly preserve value when high 32 bits are truncated. This means that
> > just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't
> > require any extra shifts).
> >
> > To minimize the chances of regressions, two code patterns were investigated,
> > as mentioned above. For both patterns, BPF assembly was analyzed in
> > ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and
> > new 64-bit long return type.
> >
> > Case 1. Variable-length data reading and concatenation. This is quite
> > ubiquitous pattern in tracing/monitoring applications, reading data like
> > process's environment variables, file path, etc. In such case, many pieces of
> > string-like variable-length data are read into a single big buffer, and at the
> > end of the process, only a part of array containing actual data is sent to
> > user-space for further processing. This case is tested in test_varlen.c
> > selftest (in the next patch). Code flow is roughly as follows:
> >
> >   void *payload = &sample->payload;
> >   u64 len;
> >
> >   len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
> >   if (len <= MAX_SZ1) {
> >       payload += len;
> >       sample->len1 = len;
> >   }
> >   len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
> >   if (len <= MAX_SZ2) {
> >       payload += len;
> >       sample->len2 = len;
> >   }
> >   /* and so on */
> >   sample->total_len = payload - &sample->payload;
> >   /* send over, e.g., perf buffer */
> >
> > There could be two variations with slightly different code generated: when len
> > is 64-bit integer and when it is 32-bit integer. Both variations were analysed.
> > BPF assembly instructions between two successive invocations of
> > bpf_probe_read_kernel_str() were used to check code regressions. Results are
> > below, followed by short analysis. Left side is using helpers with int return
> > type, the right one is after the switch to long.
> >
> > ALU32 + INT                                ALU32 + LONG
> > ===========                                ============
> >
> > 64-BIT (13 insns):                         64-BIT (10 insns):
> > ------------------------------------       ------------------------------------
> >   17:   call 115                             17:   call 115
> >   18:   if w0 > 256 goto +9 <LBB0_4>         18:   if r0 > 256 goto +6 <LBB0_4>
> >   19:   w1 = w0                              19:   r1 = 0 ll
> >   20:   r1 <<= 32                            21:   *(u64 *)(r1 + 0) = r0
> >   21:   r1 s>>= 32                           22:   r6 = 0 ll
>
> What version of clang is this? That is probably a zext in llvm-ir that in
> latest should be sufficient with the 'w1=w0'. I'm guessing (hoping?) you
> might not have latest?

Just double-checked, very latest Clang, built today. Still generates
the same code.

But I think this makes sense, because r1 is u64, and it gets assigned
from int, so int first has to be converted to s64, then casted to u64.
So sign extension is necessary. I've confirmed with this simple
program:

$ cat bla.c
#include <stdio.h>

int main() {
        int a = -1;
        unsigned long b = a;
        printf("%lx\n", b);
        return 0;
}
$ clang bla.c -o test && ./test
ffffffffffffffff

^^^^^^^^--- not zeroes

So I don't think it's a bug or inefficiency, C language requires that.

>
> >   22:   r2 = 0 ll                            24:   r6 += r0
> >   24:   *(u64 *)(r2 + 0) = r1              00000000000000c8 <LBB0_4>:
> >   25:   r6 = 0 ll                            25:   r1 = r6
> >   27:   r6 += r1                             26:   w2 = 256
> > 00000000000000e0 <LBB0_4>:                   27:   r3 = 0 ll
> >   28:   r1 = r6                              29:   call 115
> >   29:   w2 = 256
> >   30:   r3 = 0 ll
> >   32:   call 115
> >
> > 32-BIT (11 insns):                         32-BIT (12 insns):
> > ------------------------------------       ------------------------------------
> >   17:   call 115                             17:   call 115
> >   18:   if w0 > 256 goto +7 <LBB1_4>         18:   if w0 > 256 goto +8 <LBB1_4>
> >   19:   r1 = 0 ll                            19:   r1 = 0 ll
> >   21:   *(u32 *)(r1 + 0) = r0                21:   *(u32 *)(r1 + 0) = r0
> >   22:   w1 = w0                              22:   r0 <<= 32
> >   23:   r6 = 0 ll                            23:   r0 >>= 32
> >   25:   r6 += r1                             24:   r6 = 0 ll
> > 00000000000000d0 <LBB1_4>:                   26:   r6 += r0
> >   26:   r1 = r6                            00000000000000d8 <LBB1_4>:
> >   27:   w2 = 256                             27:   r1 = r6
> >   28:   r3 = 0 ll                            28:   w2 = 256
> >   30:   call 115                             29:   r3 = 0 ll
> >                                              31:   call 115
> >
> > In ALU32 mode, the variant using 64-bit length variable clearly wins and
> > avoids unnecessary zero-extension bit shifts. In practice, this is even more
> > important and good, because BPF code won't need to do extra checks to "prove"
> > that payload/len are within good bounds.
>
> I bet with latest clang the shifts are removed. But if not we probably
> should fix clang regardless of if helpers return longs or ints.

are we still talking about bit shifts for INT HELPER + U64 len case?
Or now about bit shifts in LONG HELPER + U32 len case?

>
> >
> > 32-bit len is one instruction longer. Clang decided to do 64-to-32 casting
> > with two bit shifts, instead of equivalent `w1 = w0` assignment. The former
> > uses extra register. The latter might potentially lose some range information,
> > but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256]
> > after check at 18:, and shifting 32 bits left/right keeps that range intact.
> > We should probably look into Clang's logic and see why it chooses bitshifts
> > over sub-register assignments for this.
> >
> > NO-ALU32 + INT                             NO-ALU32 + LONG
> > ==============                             ===============
> >
> > 64-BIT (14 insns):                         64-BIT (10 insns):
> > ------------------------------------       ------------------------------------
> >   17:   call 115                             17:   call 115
> >   18:   r0 <<= 32                            18:   if r0 > 256 goto +6 <LBB0_4>
> >   19:   r1 = r0                              19:   r1 = 0 ll
> >   20:   r1 >>= 32                            21:   *(u64 *)(r1 + 0) = r0
> >   21:   if r1 > 256 goto +7 <LBB0_4>         22:   r6 = 0 ll
> >   22:   r0 s>>= 32                           24:   r6 += r0
> >   23:   r1 = 0 ll                          00000000000000c8 <LBB0_4>:
> >   25:   *(u64 *)(r1 + 0) = r0                25:   r1 = r6
> >   26:   r6 = 0 ll                            26:   r2 = 256
> >   28:   r6 += r0                             27:   r3 = 0 ll
> > 00000000000000e8 <LBB0_4>:                   29:   call 115
> >   29:   r1 = r6
> >   30:   r2 = 256
> >   31:   r3 = 0 ll
> >   33:   call 115
> >
> > 32-BIT (13 insns):                         32-BIT (13 insns):
> > ------------------------------------       ------------------------------------
> >   17:   call 115                             17:   call 115
> >   18:   r1 = r0                              18:   r1 = r0
> >   19:   r1 <<= 32                            19:   r1 <<= 32
> >   20:   r1 >>= 32                            20:   r1 >>= 32
> >   21:   if r1 > 256 goto +6 <LBB1_4>         21:   if r1 > 256 goto +6 <LBB1_4>
> >   22:   r2 = 0 ll                            22:   r2 = 0 ll
> >   24:   *(u32 *)(r2 + 0) = r0                24:   *(u32 *)(r2 + 0) = r0
> >   25:   r6 = 0 ll                            25:   r6 = 0 ll
> >   27:   r6 += r1                             27:   r6 += r1
> > 00000000000000e0 <LBB1_4>:                 00000000000000e0 <LBB1_4>:
> >   28:   r1 = r6                              28:   r1 = r6
> >   29:   r2 = 256                             29:   r2 = 256
> >   30:   r3 = 0 ll                            30:   r3 = 0 ll
> >   32:   call 115                             32:   call 115
> >
> > In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much
> > superior code, as expected, eliminating unnecessary bit shifts. For 32-bit
> > len, code is identical.
>
> Right I can't think of any way clang can avoid it here. OTOH I fix this
> by enabling alu32 ;)
>
> >
> > So overall, only ALU-32 32-bit len case is more-or-less equivalent and the
> > difference stems from internal Clang decision, rather than compiler lacking
> > enough information about types.
> >
> > Case 2. Let's look at the simpler case of checking return result of BPF helper
> > for errors. The code is very simple:
> >
> >   long bla;
> >   if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0))
> >       return 1;
> >   else
> >       return 0;
> >
> > ALU32 + CHECK (9 insns)                    ALU32 + CHECK (9 insns)
> > ====================================       ====================================
> >   0:    r1 = r10                             0:    r1 = r10
> >   1:    r1 += -8                             1:    r1 += -8
> >   2:    w2 = 8                               2:    w2 = 8
> >   3:    r3 = 0                               3:    r3 = 0
> >   4:    call 113                             4:    call 113
> >   5:    w1 = w0                              5:    r1 = r0
> >   6:    w0 = 1                               6:    w0 = 1
> >   7:    if w1 != 0 goto +1 <LBB2_2>          7:    if r1 != 0 goto +1 <LBB2_2>
> >   8:    w0 = 0                               8:    w0 = 0
> > 0000000000000048 <LBB2_2>:                 0000000000000048 <LBB2_2>:
> >   9:    exit                                 9:    exit
> >
> > Almost identical code, the only difference is the use of full register
> > assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit
> > architectures, new BPF assembly might be slightly less optimal, in theory. But
> > one can argue that's not a big issue, given that use of full registers is
> > still prevalent (e.g., for parameter passing).
> >
> > NO-ALU32 + CHECK (11 insns)                NO-ALU32 + CHECK (9 insns)
> > ====================================       ====================================
> >   0:    r1 = r10                             0:    r1 = r10
> >   1:    r1 += -8                             1:    r1 += -8
> >   2:    r2 = 8                               2:    r2 = 8
> >   3:    r3 = 0                               3:    r3 = 0
> >   4:    call 113                             4:    call 113
> >   5:    r1 = r0                              5:    r1 = r0
> >   6:    r1 <<= 32                            6:    r0 = 1
> >   7:    r1 >>= 32                            7:    if r1 != 0 goto +1 <LBB2_2>
> >   8:    r0 = 1                               8:    r0 = 0
> >   9:    if r1 != 0 goto +1 <LBB2_2>        0000000000000048 <LBB2_2>:
> >  10:    r0 = 0                               9:    exit
> > 0000000000000058 <LBB2_2>:
> >  11:    exit
> >
> > NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit
> > shifts.
>
> It seems a win for the NO-ALU32 case but for the +ALU32 case I think its
> the same with latest clang although I haven't tried yet. I was actually
> considering going the other way and avoiding always returning u64 on
> the other side. From a purely aesethetics point of view I prefer the
> int type because it seems more clear/standard C. I'm also not so interested
> in optimizing the no-alu32 case but curious if there is a use case for
> that?

My point was that this int -> long switch doesn't degrade ALU32 and
helps no-ALU32, and thus is good :)

Overall, long as a return type matches reality and BPF ABI
specification. BTW, one of the varlen programs from patch 2 doesn't
even validate successfully on latest kernel with latest Clang right
now, if helpers return int, even though it's completely correct code.
That's a real problem we have to deal with in few major BPF
applications right now, and we have to use inline assembly to enforce
Clang to do the right thing. A bunch of those problems are simply
avoided with correct return types for helpers.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-18  7:30   ` Andrii Nakryiko
@ 2020-06-18 18:58     ` John Fastabend
  2020-06-18 21:41       ` Andrii Nakryiko
  0 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2020-06-18 18:58 UTC (permalink / raw)
  To: Andrii Nakryiko, John Fastabend
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

Andrii Nakryiko wrote:
> On Wed, Jun 17, 2020 at 11:49 PM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Andrii Nakryiko wrote:
> > > Switch most of BPF helper definitions from returning int to long. These
> > > definitions are coming from comments in BPF UAPI header and are used to
> > > generate bpf_helper_defs.h (under libbpf) to be later included and used from
> > > BPF programs.
> > >
> > > In actual in-kernel implementation, all the helpers are defined as returning
> > > u64, but due to some historical reasons, most of them are actually defined as
> > > returning int in UAPI (usually, to return 0 on success, and negative value on
> > > error).
> >
> > Could we change the helpers side to return correct types now? Meaning if the
> > UAPI claims its an int lets actually return the int.
> 
> I'm not sure how exactly you see this being done. BPF ABI dictates
> that the helper's result is passed in a full 64-bit r0 register. Are
> you suggesting that in addition to RET_ANYTHING we should add
> RET_ANYTHING32 and teach verifier that higher 32 bits of r0 are
> guaranteed to be zero? And then make helpers actually return 32-bit
> values without up-casting them to u64?

Yes this is what I was thinking, having a RET_ANYTHING32 but I would
assume the upper 32-bits could be anything not zeroed. For +alu32
and programmer using correct types I would expect clang to generate
good code here and mostly not need to zero upper bits.

I think this discussion can be independent of your changes though and
its not at the top of my todo list so probably wont get to investigating
more for awhile.

> 
> >
> > >
> > > This actually causes Clang to quite often generate sub-optimal code, because
> > > compiler believes that return value is 32-bit, and in a lot of cases has to be
> > > up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values,
> > > before they can be used further in BPF code.
> > >
> > > Besides just "polluting" the code, these 32-bit shifts quite often cause
> > > problems for cases in which return value matters. This is especially the case
> > > for the family of bpf_probe_read_str() functions. There are few other similar
> > > helpers (e.g., bpf_read_branch_records()), in which return value is used by
> > > BPF program logic to record variable-length data and process it. For such
> > > cases, BPF program logic carefully manages offsets within some array or map to
> > > read variable-length data. For such uses, it's crucial for BPF verifier to
> > > track possible range of register values to prove that all the accesses happen
> > > within given memory bounds. Those extraneous zero-extending bit shifts,
> > > inserted by Clang (and quite often interleaved with other code, which makes
> > > the issues even more challenging and sometimes requires employing extra
> > > per-variable compiler barriers), throws off verifier logic and makes it mark
> > > registers as having unknown variable offset. We'll study this pattern a bit
> > > later below.
> >
> > With latest verifier zext with alu32 support should be implemented as a
> > MOV insn.
> 
> Code generation is independent of verifier version or am I not getting
> what you are saying? Also all this code was compiled with up-to-date
> Clang.

Agh sorry I read the example too fast and too late. The above is a typo
what I meant is "With the latest _clang_ zext with alu32 support...". But,
I also read your code example wrong and the left/right shift pattern here,

> > >   19:   w1 = w0                              19:   r1 = 0 ll
> > >   20:   r1 <<= 32
> > >   21:   r1 s>>= 32

above is a _signed_ zext to deal with the types. OK thanks for bearing with me
on this one. Some more thoughts below.

> 
> >
> > >
> > > Another common pattern is to check return of BPF helper for non-zero state to
> > > detect error conditions and attempt alternative actions in such case. Even in
> > > this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode
> > > quite often leads to sub-optimal and unnecessary extra code. We'll look at
> > > this pattern as well.
> > >
> > > Clang's BPF target supports two modes of code generation: ALU32, in which it
> > > is capable of using lower 32-bit parts of registers, and no-ALU32, in which
> > > only full 64-bit registers are being used. ALU32 mode somewhat mitigates the
> > > above described problems, but not in all cases.
> >
> > A bit curious, do you see users running with no-ALU32 support? I have enabled
> > it by default now. It seems to generate better code and with latest 32-bit
> > bounds tracking I haven't hit any issues with verifier.
> 
> Yes, all Facebook apps are built with no-ALU32. And those apps have to
> run on quite old kernels as well, so relying on latest bug fixes in
> kernel is not an option right now.

OK got it.

> 
> >
> > >
> > > This patch switches all the cases in which BPF helpers return 0 or negative
> > > error from returning int to returning long. It is shown below that such change
> > > in definition leads to equivalent or better code. No-ALU32 mode benefits more,
> > > but ALU32 mode doesn't degrade or still gets improved code generation.
> > >
> > > Another class of cases switched from int to long are bpf_probe_read_str()-like
> > > helpers, which encode successful case as non-negative values, while still
> > > returning negative value for errors.
> > >
> > > In all of such cases, correctness is preserved due to two's complement
> > > encoding of negative values and the fact that all helpers return values with
> > > 32-bit absolute value. Two's complement ensures that for negative values
> > > higher 32 bits are all ones and when truncated, leave valid negative 32-bit
> > > value with the same value. Non-negative values have upper 32 bits set to zero
> > > and similarly preserve value when high 32 bits are truncated. This means that
> > > just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't
> > > require any extra shifts).
> > >
> > > To minimize the chances of regressions, two code patterns were investigated,
> > > as mentioned above. For both patterns, BPF assembly was analyzed in
> > > ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and
> > > new 64-bit long return type.
> > >
> > > Case 1. Variable-length data reading and concatenation. This is quite
> > > ubiquitous pattern in tracing/monitoring applications, reading data like
> > > process's environment variables, file path, etc. In such case, many pieces of
> > > string-like variable-length data are read into a single big buffer, and at the
> > > end of the process, only a part of array containing actual data is sent to
> > > user-space for further processing. This case is tested in test_varlen.c
> > > selftest (in the next patch). Code flow is roughly as follows:
> > >
> > >   void *payload = &sample->payload;
> > >   u64 len;
> > >
> > >   len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
> > >   if (len <= MAX_SZ1) {
> > >       payload += len;
> > >       sample->len1 = len;
> > >   }
> > >   len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
> > >   if (len <= MAX_SZ2) {
> > >       payload += len;
> > >       sample->len2 = len;
> > >   }
> > >   /* and so on */
> > >   sample->total_len = payload - &sample->payload;
> > >   /* send over, e.g., perf buffer */
> > >
> > > There could be two variations with slightly different code generated: when len
> > > is 64-bit integer and when it is 32-bit integer. Both variations were analysed.
> > > BPF assembly instructions between two successive invocations of
> > > bpf_probe_read_kernel_str() were used to check code regressions. Results are
> > > below, followed by short analysis. Left side is using helpers with int return
> > > type, the right one is after the switch to long.
> > >
> > > ALU32 + INT                                ALU32 + LONG
> > > ===========                                ============
> > >
> > > 64-BIT (13 insns):                         64-BIT (10 insns):
> > > ------------------------------------       ------------------------------------
> > >   17:   call 115                             17:   call 115
> > >   18:   if w0 > 256 goto +9 <LBB0_4>         18:   if r0 > 256 goto +6 <LBB0_4>
> > >   19:   w1 = w0                              19:   r1 = 0 ll
> > >   20:   r1 <<= 32                            21:   *(u64 *)(r1 + 0) = r0
> > >   21:   r1 s>>= 32                           22:   r6 = 0 ll
> >
> > What version of clang is this? That is probably a zext in llvm-ir that in
> > latest should be sufficient with the 'w1=w0'. I'm guessing (hoping?) you
> > might not have latest?
> 
> Just double-checked, very latest Clang, built today. Still generates
> the same code.
> 
> But I think this makes sense, because r1 is u64, and it gets assigned
> from int, so int first has to be converted to s64, then casted to u64.
> So sign extension is necessary. I've confirmed with this simple
> program:
> 
> $ cat bla.c
> #include <stdio.h>
> 
> int main() {
>         int a = -1;
>         unsigned long b = a;
>         printf("%lx\n", b);
>         return 0;
> }
> $ clang bla.c -o test && ./test
> ffffffffffffffff
> 
> ^^^^^^^^--- not zeroes
> 
> So I don't think it's a bug or inefficiency, C language requires that.

Agreed. Sorry for the confusion on my side. Poked at this a bit more this
morning trying to see why I don't hit the same pattern when we have many
cases very similar to above.

In your C code you never check for negative return codes? Oops, this
means you can walk backwards off the front of payload? This is probably
not valid either from program logic side and/or verifier will probably
yell. Commented where I think you want a <0 check here,

 void *payload = &sample->payload;
 u64 len; // but how do you get a len < 0 check now with u64 type?

 len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
 // should insert a negative case check here
 // if (len < 0) goto abort;
 if (len <= MAX_SZ1) {
	payload += len;
	sample->len1 = len;
 }
 // without the <0 case you may have walked backwards in payload?
 len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);

So in all of my C code I have something like this,

 int myfunc(void *a, void *b, void *c) {
	void *payload = a;
	int len;

	len = probe_read_str(payload, 1000, b);
	if (len < 0) return len;
	if (len <= 1000) { // yeah we have room
		ayload += len;
	}
	len = probe_read_str(payload, 1000, c);
	[...]
 }

And then when I look at generated code I get this,

; int myfunc(void *a, void *b, void *c) {
       4:	call 45
; 	if (len < 0) return len;
       5:	if w0 s< 0 goto +9 <LBB0_4>
; 	if (len <= 1000) {
       6:	w2 = w0
       7:	r1 = r7
       8:	r1 += r2
       9:	if w0 s< 1001 goto +1 <LBB0_3>
      10:	r1 = r7

0000000000000058 <LBB0_3>:
; 	len = probe_read_str(payload, 1000, b);
      11:	w2 = 1000
      12:	r3 = r6
      13:	call 45


Here the <0 check means we can skip the sext and do a
simple zext which is just a w2=w0 by bpf zero extension
rules.

> 
> >
> > >   22:   r2 = 0 ll                            24:   r6 += r0
> > >   24:   *(u64 *)(r2 + 0) = r1              00000000000000c8 <LBB0_4>:
> > >   25:   r6 = 0 ll                            25:   r1 = r6
> > >   27:   r6 += r1                             26:   w2 = 256
> > > 00000000000000e0 <LBB0_4>:                   27:   r3 = 0 ll
> > >   28:   r1 = r6                              29:   call 115
> > >   29:   w2 = 256
> > >   30:   r3 = 0 ll
> > >   32:   call 115
> > >
> > > 32-BIT (11 insns):                         32-BIT (12 insns):
> > > ------------------------------------       ------------------------------------
> > >   17:   call 115                             17:   call 115
> > >   18:   if w0 > 256 goto +7 <LBB1_4>         18:   if w0 > 256 goto +8 <LBB1_4>
> > >   19:   r1 = 0 ll                            19:   r1 = 0 ll
> > >   21:   *(u32 *)(r1 + 0) = r0                21:   *(u32 *)(r1 + 0) = r0
> > >   22:   w1 = w0                              22:   r0 <<= 32
> > >   23:   r6 = 0 ll                            23:   r0 >>= 32
> > >   25:   r6 += r1                             24:   r6 = 0 ll
> > > 00000000000000d0 <LBB1_4>:                   26:   r6 += r0
> > >   26:   r1 = r6                            00000000000000d8 <LBB1_4>:
> > >   27:   w2 = 256                             27:   r1 = r6
> > >   28:   r3 = 0 ll                            28:   w2 = 256
> > >   30:   call 115                             29:   r3 = 0 ll
> > >                                              31:   call 115
> > >
> > > In ALU32 mode, the variant using 64-bit length variable clearly wins and
> > > avoids unnecessary zero-extension bit shifts. In practice, this is even more
> > > important and good, because BPF code won't need to do extra checks to "prove"
> > > that payload/len are within good bounds.
> >
> > I bet with latest clang the shifts are removed. But if not we probably
> > should fix clang regardless of if helpers return longs or ints.
> 
> are we still talking about bit shifts for INT HELPER + U64 len case?
> Or now about bit shifts in LONG HELPER + U32 len case?

Forget I was confused. I put the <0 checks in there mentally and it messed up
how I read your code.

> 
> >
> > >
> > > 32-bit len is one instruction longer. Clang decided to do 64-to-32 casting
> > > with two bit shifts, instead of equivalent `w1 = w0` assignment. The former
> > > uses extra register. The latter might potentially lose some range information,
> > > but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256]
> > > after check at 18:, and shifting 32 bits left/right keeps that range intact.
> > > We should probably look into Clang's logic and see why it chooses bitshifts
> > > over sub-register assignments for this.
> > >
> > > NO-ALU32 + INT                             NO-ALU32 + LONG
> > > ==============                             ===============
> > >
> > > 64-BIT (14 insns):                         64-BIT (10 insns):
> > > ------------------------------------       ------------------------------------
> > >   17:   call 115                             17:   call 115
> > >   18:   r0 <<= 32                            18:   if r0 > 256 goto +6 <LBB0_4>
> > >   19:   r1 = r0                              19:   r1 = 0 ll
> > >   20:   r1 >>= 32                            21:   *(u64 *)(r1 + 0) = r0
> > >   21:   if r1 > 256 goto +7 <LBB0_4>         22:   r6 = 0 ll
> > >   22:   r0 s>>= 32                           24:   r6 += r0
> > >   23:   r1 = 0 ll                          00000000000000c8 <LBB0_4>:
> > >   25:   *(u64 *)(r1 + 0) = r0                25:   r1 = r6
> > >   26:   r6 = 0 ll                            26:   r2 = 256
> > >   28:   r6 += r0                             27:   r3 = 0 ll
> > > 00000000000000e8 <LBB0_4>:                   29:   call 115
> > >   29:   r1 = r6
> > >   30:   r2 = 256
> > >   31:   r3 = 0 ll
> > >   33:   call 115
> > >
> > > 32-BIT (13 insns):                         32-BIT (13 insns):
> > > ------------------------------------       ------------------------------------
> > >   17:   call 115                             17:   call 115
> > >   18:   r1 = r0                              18:   r1 = r0
> > >   19:   r1 <<= 32                            19:   r1 <<= 32
> > >   20:   r1 >>= 32                            20:   r1 >>= 32
> > >   21:   if r1 > 256 goto +6 <LBB1_4>         21:   if r1 > 256 goto +6 <LBB1_4>
> > >   22:   r2 = 0 ll                            22:   r2 = 0 ll
> > >   24:   *(u32 *)(r2 + 0) = r0                24:   *(u32 *)(r2 + 0) = r0
> > >   25:   r6 = 0 ll                            25:   r6 = 0 ll
> > >   27:   r6 += r1                             27:   r6 += r1
> > > 00000000000000e0 <LBB1_4>:                 00000000000000e0 <LBB1_4>:
> > >   28:   r1 = r6                              28:   r1 = r6
> > >   29:   r2 = 256                             29:   r2 = 256
> > >   30:   r3 = 0 ll                            30:   r3 = 0 ll
> > >   32:   call 115                             32:   call 115
> > >
> > > In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much
> > > superior code, as expected, eliminating unnecessary bit shifts. For 32-bit
> > > len, code is identical.
> >
> > Right I can't think of any way clang can avoid it here. OTOH I fix this
> > by enabling alu32 ;)
> >
> > >
> > > So overall, only ALU-32 32-bit len case is more-or-less equivalent and the
> > > difference stems from internal Clang decision, rather than compiler lacking
> > > enough information about types.
> > >
> > > Case 2. Let's look at the simpler case of checking return result of BPF helper
> > > for errors. The code is very simple:
> > >
> > >   long bla;
> > >   if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0))
> > >       return 1;
> > >   else
> > >       return 0;
> > >
> > > ALU32 + CHECK (9 insns)                    ALU32 + CHECK (9 insns)
> > > ====================================       ====================================
> > >   0:    r1 = r10                             0:    r1 = r10
> > >   1:    r1 += -8                             1:    r1 += -8
> > >   2:    w2 = 8                               2:    w2 = 8
> > >   3:    r3 = 0                               3:    r3 = 0
> > >   4:    call 113                             4:    call 113
> > >   5:    w1 = w0                              5:    r1 = r0
> > >   6:    w0 = 1                               6:    w0 = 1
> > >   7:    if w1 != 0 goto +1 <LBB2_2>          7:    if r1 != 0 goto +1 <LBB2_2>
> > >   8:    w0 = 0                               8:    w0 = 0
> > > 0000000000000048 <LBB2_2>:                 0000000000000048 <LBB2_2>:
> > >   9:    exit                                 9:    exit
> > >
> > > Almost identical code, the only difference is the use of full register
> > > assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit
> > > architectures, new BPF assembly might be slightly less optimal, in theory. But
> > > one can argue that's not a big issue, given that use of full registers is
> > > still prevalent (e.g., for parameter passing).
> > >
> > > NO-ALU32 + CHECK (11 insns)                NO-ALU32 + CHECK (9 insns)
> > > ====================================       ====================================
> > >   0:    r1 = r10                             0:    r1 = r10
> > >   1:    r1 += -8                             1:    r1 += -8
> > >   2:    r2 = 8                               2:    r2 = 8
> > >   3:    r3 = 0                               3:    r3 = 0
> > >   4:    call 113                             4:    call 113
> > >   5:    r1 = r0                              5:    r1 = r0
> > >   6:    r1 <<= 32                            6:    r0 = 1
> > >   7:    r1 >>= 32                            7:    if r1 != 0 goto +1 <LBB2_2>
> > >   8:    r0 = 1                               8:    r0 = 0
> > >   9:    if r1 != 0 goto +1 <LBB2_2>        0000000000000048 <LBB2_2>:
> > >  10:    r0 = 0                               9:    exit
> > > 0000000000000058 <LBB2_2>:
> > >  11:    exit
> > >
> > > NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit
> > > shifts.
> >
> > It seems a win for the NO-ALU32 case but for the +ALU32 case I think its
> > the same with latest clang although I haven't tried yet. I was actually
> > considering going the other way and avoiding always returning u64 on
> > the other side. From a purely aesethetics point of view I prefer the
> > int type because it seems more clear/standard C. I'm also not so interested
> > in optimizing the no-alu32 case but curious if there is a use case for
> > that?
> 
> My point was that this int -> long switch doesn't degrade ALU32 and
> helps no-ALU32, and thus is good :)

With the long vs int I do see worse code when using the <0 check.
Using C function below which I took from some real code and renamed
variables. 

int myfunc(void *a, void *b, void *c) {
	void *payload = a;
	int len;

	len = probe_read_str(payload, 1000, a);
	if (len < 0) return len;
	if (len <= 1000) {
		payload += len;
	}
	len = probe_read_str(payload, 1000, b);
	if (len <= 1000) {
    		payload += len;
	}
	return 1;
}

Then here is the side-by-side of generated code, with +ALU32.

  int BPF_FUNC(probe_read, ...                  long BPF_FUNC(probe_read, ...
-------------------------------                ---------------------------------
       0:	r6 = r2                         0:	r6 = r2
       1:	r7 = r1                         1:	r7 = r1
       2:	w2 = 1000                       2:	w2 = 1000
       3:	r3 = r7                         3:	r3 = r7
       4:	call 45                         4:	call 45
       5:	if w0 s< 0 goto +9 <LBB0_4>     5:      r2 = r0
       6:	w2 = w0                         6:	if w0 s< 0 goto +10 <LBB0_4>
       7:	r1 = r7                         7:	r2 <<= 32
       8:	r1 += r2                        8:	r2 s>>= 32
       9:	if w0 s< 1001 goto +1 <LBB0_3>  9:	r1 = r7
      10:	r1 = r7                        10:	r1 += r2
      11:	w2 = 1000                      11:	if w0 s< 1001 goto +1 <LBB0_3>
      12:	r3 = r6                        12:	r1 = r7
      13:	call 45                        13:	w2 = 1000
      14:	w0 = 1                         14:	r3 = r6
      15:       exit                           15:	call 45
                                               16:	w0 = 1
                                               17:      exit

So a couple extra instruction, but more concerning we created a
<<=,s>> pattern. I'll need to do some more tests but my concern
is that could break verifier for real programs we have. I guess
it didn't in the selftests? Surely, this thread has at least
pointed out some gaps in our test cases. I guess I _could_ make
len a u64 type to remove the sext but then <0 check on a u64?!

> 
> Overall, long as a return type matches reality and BPF ABI
> specification. BTW, one of the varlen programs from patch 2 doesn't
> even validate successfully on latest kernel with latest Clang right
> now, if helpers return int, even though it's completely correct code.
> That's a real problem we have to deal with in few major BPF
> applications right now, and we have to use inline assembly to enforce
> Clang to do the right thing. A bunch of those problems are simply
> avoided with correct return types for helpers.

Do the real programs check <0? Did I miss something? I'll try
applying your patch to our real code base and see what happens.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-18 18:58     ` John Fastabend
@ 2020-06-18 21:41       ` Andrii Nakryiko
  2020-06-18 23:30         ` John Fastabend
  0 siblings, 1 reply; 18+ messages in thread
From: Andrii Nakryiko @ 2020-06-18 21:41 UTC (permalink / raw)
  To: John Fastabend
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > On Wed, Jun 17, 2020 at 11:49 PM John Fastabend
> > <john.fastabend@gmail.com> wrote:
> > >
> > > Andrii Nakryiko wrote:
> > > > Switch most of BPF helper definitions from returning int to long. These
> > > > definitions are coming from comments in BPF UAPI header and are used to
> > > > generate bpf_helper_defs.h (under libbpf) to be later included and used from
> > > > BPF programs.
> > > >
> > > > In actual in-kernel implementation, all the helpers are defined as returning
> > > > u64, but due to some historical reasons, most of them are actually defined as
> > > > returning int in UAPI (usually, to return 0 on success, and negative value on
> > > > error).
> > >
> > > Could we change the helpers side to return correct types now? Meaning if the
> > > UAPI claims its an int lets actually return the int.
> >
> > I'm not sure how exactly you see this being done. BPF ABI dictates
> > that the helper's result is passed in a full 64-bit r0 register. Are
> > you suggesting that in addition to RET_ANYTHING we should add
> > RET_ANYTHING32 and teach verifier that higher 32 bits of r0 are
> > guaranteed to be zero? And then make helpers actually return 32-bit
> > values without up-casting them to u64?
>
> Yes this is what I was thinking, having a RET_ANYTHING32 but I would
> assume the upper 32-bits could be anything not zeroed. For +alu32
> and programmer using correct types I would expect clang to generate
> good code here and mostly not need to zero upper bits.
>
> I think this discussion can be independent of your changes though and
> its not at the top of my todo list so probably wont get to investigating
> more for awhile.

I'm confused. If the verifier doesn't make any assumptions about upper
32-bits for RET_ANYTHING32, how is it different from RET_ANYTHING and
today's logic? What you described is exactly what is happening when
bpf_helpers_def.h has BPF helpers defined as returning int.

>
> >
> > >
> > > >
> > > > This actually causes Clang to quite often generate sub-optimal code, because
> > > > compiler believes that return value is 32-bit, and in a lot of cases has to be
> > > > up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values,
> > > > before they can be used further in BPF code.
> > > >
> > > > Besides just "polluting" the code, these 32-bit shifts quite often cause
> > > > problems for cases in which return value matters. This is especially the case
> > > > for the family of bpf_probe_read_str() functions. There are few other similar
> > > > helpers (e.g., bpf_read_branch_records()), in which return value is used by
> > > > BPF program logic to record variable-length data and process it. For such
> > > > cases, BPF program logic carefully manages offsets within some array or map to
> > > > read variable-length data. For such uses, it's crucial for BPF verifier to
> > > > track possible range of register values to prove that all the accesses happen
> > > > within given memory bounds. Those extraneous zero-extending bit shifts,
> > > > inserted by Clang (and quite often interleaved with other code, which makes
> > > > the issues even more challenging and sometimes requires employing extra
> > > > per-variable compiler barriers), throws off verifier logic and makes it mark
> > > > registers as having unknown variable offset. We'll study this pattern a bit
> > > > later below.
> > >
> > > With latest verifier zext with alu32 support should be implemented as a
> > > MOV insn.
> >
> > Code generation is independent of verifier version or am I not getting
> > what you are saying? Also all this code was compiled with up-to-date
> > Clang.
>
> Agh sorry I read the example too fast and too late. The above is a typo
> what I meant is "With the latest _clang_ zext with alu32 support...". But,
> I also read your code example wrong and the left/right shift pattern here,
>
> > > >   19:   w1 = w0                              19:   r1 = 0 ll
> > > >   20:   r1 <<= 32
> > > >   21:   r1 s>>= 32
>
> above is a _signed_ zext to deal with the types. OK thanks for bearing with me
> on this one. Some more thoughts below.

No worries, just trying to ensure we are on the same page in discussion :)

>
> >
> > >
> > > >
> > > > Another common pattern is to check return of BPF helper for non-zero state to
> > > > detect error conditions and attempt alternative actions in such case. Even in
> > > > this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode
> > > > quite often leads to sub-optimal and unnecessary extra code. We'll look at
> > > > this pattern as well.
> > > >
> > > > Clang's BPF target supports two modes of code generation: ALU32, in which it
> > > > is capable of using lower 32-bit parts of registers, and no-ALU32, in which
> > > > only full 64-bit registers are being used. ALU32 mode somewhat mitigates the
> > > > above described problems, but not in all cases.
> > >
> > > A bit curious, do you see users running with no-ALU32 support? I have enabled
> > > it by default now. It seems to generate better code and with latest 32-bit
> > > bounds tracking I haven't hit any issues with verifier.
> >
> > Yes, all Facebook apps are built with no-ALU32. And those apps have to
> > run on quite old kernels as well, so relying on latest bug fixes in
> > kernel is not an option right now.
>
> OK got it.
>
> >
> > >
> > > >
> > > > This patch switches all the cases in which BPF helpers return 0 or negative
> > > > error from returning int to returning long. It is shown below that such change
> > > > in definition leads to equivalent or better code. No-ALU32 mode benefits more,
> > > > but ALU32 mode doesn't degrade or still gets improved code generation.
> > > >
> > > > Another class of cases switched from int to long are bpf_probe_read_str()-like
> > > > helpers, which encode successful case as non-negative values, while still
> > > > returning negative value for errors.
> > > >
> > > > In all of such cases, correctness is preserved due to two's complement
> > > > encoding of negative values and the fact that all helpers return values with
> > > > 32-bit absolute value. Two's complement ensures that for negative values
> > > > higher 32 bits are all ones and when truncated, leave valid negative 32-bit
> > > > value with the same value. Non-negative values have upper 32 bits set to zero
> > > > and similarly preserve value when high 32 bits are truncated. This means that
> > > > just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't
> > > > require any extra shifts).
> > > >
> > > > To minimize the chances of regressions, two code patterns were investigated,
> > > > as mentioned above. For both patterns, BPF assembly was analyzed in
> > > > ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and
> > > > new 64-bit long return type.
> > > >
> > > > Case 1. Variable-length data reading and concatenation. This is quite
> > > > ubiquitous pattern in tracing/monitoring applications, reading data like
> > > > process's environment variables, file path, etc. In such case, many pieces of
> > > > string-like variable-length data are read into a single big buffer, and at the
> > > > end of the process, only a part of array containing actual data is sent to
> > > > user-space for further processing. This case is tested in test_varlen.c
> > > > selftest (in the next patch). Code flow is roughly as follows:
> > > >
> > > >   void *payload = &sample->payload;
> > > >   u64 len;
> > > >
> > > >   len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
> > > >   if (len <= MAX_SZ1) {
> > > >       payload += len;
> > > >       sample->len1 = len;
> > > >   }
> > > >   len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
> > > >   if (len <= MAX_SZ2) {
> > > >       payload += len;
> > > >       sample->len2 = len;
> > > >   }
> > > >   /* and so on */
> > > >   sample->total_len = payload - &sample->payload;
> > > >   /* send over, e.g., perf buffer */
> > > >
> > > > There could be two variations with slightly different code generated: when len
> > > > is 64-bit integer and when it is 32-bit integer. Both variations were analysed.
> > > > BPF assembly instructions between two successive invocations of
> > > > bpf_probe_read_kernel_str() were used to check code regressions. Results are
> > > > below, followed by short analysis. Left side is using helpers with int return
> > > > type, the right one is after the switch to long.
> > > >
> > > > ALU32 + INT                                ALU32 + LONG
> > > > ===========                                ============
> > > >
> > > > 64-BIT (13 insns):                         64-BIT (10 insns):
> > > > ------------------------------------       ------------------------------------
> > > >   17:   call 115                             17:   call 115
> > > >   18:   if w0 > 256 goto +9 <LBB0_4>         18:   if r0 > 256 goto +6 <LBB0_4>
> > > >   19:   w1 = w0                              19:   r1 = 0 ll
> > > >   20:   r1 <<= 32                            21:   *(u64 *)(r1 + 0) = r0
> > > >   21:   r1 s>>= 32                           22:   r6 = 0 ll
> > >
> > > What version of clang is this? That is probably a zext in llvm-ir that in
> > > latest should be sufficient with the 'w1=w0'. I'm guessing (hoping?) you
> > > might not have latest?
> >
> > Just double-checked, very latest Clang, built today. Still generates
> > the same code.
> >
> > But I think this makes sense, because r1 is u64, and it gets assigned
> > from int, so int first has to be converted to s64, then casted to u64.
> > So sign extension is necessary. I've confirmed with this simple
> > program:
> >
> > $ cat bla.c
> > #include <stdio.h>
> >
> > int main() {
> >         int a = -1;
> >         unsigned long b = a;
> >         printf("%lx\n", b);
> >         return 0;
> > }
> > $ clang bla.c -o test && ./test
> > ffffffffffffffff
> >
> > ^^^^^^^^--- not zeroes
> >
> > So I don't think it's a bug or inefficiency, C language requires that.
>
> Agreed. Sorry for the confusion on my side. Poked at this a bit more this
> morning trying to see why I don't hit the same pattern when we have many
> cases very similar to above.
>
> In your C code you never check for negative return codes? Oops, this
> means you can walk backwards off the front of payload? This is probably
> not valid either from program logic side and/or verifier will probably
> yell. Commented where I think you want a <0 check here,

You are missing that I'm using unsigned u64. So (s64)-1 ==
(u64)0xFFFFFFFFFFFFFFFF. So negative errors are effectively turned
into too large length and I filter them out with the same (len >
MAX_SZ) check. This allows to do just one comparison instead of two,
and also helps avoid some Clang optimizations that Yonghong is trying
to undo right now (if (a > X && a < Y) turned into if (x < Y - X),
with assembly that verifier can't verify). So no bug there, very
deliberate choice of types.

>
>  void *payload = &sample->payload;
>  u64 len; // but how do you get a len < 0 check now with u64 type?
>
>  len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
>  // should insert a negative case check here
>  // if (len < 0) goto abort;
>  if (len <= MAX_SZ1) {
>         payload += len;
>         sample->len1 = len;
>  }
>  // without the <0 case you may have walked backwards in payload?
>  len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
>
> So in all of my C code I have something like this,
>
>  int myfunc(void *a, void *b, void *c) {
>         void *payload = a;
>         int len;
>
>         len = probe_read_str(payload, 1000, b);
>         if (len < 0) return len;
>         if (len <= 1000) { // yeah we have room
>                 ayload += len;
>         }
>         len = probe_read_str(payload, 1000, c);
>         [...]
>  }
>
> And then when I look at generated code I get this,
>
> ; int myfunc(void *a, void *b, void *c) {
>        4:       call 45
> ;       if (len < 0) return len;
>        5:       if w0 s< 0 goto +9 <LBB0_4>
> ;       if (len <= 1000) {
>        6:       w2 = w0
>        7:       r1 = r7
>        8:       r1 += r2
>        9:       if w0 s< 1001 goto +1 <LBB0_3>
>       10:       r1 = r7
>
> 0000000000000058 <LBB0_3>:
> ;       len = probe_read_str(payload, 1000, b);
>       11:       w2 = 1000
>       12:       r3 = r6
>       13:       call 45
>
>
> Here the <0 check means we can skip the sext and do a
> simple zext which is just a w2=w0 by bpf zero extension
> rules.

See above. In practice (it might be no-ALU32-only thing, don't know),
doing two ifs is both less efficient and quite often leads to
unverifiable code. Users have to do hacks to complicate control flow
enough to prevent Clang from doing Hi/Lo combining. I learned a new
inlined assembly trick recently to prevent this, but either way it's
unpleasant and unnecessary.

>
> >
> > >
> > > >   22:   r2 = 0 ll                            24:   r6 += r0
> > > >   24:   *(u64 *)(r2 + 0) = r1              00000000000000c8 <LBB0_4>:
> > > >   25:   r6 = 0 ll                            25:   r1 = r6
> > > >   27:   r6 += r1                             26:   w2 = 256
> > > > 00000000000000e0 <LBB0_4>:                   27:   r3 = 0 ll
> > > >   28:   r1 = r6                              29:   call 115
> > > >   29:   w2 = 256
> > > >   30:   r3 = 0 ll
> > > >   32:   call 115
> > > >
> > > > 32-BIT (11 insns):                         32-BIT (12 insns):
> > > > ------------------------------------       ------------------------------------
> > > >   17:   call 115                             17:   call 115
> > > >   18:   if w0 > 256 goto +7 <LBB1_4>         18:   if w0 > 256 goto +8 <LBB1_4>
> > > >   19:   r1 = 0 ll                            19:   r1 = 0 ll
> > > >   21:   *(u32 *)(r1 + 0) = r0                21:   *(u32 *)(r1 + 0) = r0
> > > >   22:   w1 = w0                              22:   r0 <<= 32
> > > >   23:   r6 = 0 ll                            23:   r0 >>= 32
> > > >   25:   r6 += r1                             24:   r6 = 0 ll
> > > > 00000000000000d0 <LBB1_4>:                   26:   r6 += r0
> > > >   26:   r1 = r6                            00000000000000d8 <LBB1_4>:
> > > >   27:   w2 = 256                             27:   r1 = r6
> > > >   28:   r3 = 0 ll                            28:   w2 = 256
> > > >   30:   call 115                             29:   r3 = 0 ll
> > > >                                              31:   call 115
> > > >
> > > > In ALU32 mode, the variant using 64-bit length variable clearly wins and
> > > > avoids unnecessary zero-extension bit shifts. In practice, this is even more
> > > > important and good, because BPF code won't need to do extra checks to "prove"
> > > > that payload/len are within good bounds.
> > >
> > > I bet with latest clang the shifts are removed. But if not we probably
> > > should fix clang regardless of if helpers return longs or ints.
> >
> > are we still talking about bit shifts for INT HELPER + U64 len case?
> > Or now about bit shifts in LONG HELPER + U32 len case?
>
> Forget I was confused. I put the <0 checks in there mentally and it messed up
> how I read your code.
>
> >
> > >
> > > >
> > > > 32-bit len is one instruction longer. Clang decided to do 64-to-32 casting
> > > > with two bit shifts, instead of equivalent `w1 = w0` assignment. The former
> > > > uses extra register. The latter might potentially lose some range information,
> > > > but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256]
> > > > after check at 18:, and shifting 32 bits left/right keeps that range intact.
> > > > We should probably look into Clang's logic and see why it chooses bitshifts
> > > > over sub-register assignments for this.
> > > >
> > > > NO-ALU32 + INT                             NO-ALU32 + LONG
> > > > ==============                             ===============
> > > >
> > > > 64-BIT (14 insns):                         64-BIT (10 insns):
> > > > ------------------------------------       ------------------------------------
> > > >   17:   call 115                             17:   call 115
> > > >   18:   r0 <<= 32                            18:   if r0 > 256 goto +6 <LBB0_4>
> > > >   19:   r1 = r0                              19:   r1 = 0 ll
> > > >   20:   r1 >>= 32                            21:   *(u64 *)(r1 + 0) = r0
> > > >   21:   if r1 > 256 goto +7 <LBB0_4>         22:   r6 = 0 ll
> > > >   22:   r0 s>>= 32                           24:   r6 += r0
> > > >   23:   r1 = 0 ll                          00000000000000c8 <LBB0_4>:
> > > >   25:   *(u64 *)(r1 + 0) = r0                25:   r1 = r6
> > > >   26:   r6 = 0 ll                            26:   r2 = 256
> > > >   28:   r6 += r0                             27:   r3 = 0 ll
> > > > 00000000000000e8 <LBB0_4>:                   29:   call 115
> > > >   29:   r1 = r6
> > > >   30:   r2 = 256
> > > >   31:   r3 = 0 ll
> > > >   33:   call 115
> > > >
> > > > 32-BIT (13 insns):                         32-BIT (13 insns):
> > > > ------------------------------------       ------------------------------------
> > > >   17:   call 115                             17:   call 115
> > > >   18:   r1 = r0                              18:   r1 = r0
> > > >   19:   r1 <<= 32                            19:   r1 <<= 32
> > > >   20:   r1 >>= 32                            20:   r1 >>= 32
> > > >   21:   if r1 > 256 goto +6 <LBB1_4>         21:   if r1 > 256 goto +6 <LBB1_4>
> > > >   22:   r2 = 0 ll                            22:   r2 = 0 ll
> > > >   24:   *(u32 *)(r2 + 0) = r0                24:   *(u32 *)(r2 + 0) = r0
> > > >   25:   r6 = 0 ll                            25:   r6 = 0 ll
> > > >   27:   r6 += r1                             27:   r6 += r1
> > > > 00000000000000e0 <LBB1_4>:                 00000000000000e0 <LBB1_4>:
> > > >   28:   r1 = r6                              28:   r1 = r6
> > > >   29:   r2 = 256                             29:   r2 = 256
> > > >   30:   r3 = 0 ll                            30:   r3 = 0 ll
> > > >   32:   call 115                             32:   call 115
> > > >
> > > > In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much
> > > > superior code, as expected, eliminating unnecessary bit shifts. For 32-bit
> > > > len, code is identical.
> > >
> > > Right I can't think of any way clang can avoid it here. OTOH I fix this
> > > by enabling alu32 ;)
> > >
> > > >
> > > > So overall, only ALU-32 32-bit len case is more-or-less equivalent and the
> > > > difference stems from internal Clang decision, rather than compiler lacking
> > > > enough information about types.
> > > >
> > > > Case 2. Let's look at the simpler case of checking return result of BPF helper
> > > > for errors. The code is very simple:
> > > >
> > > >   long bla;
> > > >   if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0))
> > > >       return 1;
> > > >   else
> > > >       return 0;
> > > >
> > > > ALU32 + CHECK (9 insns)                    ALU32 + CHECK (9 insns)
> > > > ====================================       ====================================
> > > >   0:    r1 = r10                             0:    r1 = r10
> > > >   1:    r1 += -8                             1:    r1 += -8
> > > >   2:    w2 = 8                               2:    w2 = 8
> > > >   3:    r3 = 0                               3:    r3 = 0
> > > >   4:    call 113                             4:    call 113
> > > >   5:    w1 = w0                              5:    r1 = r0
> > > >   6:    w0 = 1                               6:    w0 = 1
> > > >   7:    if w1 != 0 goto +1 <LBB2_2>          7:    if r1 != 0 goto +1 <LBB2_2>
> > > >   8:    w0 = 0                               8:    w0 = 0
> > > > 0000000000000048 <LBB2_2>:                 0000000000000048 <LBB2_2>:
> > > >   9:    exit                                 9:    exit
> > > >
> > > > Almost identical code, the only difference is the use of full register
> > > > assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit
> > > > architectures, new BPF assembly might be slightly less optimal, in theory. But
> > > > one can argue that's not a big issue, given that use of full registers is
> > > > still prevalent (e.g., for parameter passing).
> > > >
> > > > NO-ALU32 + CHECK (11 insns)                NO-ALU32 + CHECK (9 insns)
> > > > ====================================       ====================================
> > > >   0:    r1 = r10                             0:    r1 = r10
> > > >   1:    r1 += -8                             1:    r1 += -8
> > > >   2:    r2 = 8                               2:    r2 = 8
> > > >   3:    r3 = 0                               3:    r3 = 0
> > > >   4:    call 113                             4:    call 113
> > > >   5:    r1 = r0                              5:    r1 = r0
> > > >   6:    r1 <<= 32                            6:    r0 = 1
> > > >   7:    r1 >>= 32                            7:    if r1 != 0 goto +1 <LBB2_2>
> > > >   8:    r0 = 1                               8:    r0 = 0
> > > >   9:    if r1 != 0 goto +1 <LBB2_2>        0000000000000048 <LBB2_2>:
> > > >  10:    r0 = 0                               9:    exit
> > > > 0000000000000058 <LBB2_2>:
> > > >  11:    exit
> > > >
> > > > NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit
> > > > shifts.
> > >
> > > It seems a win for the NO-ALU32 case but for the +ALU32 case I think its
> > > the same with latest clang although I haven't tried yet. I was actually
> > > considering going the other way and avoiding always returning u64 on
> > > the other side. From a purely aesethetics point of view I prefer the
> > > int type because it seems more clear/standard C. I'm also not so interested
> > > in optimizing the no-alu32 case but curious if there is a use case for
> > > that?
> >
> > My point was that this int -> long switch doesn't degrade ALU32 and
> > helps no-ALU32, and thus is good :)
>
> With the long vs int I do see worse code when using the <0 check.
> Using C function below which I took from some real code and renamed
> variables.
>
> int myfunc(void *a, void *b, void *c) {
>         void *payload = a;
>         int len;
>
>         len = probe_read_str(payload, 1000, a);
>         if (len < 0) return len;
>         if (len <= 1000) {
>                 payload += len;
>         }
>         len = probe_read_str(payload, 1000, b);
>         if (len <= 1000) {
>                 payload += len;
>         }
>         return 1;
> }
>
> Then here is the side-by-side of generated code, with +ALU32.
>
>   int BPF_FUNC(probe_read, ...                  long BPF_FUNC(probe_read, ...
> -------------------------------                ---------------------------------
>        0:       r6 = r2                         0:      r6 = r2
>        1:       r7 = r1                         1:      r7 = r1
>        2:       w2 = 1000                       2:      w2 = 1000
>        3:       r3 = r7                         3:      r3 = r7
>        4:       call 45                         4:      call 45
>        5:       if w0 s< 0 goto +9 <LBB0_4>     5:      r2 = r0
>        6:       w2 = w0                         6:      if w0 s< 0 goto +10 <LBB0_4>
>        7:       r1 = r7                         7:      r2 <<= 32
>        8:       r1 += r2                        8:      r2 s>>= 32
>        9:       if w0 s< 1001 goto +1 <LBB0_3>  9:      r1 = r7
>       10:       r1 = r7                        10:      r1 += r2
>       11:       w2 = 1000                      11:      if w0 s< 1001 goto +1 <LBB0_3>
>       12:       r3 = r6                        12:      r1 = r7
>       13:       call 45                        13:      w2 = 1000
>       14:       w0 = 1                         14:      r3 = r6
>       15:       exit                           15:      call 45
>                                                16:      w0 = 1
>                                                17:      exit
>
> So a couple extra instruction, but more concerning we created a
> <<=,s>> pattern. I'll need to do some more tests but my concern
> is that could break verifier for real programs we have. I guess
> it didn't in the selftests? Surely, this thread has at least
> pointed out some gaps in our test cases. I guess I _could_ make
> len a u64 type to remove the sext but then <0 check on a u64?!

I addressed <0 check above. As for <<=,s>>=, I wish Clang was a bit
smarter and just did w2 = w2 or something like that, given we just
checked that w0 is non-negative. But then again, I wouldn't do two ifs
and wouldn't use signed int for len.

>
> >
> > Overall, long as a return type matches reality and BPF ABI
> > specification. BTW, one of the varlen programs from patch 2 doesn't
> > even validate successfully on latest kernel with latest Clang right
> > now, if helpers return int, even though it's completely correct code.
> > That's a real problem we have to deal with in few major BPF
> > applications right now, and we have to use inline assembly to enforce
> > Clang to do the right thing. A bunch of those problems are simply
> > avoided with correct return types for helpers.
>
> Do the real programs check <0? Did I miss something? I'll try
> applying your patch to our real code base and see what happens.

That would be great. Self-tests do work, but having more testing with
real-world application would certainly help as well.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-18 21:41       ` Andrii Nakryiko
@ 2020-06-18 23:30         ` John Fastabend
  2020-06-19  0:39           ` John Fastabend
  0 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2020-06-18 23:30 UTC (permalink / raw)
  To: Andrii Nakryiko, John Fastabend
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

Andrii Nakryiko wrote:
> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Andrii Nakryiko wrote:
> > > On Wed, Jun 17, 2020 at 11:49 PM John Fastabend
> > > <john.fastabend@gmail.com> wrote:
> > > >
> > > > Andrii Nakryiko wrote:
> > > > > Switch most of BPF helper definitions from returning int to long. These
> > > > > definitions are coming from comments in BPF UAPI header and are used to
> > > > > generate bpf_helper_defs.h (under libbpf) to be later included and used from
> > > > > BPF programs.
> > > > >
> > > > > In actual in-kernel implementation, all the helpers are defined as returning
> > > > > u64, but due to some historical reasons, most of them are actually defined as
> > > > > returning int in UAPI (usually, to return 0 on success, and negative value on
> > > > > error).
> > > >
> > > > Could we change the helpers side to return correct types now? Meaning if the
> > > > UAPI claims its an int lets actually return the int.
> > >
> > > I'm not sure how exactly you see this being done. BPF ABI dictates
> > > that the helper's result is passed in a full 64-bit r0 register. Are
> > > you suggesting that in addition to RET_ANYTHING we should add
> > > RET_ANYTHING32 and teach verifier that higher 32 bits of r0 are
> > > guaranteed to be zero? And then make helpers actually return 32-bit
> > > values without up-casting them to u64?
> >
> > Yes this is what I was thinking, having a RET_ANYTHING32 but I would
> > assume the upper 32-bits could be anything not zeroed. For +alu32
> > and programmer using correct types I would expect clang to generate
> > good code here and mostly not need to zero upper bits.
> >
> > I think this discussion can be independent of your changes though and
> > its not at the top of my todo list so probably wont get to investigating
> > more for awhile.
> 
> I'm confused. If the verifier doesn't make any assumptions about upper
> 32-bits for RET_ANYTHING32, how is it different from RET_ANYTHING and
> today's logic? What you described is exactly what is happening when
> bpf_helpers_def.h has BPF helpers defined as returning int.
> 

Agreed. I recall it helping the 32-bit bounds on the verifier side
somewhere. But lets drop it maybe it really is not useful. I'll go
try and recall the details later.

[...] Aggressively pruning

> >
> > Agreed. Sorry for the confusion on my side. Poked at this a bit more this
> > morning trying to see why I don't hit the same pattern when we have many
> > cases very similar to above.
> >
> > In your C code you never check for negative return codes? Oops, this
> > means you can walk backwards off the front of payload? This is probably
> > not valid either from program logic side and/or verifier will probably
> > yell. Commented where I think you want a <0 check here,
> 
> You are missing that I'm using unsigned u64. So (s64)-1 ==
> (u64)0xFFFFFFFFFFFFFFFF. So negative errors are effectively turned
> into too large length and I filter them out with the same (len >
> MAX_SZ) check. This allows to do just one comparison instead of two,
> and also helps avoid some Clang optimizations that Yonghong is trying
> to undo right now (if (a > X && a < Y) turned into if (x < Y - X),
> with assembly that verifier can't verify). So no bug there, very
> deliberate choice of types.

I caught it just after I sent above ;) In our codebase we do need to
handle errors and truncated strings differently so really do need the
two conditions. I guess we could find some clever way around it but
in practice on latest kernels we've not seen much trouble around
these with +alu32.

Interesting about the optimization I've not seen that one yet.  

[...]

> See above. In practice (it might be no-ALU32-only thing, don't know),
> doing two ifs is both less efficient and quite often leads to
> unverifiable code. Users have to do hacks to complicate control flow
> enough to prevent Clang from doing Hi/Lo combining. I learned a new
> inlined assembly trick recently to prevent this, but either way it's
> unpleasant and unnecessary.

In the end we also run on ancient kernels so have lots of tricks.

[...] more pruning

> > > My point was that this int -> long switch doesn't degrade ALU32 and
> > > helps no-ALU32, and thus is good :)
> >
> > With the long vs int I do see worse code when using the <0 check.
> > Using C function below which I took from some real code and renamed
> > variables.
> >
> > int myfunc(void *a, void *b, void *c) {
> >         void *payload = a;
> >         int len;
> >
> >         len = probe_read_str(payload, 1000, a);
> >         if (len < 0) return len;
> >         if (len <= 1000) {
> >                 payload += len;
> >         }
> >         len = probe_read_str(payload, 1000, b);
> >         if (len <= 1000) {
> >                 payload += len;
> >         }
> >         return 1;
> > }
> >
> > Then here is the side-by-side of generated code, with +ALU32.
> >
> >   int BPF_FUNC(probe_read, ...                  long BPF_FUNC(probe_read, ...
> > -------------------------------                ---------------------------------
> >        0:       r6 = r2                         0:      r6 = r2
> >        1:       r7 = r1                         1:      r7 = r1
> >        2:       w2 = 1000                       2:      w2 = 1000
> >        3:       r3 = r7                         3:      r3 = r7
> >        4:       call 45                         4:      call 45
> >        5:       if w0 s< 0 goto +9 <LBB0_4>     5:      r2 = r0
> >        6:       w2 = w0                         6:      if w0 s< 0 goto +10 <LBB0_4>
> >        7:       r1 = r7                         7:      r2 <<= 32
> >        8:       r1 += r2                        8:      r2 s>>= 32
> >        9:       if w0 s< 1001 goto +1 <LBB0_3>  9:      r1 = r7
> >       10:       r1 = r7                        10:      r1 += r2
> >       11:       w2 = 1000                      11:      if w0 s< 1001 goto +1 <LBB0_3>
> >       12:       r3 = r6                        12:      r1 = r7
> >       13:       call 45                        13:      w2 = 1000
> >       14:       w0 = 1                         14:      r3 = r6
> >       15:       exit                           15:      call 45
> >                                                16:      w0 = 1
> >                                                17:      exit
> >
> > So a couple extra instruction, but more concerning we created a
> > <<=,s>> pattern. I'll need to do some more tests but my concern
> > is that could break verifier for real programs we have. I guess
> > it didn't in the selftests? Surely, this thread has at least
> > pointed out some gaps in our test cases. I guess I _could_ make
> > len a u64 type to remove the sext but then <0 check on a u64?!
> 
> I addressed <0 check above. As for <<=,s>>=, I wish Clang was a bit
> smarter and just did w2 = w2 or something like that, given we just
> checked that w0 is non-negative. But then again, I wouldn't do two ifs
> and wouldn't use signed int for len.

It is smart enough once you get all the types aligned. So after pulling
in int->long change ideally we would change codebase to use long types as
well. Maybe we should modify the tests in selftests as well OTOH
its nice to test what happens when folks leave the return types as int.

> 
> >
> > >
> > > Overall, long as a return type matches reality and BPF ABI
> > > specification. BTW, one of the varlen programs from patch 2 doesn't
> > > even validate successfully on latest kernel with latest Clang right
> > > now, if helpers return int, even though it's completely correct code.
> > > That's a real problem we have to deal with in few major BPF
> > > applications right now, and we have to use inline assembly to enforce
> > > Clang to do the right thing. A bunch of those problems are simply
> > > avoided with correct return types for helpers.
> >
> > Do the real programs check <0? Did I miss something? I'll try
> > applying your patch to our real code base and see what happens.
> 
> That would be great. Self-tests do work, but having more testing with
> real-world application would certainly help as well.

Thanks for all the follow up.

I ran the change through some CI on my side and it passed so I can
complain about a few shifts here and there or just update my code or
just not change the return types on my side but I'm convinced its OK
in most cases and helps in some so...

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-18 23:30         ` John Fastabend
@ 2020-06-19  0:39           ` John Fastabend
  2020-06-19 13:08             ` Daniel Borkmann
  0 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2020-06-19  0:39 UTC (permalink / raw)
  To: John Fastabend, Andrii Nakryiko, John Fastabend
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

John Fastabend wrote:
> Andrii Nakryiko wrote:
> > On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> > <john.fastabend@gmail.com> wrote:

[...]

> > That would be great. Self-tests do work, but having more testing with
> > real-world application would certainly help as well.
> 
> Thanks for all the follow up.
> 
> I ran the change through some CI on my side and it passed so I can
> complain about a few shifts here and there or just update my code or
> just not change the return types on my side but I'm convinced its OK
> in most cases and helps in some so...
> 
> Acked-by: John Fastabend <john.fastabend@gmail.com>

I'll follow this up with a few more selftests to capture a couple of our
patterns. These changes are subtle and I worry a bit that additional
<<,s>> pattern could have the potential to break something.

Another one we didn't discuss that I found in our code base is feeding
the output of a probe_* helper back into the size field (after some
alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
today didn't cover that case.

I'll put it on the list tomorrow and encode these in selftests. I'll
let the mainainers decide if they want to wait for those or not.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-19  0:39           ` John Fastabend
@ 2020-06-19 13:08             ` Daniel Borkmann
  2020-06-19 18:41               ` Andrii Nakryiko
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2020-06-19 13:08 UTC (permalink / raw)
  To: John Fastabend, Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov, Kernel Team

On 6/19/20 2:39 AM, John Fastabend wrote:
> John Fastabend wrote:
>> Andrii Nakryiko wrote:
>>> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
>>> <john.fastabend@gmail.com> wrote:
> 
> [...]
> 
>>> That would be great. Self-tests do work, but having more testing with
>>> real-world application would certainly help as well.
>>
>> Thanks for all the follow up.
>>
>> I ran the change through some CI on my side and it passed so I can
>> complain about a few shifts here and there or just update my code or
>> just not change the return types on my side but I'm convinced its OK
>> in most cases and helps in some so...
>>
>> Acked-by: John Fastabend <john.fastabend@gmail.com>
> 
> I'll follow this up with a few more selftests to capture a couple of our
> patterns. These changes are subtle and I worry a bit that additional
> <<,s>> pattern could have the potential to break something.
> 
> Another one we didn't discuss that I found in our code base is feeding
> the output of a probe_* helper back into the size field (after some
> alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
> today didn't cover that case.
> 
> I'll put it on the list tomorrow and encode these in selftests. I'll
> let the mainainers decide if they want to wait for those or not.

Given potential fragility on verifier side, my preference would be that we
have the known variations all covered in selftests before moving forward in
order to make sure they don't break in any way. Back in [0] I've seen mostly
similar cases in the way John mentioned in other projects, iirc, sysdig was
another one. If both of you could hack up the remaining cases we need to
cover and then submit a combined series, that would be great. I don't think
we need to rush this optimization w/o necessary selftests.

Thanks everyone,
Daniel

   [0] https://lore.kernel.org/bpf/20200421125822.14073-1-daniel@iogearbox.net/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-19 13:08             ` Daniel Borkmann
@ 2020-06-19 18:41               ` Andrii Nakryiko
  2020-06-19 22:21                 ` Daniel Borkmann
  0 siblings, 1 reply; 18+ messages in thread
From: Andrii Nakryiko @ 2020-06-19 18:41 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: John Fastabend, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Kernel Team

On Fri, Jun 19, 2020 at 6:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 6/19/20 2:39 AM, John Fastabend wrote:
> > John Fastabend wrote:
> >> Andrii Nakryiko wrote:
> >>> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> >>> <john.fastabend@gmail.com> wrote:
> >
> > [...]
> >
> >>> That would be great. Self-tests do work, but having more testing with
> >>> real-world application would certainly help as well.
> >>
> >> Thanks for all the follow up.
> >>
> >> I ran the change through some CI on my side and it passed so I can
> >> complain about a few shifts here and there or just update my code or
> >> just not change the return types on my side but I'm convinced its OK
> >> in most cases and helps in some so...
> >>
> >> Acked-by: John Fastabend <john.fastabend@gmail.com>
> >
> > I'll follow this up with a few more selftests to capture a couple of our
> > patterns. These changes are subtle and I worry a bit that additional
> > <<,s>> pattern could have the potential to break something.
> >
> > Another one we didn't discuss that I found in our code base is feeding
> > the output of a probe_* helper back into the size field (after some
> > alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
> > today didn't cover that case.
> >
> > I'll put it on the list tomorrow and encode these in selftests. I'll
> > let the mainainers decide if they want to wait for those or not.
>
> Given potential fragility on verifier side, my preference would be that we
> have the known variations all covered in selftests before moving forward in
> order to make sure they don't break in any way. Back in [0] I've seen mostly
> similar cases in the way John mentioned in other projects, iirc, sysdig was
> another one. If both of you could hack up the remaining cases we need to
> cover and then submit a combined series, that would be great. I don't think
> we need to rush this optimization w/o necessary selftests.

There is no rush, but there is also no reason to delay it. I'd rather
land it early in the libbpf release cycle and let people try it in
their prod environments, for those concerned about such code patterns.

I don't have a list of all the patterns that we might need to test.
Going through all open-source BPF source code to identify possible
patterns and then coding them up in minimal selftests is a bit too
much for me, honestly. Additionally, some of those patterns will most
probably be broken in no-ALU32 and making them work with assembly and
other clever tricks is actually where the majority of time usually
goes. Also, simple selftests might not actually trigger pathological
codegen cases (because in a lot of cases register spill/pressure
triggers different codegen patterns). So I just don't believe we can
have a full piece of mind, regardless of how many selftests we add.
This test_varlen selftest is a simplification of a production code
we've had for a long while. We never bothered to contribute it as a
selftest before, which I'd say is our fault as users of BPF. Anyone
interested in ensuring regressions get detected for the way they write
BPF code, should distill them into selftests and contribute to our
test suite (like we did with PyPerf, Strobemeta, and how Jakub
Sitnicki did recently with his program).

So sure, maintainers might decide to not land this because of
potential regressions, but I tried to do my best to explain why there
shouldn't be really regressions (after all, int -> long reflects
*reality*, where everything is u64/s64 on return from BPF helper),
apart from actually testing for two patterns I knew about.

After all, even in case of regression, doing `int bla =
(int)bpf_helper_whatever(...);` is in theory equivalent to what we had
before, so it's an easy fix. Reality might require an extra compiler
barrier after that to force Clang to emit casting instructions sooner,
but that's another story.

>
> Thanks everyone,
> Daniel
>
>    [0] https://lore.kernel.org/bpf/20200421125822.14073-1-daniel@iogearbox.net/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-19 18:41               ` Andrii Nakryiko
@ 2020-06-19 22:21                 ` Daniel Borkmann
  2020-06-19 23:11                   ` Andrii Nakryiko
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2020-06-19 22:21 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: John Fastabend, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Kernel Team

On 6/19/20 8:41 PM, Andrii Nakryiko wrote:
> On Fri, Jun 19, 2020 at 6:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 6/19/20 2:39 AM, John Fastabend wrote:
>>> John Fastabend wrote:
>>>> Andrii Nakryiko wrote:
>>>>> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
>>>>> <john.fastabend@gmail.com> wrote:
>>>
>>> [...]
>>>
>>>>> That would be great. Self-tests do work, but having more testing with
>>>>> real-world application would certainly help as well.
>>>>
>>>> Thanks for all the follow up.
>>>>
>>>> I ran the change through some CI on my side and it passed so I can
>>>> complain about a few shifts here and there or just update my code or
>>>> just not change the return types on my side but I'm convinced its OK
>>>> in most cases and helps in some so...
>>>>
>>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
>>>
>>> I'll follow this up with a few more selftests to capture a couple of our
>>> patterns. These changes are subtle and I worry a bit that additional
>>> <<,s>> pattern could have the potential to break something.
>>>
>>> Another one we didn't discuss that I found in our code base is feeding
>>> the output of a probe_* helper back into the size field (after some
>>> alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
>>> today didn't cover that case.
>>>
>>> I'll put it on the list tomorrow and encode these in selftests. I'll
>>> let the mainainers decide if they want to wait for those or not.
>>
>> Given potential fragility on verifier side, my preference would be that we
>> have the known variations all covered in selftests before moving forward in
>> order to make sure they don't break in any way. Back in [0] I've seen mostly
>> similar cases in the way John mentioned in other projects, iirc, sysdig was
>> another one. If both of you could hack up the remaining cases we need to
>> cover and then submit a combined series, that would be great. I don't think
>> we need to rush this optimization w/o necessary selftests.
> 
> There is no rush, but there is also no reason to delay it. I'd rather
> land it early in the libbpf release cycle and let people try it in
> their prod environments, for those concerned about such code patterns.

Andrii, define 'delay'. John mentioned above to put together few more
selftests today so that there is better coverage at least, why is that
an 'issue'? I'm not sure how you read 'late in release cycle' out of it,
it's still as early. The unsigned optimization for len <= MAX_LEN is
reasonable and makes sense, but it's still one [specific] variant only.

> I don't have a list of all the patterns that we might need to test.
> Going through all open-source BPF source code to identify possible
> patterns and then coding them up in minimal selftests is a bit too
> much for me, honestly.

I think we're probably talking past each other. John wrote above:

 >>> I'll follow this up with a few more selftests to capture a couple of our
 >>> patterns. These changes are subtle and I worry a bit that additional
 >>> <<,s>> pattern could have the potential to break something.

So submitting this as a full series together makes absolutely sense to me,
so there's maybe not perfect but certainly more confidence that also other
patterns where the shifts optimized out in one case are then appearing in
another are tested on a best effort and run our kselftest suite.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-19 22:21                 ` Daniel Borkmann
@ 2020-06-19 23:11                   ` Andrii Nakryiko
  2020-06-22 18:30                     ` John Fastabend
  0 siblings, 1 reply; 18+ messages in thread
From: Andrii Nakryiko @ 2020-06-19 23:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: John Fastabend, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Kernel Team

On Fri, Jun 19, 2020 at 3:21 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 6/19/20 8:41 PM, Andrii Nakryiko wrote:
> > On Fri, Jun 19, 2020 at 6:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 6/19/20 2:39 AM, John Fastabend wrote:
> >>> John Fastabend wrote:
> >>>> Andrii Nakryiko wrote:
> >>>>> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> >>>>> <john.fastabend@gmail.com> wrote:
> >>>
> >>> [...]
> >>>
> >>>>> That would be great. Self-tests do work, but having more testing with
> >>>>> real-world application would certainly help as well.
> >>>>
> >>>> Thanks for all the follow up.
> >>>>
> >>>> I ran the change through some CI on my side and it passed so I can
> >>>> complain about a few shifts here and there or just update my code or
> >>>> just not change the return types on my side but I'm convinced its OK
> >>>> in most cases and helps in some so...
> >>>>
> >>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
> >>>
> >>> I'll follow this up with a few more selftests to capture a couple of our
> >>> patterns. These changes are subtle and I worry a bit that additional
> >>> <<,s>> pattern could have the potential to break something.
> >>>
> >>> Another one we didn't discuss that I found in our code base is feeding
> >>> the output of a probe_* helper back into the size field (after some
> >>> alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
> >>> today didn't cover that case.
> >>>
> >>> I'll put it on the list tomorrow and encode these in selftests. I'll
> >>> let the mainainers decide if they want to wait for those or not.
> >>
> >> Given potential fragility on verifier side, my preference would be that we
> >> have the known variations all covered in selftests before moving forward in
> >> order to make sure they don't break in any way. Back in [0] I've seen mostly
> >> similar cases in the way John mentioned in other projects, iirc, sysdig was
> >> another one. If both of you could hack up the remaining cases we need to
> >> cover and then submit a combined series, that would be great. I don't think
> >> we need to rush this optimization w/o necessary selftests.
> >
> > There is no rush, but there is also no reason to delay it. I'd rather
> > land it early in the libbpf release cycle and let people try it in
> > their prod environments, for those concerned about such code patterns.
>
> Andrii, define 'delay'. John mentioned above to put together few more
> selftests today so that there is better coverage at least, why is that
> an 'issue'? I'm not sure how you read 'late in release cycle' out of it,
> it's still as early. The unsigned optimization for len <= MAX_LEN is
> reasonable and makes sense, but it's still one [specific] variant only.

I'm totally fine waiting for John's tests, but I read your reply as a
request to go dig up some more examples from sysdig and other
projects, which I don't think I can commit to. So if it's just about
waiting for John's examples, that's fine and sorry for
misunderstanding.

>
> > I don't have a list of all the patterns that we might need to test.
> > Going through all open-source BPF source code to identify possible
> > patterns and then coding them up in minimal selftests is a bit too
> > much for me, honestly.
>
> I think we're probably talking past each other. John wrote above:

Yep, sorry, I assumed more general context, not specifically John's reply.

>
>  >>> I'll follow this up with a few more selftests to capture a couple of our
>  >>> patterns. These changes are subtle and I worry a bit that additional
>  >>> <<,s>> pattern could have the potential to break something.
>
> So submitting this as a full series together makes absolutely sense to me,
> so there's maybe not perfect but certainly more confidence that also other
> patterns where the shifts optimized out in one case are then appearing in
> another are tested on a best effort and run our kselftest suite.
>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-19 23:11                   ` Andrii Nakryiko
@ 2020-06-22 18:30                     ` John Fastabend
  2020-06-22 19:16                       ` Andrii Nakryiko
  0 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2020-06-22 18:30 UTC (permalink / raw)
  To: Andrii Nakryiko, Daniel Borkmann
  Cc: John Fastabend, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Kernel Team

Andrii Nakryiko wrote:
> On Fri, Jun 19, 2020 at 3:21 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >
> > On 6/19/20 8:41 PM, Andrii Nakryiko wrote:
> > > On Fri, Jun 19, 2020 at 6:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >> On 6/19/20 2:39 AM, John Fastabend wrote:
> > >>> John Fastabend wrote:
> > >>>> Andrii Nakryiko wrote:
> > >>>>> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> > >>>>> <john.fastabend@gmail.com> wrote:
> > >>>
> > >>> [...]
> > >>>
> > >>>>> That would be great. Self-tests do work, but having more testing with
> > >>>>> real-world application would certainly help as well.
> > >>>>
> > >>>> Thanks for all the follow up.
> > >>>>
> > >>>> I ran the change through some CI on my side and it passed so I can
> > >>>> complain about a few shifts here and there or just update my code or
> > >>>> just not change the return types on my side but I'm convinced its OK
> > >>>> in most cases and helps in some so...
> > >>>>
> > >>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
> > >>>
> > >>> I'll follow this up with a few more selftests to capture a couple of our
> > >>> patterns. These changes are subtle and I worry a bit that additional
> > >>> <<,s>> pattern could have the potential to break something.
> > >>>
> > >>> Another one we didn't discuss that I found in our code base is feeding
> > >>> the output of a probe_* helper back into the size field (after some
> > >>> alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
> > >>> today didn't cover that case.
> > >>>
> > >>> I'll put it on the list tomorrow and encode these in selftests. I'll
> > >>> let the mainainers decide if they want to wait for those or not.
> > >>
> > >> Given potential fragility on verifier side, my preference would be that we
> > >> have the known variations all covered in selftests before moving forward in
> > >> order to make sure they don't break in any way. Back in [0] I've seen mostly
> > >> similar cases in the way John mentioned in other projects, iirc, sysdig was
> > >> another one. If both of you could hack up the remaining cases we need to
> > >> cover and then submit a combined series, that would be great. I don't think
> > >> we need to rush this optimization w/o necessary selftests.
> > >
> > > There is no rush, but there is also no reason to delay it. I'd rather
> > > land it early in the libbpf release cycle and let people try it in
> > > their prod environments, for those concerned about such code patterns.
> >
> > Andrii, define 'delay'. John mentioned above to put together few more
> > selftests today so that there is better coverage at least, why is that
> > an 'issue'? I'm not sure how you read 'late in release cycle' out of it,
> > it's still as early. The unsigned optimization for len <= MAX_LEN is
> > reasonable and makes sense, but it's still one [specific] variant only.
> 
> I'm totally fine waiting for John's tests, but I read your reply as a
> request to go dig up some more examples from sysdig and other
> projects, which I don't think I can commit to. So if it's just about
> waiting for John's examples, that's fine and sorry for
> misunderstanding.
> 
> >
> > > I don't have a list of all the patterns that we might need to test.
> > > Going through all open-source BPF source code to identify possible
> > > patterns and then coding them up in minimal selftests is a bit too
> > > much for me, honestly.
> >
> > I think we're probably talking past each other. John wrote above:
> 
> Yep, sorry, I assumed more general context, not specifically John's reply.
> 
> >
> >  >>> I'll follow this up with a few more selftests to capture a couple of our
> >  >>> patterns. These changes are subtle and I worry a bit that additional
> >  >>> <<,s>> pattern could have the potential to break something.
> >
> > So submitting this as a full series together makes absolutely sense to me,
> > so there's maybe not perfect but certainly more confidence that also other
> > patterns where the shifts optimized out in one case are then appearing in
> > another are tested on a best effort and run our kselftest suite.
> >
> > Thanks,
> > Daniel

Hi Andrii,

How about adding this on-top of your selftests patch? It will cover the
cases we have now with 'len < 0' check vs 'len > MAX'. I had another case
where we feed the out 'len' back into other probes but this requires more
hackery than I'm willing to encode in a selftests. There seems to be
some better options to improve clang side + verifier and get a clean
working version in the future.

On the clang/verifier side though I think the root cause is we do a poor
job of tracking >>32, s<<32 case. How about we add a sign-extend instruction
to BPF? Then clang can emit BPF_SEXT_32_64 and verifier can correctly
account for it and finally backends can generate better code. This
will help here, but also any other place we hit the sext codegen.

Alexei, Yonghong, any opinions for/against adding new insn? I think we
talked about doing it earlier.

---

selftests/bpf: add variable-length data concat pattern less than test

Extend original variable-length tests with a case to catch a common
existing pattern of testing for < 0 for errors. Note because
verifier also tracks upper bounds and we know it can not be greater
than MAX_LEN here we can skip upper bound check.

In ALU64 enabled compilation converting from long->int return types
in probe helpers results in extra instruction pattern, <<= 32, s >>= 32.
The trade-off is the non-ALU64 case works. If you really care about
every extra insn (XDP case?) then you probably should be using original
int type.

In addition adding a sext insn to bpf might help the verifier in the
general case to avoid these types of tricks.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 tools/testing/selftests/bpf/prog_tests/varlen.c |   13 +++
 tools/testing/selftests/bpf/progs/test_varlen.c |   90 ++++++++++++++++++++++-
 2 files changed, 99 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/varlen.c b/tools/testing/selftests/bpf/prog_tests/varlen.c
index 7533565..e1499f7 100644
--- a/tools/testing/selftests/bpf/prog_tests/varlen.c
+++ b/tools/testing/selftests/bpf/prog_tests/varlen.c
@@ -51,6 +51,19 @@ void test_varlen(void)
 	CHECK_VAL(data->total2, size1 + size2);
 	CHECK(memcmp(data->payload2, exp_str, size1 + size2), "content_check",
 	      "doesn't match!");
+
+	CHECK_VAL(data->payload3_len1, size1);
+	CHECK_VAL(data->payload3_len2, size2);
+	CHECK_VAL(data->total3, size1 + size2);
+	CHECK(memcmp(data->payload3, exp_str, size1 + size2), "content_check",
+	      "doesn't match!");
+
+	CHECK_VAL(data->payload4_len1, size1);
+	CHECK_VAL(data->payload4_len2, size2);
+	CHECK_VAL(data->total4, size1 + size2);
+	CHECK(memcmp(data->payload4, exp_str, size1 + size2), "content_check",
+	      "doesn't match!");
+
 cleanup:
 	test_varlen__destroy(skel);
 }
diff --git a/tools/testing/selftests/bpf/progs/test_varlen.c b/tools/testing/selftests/bpf/progs/test_varlen.c
index 0969185..dfe3a32 100644
--- a/tools/testing/selftests/bpf/progs/test_varlen.c
+++ b/tools/testing/selftests/bpf/progs/test_varlen.c
@@ -26,8 +26,18 @@ int payload2_len2 = -1;
 int total2 = -1;
 char payload2[MAX_LEN + MAX_LEN] = { 1 };
 
-SEC("raw_tp/sys_enter")
-int handler64(void *regs)
+int payload3_len1 = -1;
+int payload3_len2 = -1;
+int total3= -1;
+char payload3[MAX_LEN + MAX_LEN] = { 1 };
+
+int payload4_len1 = -1;
+int payload4_len2 = -1;
+int total4= -1;
+char payload4[MAX_LEN + MAX_LEN] = { 1 };
+
+static __always_inline
+int handler64_gt(void *regs)
 {
 	int pid = bpf_get_current_pid_tgid() >> 32;
 	void *payload = payload1;
@@ -54,8 +64,44 @@ int handler64(void *regs)
 	return 0;
 }
 
-SEC("tp_btf/sys_enter")
-int handler32(void *regs)
+static __always_inline
+int handler64_lt(void *regs)
+{
+	int pid = bpf_get_current_pid_tgid() >> 32;
+	void *payload = payload3;
+	long len;
+
+	/* ignore irrelevant invocations */
+	if (test_pid != pid || !capture)
+		return 0;
+
+	len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in1[0]);
+	if (len < 0)
+		goto next_lt_long;
+	payload += len;
+	payload3_len1 = len;
+next_lt_long:
+	len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in2[0]);
+	if (len < 0)
+		goto done_lt_long;
+	payload += len;
+	payload3_len2 = len;
+done_lt_long:
+	total3 = payload - (void *)payload3;
+
+	return 0;
+}
+
+SEC("raw_tp/sys_enter")
+int handler64(void *regs)
+{
+	handler64_gt(regs);
+	handler64_lt(regs);
+	return 0;
+}
+
+static __always_inline
+int handler32_gt(void *regs)
 {
 	int pid = bpf_get_current_pid_tgid() >> 32;
 	void *payload = payload2;
@@ -82,6 +128,42 @@ int handler32(void *regs)
 	return 0;
 }
 
+static __always_inline
+int handler32_lt(void *regs)
+{
+	int pid = bpf_get_current_pid_tgid() >> 32;
+	void *payload = payload4;
+	int len;
+
+	/* ignore irrelevant invocations */
+	if (test_pid != pid || !capture)
+		return 0;
+
+	len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in1[0]);
+	if (len < 0)
+		goto next_lt_int;
+	payload += len;
+	payload4_len1 = len;
+next_lt_int:
+	len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in2[0]);
+	if (len < 0)
+		goto done_lt_int;
+	payload += len;
+	payload4_len2 = len;
+done_lt_int:
+	total4 = payload - (void *)payload4;
+
+	return 0;
+}
+
+SEC("tp_btf/sys_enter")
+int handler32(void *regs)
+{
+	handler32_gt(regs);
+	handler32_lt(regs);
+	return 0;
+}
+
 SEC("tp_btf/sys_exit")
 int handler_exit(void *regs)
 {

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-22 18:30                     ` John Fastabend
@ 2020-06-22 19:16                       ` Andrii Nakryiko
  2020-06-22 19:42                         ` John Fastabend
  0 siblings, 1 reply; 18+ messages in thread
From: Andrii Nakryiko @ 2020-06-22 19:16 UTC (permalink / raw)
  To: John Fastabend
  Cc: Daniel Borkmann, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Kernel Team

On Mon, Jun 22, 2020 at 11:30 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > On Fri, Jun 19, 2020 at 3:21 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >
> > > On 6/19/20 8:41 PM, Andrii Nakryiko wrote:
> > > > On Fri, Jun 19, 2020 at 6:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > >> On 6/19/20 2:39 AM, John Fastabend wrote:
> > > >>> John Fastabend wrote:
> > > >>>> Andrii Nakryiko wrote:
> > > >>>>> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> > > >>>>> <john.fastabend@gmail.com> wrote:
> > > >>>
> > > >>> [...]
> > > >>>
> > > >>>>> That would be great. Self-tests do work, but having more testing with
> > > >>>>> real-world application would certainly help as well.
> > > >>>>
> > > >>>> Thanks for all the follow up.
> > > >>>>
> > > >>>> I ran the change through some CI on my side and it passed so I can
> > > >>>> complain about a few shifts here and there or just update my code or
> > > >>>> just not change the return types on my side but I'm convinced its OK
> > > >>>> in most cases and helps in some so...
> > > >>>>
> > > >>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
> > > >>>
> > > >>> I'll follow this up with a few more selftests to capture a couple of our
> > > >>> patterns. These changes are subtle and I worry a bit that additional
> > > >>> <<,s>> pattern could have the potential to break something.
> > > >>>
> > > >>> Another one we didn't discuss that I found in our code base is feeding
> > > >>> the output of a probe_* helper back into the size field (after some
> > > >>> alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
> > > >>> today didn't cover that case.
> > > >>>
> > > >>> I'll put it on the list tomorrow and encode these in selftests. I'll
> > > >>> let the mainainers decide if they want to wait for those or not.
> > > >>
> > > >> Given potential fragility on verifier side, my preference would be that we
> > > >> have the known variations all covered in selftests before moving forward in
> > > >> order to make sure they don't break in any way. Back in [0] I've seen mostly
> > > >> similar cases in the way John mentioned in other projects, iirc, sysdig was
> > > >> another one. If both of you could hack up the remaining cases we need to
> > > >> cover and then submit a combined series, that would be great. I don't think
> > > >> we need to rush this optimization w/o necessary selftests.
> > > >
> > > > There is no rush, but there is also no reason to delay it. I'd rather
> > > > land it early in the libbpf release cycle and let people try it in
> > > > their prod environments, for those concerned about such code patterns.
> > >
> > > Andrii, define 'delay'. John mentioned above to put together few more
> > > selftests today so that there is better coverage at least, why is that
> > > an 'issue'? I'm not sure how you read 'late in release cycle' out of it,
> > > it's still as early. The unsigned optimization for len <= MAX_LEN is
> > > reasonable and makes sense, but it's still one [specific] variant only.
> >
> > I'm totally fine waiting for John's tests, but I read your reply as a
> > request to go dig up some more examples from sysdig and other
> > projects, which I don't think I can commit to. So if it's just about
> > waiting for John's examples, that's fine and sorry for
> > misunderstanding.
> >
> > >
> > > > I don't have a list of all the patterns that we might need to test.
> > > > Going through all open-source BPF source code to identify possible
> > > > patterns and then coding them up in minimal selftests is a bit too
> > > > much for me, honestly.
> > >
> > > I think we're probably talking past each other. John wrote above:
> >
> > Yep, sorry, I assumed more general context, not specifically John's reply.
> >
> > >
> > >  >>> I'll follow this up with a few more selftests to capture a couple of our
> > >  >>> patterns. These changes are subtle and I worry a bit that additional
> > >  >>> <<,s>> pattern could have the potential to break something.
> > >
> > > So submitting this as a full series together makes absolutely sense to me,
> > > so there's maybe not perfect but certainly more confidence that also other
> > > patterns where the shifts optimized out in one case are then appearing in
> > > another are tested on a best effort and run our kselftest suite.
> > >
> > > Thanks,
> > > Daniel
>
> Hi Andrii,
>
> How about adding this on-top of your selftests patch? It will cover the
> cases we have now with 'len < 0' check vs 'len > MAX'. I had another case
> where we feed the out 'len' back into other probes but this requires more
> hackery than I'm willing to encode in a selftests. There seems to be
> some better options to improve clang side + verifier and get a clean
> working version in the future.

Ok, sounds good. I'll add it as an extra patch. Not sure about all the
conventions with preserving Signed-off-by. Would just keeping your
Signed-off-by be ok? If you don't mind, though, I'll keep each
handler{32,64}_{gt,lt} as 4 independent BPF programs, so that if any
of them is unverifiable, it's easier to inspect the BPF assembly. Yell
if you don't like this.

>
> On the clang/verifier side though I think the root cause is we do a poor
> job of tracking >>32, s<<32 case. How about we add a sign-extend instruction
> to BPF? Then clang can emit BPF_SEXT_32_64 and verifier can correctly
> account for it and finally backends can generate better code. This
> will help here, but also any other place we hit the sext codegen.
>
> Alexei, Yonghong, any opinions for/against adding new insn? I think we
> talked about doing it earlier.

Seems like an overkill to me, honestly. I'd rather spend effort on
teaching Clang to always generate `w1 = w0` for such a case (for
alu32). For no-ALU32 recommendation would be to switch to ALU32, if
you want to work with int instead of long and care about two bitshift
operations. If you can stick to longs on no-ALU32, then no harm, no
foul.


>
> ---
>
> selftests/bpf: add variable-length data concat pattern less than test
>
> Extend original variable-length tests with a case to catch a common
> existing pattern of testing for < 0 for errors. Note because
> verifier also tracks upper bounds and we know it can not be greater
> than MAX_LEN here we can skip upper bound check.
>
> In ALU64 enabled compilation converting from long->int return types
> in probe helpers results in extra instruction pattern, <<= 32, s >>= 32.
> The trade-off is the non-ALU64 case works. If you really care about
> every extra insn (XDP case?) then you probably should be using original
> int type.
>
> In addition adding a sext insn to bpf might help the verifier in the
> general case to avoid these types of tricks.
>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  tools/testing/selftests/bpf/prog_tests/varlen.c |   13 +++
>  tools/testing/selftests/bpf/progs/test_varlen.c |   90 ++++++++++++++++++++++-
>  2 files changed, 99 insertions(+), 4 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/varlen.c b/tools/testing/selftests/bpf/prog_tests/varlen.c
> index 7533565..e1499f7 100644
> --- a/tools/testing/selftests/bpf/prog_tests/varlen.c
> +++ b/tools/testing/selftests/bpf/prog_tests/varlen.c
> @@ -51,6 +51,19 @@ void test_varlen(void)
>         CHECK_VAL(data->total2, size1 + size2);
>         CHECK(memcmp(data->payload2, exp_str, size1 + size2), "content_check",
>               "doesn't match!");
> +
> +       CHECK_VAL(data->payload3_len1, size1);
> +       CHECK_VAL(data->payload3_len2, size2);
> +       CHECK_VAL(data->total3, size1 + size2);
> +       CHECK(memcmp(data->payload3, exp_str, size1 + size2), "content_check",
> +             "doesn't match!");
> +
> +       CHECK_VAL(data->payload4_len1, size1);
> +       CHECK_VAL(data->payload4_len2, size2);
> +       CHECK_VAL(data->total4, size1 + size2);
> +       CHECK(memcmp(data->payload4, exp_str, size1 + size2), "content_check",
> +             "doesn't match!");
> +
>  cleanup:
>         test_varlen__destroy(skel);
>  }
> diff --git a/tools/testing/selftests/bpf/progs/test_varlen.c b/tools/testing/selftests/bpf/progs/test_varlen.c
> index 0969185..dfe3a32 100644
> --- a/tools/testing/selftests/bpf/progs/test_varlen.c
> +++ b/tools/testing/selftests/bpf/progs/test_varlen.c
> @@ -26,8 +26,18 @@ int payload2_len2 = -1;
>  int total2 = -1;
>  char payload2[MAX_LEN + MAX_LEN] = { 1 };
>
> -SEC("raw_tp/sys_enter")
> -int handler64(void *regs)
> +int payload3_len1 = -1;
> +int payload3_len2 = -1;
> +int total3= -1;
> +char payload3[MAX_LEN + MAX_LEN] = { 1 };
> +
> +int payload4_len1 = -1;
> +int payload4_len2 = -1;
> +int total4= -1;
> +char payload4[MAX_LEN + MAX_LEN] = { 1 };
> +
> +static __always_inline
> +int handler64_gt(void *regs)
>  {
>         int pid = bpf_get_current_pid_tgid() >> 32;
>         void *payload = payload1;
> @@ -54,8 +64,44 @@ int handler64(void *regs)
>         return 0;
>  }
>
> -SEC("tp_btf/sys_enter")
> -int handler32(void *regs)
> +static __always_inline
> +int handler64_lt(void *regs)
> +{
> +       int pid = bpf_get_current_pid_tgid() >> 32;
> +       void *payload = payload3;
> +       long len;
> +
> +       /* ignore irrelevant invocations */
> +       if (test_pid != pid || !capture)
> +               return 0;
> +
> +       len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in1[0]);
> +       if (len < 0)
> +               goto next_lt_long;
> +       payload += len;
> +       payload3_len1 = len;
> +next_lt_long:
> +       len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in2[0]);
> +       if (len < 0)
> +               goto done_lt_long;
> +       payload += len;
> +       payload3_len2 = len;
> +done_lt_long:
> +       total3 = payload - (void *)payload3;
> +
> +       return 0;
> +}
> +
> +SEC("raw_tp/sys_enter")
> +int handler64(void *regs)
> +{
> +       handler64_gt(regs);
> +       handler64_lt(regs);
> +       return 0;
> +}
> +
> +static __always_inline
> +int handler32_gt(void *regs)
>  {
>         int pid = bpf_get_current_pid_tgid() >> 32;
>         void *payload = payload2;
> @@ -82,6 +128,42 @@ int handler32(void *regs)
>         return 0;
>  }
>
> +static __always_inline
> +int handler32_lt(void *regs)
> +{
> +       int pid = bpf_get_current_pid_tgid() >> 32;
> +       void *payload = payload4;
> +       int len;
> +
> +       /* ignore irrelevant invocations */
> +       if (test_pid != pid || !capture)
> +               return 0;
> +
> +       len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in1[0]);
> +       if (len < 0)
> +               goto next_lt_int;
> +       payload += len;
> +       payload4_len1 = len;
> +next_lt_int:
> +       len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in2[0]);
> +       if (len < 0)
> +               goto done_lt_int;
> +       payload += len;
> +       payload4_len2 = len;
> +done_lt_int:
> +       total4 = payload - (void *)payload4;
> +
> +       return 0;
> +}
> +
> +SEC("tp_btf/sys_enter")
> +int handler32(void *regs)
> +{
> +       handler32_gt(regs);
> +       handler32_lt(regs);
> +       return 0;
> +}
> +
>  SEC("tp_btf/sys_exit")
>  int handler_exit(void *regs)
>  {

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-22 19:16                       ` Andrii Nakryiko
@ 2020-06-22 19:42                         ` John Fastabend
  2020-06-22 21:00                           ` Andrii Nakryiko
  0 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2020-06-22 19:42 UTC (permalink / raw)
  To: Andrii Nakryiko, John Fastabend
  Cc: Daniel Borkmann, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Kernel Team

Andrii Nakryiko wrote:
> On Mon, Jun 22, 2020 at 11:30 AM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Andrii Nakryiko wrote:
> > > On Fri, Jun 19, 2020 at 3:21 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > >
> > > > On 6/19/20 8:41 PM, Andrii Nakryiko wrote:
> > > > > On Fri, Jun 19, 2020 at 6:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > > >> On 6/19/20 2:39 AM, John Fastabend wrote:
> > > > >>> John Fastabend wrote:
> > > > >>>> Andrii Nakryiko wrote:
> > > > >>>>> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> > > > >>>>> <john.fastabend@gmail.com> wrote:
> > > > >>>
> > > > >>> [...]
> > > > >>>
> > > > >>>>> That would be great. Self-tests do work, but having more testing with
> > > > >>>>> real-world application would certainly help as well.
> > > > >>>>
> > > > >>>> Thanks for all the follow up.
> > > > >>>>
> > > > >>>> I ran the change through some CI on my side and it passed so I can
> > > > >>>> complain about a few shifts here and there or just update my code or
> > > > >>>> just not change the return types on my side but I'm convinced its OK
> > > > >>>> in most cases and helps in some so...
> > > > >>>>
> > > > >>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
> > > > >>>
> > > > >>> I'll follow this up with a few more selftests to capture a couple of our
> > > > >>> patterns. These changes are subtle and I worry a bit that additional
> > > > >>> <<,s>> pattern could have the potential to break something.
> > > > >>>
> > > > >>> Another one we didn't discuss that I found in our code base is feeding
> > > > >>> the output of a probe_* helper back into the size field (after some
> > > > >>> alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
> > > > >>> today didn't cover that case.
> > > > >>>
> > > > >>> I'll put it on the list tomorrow and encode these in selftests. I'll
> > > > >>> let the mainainers decide if they want to wait for those or not.
> > > > >>
> > > > >> Given potential fragility on verifier side, my preference would be that we
> > > > >> have the known variations all covered in selftests before moving forward in
> > > > >> order to make sure they don't break in any way. Back in [0] I've seen mostly
> > > > >> similar cases in the way John mentioned in other projects, iirc, sysdig was
> > > > >> another one. If both of you could hack up the remaining cases we need to
> > > > >> cover and then submit a combined series, that would be great. I don't think
> > > > >> we need to rush this optimization w/o necessary selftests.
> > > > >
> > > > > There is no rush, but there is also no reason to delay it. I'd rather
> > > > > land it early in the libbpf release cycle and let people try it in
> > > > > their prod environments, for those concerned about such code patterns.
> > > >
> > > > Andrii, define 'delay'. John mentioned above to put together few more
> > > > selftests today so that there is better coverage at least, why is that
> > > > an 'issue'? I'm not sure how you read 'late in release cycle' out of it,
> > > > it's still as early. The unsigned optimization for len <= MAX_LEN is
> > > > reasonable and makes sense, but it's still one [specific] variant only.
> > >
> > > I'm totally fine waiting for John's tests, but I read your reply as a
> > > request to go dig up some more examples from sysdig and other
> > > projects, which I don't think I can commit to. So if it's just about
> > > waiting for John's examples, that's fine and sorry for
> > > misunderstanding.
> > >
> > > >
> > > > > I don't have a list of all the patterns that we might need to test.
> > > > > Going through all open-source BPF source code to identify possible
> > > > > patterns and then coding them up in minimal selftests is a bit too
> > > > > much for me, honestly.
> > > >
> > > > I think we're probably talking past each other. John wrote above:
> > >
> > > Yep, sorry, I assumed more general context, not specifically John's reply.
> > >
> > > >
> > > >  >>> I'll follow this up with a few more selftests to capture a couple of our
> > > >  >>> patterns. These changes are subtle and I worry a bit that additional
> > > >  >>> <<,s>> pattern could have the potential to break something.
> > > >
> > > > So submitting this as a full series together makes absolutely sense to me,
> > > > so there's maybe not perfect but certainly more confidence that also other
> > > > patterns where the shifts optimized out in one case are then appearing in
> > > > another are tested on a best effort and run our kselftest suite.
> > > >
> > > > Thanks,
> > > > Daniel
> >
> > Hi Andrii,
> >
> > How about adding this on-top of your selftests patch? It will cover the
> > cases we have now with 'len < 0' check vs 'len > MAX'. I had another case
> > where we feed the out 'len' back into other probes but this requires more
> > hackery than I'm willing to encode in a selftests. There seems to be
> > some better options to improve clang side + verifier and get a clean
> > working version in the future.
> 
> Ok, sounds good. I'll add it as an extra patch. Not sure about all the
> conventions with preserving Signed-off-by. Would just keeping your
> Signed-off-by be ok? If you don't mind, though, I'll keep each
> handler{32,64}_{gt,lt} as 4 independent BPF programs, so that if any
> of them is unverifiable, it's easier to inspect the BPF assembly. Yell
> if you don't like this.

works for me, go for it.

> 
> >
> > On the clang/verifier side though I think the root cause is we do a poor
> > job of tracking >>32, s<<32 case. How about we add a sign-extend instruction
> > to BPF? Then clang can emit BPF_SEXT_32_64 and verifier can correctly
> > account for it and finally backends can generate better code. This
> > will help here, but also any other place we hit the sext codegen.
> >
> > Alexei, Yonghong, any opinions for/against adding new insn? I think we
> > talked about doing it earlier.
> 
> Seems like an overkill to me, honestly. I'd rather spend effort on
> teaching Clang to always generate `w1 = w0` for such a case (for
> alu32). For no-ALU32 recommendation would be to switch to ALU32, if
> you want to work with int instead of long and care about two bitshift
> operations. If you can stick to longs on no-ALU32, then no harm, no
> foul.
> 

Do you have an example of where clang doesn't generate just `w1 = w0`
for the alu32 case? It really should at this point I'm not aware of
any cases where it doesn't. I think you might have mentioned this
earlier but I'm not seeing it.

There are other cases where sext gets generated in normal code and
it would be nice to not always have to work around it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-22 19:42                         ` John Fastabend
@ 2020-06-22 21:00                           ` Andrii Nakryiko
  2020-06-22 21:18                             ` John Fastabend
  0 siblings, 1 reply; 18+ messages in thread
From: Andrii Nakryiko @ 2020-06-22 21:00 UTC (permalink / raw)
  To: John Fastabend
  Cc: Daniel Borkmann, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Kernel Team

On Mon, Jun 22, 2020 at 12:42 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > On Mon, Jun 22, 2020 at 11:30 AM John Fastabend
> > <john.fastabend@gmail.com> wrote:
> > >
> > > Andrii Nakryiko wrote:
> > > > On Fri, Jun 19, 2020 at 3:21 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > > >
> > > > > On 6/19/20 8:41 PM, Andrii Nakryiko wrote:
> > > > > > On Fri, Jun 19, 2020 at 6:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > > > >> On 6/19/20 2:39 AM, John Fastabend wrote:
> > > > > >>> John Fastabend wrote:
> > > > > >>>> Andrii Nakryiko wrote:
> > > > > >>>>> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> > > > > >>>>> <john.fastabend@gmail.com> wrote:
> > > > > >>>
> > > > > >>> [...]
> > > > > >>>
> > > > > >>>>> That would be great. Self-tests do work, but having more testing with
> > > > > >>>>> real-world application would certainly help as well.
> > > > > >>>>
> > > > > >>>> Thanks for all the follow up.
> > > > > >>>>
> > > > > >>>> I ran the change through some CI on my side and it passed so I can
> > > > > >>>> complain about a few shifts here and there or just update my code or
> > > > > >>>> just not change the return types on my side but I'm convinced its OK
> > > > > >>>> in most cases and helps in some so...
> > > > > >>>>
> > > > > >>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
> > > > > >>>
> > > > > >>> I'll follow this up with a few more selftests to capture a couple of our
> > > > > >>> patterns. These changes are subtle and I worry a bit that additional
> > > > > >>> <<,s>> pattern could have the potential to break something.
> > > > > >>>
> > > > > >>> Another one we didn't discuss that I found in our code base is feeding
> > > > > >>> the output of a probe_* helper back into the size field (after some
> > > > > >>> alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
> > > > > >>> today didn't cover that case.
> > > > > >>>
> > > > > >>> I'll put it on the list tomorrow and encode these in selftests. I'll
> > > > > >>> let the mainainers decide if they want to wait for those or not.
> > > > > >>
> > > > > >> Given potential fragility on verifier side, my preference would be that we
> > > > > >> have the known variations all covered in selftests before moving forward in
> > > > > >> order to make sure they don't break in any way. Back in [0] I've seen mostly
> > > > > >> similar cases in the way John mentioned in other projects, iirc, sysdig was
> > > > > >> another one. If both of you could hack up the remaining cases we need to
> > > > > >> cover and then submit a combined series, that would be great. I don't think
> > > > > >> we need to rush this optimization w/o necessary selftests.
> > > > > >
> > > > > > There is no rush, but there is also no reason to delay it. I'd rather
> > > > > > land it early in the libbpf release cycle and let people try it in
> > > > > > their prod environments, for those concerned about such code patterns.
> > > > >
> > > > > Andrii, define 'delay'. John mentioned above to put together few more
> > > > > selftests today so that there is better coverage at least, why is that
> > > > > an 'issue'? I'm not sure how you read 'late in release cycle' out of it,
> > > > > it's still as early. The unsigned optimization for len <= MAX_LEN is
> > > > > reasonable and makes sense, but it's still one [specific] variant only.
> > > >
> > > > I'm totally fine waiting for John's tests, but I read your reply as a
> > > > request to go dig up some more examples from sysdig and other
> > > > projects, which I don't think I can commit to. So if it's just about
> > > > waiting for John's examples, that's fine and sorry for
> > > > misunderstanding.
> > > >
> > > > >
> > > > > > I don't have a list of all the patterns that we might need to test.
> > > > > > Going through all open-source BPF source code to identify possible
> > > > > > patterns and then coding them up in minimal selftests is a bit too
> > > > > > much for me, honestly.
> > > > >
> > > > > I think we're probably talking past each other. John wrote above:
> > > >
> > > > Yep, sorry, I assumed more general context, not specifically John's reply.
> > > >
> > > > >
> > > > >  >>> I'll follow this up with a few more selftests to capture a couple of our
> > > > >  >>> patterns. These changes are subtle and I worry a bit that additional
> > > > >  >>> <<,s>> pattern could have the potential to break something.
> > > > >
> > > > > So submitting this as a full series together makes absolutely sense to me,
> > > > > so there's maybe not perfect but certainly more confidence that also other
> > > > > patterns where the shifts optimized out in one case are then appearing in
> > > > > another are tested on a best effort and run our kselftest suite.
> > > > >
> > > > > Thanks,
> > > > > Daniel
> > >
> > > Hi Andrii,
> > >
> > > How about adding this on-top of your selftests patch? It will cover the
> > > cases we have now with 'len < 0' check vs 'len > MAX'. I had another case
> > > where we feed the out 'len' back into other probes but this requires more
> > > hackery than I'm willing to encode in a selftests. There seems to be
> > > some better options to improve clang side + verifier and get a clean
> > > working version in the future.
> >
> > Ok, sounds good. I'll add it as an extra patch. Not sure about all the
> > conventions with preserving Signed-off-by. Would just keeping your
> > Signed-off-by be ok? If you don't mind, though, I'll keep each
> > handler{32,64}_{gt,lt} as 4 independent BPF programs, so that if any
> > of them is unverifiable, it's easier to inspect the BPF assembly. Yell
> > if you don't like this.
>
> works for me, go for it.
>
> >
> > >
> > > On the clang/verifier side though I think the root cause is we do a poor
> > > job of tracking >>32, s<<32 case. How about we add a sign-extend instruction
> > > to BPF? Then clang can emit BPF_SEXT_32_64 and verifier can correctly
> > > account for it and finally backends can generate better code. This
> > > will help here, but also any other place we hit the sext codegen.
> > >
> > > Alexei, Yonghong, any opinions for/against adding new insn? I think we
> > > talked about doing it earlier.
> >
> > Seems like an overkill to me, honestly. I'd rather spend effort on
> > teaching Clang to always generate `w1 = w0` for such a case (for
> > alu32). For no-ALU32 recommendation would be to switch to ALU32, if
> > you want to work with int instead of long and care about two bitshift
> > operations. If you can stick to longs on no-ALU32, then no harm, no
> > foul.
> >
>
> Do you have an example of where clang doesn't generate just `w1 = w0`
> for the alu32 case? It really should at this point I'm not aware of
> any cases where it doesn't. I think you might have mentioned this
> earlier but I'm not seeing it.

Yeah, ALU32 + LONG for helpers + u32 for len variable. I actually call
this out explicitly in the commit message for this patch.

>
> There are other cases where sext gets generated in normal code and
> it would be nice to not always have to work around it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-22 21:00                           ` Andrii Nakryiko
@ 2020-06-22 21:18                             ` John Fastabend
  2020-06-22 23:03                               ` Andrii Nakryiko
  0 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2020-06-22 21:18 UTC (permalink / raw)
  To: Andrii Nakryiko, John Fastabend
  Cc: Daniel Borkmann, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Kernel Team

Andrii Nakryiko wrote:
> On Mon, Jun 22, 2020 at 12:42 PM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Andrii Nakryiko wrote:
> > > On Mon, Jun 22, 2020 at 11:30 AM John Fastabend
> > > <john.fastabend@gmail.com> wrote:
> > > >
> > > > Andrii Nakryiko wrote:
> > > > > On Fri, Jun 19, 2020 at 3:21 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > > > >
> > > > > > On 6/19/20 8:41 PM, Andrii Nakryiko wrote:
> > > > > > > On Fri, Jun 19, 2020 at 6:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > > > > >> On 6/19/20 2:39 AM, John Fastabend wrote:
> > > > > > >>> John Fastabend wrote:
> > > > > > >>>> Andrii Nakryiko wrote:
> > > > > > >>>>> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> > > > > > >>>>> <john.fastabend@gmail.com> wrote:
> > > > > > >>>
> > > > > > >>> [...]
> > > > > > >>>
> > > > > > >>>>> That would be great. Self-tests do work, but having more testing with
> > > > > > >>>>> real-world application would certainly help as well.
> > > > > > >>>>
> > > > > > >>>> Thanks for all the follow up.
> > > > > > >>>>
> > > > > > >>>> I ran the change through some CI on my side and it passed so I can
> > > > > > >>>> complain about a few shifts here and there or just update my code or
> > > > > > >>>> just not change the return types on my side but I'm convinced its OK
> > > > > > >>>> in most cases and helps in some so...
> > > > > > >>>>
> > > > > > >>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
> > > > > > >>>
> > > > > > >>> I'll follow this up with a few more selftests to capture a couple of our
> > > > > > >>> patterns. These changes are subtle and I worry a bit that additional
> > > > > > >>> <<,s>> pattern could have the potential to break something.
> > > > > > >>>
> > > > > > >>> Another one we didn't discuss that I found in our code base is feeding
> > > > > > >>> the output of a probe_* helper back into the size field (after some
> > > > > > >>> alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
> > > > > > >>> today didn't cover that case.
> > > > > > >>>
> > > > > > >>> I'll put it on the list tomorrow and encode these in selftests. I'll
> > > > > > >>> let the mainainers decide if they want to wait for those or not.
> > > > > > >>
> > > > > > >> Given potential fragility on verifier side, my preference would be that we
> > > > > > >> have the known variations all covered in selftests before moving forward in
> > > > > > >> order to make sure they don't break in any way. Back in [0] I've seen mostly
> > > > > > >> similar cases in the way John mentioned in other projects, iirc, sysdig was
> > > > > > >> another one. If both of you could hack up the remaining cases we need to
> > > > > > >> cover and then submit a combined series, that would be great. I don't think
> > > > > > >> we need to rush this optimization w/o necessary selftests.
> > > > > > >
> > > > > > > There is no rush, but there is also no reason to delay it. I'd rather
> > > > > > > land it early in the libbpf release cycle and let people try it in
> > > > > > > their prod environments, for those concerned about such code patterns.
> > > > > >
> > > > > > Andrii, define 'delay'. John mentioned above to put together few more
> > > > > > selftests today so that there is better coverage at least, why is that
> > > > > > an 'issue'? I'm not sure how you read 'late in release cycle' out of it,
> > > > > > it's still as early. The unsigned optimization for len <= MAX_LEN is
> > > > > > reasonable and makes sense, but it's still one [specific] variant only.
> > > > >
> > > > > I'm totally fine waiting for John's tests, but I read your reply as a
> > > > > request to go dig up some more examples from sysdig and other
> > > > > projects, which I don't think I can commit to. So if it's just about
> > > > > waiting for John's examples, that's fine and sorry for
> > > > > misunderstanding.
> > > > >
> > > > > >
> > > > > > > I don't have a list of all the patterns that we might need to test.
> > > > > > > Going through all open-source BPF source code to identify possible
> > > > > > > patterns and then coding them up in minimal selftests is a bit too
> > > > > > > much for me, honestly.
> > > > > >
> > > > > > I think we're probably talking past each other. John wrote above:
> > > > >
> > > > > Yep, sorry, I assumed more general context, not specifically John's reply.
> > > > >
> > > > > >
> > > > > >  >>> I'll follow this up with a few more selftests to capture a couple of our
> > > > > >  >>> patterns. These changes are subtle and I worry a bit that additional
> > > > > >  >>> <<,s>> pattern could have the potential to break something.
> > > > > >
> > > > > > So submitting this as a full series together makes absolutely sense to me,
> > > > > > so there's maybe not perfect but certainly more confidence that also other
> > > > > > patterns where the shifts optimized out in one case are then appearing in
> > > > > > another are tested on a best effort and run our kselftest suite.
> > > > > >
> > > > > > Thanks,
> > > > > > Daniel
> > > >
> > > > Hi Andrii,
> > > >
> > > > How about adding this on-top of your selftests patch? It will cover the
> > > > cases we have now with 'len < 0' check vs 'len > MAX'. I had another case
> > > > where we feed the out 'len' back into other probes but this requires more
> > > > hackery than I'm willing to encode in a selftests. There seems to be
> > > > some better options to improve clang side + verifier and get a clean
> > > > working version in the future.
> > >
> > > Ok, sounds good. I'll add it as an extra patch. Not sure about all the
> > > conventions with preserving Signed-off-by. Would just keeping your
> > > Signed-off-by be ok? If you don't mind, though, I'll keep each
> > > handler{32,64}_{gt,lt} as 4 independent BPF programs, so that if any
> > > of them is unverifiable, it's easier to inspect the BPF assembly. Yell
> > > if you don't like this.
> >
> > works for me, go for it.
> >
> > >
> > > >
> > > > On the clang/verifier side though I think the root cause is we do a poor
> > > > job of tracking >>32, s<<32 case. How about we add a sign-extend instruction
> > > > to BPF? Then clang can emit BPF_SEXT_32_64 and verifier can correctly
> > > > account for it and finally backends can generate better code. This
> > > > will help here, but also any other place we hit the sext codegen.
> > > >
> > > > Alexei, Yonghong, any opinions for/against adding new insn? I think we
> > > > talked about doing it earlier.
> > >
> > > Seems like an overkill to me, honestly. I'd rather spend effort on
> > > teaching Clang to always generate `w1 = w0` for such a case (for
> > > alu32). For no-ALU32 recommendation would be to switch to ALU32, if
> > > you want to work with int instead of long and care about two bitshift
> > > operations. If you can stick to longs on no-ALU32, then no harm, no
> > > foul.
> > >
> >
> > Do you have an example of where clang doesn't generate just `w1 = w0`
> > for the alu32 case? It really should at this point I'm not aware of
> > any cases where it doesn't. I think you might have mentioned this
> > earlier but I'm not seeing it.
> 
> Yeah, ALU32 + LONG for helpers + u32 for len variable. I actually call
> this out explicitly in the commit message for this patch.
> 

Maybe we are just saying the same thing but the <<32, s>>32 pattern
from the ALU32 + LONG for helpers + u32 is becuase llvm generated a
LLVM IR sext instruction. We need the sext because its promoting a
u32 type to a long. We can't just replace those with MOV instructions
like we do with zext giving the `w1=w0`. We would have to "know" the
helper call zero'd the upper bits but this isn't C standard. My
suggestion to fix this is to generate a BPF_SEXT and then let the
verifier handle it and JITs generate good code for it. On x86
we have a sign-extending move MOVSX for example.

Trying to go the other way and enforce callees zero upper bits of
return register seems inconsistent and more difficult to implement.

> >
> > There are other cases where sext gets generated in normal code and
> > it would be nice to not always have to work around it.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
  2020-06-22 21:18                             ` John Fastabend
@ 2020-06-22 23:03                               ` Andrii Nakryiko
  0 siblings, 0 replies; 18+ messages in thread
From: Andrii Nakryiko @ 2020-06-22 23:03 UTC (permalink / raw)
  To: John Fastabend
  Cc: Daniel Borkmann, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Kernel Team

On Mon, Jun 22, 2020 at 2:18 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > On Mon, Jun 22, 2020 at 12:42 PM John Fastabend
> > <john.fastabend@gmail.com> wrote:
> > >
> > > Andrii Nakryiko wrote:
> > > > On Mon, Jun 22, 2020 at 11:30 AM John Fastabend
> > > > <john.fastabend@gmail.com> wrote:
> > > > >
> > > > > Andrii Nakryiko wrote:
> > > > > > On Fri, Jun 19, 2020 at 3:21 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > > > > >
> > > > > > > On 6/19/20 8:41 PM, Andrii Nakryiko wrote:
> > > > > > > > On Fri, Jun 19, 2020 at 6:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > > > > > >> On 6/19/20 2:39 AM, John Fastabend wrote:
> > > > > > > >>> John Fastabend wrote:
> > > > > > > >>>> Andrii Nakryiko wrote:
> > > > > > > >>>>> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> > > > > > > >>>>> <john.fastabend@gmail.com> wrote:
> > > > > > > >>>
> > > > > > > >>> [...]
> > > > > > > >>>
> > > > > > > >>>>> That would be great. Self-tests do work, but having more testing with
> > > > > > > >>>>> real-world application would certainly help as well.
> > > > > > > >>>>
> > > > > > > >>>> Thanks for all the follow up.
> > > > > > > >>>>
> > > > > > > >>>> I ran the change through some CI on my side and it passed so I can
> > > > > > > >>>> complain about a few shifts here and there or just update my code or
> > > > > > > >>>> just not change the return types on my side but I'm convinced its OK
> > > > > > > >>>> in most cases and helps in some so...
> > > > > > > >>>>
> > > > > > > >>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
> > > > > > > >>>
> > > > > > > >>> I'll follow this up with a few more selftests to capture a couple of our
> > > > > > > >>> patterns. These changes are subtle and I worry a bit that additional
> > > > > > > >>> <<,s>> pattern could have the potential to break something.
> > > > > > > >>>
> > > > > > > >>> Another one we didn't discuss that I found in our code base is feeding
> > > > > > > >>> the output of a probe_* helper back into the size field (after some
> > > > > > > >>> alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
> > > > > > > >>> today didn't cover that case.
> > > > > > > >>>
> > > > > > > >>> I'll put it on the list tomorrow and encode these in selftests. I'll
> > > > > > > >>> let the mainainers decide if they want to wait for those or not.
> > > > > > > >>
> > > > > > > >> Given potential fragility on verifier side, my preference would be that we
> > > > > > > >> have the known variations all covered in selftests before moving forward in
> > > > > > > >> order to make sure they don't break in any way. Back in [0] I've seen mostly
> > > > > > > >> similar cases in the way John mentioned in other projects, iirc, sysdig was
> > > > > > > >> another one. If both of you could hack up the remaining cases we need to
> > > > > > > >> cover and then submit a combined series, that would be great. I don't think
> > > > > > > >> we need to rush this optimization w/o necessary selftests.
> > > > > > > >
> > > > > > > > There is no rush, but there is also no reason to delay it. I'd rather
> > > > > > > > land it early in the libbpf release cycle and let people try it in
> > > > > > > > their prod environments, for those concerned about such code patterns.
> > > > > > >
> > > > > > > Andrii, define 'delay'. John mentioned above to put together few more
> > > > > > > selftests today so that there is better coverage at least, why is that
> > > > > > > an 'issue'? I'm not sure how you read 'late in release cycle' out of it,
> > > > > > > it's still as early. The unsigned optimization for len <= MAX_LEN is
> > > > > > > reasonable and makes sense, but it's still one [specific] variant only.
> > > > > >
> > > > > > I'm totally fine waiting for John's tests, but I read your reply as a
> > > > > > request to go dig up some more examples from sysdig and other
> > > > > > projects, which I don't think I can commit to. So if it's just about
> > > > > > waiting for John's examples, that's fine and sorry for
> > > > > > misunderstanding.
> > > > > >
> > > > > > >
> > > > > > > > I don't have a list of all the patterns that we might need to test.
> > > > > > > > Going through all open-source BPF source code to identify possible
> > > > > > > > patterns and then coding them up in minimal selftests is a bit too
> > > > > > > > much for me, honestly.
> > > > > > >
> > > > > > > I think we're probably talking past each other. John wrote above:
> > > > > >
> > > > > > Yep, sorry, I assumed more general context, not specifically John's reply.
> > > > > >
> > > > > > >
> > > > > > >  >>> I'll follow this up with a few more selftests to capture a couple of our
> > > > > > >  >>> patterns. These changes are subtle and I worry a bit that additional
> > > > > > >  >>> <<,s>> pattern could have the potential to break something.
> > > > > > >
> > > > > > > So submitting this as a full series together makes absolutely sense to me,
> > > > > > > so there's maybe not perfect but certainly more confidence that also other
> > > > > > > patterns where the shifts optimized out in one case are then appearing in
> > > > > > > another are tested on a best effort and run our kselftest suite.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Daniel
> > > > >
> > > > > Hi Andrii,
> > > > >
> > > > > How about adding this on-top of your selftests patch? It will cover the
> > > > > cases we have now with 'len < 0' check vs 'len > MAX'. I had another case
> > > > > where we feed the out 'len' back into other probes but this requires more
> > > > > hackery than I'm willing to encode in a selftests. There seems to be
> > > > > some better options to improve clang side + verifier and get a clean
> > > > > working version in the future.
> > > >
> > > > Ok, sounds good. I'll add it as an extra patch. Not sure about all the
> > > > conventions with preserving Signed-off-by. Would just keeping your
> > > > Signed-off-by be ok? If you don't mind, though, I'll keep each
> > > > handler{32,64}_{gt,lt} as 4 independent BPF programs, so that if any
> > > > of them is unverifiable, it's easier to inspect the BPF assembly. Yell
> > > > if you don't like this.
> > >
> > > works for me, go for it.
> > >
> > > >
> > > > >
> > > > > On the clang/verifier side though I think the root cause is we do a poor
> > > > > job of tracking >>32, s<<32 case. How about we add a sign-extend instruction
> > > > > to BPF? Then clang can emit BPF_SEXT_32_64 and verifier can correctly
> > > > > account for it and finally backends can generate better code. This
> > > > > will help here, but also any other place we hit the sext codegen.
> > > > >
> > > > > Alexei, Yonghong, any opinions for/against adding new insn? I think we
> > > > > talked about doing it earlier.
> > > >
> > > > Seems like an overkill to me, honestly. I'd rather spend effort on
> > > > teaching Clang to always generate `w1 = w0` for such a case (for
> > > > alu32). For no-ALU32 recommendation would be to switch to ALU32, if
> > > > you want to work with int instead of long and care about two bitshift
> > > > operations. If you can stick to longs on no-ALU32, then no harm, no
> > > > foul.
> > > >
> > >
> > > Do you have an example of where clang doesn't generate just `w1 = w0`
> > > for the alu32 case? It really should at this point I'm not aware of
> > > any cases where it doesn't. I think you might have mentioned this
> > > earlier but I'm not seeing it.
> >
> > Yeah, ALU32 + LONG for helpers + u32 for len variable. I actually call
> > this out explicitly in the commit message for this patch.
> >
>
> Maybe we are just saying the same thing but the <<32, s>>32 pattern
> from the ALU32 + LONG for helpers + u32 is becuase llvm generated a
> LLVM IR sext instruction. We need the sext because its promoting a
> u32 type to a long. We can't just replace those with MOV instructions
> like we do with zext giving the `w1=w0`. We would have to "know" the
> helper call zero'd the upper bits but this isn't C standard. My
> suggestion to fix this is to generate a BPF_SEXT and then let the
> verifier handle it and JITs generate good code for it. On x86
> we have a sign-extending move MOVSX for example.
>

There is no sign extension there, we convert long into u32, so just
truncating upper 32 bits and leaving lower 32 bits intact. Here are
relevant listings:

ALU32 + INT                                ALU32 + LONG
===========                                ============

32-BIT (11 insns):                         32-BIT (12 insns):
------------------------------------       ------------------------------------
  17:   call 115                             17:   call 115
  18:   if w0 > 256 goto +7 <LBB1_4>         18:   if w0 > 256 goto +8 <LBB1_4>
  19:   r1 = 0 ll                            19:   r1 = 0 ll
  21:   *(u32 *)(r1 + 0) = r0                21:   *(u32 *)(r1 + 0) = r0
  22:   w1 = w0                              22:   r0 <<= 32
  23:   r6 = 0 ll                            23:   r0 >>= 32
  25:   r6 += r1                             24:   r6 = 0 ll
00000000000000d0 <LBB1_4>:                   26:   r6 += r0
  26:   r1 = r6                            00000000000000d8 <LBB1_4>:
  27:   w2 = 256                             27:   r1 = r6
  28:   r3 = 0 ll                            28:   w2 = 256
  30:   call 115                             29:   r3 = 0 ll
                                             31:   call 115

See insns 22-23 on the right side and equivalent insn 17 on the left
side. It's identical code except for these bit shifts. It still seems
to me the suboptimal choice on the part of Clang.


> Trying to go the other way and enforce callees zero upper bits of
> return register seems inconsistent and more difficult to implement.
>
> > >
> > > There are other cases where sext gets generated in normal code and
> > > it would be nice to not always have to work around it.
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-06-22 23:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-17 20:21 [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long Andrii Nakryiko
2020-06-17 20:21 ` [PATCH bpf-next 2/2] selftests/bpf: add variable-length data concatenation pattern test Andrii Nakryiko
2020-06-18  6:49 ` [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long John Fastabend
2020-06-18  7:30   ` Andrii Nakryiko
2020-06-18 18:58     ` John Fastabend
2020-06-18 21:41       ` Andrii Nakryiko
2020-06-18 23:30         ` John Fastabend
2020-06-19  0:39           ` John Fastabend
2020-06-19 13:08             ` Daniel Borkmann
2020-06-19 18:41               ` Andrii Nakryiko
2020-06-19 22:21                 ` Daniel Borkmann
2020-06-19 23:11                   ` Andrii Nakryiko
2020-06-22 18:30                     ` John Fastabend
2020-06-22 19:16                       ` Andrii Nakryiko
2020-06-22 19:42                         ` John Fastabend
2020-06-22 21:00                           ` Andrii Nakryiko
2020-06-22 21:18                             ` John Fastabend
2020-06-22 23:03                               ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).