Re: [PATCH RFC bpf-next 1/3] bpf: Fix certain narrow loads with offsets

From: Jakub Sitnicki <jakub@cloudflare.com>
To: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	bpf@vger.kernel.org, Heiko Carstens <hca@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>
Subject: Re: [PATCH RFC bpf-next 1/3] bpf: Fix certain narrow loads with offsets
Date: Tue, 08 Mar 2022 16:01:05 +0100	[thread overview]
Message-ID: <87bkygzbg5.fsf@cloudflare.com> (raw)
In-Reply-To: <20220222182559.2865596-2-iii@linux.ibm.com>

On Tue, Feb 22, 2022 at 07:25 PM +01, Ilya Leoshkevich wrote:
> Verifier treats bpf_sk_lookup.remote_port as a 32-bit field for
> backward compatibility, regardless of what the uapi headers say.
> This field is mapped onto the 16-bit bpf_sk_lookup_kern.sport field.
> Therefore, accessing the most significant 16 bits of
> bpf_sk_lookup.remote_port must produce 0, which is currently not
> the case.
>
> The problem is that narrow loads with offset - commit 46f53a65d2de
> ("bpf: Allow narrow loads with offset > 0"), don't play nicely with
> the masking optimization - commit 239946314e57 ("bpf: possibly avoid
> extra masking for narrower load in verifier"). In particular, when we
> suppress extra masking, we suppress shifting as well, which is not
> correct.
>
> Fix by moving the masking suppression check to BPF_AND generation.
>
> Fixes: 46f53a65d2de ("bpf: Allow narrow loads with offset > 0")
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> ---
>  kernel/bpf/verifier.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d7473fee247c..195f2e9b5a47 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -12848,7 +12848,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
>  			return -EINVAL;
>  		}
>  
> -		if (is_narrower_load && size < target_size) {
> +		if (is_narrower_load) {
>  			u8 shift = bpf_ctx_narrow_access_offset(
>  				off, size, size_default) * 8;
>  			if (shift && cnt + 1 >= ARRAY_SIZE(insn_buf)) {
> @@ -12860,15 +12860,19 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
>  					insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH,
>  									insn->dst_reg,
>  									shift);
> -				insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
> -								(1 << size * 8) - 1);
> +				if (size < target_size)
> +					insn_buf[cnt++] = BPF_ALU32_IMM(
> +						BPF_AND, insn->dst_reg,
> +						(1 << size * 8) - 1);
>  			} else {
>  				if (shift)
>  					insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH,
>  									insn->dst_reg,
>  									shift);
> -				insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg,
> -								(1ULL << size * 8) - 1);
> +				if (size < target_size)
> +					insn_buf[cnt++] = BPF_ALU64_IMM(
> +						BPF_AND, insn->dst_reg,
> +						(1ULL << size * 8) - 1);
>  			}
>  		}

Thanks for patience. I'm coming back to this.

This fix affects the 2-byte load from bpf_sk_lookup.remote_port.
Dumping the xlated BPF code confirms it.

On LE (x86-64) things look well.

Before this patch:

* size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
   0: (69) r2 = *(u16 *)(r1 +4)
   1: (b7) r0 = 0
   2: (95) exit

* size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
   0: (69) r2 = *(u16 *)(r1 +4)
   1: (b7) r0 = 0
   2: (95) exit

After this patch:

* size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
   0: (69) r2 = *(u16 *)(r1 +4)
   1: (b7) r0 = 0
   2: (95) exit

* size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
   0: (69) r2 = *(u16 *)(r1 +4)
   1: (74) w2 >>= 16
   2: (b7) r0 = 0
   3: (95) exit

Which works great because the JIT generates a zero-extended load movzwq:

* size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
bpf_prog_5e4fe3dbdcb18fd3:
   0:   nopl   0x0(%rax,%rax,1)
   5:   xchg   %ax,%ax
   7:   push   %rbp
   8:   mov    %rsp,%rbp
   b:   movzwq 0x4(%rdi),%rsi
  10:   xor    %eax,%eax
  12:   leave
  13:   ret

* size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
bpf_prog_4a6336c64a340b96:
   0:   nopl   0x0(%rax,%rax,1)
   5:   xchg   %ax,%ax
   7:   push   %rbp
   8:   mov    %rsp,%rbp
   b:   movzwq 0x4(%rdi),%rsi
  10:   shr    $0x10,%esi
  13:   xor    %eax,%eax
  15:   leave
  16:   ret

Runtime checks for bpf_sk_lookup.remote_port load and the 2-bytes of
zero padding following it, like below, pass with flying colors:

	ok = ctx->remote_port == bpf_htons(8008);
	if (!ok)
		return SK_DROP;
	ok = *((__u16 *)&ctx->remote_port + 1) == 0;
	if (!ok)
		return SK_DROP;

(The above checks compile to half-word (2-byte) loads.)

On BE (s390x) things look different:

Before the patch:

* size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
   0: (69) r2 = *(u16 *)(r1 +4)
   1: (bc) w2 = w2
   2: (b7) r0 = 0
   3: (95) exit

* size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
   0: (69) r2 = *(u16 *)(r1 +4)
   1: (bc) w2 = w2
   2: (b7) r0 = 0
   3: (95) exit

After the patch:

* size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
   0: (69) r2 = *(u16 *)(r1 +4)
   1: (bc) w2 = w2
   2: (74) w2 >>= 16
   3: (bc) w2 = w2
   4: (b7) r0 = 0
   5: (95) exit

* size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
   0: (69) r2 = *(u16 *)(r1 +4)
   1: (bc) w2 = w2
   2: (b7) r0 = 0
   3: (95) exit

These compile to:

* size=2, offset=0, 0: (69) r2 = *(u16 *)(r1 +36)
bpf_prog_fdd58b8caca29f00:
   0:   j       0x0000000000000006
   4:   nopr
   6:   stmg    %r11,%r15,112(%r15)
   c:   la      %r13,64(%r15)
  10:   aghi    %r15,-96
  14:   llgh    %r3,4(%r2,%r0)
  1a:   srl     %r3,16
  1e:   llgfr   %r3,%r3
  22:   lgfi    %r14,0
  28:   lgr     %r2,%r14
  2c:   lmg     %r11,%r15,208(%r15)
  32:   br      %r14

* size=2, offset=2, 0: (69) r2 = *(u16 *)(r1 +38)
bpf_prog_5e3d8e92223c6841:
   0:   j       0x0000000000000006
   4:   nopr
   6:   stmg    %r11,%r15,112(%r15)
   c:   la      %r13,64(%r15)
  10:   aghi    %r15,-96
  14:   llgh    %r3,4(%r2,%r0)
  1a:   lgfi    %r14,0
  20:   lgr     %r2,%r14
  24:   lmg     %r11,%r15,208(%r15)
  2a:   br      %r14

Now, we right shift the value when loading

  *(u16 *)(r1 +36)

which in C BPF is equivalent to

  *((__u16 *)&ctx->remote_port + 0)

due to how the shift is calculated by bpf_ctx_narrow_access_offset().

This makes the expected typical use-case

  ctx->remote_port == bpf_htons(8008)

fail on s390x because llgh (Load Logical Halfword (64<-16)) seems to lay
out the data in the destination register so that it holds
0x0000_0000_0000_1f48.

I don't know that was the intention here, as it makes the BPF C code
non-portable.

WDYT?

BTW. Out of curiosity, how does a Logical Load Halfword (llgh) differ
differ from a non-logical Load Halfword (lgh) on s390x? Compiler
Explorer generates a non-logical load for similar C code.