Re: [RFC bpf-next 1/2] bpf: verify scalar ids mapping in regsafe() using check_ids()

From: Eduard Zingerman <eddyz87@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, ast@kernel.org, andrii@kernel.org,
	daniel@iogearbox.net, kernel-team@fb.com, yhs@fb.com
Subject: Re: [RFC bpf-next 1/2] bpf: verify scalar ids mapping in regsafe() using check_ids()
Date: Thu, 01 Dec 2022 20:33:23 +0200	[thread overview]
Message-ID: <8c5560dd1e27ed9eee57ca24b4e6a7304875c669.camel@gmail.com> (raw)
In-Reply-To: <859d531ef1e2b4dab103d316e6f109958f3f1bad.camel@gmail.com>

On Thu, 2022-12-01 at 03:14 +0200, Eduard Zingerman wrote:
> On Wed, 2022-11-30 at 16:26 -0800, Andrii Nakryiko wrote:
> > On Mon, Nov 28, 2022 at 8:35 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > 
> > > Prior to this commit the following unsafe example passed verification:
> > > 
> > > 1: r9 = ... some pointer with range X ...
> > > 2: r6 = ... unbound scalar ID=a ...
> > > 3: r7 = ... unbound scalar ID=b ...
> > > 4: if (r6 > r7) goto +1
> > > 5: r6 = r7
> > > 6: if (r6 > X) goto ...   ; <-- suppose checkpoint state is created here
> > > 7: r9 += r7
> > > 8: *(u64 *)r9 = Y
> > > 
> > > This example is unsafe because not all execution paths verify r7 range.
> > > Because of the jump at (4) the verifier would arrive at (6) in two states:
> > > I.  r6{.id=b}, r7{.id=b} via path 1-6;
> > > II. r6{.id=a}, r7{.id=b} via path 1-4, 6.
> > > 
> > > Currently regsafe() does not call check_ids() for scalar registers,
> > > thus from POV of regsafe() states (I) and (II) are identical. If the
> > > path 1-6 is taken by verifier first and checkpoint is created at (6)
> > > the path 1-4, 6 would be considered safe.
> > > 
> > > This commit makes the following changes:
> > > - a call to check_ids() is added in regsafe() for scalar registers case;
> > > - a function mark_equal_scalars_as_read() is added to ensure that
> > >   registers with identical IDs are preserved in the checkpoint states
> > >   in case when find_equal_scalars() updates register range for several
> > >   registers sharing the same ID.
> > > 
> > 
> > Fixes tag missing?
> > 
> > These are tricky changes with subtle details. Let's split check_ids()
> > change and all the liveness manipulations into separate patches? They
> > are conceptually completely independent, right?
> > 
> > 
> > > Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
> > > ---
> > >  kernel/bpf/verifier.c | 87 ++++++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 85 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 6599d25dae38..8a5b7192514a 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -10638,10 +10638,12 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
> > >                                 /* case: R1 = R2
> > >                                  * copy register state to dest reg
> > >                                  */
> > > -                               if (src_reg->type == SCALAR_VALUE && !src_reg->id)
> > > +                               if (src_reg->type == SCALAR_VALUE && !src_reg->id &&
> > > +                                   !tnum_is_const(src_reg->var_off))
> > >                                         /* Assign src and dst registers the same ID
> > >                                          * that will be used by find_equal_scalars()
> > >                                          * to propagate min/max range.
> > > +                                        * Skip constants to avoid allocation of useless ID.
> > >                                          */
> > >                                         src_reg->id = ++env->id_gen;
> > >                                 *dst_reg = *src_reg;
> > > @@ -11446,16 +11448,86 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn,
> > >         return true;
> > >  }
> > > 
> > > +/* Scalar ID generation in check_alu_op() and logic of
> > > + * find_equal_scalars() make the following pattern possible:
> > > + *
> > > + * 1: r9 = ... some pointer with range X ...
> > > + * 2: r6 = ... unbound scalar ID=a ...
> > > + * 3: r7 = ... unbound scalar ID=b ...
> > > + * 4: if (r6 > r7) goto +1
> > > + * 5: r6 = r7
> > > + * 6: if (r6 > X) goto ...   ; <-- suppose checkpoint state is created here
> > > + * 7: r9 += r7
> > > + * 8: *(u64 *)r9 = Y
> > > + *
> > > + * Because of the jump at (4) the verifier would arrive at (6) in two states:
> > > + * I.  r6{.id=b}, r7{.id=b}
> > > + * II. r6{.id=a}, r7{.id=b}
> > > + *
> > > + * Relevant facts:
> > > + * - regsafe() matches ID mappings for scalars using check_ids(), this makes
> > > + *   states (I) and (II) non-equal;
> > > + * - clean_func_state() removes registers not marked as REG_LIVE_READ from
> > > + *   checkpoint states;
> > > + * - mark_reg_read() modifies reg->live for reg->parent (and it's parents);
> > > + * - when r6 = r7 is process the bpf_reg_state is copied in full, meaning
> > > + *   that parent pointers are copied as well.
> > 
> > not too familiar with liveness handling, but is this correct and
> > expected? Should this be fixed instead of REG_LIVE_READ manipulations?

TLDR:

- looks like it is safe to avoid bpf_reg_state->parent update when
  registers are copied, but it has some performance impact. If this
  impact is acceptable I'd like to move this way are remove
  mark_equal_scalars_as_read().

- unrelated question: does anything has to be done to
  __mark_chain_precision to make it know about shared scalar IDs?
  E.g. in the following case:
    ...
  --- checkpoint 1 --- r6 would be marked as precise here
  r6 = r7
    ...
  --- checkpoint 2 --- r6 won't be marked as precise here
    ...
  if r6 < 10
  --- checkpoint 3 ---
  fp[r7] = 42

  The additional precision marks could be inferred if additional info
  is added the jump stack.

-----
Long version about bpf_reg_state->parent.

All functions that copy register states:
- save_register_state()
  Register is copied to stack spill location.

- check_kfunc_mem_size_reg()
  Makes a tmp register copy, harmless.

- sanitize_ptr_alu()
  Register copy is visible in the speculative branch.

- check_alu_op()
  Register is copied when reg to reg MOV is processed.

- find_equal_scalars()
  Register is copied to the registers with the same ID.

Original commit introducing bpf_reg_state->parent:
- 679c782de14bd48c19dd74cd1af20a2bc05dd936 2018-08-30
  bpf/verifier: per-register parent pointers

  Updates mark_reg_read() and removes mark_stack_slot_read().
  Previous version accessed parent state registers directly using
  regno w/o any additional manipulations.

  In this commit:
  - save_register_state() - does not exist yet, its function is
    performed by check_stack_write(), has a direct register state copy
    as in save_register_state();
  - check_kfunc_mem_size_reg() - does not exist yet;
  - sanitize_ptr_alu() - does not exist yet;
  - check_alu_op() - has a direct register state copy unchanged for
    reg to reg MOV.

Commit introducing find_equal_scalars():
- 75748837b7e56919679e02163f45d5818c644d03 2020-10-09
  bpf: Propagate scalar ranges through register assignments.

  Has a direct register copy *reg = *known_reg.

It looks like that for places where registers are copied ->parent
change is an unintentional side effect. Removal of this side effect
might in theory lead to some additional register read marks, e.g. in
this case:

  r6 = r7
  if (r6 > X) goto +42
  r5 = r7              ; this would lead to a read mark on r7 not
                       ; present previously.

This should not hinder correctness.
However, there is some negative performance impact:

$ ./veristat -e file,prog,states -f 'states_pct!=0' -C master-baseline.log current.log 
File                      Program                           States (A)  States (B)  States  (DIFF)
------------------------  --------------------------------  ----------  ----------  --------------
bpf_host.o                cil_to_netdev                            358         455   +97 (+27.09%)
bpf_host.o                tail_handle_nat_fwd_ipv4                1746        1891   +145 (+8.30%)
bpf_host.o                tail_handle_nat_fwd_ipv6                 709         717     +8 (+1.13%)
bpf_host.o                tail_nodeport_ipv4_dsr                    31          42   +11 (+35.48%)
bpf_host.o                tail_nodeport_nat_egress_ipv4           2269        2274     +5 (+0.22%)
bpf_host.o                tail_nodeport_nat_ingress_ipv4           276         316   +40 (+14.49%)
bpf_host.o                tail_nodeport_nat_ingress_ipv6           243         254    +11 (+4.53%)
bpf_lxc.o                 tail_handle_nat_fwd_ipv4                1746        1891   +145 (+8.30%)
bpf_lxc.o                 tail_handle_nat_fwd_ipv6                 709         717     +8 (+1.13%)
bpf_lxc.o                 tail_ipv4_ct_egress                      248         251     +3 (+1.21%)
bpf_lxc.o                 tail_ipv4_ct_ingress                     248         251     +3 (+1.21%)
bpf_lxc.o                 tail_ipv4_ct_ingress_policy_only         248         251     +3 (+1.21%)
bpf_lxc.o                 tail_nodeport_ipv4_dsr                    31          42   +11 (+35.48%)
bpf_lxc.o                 tail_nodeport_nat_ingress_ipv4           276         316   +40 (+14.49%)
bpf_lxc.o                 tail_nodeport_nat_ingress_ipv6           243         254    +11 (+4.53%)
bpf_overlay.o             tail_handle_nat_fwd_ipv4                1082        1109    +27 (+2.50%)
bpf_overlay.o             tail_nodeport_ipv4_dsr                    31          42   +11 (+35.48%)
bpf_overlay.o             tail_nodeport_nat_egress_ipv4           2238        2243     +5 (+0.22%)
bpf_overlay.o             tail_nodeport_nat_ingress_ipv4           276         316   +40 (+14.49%)
bpf_overlay.o             tail_nodeport_nat_ingress_ipv6           243         254    +11 (+4.53%)
bpf_sock.o                cil_sock4_connect                         47          64   +17 (+36.17%)
bpf_sock.o                cil_sock4_sendmsg                         45          62   +17 (+37.78%)
bpf_xdp.o                 tail_handle_nat_fwd_ipv4                1461        1912  +451 (+30.87%)
bpf_xdp.o                 tail_lb_ipv4                            4643        4738    +95 (+2.05%)
bpf_xdp.o                 tail_nodeport_nat_egress_ipv4           1066        1069     +3 (+0.28%)
bpf_xdp.o                 tail_rev_nodeport_lb4                    404         411     +7 (+1.73%)
bpf_xdp.o                 tail_rev_nodeport_lb6                   1076        1083     +7 (+0.65%)
pyperf600_bpf_loop.bpf.o  on_event                                 285         287     +2 (+0.70%)
xdp_synproxy_kern.bpf.o   syncookie_tc                           22513       22564    +51 (+0.23%)
xdp_synproxy_kern.bpf.o   syncookie_xdp                          22207       24206  +1999 (+9.00%)
------------------------  --------------------------------  ----------  ----------  --------------

> 
> Well, that's what I wanted to ask, actually :)
> Here is how current logic works:
> - is_state_visited() has the following two loops in the end:
> 
> 	for (j = 0; j <= cur->curframe; j++) {
> 		for (i = j < cur->curframe ? BPF_REG_6 : 0; i < BPF_REG_FP; i++)
> 			cur->frame[j]->regs[i].parent = &new->frame[j]->regs[i];
> 		for (i = 0; i < BPF_REG_FP; i++)
> 			cur->frame[j]->regs[i].live = REG_LIVE_NONE;
> 	}
> 
> 	/* all stack frames are accessible from callee, clear them all */
> 	for (j = 0; j <= cur->curframe; j++) {
> 		struct bpf_func_state *frame = cur->frame[j];
> 		struct bpf_func_state *newframe = new->frame[j];
> 
> 		for (i = 0; i < frame->allocated_stack / BPF_REG_SIZE; i++) {
> 			frame->stack[i].spilled_ptr.live = REG_LIVE_NONE;
> 			frame->stack[i].spilled_ptr.parent =
> 						&newframe->stack[i].spilled_ptr;
> 		}
> 	}
> 
>   These connect the bpf_reg_state members of the new state with
>   corresponding (index-wise) members of the parent state.
> - find_equal_scalars() looks as follows:
>   static void find_equal_scalars(struct bpf_verifier_state *vstate,
>                                struct bpf_reg_state *known_reg)
>   {
> 	struct bpf_func_state *state;
> 	struct bpf_reg_state *reg;
> 
> 	bpf_for_each_reg_in_vstate(vstate, state, reg, ({
> 		if (reg->type == SCALAR_VALUE && reg->id == known_reg->id)
> 			*reg = *known_reg;  // <--- full copy, incl. parent pointer
> 	}));
>   }
> - mark_reg_read() updates the ->live field of the *parent* register
>   when called only if ->live field of the *current* register is not
>   marked as written.
> - in case if register is overwritten it's ->live field is marked as
>   written, e.g. see check_stack_read_fixed_off().
>   
> Suppose we have an example:
> 
> ---- checkpoint ----
> r1 = r0               ; now r1.parent == &checkpoint->regs[0]
> r2 = r1               ; now r2.parent == &checkpoint->regs[0]
> if (r1 == 0) goto +42
> ...
> 
> Given the above logic only &checkpoint->regs[0] would receive read
> marks. Although I'm not the original author but this behavior seem to
> make sense.
> 
> > 
> > > + *
> > > + * Thus, for execution path 1-6:
> > > + * - both r6->parent and r7->parent point to the same register in the parent state (r7);
> > > + * - only *one* register in the checkpoint state would receive REG_LIVE_READ mark;
> > 
> > I'm trying to understand this. Clearly both r6 and r7 are read. r6 for
> > if (r6 > X) check, r7 for r9 manipulations. Why do we end up not
> > marking one of them as read using a normal logic?
> 
> When (r6 > X) is processed find_equal_scalars() updates parent
> pointers for all registers with the same ID as r6, in this case only
> for r7. So, after find_equal_scalars() is done both current state r6
> and r7 ->parent point to the r6 of the latest checkpoint state.
> 
> > 
> > I have this bad feeling I'm missing something very important here or
> > we have some bug somewhere else. So please help me understand which
> > one it is. This special liveness manipulation seems wrong.
> > 
> > My concern is that if I have some code like
> > 
> > r6 = r7;
> > r9 += r6;
> > 
> > and I never use r7 anymore after that, then we should be able to
> > forget r7 and treat it as NOT_INIT. But you are saying it's unsafe
> > right now and that doesn't make much sense to me.
> 
> It is unsafe because of the "spooky action at a distance" produced by
> a combination of:
> - allocation of scalar IDs for moves, see check_alu_op() case for
>   64-bit move;
> - find_equal_scalars() that propagates range, this one is only
>   executed for conditional jumps.
> 
> > 
> > 
> > > + * - clean_func_state() would remove r6 from checkpoint state (mark it NOT_INIT).
> > > + *
> > > + * Consequently, when execution path 1-4, 6 reaches (6) in state (II)
> > > + * regsafe() won't be able to see a mismatch in ID mappings.
> > > + *
> > > + * To avoid this issue mark_equal_scalars_as_read() conservatively
> > > + * marks all registers with matching ID as REG_LIVE_READ, thus
> > > + * preserving r6 and r7 in the checkpoint state for the example above.
> > > + */
> > > +static void mark_equal_scalars_as_read(struct bpf_verifier_state *vstate, int id)
> > > +{
> > > +       struct bpf_verifier_state *st;
> > > +       struct bpf_func_state *state;
> > > +       struct bpf_reg_state *reg;
> > > +       bool move_up;
> > > +       int i = 0;
> > > +
> > > +       for (st = vstate, move_up = true; st && move_up; st = st->parent) {
> > > +               move_up = false;
> > > +               bpf_for_each_reg_in_vstate(st, state, reg, ({
> > > +                       if (reg->type == SCALAR_VALUE && reg->id == id &&
> > > +                           !(reg->live & REG_LIVE_READ)) {
> > > +                               reg->live |= REG_LIVE_READ;
> > > +                               move_up = true;
> > > +                       }
> > > +                       ++i;
> > > +               }));
> > > +       }
> > > +}
> > > +
> > >  static void find_equal_scalars(struct bpf_verifier_state *vstate,
> > >                                struct bpf_reg_state *known_reg)
> > >  {
> > >         struct bpf_func_state *state;
> > >         struct bpf_reg_state *reg;
> > > +       int count = 0;
> > > 
> > >         bpf_for_each_reg_in_vstate(vstate, state, reg, ({
> > > -               if (reg->type == SCALAR_VALUE && reg->id == known_reg->id)
> > > +               if (reg->type == SCALAR_VALUE && reg->id == known_reg->id) {
> > >                         *reg = *known_reg;
> > > +                       ++count;
> > > +               }
> > >         }));
> > > +
> > > +       /* Count equal to 1 means that find_equal_scalars have not
> > > +        * found any registers with the same ID (except self), thus
> > > +        * the range knowledge have not been transferred and there is
> > > +        * no need to preserve registers with the same ID in a parent
> > > +        * state.
> > > +        */
> > > +       if (count > 1)
> > > +               mark_equal_scalars_as_read(vstate->parent, known_reg->id);
> > >  }
> > > 
> > >  static int check_cond_jmp_op(struct bpf_verifier_env *env,
> > > @@ -12878,6 +12950,12 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
> > >                  */
> > >                 return equal && rold->frameno == rcur->frameno;
> > > 
> > > +       /* even if two registers are identical the id mapping might diverge
> > > +        * e.g. rold{.id=1}, rcur{.id=1}, idmap{1->2}
> > > +        */
> > > +       if (equal && rold->type == SCALAR_VALUE && rold->id)
> > > +               return check_ids(rold->id, rcur->id, idmap);
> > 
> > nit: let's teach check_ids() to handle the id == 0 case properly
> > instead of guarding everything with `if (rold->id)`?
> > 
> > but also I think this applies not just to SCALARs, right? the memcmp()
> > check above has to be augmented with check_ids() for id and ref_obj_id
> 
> Yes, it is the same issue as described in [1] as you pointed out.
> I'll updated it for other branches, but I want the main issue to
> be sorted out first.
> 
> [1] https://lore.kernel.org/bpf/CAEf4BzbFB5g4oUfyxk9rHy-PJSLQ3h8q9mV=rVoXfr_JVm8+1Q@mail.gmail.com/
> 
> > 
> > > +
> > >         if (equal)
> > >                 return true;
> > > 
> > > @@ -12891,6 +12969,11 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
> > >                 if (env->explore_alu_limits)
> > >                         return false;
> > >                 if (rcur->type == SCALAR_VALUE) {
> > > +                       /* id relations must be preserved, see comment in
> > > +                        * mark_equal_scalars_as_read() for SCALAR_VALUE example.
> > > +                        */
> > > +                       if (rold->id && !check_ids(rold->id, rcur->id, idmap))
> > > +                               return false;
> > >                         if (!rold->precise)
> > >                                 return true;
> > >                         /* new val must satisfy old val knowledge */
> > > --
> > > 2.34.1
> > > 
>