All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC bpf-next] Speedup for verifier.c:do_misc_fixups
@ 2022-06-21 23:30 Eduard Zingerman
  2022-06-23  3:11 ` Alexei Starovoitov
  0 siblings, 1 reply; 5+ messages in thread
From: Eduard Zingerman @ 2022-06-21 23:30 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kernel-team

Hi Everyone,

Below I describe a scenario when function `verifier.c:do_misc_fixups`
exhibits O(n**2) performance for certain BPF programs. I also describe
a possible adjustment for the `verifier.c:do_misc_fixups` to avoid
such behavior. I can work on this adjustment in my spare time for a
few weeks if the community is interested.

The function `verifier.c:do_misc_fixups` uses
`verifier.c:bpf_patch_insn_data` to replace single instructions with a
series of instructions. The `verifier.c:bpf_patch_insn_data` is a
wrapper for the function `core.c:bpf_patch_insn_single`. The latter
function operates in the following manner:
- allocates new instructions array of size `old size + patch size`;
- copies old instructions;
- copies patch instructions;
- adjusts offset fields for all instructions.

This algorithm means that for pathological BPF programs the
`verifier.c:do_misc_fixups` would demonstrate running time
proportional to the square of the program length.
An example of such program looks as follows:

BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_jiffies64);
... N times ...
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_jiffies64);
BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 0);
BPF_EXIT_INSN();

`verifier.c:do_misc_fixups` replaces each call to jiffies by 3
instructions. Hence the code above would lead to the copying of
(N + N+2 + N+4 ... + N+2N) bytes, which is O(n**2).

Experimental results confirm this observation.  Here is the output of
the demo program:

  prog len     verif time usec
       128          1461
       256 x2       4132 x2.8
       512 x2      12510 x3.0
      1024 x2      44022 x3.5
      2048 x2     170479 x3.9
      4096 x2     646766 x3.8
      8192 x2    2557379 x4.0

The source code for this program is at the end of this email.

The following technique could be used to improve the running time of
the `verifier.c:do_misc_fixups`:
- patches are not applied immediately but are accumulated;
- patches are stored in the intermediate array; the size of this array
  is doubled when the capacity for the new patch is insufficient;
- patches are applied at once at the end of the `verifier.c:do_misc_fixups`;
- intermediate data structure is constructed to efficiently map
  between old and new instruction indexes;
- instruction offset fields are updated using this data structure.

In terms of the C code, this could look as follows:

/* Describes a single patch:
 * BPF instruction at index `offset` is replaced by
 * a series of the instructions pointed to by `insns`.
 */
struct bpf_patch {
  int offset;             /* index inside the original program,
                           * the instruction at this index would be replaced.
                           */
  int insns_count;        /* size of the patch */
  int delta;              /* difference in indexes between original program and
                           * new program after application of all patches up to
                           * and including this one.
                           */
  struct bpf_insn *insns; /* patch instructions */
};

/* Used to accumulate patches */
struct bpf_patching_state {
  struct bpf_patch *patches; /* array of all patches*/
  int patches_count;         /* current amount of patches */
  int patches_capacity;      /* total capacity of the patches array */
  struct bpf_insn *insns;    /* array of patch instructions,
                              * bpf_patch::insns points to the middle of this array
                              */
  int insns_count;           /* first free index in the instructions array */
  int insns_capacity;        /* total capacity of the instructions array */
};

/* Allocates `patches` and `insns` arrays with some initial capacity */
static int init_patching_state(struct bpf_patching_state *state)
{ ... }

/* Adds a patch record to the `bpf_patching_state::patches` array.
 * If array capacity is insufficient, its size is doubled.
 * Copies `insns` to the end of the `bpf_patching_state::insns`.
 * If array capacity is insufficient, its size is doubled.
 *
 * It is assumed that patches are added in increasing order of
 * the `bpf_patch::offset` field.
 */
static int add_patch(struct bpf_patching_state *state,
                     int offset,
                     int insns_count,
                     struct bpf_insn *insns)
{ ... }

/* - fills in the `bpf_patch::delta` fields for all patches in `state`.
 * - allocates new program
 * - copies program and patch instructions using the `patch_and_copy_insn` function
 */
static struct bpf_insn* apply_patches(struct bpf_patching_state *state,
                                      struct bpf_insn *prog,
                                      int *prog_size) {
{
  int delta = 0;
  for (int i = 0; i < state->patches_count; ++i) {
    delta += state->patches[i].insns_count - 1;
    state->patches[i].delta = delta;
  }
  ...
}

/* Uses binary search to find `bpf_patch::delta` corresponding to `index`.
 * `index` stands for the index of instruction in the original program.
 * Looks for the closest `state->patches[?]->offset <= index` and returns it's `delta`.
 */
static int lookup_delta_for_index(struct bpf_patching_state *state, int index)
{ ... }

/* Copies instruction at `src` to instruction at `dst` and adjusts `dst->off` field.
 * `pc` stands for the instruction index of `src` in the original program.
 */
static void patch_and_copy_insn(struct bpf_patching_state *state,
                                int pc,
                                struct bpf_insn *dst,
                                struct bpf_insn *src)
{
  int new_pc = pc + lookup_delta_for_index(state, pc);
  int new_dst = pc + dst->off + lookup_delta_for_index(state, pc + dst->off);

  memcpy(dst, src, sizeof(struct bpf_insn));
  dst->off = new_dst - new_pc;
}

Please let me know what you think about this change and whether or not
it should be implemented.

Best regards,
Eduard Zingerman

---
// Demo program
---

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/resource.h>
#include <errno.h>
#include <stdlib.h>

#include <bpf/bpf.h>
#include <bpf/libbpf.h>
#include <linux/filter.h>

#define BPF_FUNC_jiffies64 118

const int AVG_TRIES = 50;
const int MAX_ITERS = 6;
int verbose = 0;

static void stop(char* description) {
  fprintf(stderr, "%s: %s\n", description, strerror(errno));
  exit(1);
}

static struct bpf_insn *gen_prog(int size, int *real_size) {
  int i = 0;
  *real_size = size + 2;
  struct bpf_insn *prog = calloc(*real_size, sizeof(struct bpf_insn));

  if (!prog)
    stop("can't allocate prog");

  while (i < size)
    prog[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_jiffies64);

  prog[i++] = BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 0);
  prog[i++] = BPF_EXIT_INSN();

  return prog;
}

static int get_verif_time_usec(struct bpf_insn *prog, int real_prog_len) {
  LIBBPF_OPTS(bpf_prog_load_opts, opts);
  char log[4096];

  opts.log_level = verbose ? 1 | 4 : 4;
  opts.log_buf = log;
  opts.log_size = sizeof(log);
 
  int prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACEPOINT,
                              NULL, "GPL", prog, real_prog_len, &opts);
  log[sizeof(log)-1] = 0;

  if (verbose)
    printf("--- log start ---\n%s--- log end   ---\n", log);

  if (prog_fd < 0)
    stop("can't load prog");
  
  close(prog_fd);

  int verif_time_usec = 0;
  if (sscanf(log, "verification time %d usec", &verif_time_usec) != 1)
    stop("can't get verification time from log");

  return verif_time_usec;
}

static int cmpint(const void *a, const void *b) {
  return *(int*)a - *(int*)b;
}

static int get_avg_verif_time_usec(int prog_len) {
  int results[AVG_TRIES];
  int real_prog_len;
  struct bpf_insn *prog = gen_prog(prog_len, &real_prog_len);
  for (int i = 0; i < AVG_TRIES; ++i) {
    results[i] = get_verif_time_usec(prog, real_prog_len);
  }
  free(prog);
  qsort(results, AVG_TRIES, sizeof(int), cmpint);
  int total_usec = 0;
  int samples_num = 0;
  for (int i = AVG_TRIES / 4; i < (3 * AVG_TRIES / 4); ++i) {
    total_usec += results[i];
    samples_num += 1;
  }
  return total_usec / samples_num;
}

int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) {
  return vfprintf(stderr, format, args);
}

int main(int argc, char **argv) {
  libbpf_set_print(libbpf_print_fn);
  printf("%10s     %s\n", "prog len", "verif time usec");
  int prev_prog_len = -1;
  int prev_time = -1;
  for (int i = 0; i <= MAX_ITERS; ++i) {
    int prog_len = (1 << i) * 128;
    int avg_verif_time_usec = get_avg_verif_time_usec(prog_len);
    if (prev_prog_len >= 0)
      printf("%10d x%1.f %10d x%1.1f\n",
             prog_len,
             ((double)prog_len / prev_prog_len),
             avg_verif_time_usec,
             ((double)avg_verif_time_usec / prev_time));
    else
      printf("%10d    %10d\n", prog_len, avg_verif_time_usec);
    prev_prog_len = prog_len;
    prev_time = avg_verif_time_usec;
  }
  return 0;
}


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC bpf-next] Speedup for verifier.c:do_misc_fixups
  2022-06-21 23:30 [RFC bpf-next] Speedup for verifier.c:do_misc_fixups Eduard Zingerman
@ 2022-06-23  3:11 ` Alexei Starovoitov
  2022-06-23 14:02   ` Eduard Zingerman
  0 siblings, 1 reply; 5+ messages in thread
From: Alexei Starovoitov @ 2022-06-23  3:11 UTC (permalink / raw)
  To: Eduard Zingerman; +Cc: bpf, ast, andrii, daniel, kernel-team

On Wed, Jun 22, 2022 at 02:30:21AM +0300, Eduard Zingerman wrote:
> Hi Everyone,
> 
> Below I describe a scenario when function `verifier.c:do_misc_fixups`
> exhibits O(n**2) performance for certain BPF programs. I also describe
> a possible adjustment for the `verifier.c:do_misc_fixups` to avoid
> such behavior. I can work on this adjustment in my spare time for a
> few weeks if the community is interested.
> 
> The function `verifier.c:do_misc_fixups` uses
> `verifier.c:bpf_patch_insn_data` to replace single instructions with a
> series of instructions. The `verifier.c:bpf_patch_insn_data` is a
> wrapper for the function `core.c:bpf_patch_insn_single`. The latter
> function operates in the following manner:
> - allocates new instructions array of size `old size + patch size`;
> - copies old instructions;
> - copies patch instructions;
> - adjusts offset fields for all instructions.
> 
> This algorithm means that for pathological BPF programs the
> `verifier.c:do_misc_fixups` would demonstrate running time
> proportional to the square of the program length.
> An example of such program looks as follows:
> 
> BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_jiffies64);
> ... N times ...
> BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_jiffies64);
> BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 0);
> BPF_EXIT_INSN();
> 
> `verifier.c:do_misc_fixups` replaces each call to jiffies by 3
> instructions. Hence the code above would lead to the copying of
> (N + N+2 + N+4 ... + N+2N) bytes, which is O(n**2).
> 
> Experimental results confirm this observation.  Here is the output of
> the demo program:
> 
>   prog len     verif time usec
>        128          1461
>        256 x2       4132 x2.8
>        512 x2      12510 x3.0
>       1024 x2      44022 x3.5
>       2048 x2     170479 x3.9
>       4096 x2     646766 x3.8
>       8192 x2    2557379 x4.0
> 
> The source code for this program is at the end of this email.
> 
> The following technique could be used to improve the running time of
> the `verifier.c:do_misc_fixups`:
> - patches are not applied immediately but are accumulated;
> - patches are stored in the intermediate array; the size of this array
>   is doubled when the capacity for the new patch is insufficient;
> - patches are applied at once at the end of the `verifier.c:do_misc_fixups`;
> - intermediate data structure is constructed to efficiently map
>   between old and new instruction indexes;
> - instruction offset fields are updated using this data structure.
> 
> In terms of the C code, this could look as follows:
> 
> /* Describes a single patch:
>  * BPF instruction at index `offset` is replaced by
>  * a series of the instructions pointed to by `insns`.
>  */
> struct bpf_patch {
>   int offset;             /* index inside the original program,
>                            * the instruction at this index would be replaced.
>                            */
>   int insns_count;        /* size of the patch */
>   int delta;              /* difference in indexes between original program and
>                            * new program after application of all patches up to
>                            * and including this one.
>                            */
>   struct bpf_insn *insns; /* patch instructions */
> };
> 
> /* Used to accumulate patches */
> struct bpf_patching_state {
>   struct bpf_patch *patches; /* array of all patches*/
>   int patches_count;         /* current amount of patches */
>   int patches_capacity;      /* total capacity of the patches array */
>   struct bpf_insn *insns;    /* array of patch instructions,
>                               * bpf_patch::insns points to the middle of this array
>                               */
>   int insns_count;           /* first free index in the instructions array */
>   int insns_capacity;        /* total capacity of the instructions array */
> };
> 
> /* Allocates `patches` and `insns` arrays with some initial capacity */
> static int init_patching_state(struct bpf_patching_state *state)
> { ... }
> 
> /* Adds a patch record to the `bpf_patching_state::patches` array.
>  * If array capacity is insufficient, its size is doubled.
>  * Copies `insns` to the end of the `bpf_patching_state::insns`.
>  * If array capacity is insufficient, its size is doubled.
>  *
>  * It is assumed that patches are added in increasing order of
>  * the `bpf_patch::offset` field.
>  */
> static int add_patch(struct bpf_patching_state *state,
>                      int offset,
>                      int insns_count,
>                      struct bpf_insn *insns)
> { ... }
> 
> /* - fills in the `bpf_patch::delta` fields for all patches in `state`.
>  * - allocates new program
>  * - copies program and patch instructions using the `patch_and_copy_insn` function
>  */
> static struct bpf_insn* apply_patches(struct bpf_patching_state *state,
>                                       struct bpf_insn *prog,
>                                       int *prog_size) {
> {
>   int delta = 0;
>   for (int i = 0; i < state->patches_count; ++i) {
>     delta += state->patches[i].insns_count - 1;
>     state->patches[i].delta = delta;
>   }
>   ...
> }
> 
> /* Uses binary search to find `bpf_patch::delta` corresponding to `index`.
>  * `index` stands for the index of instruction in the original program.
>  * Looks for the closest `state->patches[?]->offset <= index` and returns it's `delta`.
>  */
> static int lookup_delta_for_index(struct bpf_patching_state *state, int index)
> { ... }
> 
> /* Copies instruction at `src` to instruction at `dst` and adjusts `dst->off` field.
>  * `pc` stands for the instruction index of `src` in the original program.
>  */
> static void patch_and_copy_insn(struct bpf_patching_state *state,
>                                 int pc,
>                                 struct bpf_insn *dst,
>                                 struct bpf_insn *src)
> {
>   int new_pc = pc + lookup_delta_for_index(state, pc);
>   int new_dst = pc + dst->off + lookup_delta_for_index(state, pc + dst->off);
> 
>   memcpy(dst, src, sizeof(struct bpf_insn));
>   dst->off = new_dst - new_pc;
> }
> 
> Please let me know what you think about this change and whether or not
> it should be implemented.

tbh sounds scary. We had so many bugs in the patch_insn over years.
The code is subtle.
The main overhead is from bpf_adj_branches() as you noticed.
We might even do it twice for programs larger than S16_MAX resulting
in O(n^3) behavior.
There is also bpf_adj_linfo, adjust_insn_aux_data, adjust_subprog_starts,
adjust_poke_descs.
Every time we patch.
It's certainly something worth optimizing if we can.
Before proceeding too far based on artificial test please collect
the number of patches and their sizes we currently do across all progs
in selftests. Maybe it's not that bad in real code.

As far as algo the patch_and_copy_insn() assumes that 'dst' insn is a branch?
I guess I'm missing some parts of the proposed algo.

Instead of delaying everything maybe we can do all patching inline
except bpf_adj_branches?
Remember only:
   int offset;             /* index inside the original program,
                            * the instruction at this index would be replaced.
                            */
   int insns_count;        /* size of the patch */
   int delta;              /* difference in indexes between original program and
                            * new program after application of all patches up to
                            * and including this one.
                            */
and do single bpf_adj_branches at the end ?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC bpf-next] Speedup for verifier.c:do_misc_fixups
  2022-06-23  3:11 ` Alexei Starovoitov
@ 2022-06-23 14:02   ` Eduard Zingerman
  2022-06-24 18:39     ` Alexei Starovoitov
  0 siblings, 1 reply; 5+ messages in thread
From: Eduard Zingerman @ 2022-06-23 14:02 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: bpf, ast, andrii, daniel, kernel-team

> On Wed, 2022-06-22 at 20:11 -0700, Alexei Starovoitov wrote:
[...]
> tbh sounds scary. We had so many bugs in the patch_insn over years.
> The code is subtle.

In terms of testing strategy the following could be done:
- use pseudo-random testing to verify that `bpf_patch_insn_data` and
  the mechanism suggested in my email produce identical code.
- use pseudo-random testing to verify that offsets after rewrite point
  to expected instructions (e.g. use .imm field as a unique marker for
  the instruction).
- hand-written tests for corner cases.

Would you consider this to be sufficient?  (Probably does not sound
too reassuring from the person[me] who submits patches with trivial
use after free errors :)

[...]

> Before proceeding too far based on artificial test please collect
> the number of patches and their sizes we currently do across all progs
> in selftests. Maybe it's not that bad in real code.

I will collect and share the stats on Saturday or Sunday.

> As far as algo the patch_and_copy_insn() assumes that 'dst' insn is a branch?
> I guess I'm missing some parts of the proposed algo.

Sorry, I made a mistake in the original email, the code for
patch_and_copy_insn() should look as follows:

static void patch_and_copy_insn(struct bpf_patching_state *state, int pc,
				struct bpf_insn *dst, struct bpf_insn *src) {
	memcpy(dst, src, sizeof(struct bpf_insn));
	// TODO: other classes
	// TODO: also adjust imm for calls
	if (BPF_CLASS(src->code) == BPF_JMP) {
		int new_pc = pc + lookup_delta_for_index(state, pc);
		int dst_pc = pc + src->off + 1;
		int new_dst = dst_pc + lookup_delta_for_index(state, dst_pc);
		dst->off = new_dst - new_pc - 1;
	}
}


The logic is as follows:
- compute new instruction index for the old pc
- compute new instruction index for the (old pc + offset)
- use these values to compute the new offset

The draft implementation of this algorithm is at the end of this
email.

> Instead of delaying everything maybe we can do all patching inline
> except bpf_adj_branches?
> Remember only:
>    int offset;             /* index inside the original program,
>                             * the instruction at this index would be replaced.
>                             */
>    int insns_count;        /* size of the patch */
>    int delta;              /* difference in indexes between original program and
>                             * new program after application of all patches up to
>                             * and including this one.
>                             */
> and do single bpf_adj_branches at the end ?

The algorithm consists of two parts:
- To avoid excessive copying patches are accumulated in a separate
  array and size of this array is doubled each time the capacity is
  not sufficient to fit a new patch. This should have O(log(n))
  complexity in terms of copied bytes.
- To avoid excessive branch adjustment passes a single branch
  adjustment pass is performed at the end. This pass visits each
  instruction only once, however, for each instruction it will have
  to query the delta value in a sorted array. Thus the overall
  complexity of this pass would be O(n*log(n)). It is possible to
  adjust some relevant fields in `env` during this pass as well.

If the first part is skipped and patching is done inline, then for
each patch:
- `realloc` is invoked (relatively cheap?)
- `memcpy` is invoked to copy the head and tail of the current
  program.  This has O(n**2) worst case complexity in terms of copied
  memory.

Thanks,
Eduard

---

/* Draft implementation of the proposed algorithm, written as user
 * space code, thus calls bzero etc. Should be buggy but I'll figure
 * it out. */
   

#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <strings.h>

#define BPF_CLASS(code) ((code)&0x07)
#define BPF_JMP 0x05

struct bpf_insn {
	uint8_t code;
	uint8_t dst_reg : 4;
	uint8_t src_reg : 4;
	int16_t off;
	int32_t imm;
};

struct bpf_patch {
	int offset;
	int insns_count;
	int delta;
	struct bpf_insn *insns;
};

struct bpf_patching_state {
	struct bpf_patch *patches;
	int patches_count;
	int patches_capacity;
	struct bpf_insn *insns;
	int insns_count;
	int insns_capacity;
};

static int init_patching_state(struct bpf_patching_state *state) {
	bzero(state, sizeof(*state));
	state->patches_count = 0;
	state->patches_capacity = 1; // small number, just for testing
	state->patches = calloc(state->patches_capacity, sizeof(struct bpf_patch));
	if (!state->patches)
		goto err;
	state->insns_count = 0;
	state->insns_capacity = 2; // small number, just for testing
	state->insns = calloc(state->insns_capacity, sizeof(struct bpf_insn));
	if (!state->insns)
		goto err;

	return 0;

err:
	if (state->patches)
		free(state->patches);
	if (state->insns)
		free(state->insns);
	return -ENOMEM;
}

static void free_patching_state(struct bpf_patching_state *state) {
	free(state->patches);
	free(state->insns);
	bzero(state, sizeof(*state));
}

static int add_patch(struct bpf_patching_state *state, int offset,
		     int insns_count, struct bpf_insn *insns)
{
	if (state->patches_count == state->patches_capacity) {
		struct bpf_patch *new_patches =
			reallocarray(state->patches, state->patches_capacity * 2,
				         sizeof(struct bpf_patch));
		if (!new_patches)
			return -ENOMEM;
		state->patches = new_patches;
		state->patches_capacity *= 2;
	}

	int insns_available = state->insns_capacity - state->insns_count;
	if (insns_available < insns_count) {
		int cur_capacity = (state->insns_capacity > insns_count
							? state->insns_capacity
							: insns_count);
		int new_capacity = cur_capacity * 2;
		struct bpf_insn *new_insns =
			reallocarray(state->insns, new_capacity, sizeof(struct bpf_insn));
		if (!new_insns)
			return -ENOMEM;
		state->insns = new_insns;
		state->insns_capacity = new_capacity;
	}

	struct bpf_patch *patch = &state->patches[state->patches_count];
	struct bpf_insn *dest_insns = &state->insns[state->insns_count];
	state->patches_count += 1;
	state->insns_count += insns_count;
	patch->offset = offset;
	patch->insns_count = insns_count;
	patch->insns = insns;
	patch->delta = 0;
	memcpy(dest_insns, insns, insns_count * sizeof(struct bpf_insn));

	return 0;
}

static int lookup_delta_for_index(struct bpf_patching_state *state, int index)
{
	struct bpf_patch *patches = state->patches;

	if (state->patches_count == 0 || index < patches[0].offset)
		return 0;

	if (state->patches_count == 1)
		return patches[0].delta;

	int l = 0;
	int r = state->patches_count - 1;

	while (r - l != 1) {
		int m = l + (r - l) / 2;
		if (patches[m].offset < index)
			l = m;
		else
			r = m;
	}

	if (patches[r].offset < index)
		return patches[r].delta;
	else
		return patches[l].delta;
}

static void patch_and_copy_insn(struct bpf_patching_state *state, int pc,
				struct bpf_insn *dst, struct bpf_insn *src) {
	memcpy(dst, src, sizeof(struct bpf_insn));
	// TODO: other classes
	// TODO: also adjust imm for calls
	if (BPF_CLASS(src->code) == BPF_JMP) {
		int new_pc = pc + lookup_delta_for_index(state, pc);
		int dst_pc = pc + src->off + 1;
		int new_dst = dst_pc + lookup_delta_for_index(state, dst_pc);
		dst->off = new_dst - new_pc - 1;
	}
}

static struct bpf_insn *apply_patches(struct bpf_patching_state *state,
				      struct bpf_insn *prog, int *prog_size)
{
	int delta = 0;
	for (int i = 0; i < state->patches_count; ++i) {
		delta += state->patches[i].insns_count - 1;
		state->patches[i].delta = delta;
	}
	int old_prog_size = *prog_size;
	int new_prog_size = old_prog_size + delta;
	struct bpf_insn *new_prog = calloc(new_prog_size, sizeof(struct bpf_insn));
	if (!new_prog)
		return NULL;

	*prog_size = new_prog_size;

	struct bpf_insn *old_insn = prog;
	struct bpf_insn *new_insn = new_prog;
	int next_patch = 0;
	for (int i = 0; i < old_prog_size; ++i) {
		// TODO: deal with double-word immediate values correctly
		if (next_patch < state->patches_count &&
		    state->patches[next_patch].offset == i) {
			struct bpf_patch *patch = &state->patches[next_patch];
			for (int j = 0; j < patch->insns_count; ++j) {
				patch_and_copy_insn(state, i + j,
									new_insn, &patch->insns[j]);
				++new_insn;
			}
			++next_patch;
		} else {
			patch_and_copy_insn(state, i, new_insn, old_insn);
			++new_insn;
		}
		++old_insn;
	}

	return new_prog;
}

static void show_prog(struct bpf_insn *prog, int cnt) {
	for (int i = 0; i < cnt; ++i) {
		printf("%02d: .code = %02d, .off = %+03d, .imm = %+03d", i, prog[i].code,
		       prog[i].off, prog[i].imm);
		if (BPF_CLASS(prog[i].code) == BPF_JMP) {
			int jmp_idx = i + prog[i].off + 1;
			printf(", jmp to %02d .imm = %+03d", jmp_idx, prog[jmp_idx].imm);
		}
		printf("\n");
	}
}

int main(int argc, char **argv) {
	struct bpf_insn prog[] = {
		{.code = 0, .imm = 0},
		{.code = BPF_JMP, .off = 3, .imm = 1},
		{.code = 0, .imm = 2},
		{.code = 0, .imm = 3},
		{.code = BPF_JMP, .off = 1, .imm = 4},
		{.code = 0, .imm = 5},
		{.code = 0, .imm = 6},
		{.code = BPF_JMP, .off = -5, .imm = 7},
		{.code = 0, .imm = 8},
		{.code = BPF_JMP, .off = 0, .imm = 9},
		{.code = 0, .imm = 10},
		{.code = BPF_JMP, .off = -11, .imm = 11},
	};
	int cnt = sizeof(prog) / sizeof(prog[0]);
	struct bpf_patching_state st;
	init_patching_state(&st);
	struct bpf_insn p1[] = {{.imm = -10}, {.imm = -11}};
	add_patch(&st, 0, 2, p1);
	struct bpf_insn p2[] = {
		{.imm = -20},
		{.imm = -21},
		{.imm = -22},
	};
	add_patch(&st, 2, 3, p2);
	struct bpf_insn p3[] = {
		{.imm = -30},
		{.imm = -31},
		{.code = BPF_JMP, .off = 1, .imm = -32}, // jump out of patch by 1 insn
		{.imm = -33}
	};
	add_patch(&st, 8, 4, p3);
	int new_prog_cnt = cnt;
	struct bpf_insn *new_prog = apply_patches(&st, prog, &new_prog_cnt);
	free_patching_state(&st);

	printf("Prog before:\n");
	show_prog(prog, cnt);
	printf("\nProg after:\n");
	show_prog(new_prog, new_prog_cnt);
	free(new_prog);

	return 0;
}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC bpf-next] Speedup for verifier.c:do_misc_fixups
  2022-06-23 14:02   ` Eduard Zingerman
@ 2022-06-24 18:39     ` Alexei Starovoitov
  2022-06-24 18:57       ` Andrii Nakryiko
  0 siblings, 1 reply; 5+ messages in thread
From: Alexei Starovoitov @ 2022-06-24 18:39 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: bpf, ast, andrii, daniel, kernel-team, john.fastabend, yhs

On Thu, Jun 23, 2022 at 05:02:34PM +0300, Eduard Zingerman wrote:
> > On Wed, 2022-06-22 at 20:11 -0700, Alexei Starovoitov wrote:
> [...]
> > tbh sounds scary. We had so many bugs in the patch_insn over years.
> > The code is subtle.
> 
> In terms of testing strategy the following could be done:
> - use pseudo-random testing to verify that `bpf_patch_insn_data` and
>   the mechanism suggested in my email produce identical code.
> - use pseudo-random testing to verify that offsets after rewrite point
>   to expected instructions (e.g. use .imm field as a unique marker for
>   the instruction).
> - hand-written tests for corner cases.
> 
> Would you consider this to be sufficient?  (Probably does not sound
> too reassuring from the person[me] who submits patches with trivial
> use after free errors :)
> 
> [...]
> 
> > Before proceeding too far based on artificial test please collect
> > the number of patches and their sizes we currently do across all progs
> > in selftests. Maybe it's not that bad in real code.
> 
> I will collect and share the stats on Saturday or Sunday.
> 
> > As far as algo the patch_and_copy_insn() assumes that 'dst' insn is a branch?
> > I guess I'm missing some parts of the proposed algo.
> 
> Sorry, I made a mistake in the original email, the code for
> patch_and_copy_insn() should look as follows:
> 
> static void patch_and_copy_insn(struct bpf_patching_state *state, int pc,
> 				struct bpf_insn *dst, struct bpf_insn *src) {
> 	memcpy(dst, src, sizeof(struct bpf_insn));
> 	// TODO: other classes
> 	// TODO: also adjust imm for calls
> 	if (BPF_CLASS(src->code) == BPF_JMP) {
> 		int new_pc = pc + lookup_delta_for_index(state, pc);
> 		int dst_pc = pc + src->off + 1;
> 		int new_dst = dst_pc + lookup_delta_for_index(state, dst_pc);
> 		dst->off = new_dst - new_pc - 1;
> 	}
> }
> 
> 
> The logic is as follows:
> - compute new instruction index for the old pc
> - compute new instruction index for the (old pc + offset)
> - use these values to compute the new offset
> 
> The draft implementation of this algorithm is at the end of this
> email.
> 
> > Instead of delaying everything maybe we can do all patching inline
> > except bpf_adj_branches?
> > Remember only:
> >    int offset;             /* index inside the original program,
> >                             * the instruction at this index would be replaced.
> >                             */
> >    int insns_count;        /* size of the patch */
> >    int delta;              /* difference in indexes between original program and
> >                             * new program after application of all patches up to
> >                             * and including this one.
> >                             */
> > and do single bpf_adj_branches at the end ?
> 
> The algorithm consists of two parts:
> - To avoid excessive copying patches are accumulated in a separate
>   array and size of this array is doubled each time the capacity is
>   not sufficient to fit a new patch. This should have O(log(n))
>   complexity in terms of copied bytes.
> - To avoid excessive branch adjustment passes a single branch
>   adjustment pass is performed at the end. This pass visits each
>   instruction only once, however, for each instruction it will have
>   to query the delta value in a sorted array. Thus the overall
>   complexity of this pass would be O(n*log(n)). It is possible to
>   adjust some relevant fields in `env` during this pass as well.

Before jumping into coding let's explore the alternatives.
Can we use some of the compiler techniques?
Like:
- split a set of insn into basic blocks (BB)
- convert each jmp from relative offset to fixed bb index
- perform patching of insns. No jmp adjustments are necessary.
  Every patch applies within bb. Minimal realloc+copy of insns within bb.
- reconstruct a set of insn from a set of bb-s.

John, Yonghong, everyone,
may be other ideas?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC bpf-next] Speedup for verifier.c:do_misc_fixups
  2022-06-24 18:39     ` Alexei Starovoitov
@ 2022-06-24 18:57       ` Andrii Nakryiko
  0 siblings, 0 replies; 5+ messages in thread
From: Andrii Nakryiko @ 2022-06-24 18:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Kernel Team, john fastabend, Yonghong Song

On Fri, Jun 24, 2022 at 11:39 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jun 23, 2022 at 05:02:34PM +0300, Eduard Zingerman wrote:
> > > On Wed, 2022-06-22 at 20:11 -0700, Alexei Starovoitov wrote:
> > [...]
> > > tbh sounds scary. We had so many bugs in the patch_insn over years.
> > > The code is subtle.
> >
> > In terms of testing strategy the following could be done:
> > - use pseudo-random testing to verify that `bpf_patch_insn_data` and
> >   the mechanism suggested in my email produce identical code.
> > - use pseudo-random testing to verify that offsets after rewrite point
> >   to expected instructions (e.g. use .imm field as a unique marker for
> >   the instruction).
> > - hand-written tests for corner cases.
> >
> > Would you consider this to be sufficient?  (Probably does not sound
> > too reassuring from the person[me] who submits patches with trivial
> > use after free errors :)
> >
> > [...]
> >
> > > Before proceeding too far based on artificial test please collect
> > > the number of patches and their sizes we currently do across all progs
> > > in selftests. Maybe it's not that bad in real code.
> >
> > I will collect and share the stats on Saturday or Sunday.
> >
> > > As far as algo the patch_and_copy_insn() assumes that 'dst' insn is a branch?
> > > I guess I'm missing some parts of the proposed algo.
> >
> > Sorry, I made a mistake in the original email, the code for
> > patch_and_copy_insn() should look as follows:
> >
> > static void patch_and_copy_insn(struct bpf_patching_state *state, int pc,
> >                               struct bpf_insn *dst, struct bpf_insn *src) {
> >       memcpy(dst, src, sizeof(struct bpf_insn));
> >       // TODO: other classes
> >       // TODO: also adjust imm for calls
> >       if (BPF_CLASS(src->code) == BPF_JMP) {
> >               int new_pc = pc + lookup_delta_for_index(state, pc);
> >               int dst_pc = pc + src->off + 1;
> >               int new_dst = dst_pc + lookup_delta_for_index(state, dst_pc);
> >               dst->off = new_dst - new_pc - 1;
> >       }
> > }
> >
> >
> > The logic is as follows:
> > - compute new instruction index for the old pc
> > - compute new instruction index for the (old pc + offset)
> > - use these values to compute the new offset
> >
> > The draft implementation of this algorithm is at the end of this
> > email.
> >
> > > Instead of delaying everything maybe we can do all patching inline
> > > except bpf_adj_branches?
> > > Remember only:
> > >    int offset;             /* index inside the original program,
> > >                             * the instruction at this index would be replaced.
> > >                             */
> > >    int insns_count;        /* size of the patch */
> > >    int delta;              /* difference in indexes between original program and
> > >                             * new program after application of all patches up to
> > >                             * and including this one.
> > >                             */
> > > and do single bpf_adj_branches at the end ?
> >
> > The algorithm consists of two parts:
> > - To avoid excessive copying patches are accumulated in a separate
> >   array and size of this array is doubled each time the capacity is
> >   not sufficient to fit a new patch. This should have O(log(n))
> >   complexity in terms of copied bytes.
> > - To avoid excessive branch adjustment passes a single branch
> >   adjustment pass is performed at the end. This pass visits each
> >   instruction only once, however, for each instruction it will have
> >   to query the delta value in a sorted array. Thus the overall
> >   complexity of this pass would be O(n*log(n)). It is possible to
> >   adjust some relevant fields in `env` during this pass as well.
>
> Before jumping into coding let's explore the alternatives.
> Can we use some of the compiler techniques?
> Like:
> - split a set of insn into basic blocks (BB)
> - convert each jmp from relative offset to fixed bb index
> - perform patching of insns. No jmp adjustments are necessary.
>   Every patch applies within bb. Minimal realloc+copy of insns within bb.
> - reconstruct a set of insn from a set of bb-s.
>
> John, Yonghong, everyone,
> may be other ideas?

We've had similar discussion ages (~3 years) ago. I had much more
context at that time, but I'll link to my latest proposal ([0]) within
that thread.

Roughly the idea was along the lines of what Alexei proposed above,
but instead of a basic block we kept all instructions as individual
linked list nodes. So we'd need something like 2x of memory relative
to the current array of struct bpf_insns, but that seems totally
acceptable. And then instead of updating relative indices, we'd keep a
mapping from the original instruction index to the latest real
instruction index. All this is pretty hazy for me at this point, but I
thought I'd link to that old discussion for completeness.

  [0] https://lore.kernel.org/bpf/CAEf4BzYDAVUgajz4=dRTu5xQDddp5pi2s=T1BdFmRLZjOwGypQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-06-24 18:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-21 23:30 [RFC bpf-next] Speedup for verifier.c:do_misc_fixups Eduard Zingerman
2022-06-23  3:11 ` Alexei Starovoitov
2022-06-23 14:02   ` Eduard Zingerman
2022-06-24 18:39     ` Alexei Starovoitov
2022-06-24 18:57       ` Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.