All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
@ 2022-03-26 11:40 Fedor Pchelkin
  2022-03-26 14:17 ` Alexey Khoroshilov
  0 siblings, 1 reply; 20+ messages in thread
From: Fedor Pchelkin @ 2022-03-26 11:40 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Fedor Pchelkin, linux-fsdevel, linux-kernel, Alexey Khoroshilov

If count argument in copy_fd_bitmaps() is not a multiple of
BITS_PER_BYTE, then one byte is lost and is not used in further
manipulations with cpy value in memcpy() and memset()
causing a leak. The leak was introduced with close_range() call
using CLOSE_RANGE_UNSHARE flag.

The patch suggests implementing an indicator (named add_byte)
of count being multiple of BITS_PER_BYTE and adding it to the
cpy value.

Found by Syzkaller (https://github.com/google/syzkaller).

Signed-off-by: Fedor Pchelkin <aissur0002@gmail.com>
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
---
 fs/file.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 3ef1479df203..3c64a6423604 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -56,10 +56,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
 {
 	unsigned int cpy, set;
 	unsigned int add_byte = 0;
-	
 	if (count % BITS_PER_BYTE != 0)
 		add_byte = 1;
-	
 	cpy = count / BITS_PER_BYTE + add_byte;
 	set = (nfdt->max_fds - count) / BITS_PER_BYTE;
 	memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-26 11:40 [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps() Fedor Pchelkin
@ 2022-03-26 14:17 ` Alexey Khoroshilov
  2022-03-26 22:15   ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Alexey Khoroshilov @ 2022-03-26 14:17 UTC (permalink / raw)
  To: Fedor Pchelkin, Alexander Viro
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds

Looks like bfp has a set of macro suitable for such cases:

#define BITS_PER_BYTE_MASKED(bits) ((bits) & BITS_PER_BYTE_MASK)
#define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3)
#define BITS_ROUNDUP_BYTES(bits) \
	(BITS_ROUNDDOWN_BYTES(bits) + !!BITS_PER_BYTE_MASKED(bits))

May be it makes sense to move them to a generic header and to use here?

--
Alexey Khoroshilov


On 26.03.2022 14:40, Fedor Pchelkin wrote:
> If count argument in copy_fd_bitmaps() is not a multiple of
> BITS_PER_BYTE, then one byte is lost and is not used in further
> manipulations with cpy value in memcpy() and memset()
> causing a leak. The leak was introduced with close_range() call
> using CLOSE_RANGE_UNSHARE flag.
> 
> The patch suggests implementing an indicator (named add_byte)
> of count being multiple of BITS_PER_BYTE and adding it to the
> cpy value.
> 
> Found by Syzkaller (https://github.com/google/syzkaller).
> 
> Signed-off-by: Fedor Pchelkin <aissur0002@gmail.com>
> Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
> ---
>  fs/file.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index 3ef1479df203..3c64a6423604 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -56,10 +56,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
>  {
>  	unsigned int cpy, set;
>  	unsigned int add_byte = 0;
> -	
>  	if (count % BITS_PER_BYTE != 0)
>  		add_byte = 1;
> -	
>  	cpy = count / BITS_PER_BYTE + add_byte;
>  	set = (nfdt->max_fds - count) / BITS_PER_BYTE;
>  	memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-26 14:17 ` Alexey Khoroshilov
@ 2022-03-26 22:15   ` Linus Torvalds
  2022-03-26 22:37     ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2022-03-26 22:15 UTC (permalink / raw)
  To: Alexey Khoroshilov, Eric Biggers, Christian Brauner
  Cc: Fedor Pchelkin, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List

Sorry, quoting everything below to bring in Eric Biggers because he
touched that particular code last.

And Christian Brauner, because he worked on all teh bitmap code with
the whole close_range thing.

I think this is all ok because the number of files aren't just
byte-aligned, they are long-aligned:

         * We make sure that nr remains a multiple of BITS_PER_LONG - otherwise
         * bitmaps handling below becomes unpleasant, to put it mildly...

but maybe I'm missing something.

The fact that there's a

     Found by Syzkaller (https://github.com/google/syzkaller).

thing in that suggested commit message makes me think there _is_
something I'm missing.

Certainly NR_OPEN_DEFAULT, sane_fdtable_size() and max_fds should
always be a multiple of BITS_PER_LONG.

So I don't _think_ there is any bug here, although it might be good to

 (a) document that "we explicitly do things in BITS_PER_LONG chunks"
even more in places

 (b) have people double-check my thinking because clearly that
syzcaller thing implies I'm full of crap

Eric, Christian?

Can somebody point to the actual syzkaller report?

                Linus

On Sat, Mar 26, 2022 at 7:17 AM Alexey Khoroshilov
<khoroshilov@ispras.ru> wrote:
>
> Looks like bfp has a set of macro suitable for such cases:
>
> #define BITS_PER_BYTE_MASKED(bits) ((bits) & BITS_PER_BYTE_MASK)
> #define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3)
> #define BITS_ROUNDUP_BYTES(bits) \
>         (BITS_ROUNDDOWN_BYTES(bits) + !!BITS_PER_BYTE_MASKED(bits))
>
> May be it makes sense to move them to a generic header and to use here?
>
> --
> Alexey Khoroshilov
>
>
> On 26.03.2022 14:40, Fedor Pchelkin wrote:
> > If count argument in copy_fd_bitmaps() is not a multiple of
> > BITS_PER_BYTE, then one byte is lost and is not used in further
> > manipulations with cpy value in memcpy() and memset()
> > causing a leak. The leak was introduced with close_range() call
> > using CLOSE_RANGE_UNSHARE flag.
> >
> > The patch suggests implementing an indicator (named add_byte)
> > of count being multiple of BITS_PER_BYTE and adding it to the
> > cpy value.
> >
> > Found by Syzkaller (https://github.com/google/syzkaller).
> >
> > Signed-off-by: Fedor Pchelkin <aissur0002@gmail.com>
> > Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
> > ---
> >  fs/file.c | 2 --
> >  1 file changed, 2 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index 3ef1479df203..3c64a6423604 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -56,10 +56,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
> >  {
> >       unsigned int cpy, set;
> >       unsigned int add_byte = 0;
> > -
> >       if (count % BITS_PER_BYTE != 0)
> >               add_byte = 1;
> > -
> >       cpy = count / BITS_PER_BYTE + add_byte;
> >       set = (nfdt->max_fds - count) / BITS_PER_BYTE;
> >       memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
> >
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-26 22:15   ` Linus Torvalds
@ 2022-03-26 22:37     ` Linus Torvalds
  2022-03-27 21:54       ` aissur0002
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2022-03-26 22:37 UTC (permalink / raw)
  To: Alexey Khoroshilov, Eric Biggers, Christian Brauner
  Cc: Fedor Pchelkin, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List

On Sat, Mar 26, 2022 at 3:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> but maybe I'm missing something.

Side note: the thing I'm _definitely_ missing is context for this patch.

I'm seeing "4/4" in the subject, but I can't find 1-3.

And the patch most definitely doesn't apply as-is, because that
'add_byte' logic that it removes lines around doesn't exist in my tree
at least.

And I _do_ think that regardless of anything else, having those
calculations use BITS_PER_BYTE - as if byte level operations were
valid - is misleading.

I do find something dodgy: alloc_fdtable().

That function very much tries to keep things to that multiple of
BITS_PER_LONG, and even has a comment to that effect, but I wonder if
it is broken.

In particular, that

  nr = ... | (BITS_PER_LONG - 1)) + 1;

case is only done for the "nr > sysctl_nr_open" case. The other case
*DOES* align things up, but not necessarily sufficiently, in that it
does

        nr /= (1024 / sizeof(struct file *));
        nr = roundup_pow_of_two(nr + 1);
        nr *= (1024 / sizeof(struct file *));

and I think that despite the round-up, it could easily be a smaller
power-of-two than BITS_PER_LONG.

So I think that code _intended_ for things to always be a multiple of
BITS_PER_LONG, but I'm not convinced it is.

It would be a good idea to just make it very explicit that it's
properly aligned, using

    --- a/fs/file.c
    +++ b/fs/file.c
    @@ -111,7 +111,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
         * bitmaps handling below becomes unpleasant, to put it mildly...
         */
        if (unlikely(nr > sysctl_nr_open))
    -           nr = ((sysctl_nr_open - 1) | (BITS_PER_LONG - 1)) + 1;
    +           nr = sysctl_nr_open;
    +   nr = ALIGN(nr, BITS_PER_LONG);

        fdt = kmalloc(sizeof(struct fdtable), GFP_KERNEL_ACCOUNT);
        if (!fdt)

although that would perhaps break the whole "we try to allocate in
comfortable page-tuned chunks" code, so that obvious patch may be
doing non-obvious bad things.

Maybe it's the roundup_pow_of_two() that should be fixed to be that
proper BITS_PER_LONG alignment instead?

And related to this all, we do have that BITS_PER_BYTE thing in a few
places, and they are all a bit dodgy:

 - copy_fd_bitmaps() uses BITS_PER_BYTE, but I do think that all the
values it operates on _should_ be BITS_PER_LONG aligned

 - alloc_fdtable() again does "2 * nr / BITS_PER_BYTE" for the bitmap
allocations

and we should make really really sure that we're always doing
BITS_PER_LONG, and that alloc_fdtable() calculation should be checked.

Hmm?

                   Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-26 22:37     ` Linus Torvalds
@ 2022-03-27 21:54       ` aissur0002
  2022-03-27 22:21         ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: aissur0002 @ 2022-03-27 21:54 UTC (permalink / raw)
  To: Alexey Khoroshilov, Eric Biggers, Christian Brauner, Linus Torvalds
  Cc: Fedor Pchelkin, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1667 bytes --]

I am sorry, that was my first attempt to contribute to the kernel and
I messed up a little bit with the patch tag: it is actually a single, 
standalone patch and there has been nothing posted previously.

Moreover, I should have provided you with the context of discovering 
the bug: a Syzkaller bug reproducer and CrashReport are attached to 
the mail.

In few words, an error occurs while executing close_range() call with
CLOSE_RANGE_UNSHARE flag: in __close_range() the value of
max_unshare_fds (later used as max_fds in dup_fd() and copy_fd_bitmaps())
can be an arbitrary number.

  if (max_fd >= last_fd(files_fdtable(cur_fds)))
    max_unshare_fds = fd;

and not be a multiple of BITS_PER_LONG or BITS_PER_BYTE. In case of
Syzkaller reproducer this value equals 69 - three standard file descriptors, 
66 pipe descriptors and 1 for /dev/ptmx; nr value used in alloc_fdtable()
is 128 (that' okay). But here is the weird place which causes the leak
described in the patch. KASAN detected that leak and possibly the
location of the leak is in copy_fd_bitmaps(): I experimented with
possible solutions and if add_byte logic is implemented, then KASAN does 
not find any leaks. Of course, the problem can be somewhere else and I 
don't notice it.

On Sat, Mar 26, 2022 at 3:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Certainly NR_OPEN_DEFAULT, sane_fdtable_size() and max_fds should
> always be a multiple of BITS_PER_LONG.

I totally agree with you but in the reproducer case the max_fds
value does not follow these rules.

I think there is probably something wrong in dup_fd() when getting
open_files value and passing it to copy_fd_bitmaps().


[-- Attachment #2: repro.c --]
[-- Type: text/x-csrc, Size: 7831 bytes --]

// https://None.appspot.com/bug?id=691717b7bf277fca0588ce460186987a805e5def
// autogenerated by syzkaller (https://github.com/google/syzkaller)

#define _GNU_SOURCE

#include <dirent.h>
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/prctl.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>

#include <linux/futex.h>

static void sleep_ms(uint64_t ms)
{
  usleep(ms * 1000);
}

static uint64_t current_time_ms(void)
{
  struct timespec ts;
  if (clock_gettime(CLOCK_MONOTONIC, &ts))
    exit(1);
  return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
}

static void thread_start(void* (*fn)(void*), void* arg)
{
  pthread_t th;
  pthread_attr_t attr;
  pthread_attr_init(&attr);
  pthread_attr_setstacksize(&attr, 128 << 10);
  int i = 0;
  for (; i < 100; i++) {
    if (pthread_create(&th, &attr, fn, arg) == 0) {
      pthread_attr_destroy(&attr);
      return;
    }
    if (errno == EAGAIN) {
      usleep(50);
      continue;
    }
    break;
  }
  exit(1);
}

typedef struct {
  int state;
} event_t;

static void event_init(event_t* ev)
{
  ev->state = 0;
}

static void event_reset(event_t* ev)
{
  ev->state = 0;
}

static void event_set(event_t* ev)
{
  if (ev->state)
    exit(1);
  __atomic_store_n(&ev->state, 1, __ATOMIC_RELEASE);
  syscall(SYS_futex, &ev->state, FUTEX_WAKE | FUTEX_PRIVATE_FLAG, 1000000);
}

static void event_wait(event_t* ev)
{
  while (!__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
    syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, 0);
}

static int event_isset(event_t* ev)
{
  return __atomic_load_n(&ev->state, __ATOMIC_ACQUIRE);
}

static int event_timedwait(event_t* ev, uint64_t timeout)
{
  uint64_t start = current_time_ms();
  uint64_t now = start;
  for (;;) {
    uint64_t remain = timeout - (now - start);
    struct timespec ts;
    ts.tv_sec = remain / 1000;
    ts.tv_nsec = (remain % 1000) * 1000 * 1000;
    syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, &ts);
    if (__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
      return 1;
    now = current_time_ms();
    if (now - start > timeout)
      return 0;
  }
}

static bool write_file(const char* file, const char* what, ...)
{
  char buf[1024];
  va_list args;
  va_start(args, what);
  vsnprintf(buf, sizeof(buf), what, args);
  va_end(args);
  buf[sizeof(buf) - 1] = 0;
  int len = strlen(buf);
  int fd = open(file, O_WRONLY | O_CLOEXEC);
  if (fd == -1)
    return false;
  if (write(fd, buf, len) != len) {
    int err = errno;
    close(fd);
    errno = err;
    return false;
  }
  close(fd);
  return true;
}

static void kill_and_wait(int pid, int* status)
{
  kill(-pid, SIGKILL);
  kill(pid, SIGKILL);
  for (int i = 0; i < 100; i++) {
    if (waitpid(-1, status, WNOHANG | __WALL) == pid)
      return;
    usleep(1000);
  }
  DIR* dir = opendir("/sys/fs/fuse/connections");
  if (dir) {
    for (;;) {
      struct dirent* ent = readdir(dir);
      if (!ent)
        break;
      if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
        continue;
      char abort[300];
      snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort",
               ent->d_name);
      int fd = open(abort, O_WRONLY);
      if (fd == -1) {
        continue;
      }
      if (write(fd, abort, 1) < 0) {
      }
      close(fd);
    }
    closedir(dir);
  } else {
  }
  while (waitpid(-1, status, __WALL) != pid) {
  }
}

static void setup_test()
{
  prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
  setpgrp();
  write_file("/proc/self/oom_score_adj", "1000");
}

#define KMEMLEAK_FILE "/sys/kernel/debug/kmemleak"

static void setup_leak()
{
  if (!write_file(KMEMLEAK_FILE, "scan"))
    exit(1);
  sleep(5);
  if (!write_file(KMEMLEAK_FILE, "scan"))
    exit(1);
  if (!write_file(KMEMLEAK_FILE, "clear"))
    exit(1);
}

static void check_leaks(void)
{
  int fd = open(KMEMLEAK_FILE, O_RDWR);
  if (fd == -1)
    exit(1);
  uint64_t start = current_time_ms();
  if (write(fd, "scan", 4) != 4)
    exit(1);
  sleep(1);
  while (current_time_ms() - start < 4 * 1000)
    sleep(1);
  if (write(fd, "scan", 4) != 4)
    exit(1);
  static char buf[128 << 10];
  ssize_t n = read(fd, buf, sizeof(buf) - 1);
  if (n < 0)
    exit(1);
  int nleaks = 0;
  if (n != 0) {
    sleep(1);
    if (write(fd, "scan", 4) != 4)
      exit(1);
    if (lseek(fd, 0, SEEK_SET) < 0)
      exit(1);
    n = read(fd, buf, sizeof(buf) - 1);
    if (n < 0)
      exit(1);
    buf[n] = 0;
    char* pos = buf;
    char* end = buf + n;
    while (pos < end) {
      char* next = strstr(pos + 1, "unreferenced object");
      if (!next)
        next = end;
      char prev = *next;
      *next = 0;
      fprintf(stderr, "BUG: memory leak\n%s\n", pos);
      *next = prev;
      pos = next;
      nleaks++;
    }
  }
  if (write(fd, "clear", 5) != 5)
    exit(1);
  close(fd);
  if (nleaks)
    exit(1);
}

struct thread_t {
  int created, call;
  event_t ready, done;
};

static struct thread_t threads[16];
static void execute_call(int call);
static int running;

static void* thr(void* arg)
{
  struct thread_t* th = (struct thread_t*)arg;
  for (;;) {
    event_wait(&th->ready);
    event_reset(&th->ready);
    execute_call(th->call);
    __atomic_fetch_sub(&running, 1, __ATOMIC_RELAXED);
    event_set(&th->done);
  }
  return 0;
}

static void execute_one(void)
{
  int i, call, thread;
  for (call = 0; call < 4; call++) {
    for (thread = 0; thread < (int)(sizeof(threads) / sizeof(threads[0]));
         thread++) {
      struct thread_t* th = &threads[thread];
      if (!th->created) {
        th->created = 1;
        event_init(&th->ready);
        event_init(&th->done);
        event_set(&th->done);
        thread_start(thr, th);
      }
      if (!event_isset(&th->done))
        continue;
      event_reset(&th->done);
      th->call = call;
      __atomic_fetch_add(&running, 1, __ATOMIC_RELAXED);
      event_set(&th->ready);
      event_timedwait(&th->done, 50);
      break;
    }
  }
  for (i = 0; i < 100 && __atomic_load_n(&running, __ATOMIC_RELAXED); i++)
    sleep_ms(1);
}

static void execute_one(void);

#define WAIT_FLAGS __WALL

static void loop(void)
{
  int iter = 0;
  for (;; iter++) {
    int pid = fork();
    if (pid < 0)
      exit(1);
    if (pid == 0) {
      setup_test();
      execute_one();
      exit(0);
    }
    int status = 0;
    uint64_t start = current_time_ms();
    for (;;) {
      if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid)
        break;
      sleep_ms(1);
      if (current_time_ms() - start < 5000)
        continue;
      kill_and_wait(pid, &status);
      break;
    }
    check_leaks();
  }
}

#ifndef __NR_close_range
#define __NR_close_range 436
#endif

uint64_t r[1] = {0xffffffffffffffff};

void execute_call(int call)
{
  intptr_t res = 0;
  switch (call) {
  case 0:
    syscall(__NR_pipe, 0x200000c0ul);
    {
      int i;
      for (i = 0; i < 32; i++) {
        syscall(__NR_pipe, 0x200000c0ul);
      }
    }
    break;
  case 1:
    memcpy((void*)0x20000140, "/dev/ptmx\000", 10);
    res = syscall(__NR_openat, 0xffffffffffffff9cul, 0x20000140ul, 1ul, 0ul);
    if (res != -1)
      r[0] = res;
    break;
  case 2:
    syscall(__NR_ioctl, -1, 0xc020f509, 0ul);
    break;
  case 3:
    syscall(__NR_close_range, r[0], -1, 2ul);
    break;
  }
}
int main(void)
{
  syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  setup_leak();
  loop();
  return 0;
}


[-- Attachment #3: CrashLog.txt --]
[-- Type: text/plain, Size: 15024 bytes --]

Warning: Permanently added '[localhost]:61462' (ED25519) to the list of known hosts.
syzkaller login: [  122.778438] audit: type=1400 audit(1644766535.275:6): avc:  denied  { execmem } for  pid=269 comm="syz-executor771" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=process permissive=1
executing program
executing program
[  144.039046] kmemleak: 24 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
[  152.581891] kmemleak: 27 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
BUG: memory leak
unreferenced object 0xffff888114246200 (size 168):
  comm "syz-executor771", pid 269, jiffies 4294800484 (age 19.246s)
  hex dump (first 32 bytes):
    05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000008d05afee>] prepare_creds+0x25/0x5e0
    [<00000000dc1ed5dc>] copy_creds+0x72/0x580
    [<000000006f959e2f>] copy_process+0xee4/0x66c0
    [<00000000bd3ea815>] kernel_clone+0xe7/0xa20
    [<00000000ebcdd13c>] __do_sys_clone+0xc8/0x110
    [<00000000bf0d741e>] do_syscall_64+0x33/0x40
    [<000000008651067b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff88810b1a6d40 (size 32):
  comm "syz-executor771", pid 269, jiffies 4294800484 (age 19.246s)
  hex dump (first 32 bytes):
    01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000a1a7c9d2>] security_prepare_creds+0x10a/0x180
    [<000000004a0fcc72>] prepare_creds+0x458/0x5e0
    [<00000000dc1ed5dc>] copy_creds+0x72/0x580
    [<000000006f959e2f>] copy_process+0xee4/0x66c0
    [<00000000bd3ea815>] kernel_clone+0xe7/0xa20
    [<00000000ebcdd13c>] __do_sys_clone+0xc8/0x110
    [<00000000bf0d741e>] do_syscall_64+0x33/0x40
    [<000000008651067b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff888008312718 (size 984):
  comm "syz-executor771", pid 271, jiffies 4294800488 (age 19.242s)
  hex dump (first 32 bytes):
    80 11 04 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  backtrace:
    [<00000000fe9451d1>] alloc_inode+0x169/0x230
    [<000000000feec68a>] new_inode_pseudo+0x14/0xe0
    [<0000000086d833dc>] create_pipe_files+0x4d/0x890
    [<00000000dd797bae>] do_pipe2+0x96/0x1b0
    [<00000000f55f306a>] __x64_sys_pipe+0x2f/0x40
    [<00000000bf0d741e>] do_syscall_64+0x33/0x40
    [<000000008651067b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff888106a17ac8 (size 120):
  comm "syz-executor771", pid 271, jiffies 4294800488 (age 19.242s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    18 27 31 08 80 88 ff ff e0 7a a1 06 81 88 ff ff  .'1......z......
  backtrace:
    [<000000001d4cf14c>] security_inode_alloc+0x34/0x160
    [<000000004aa5c9b3>] inode_init_always+0x507/0xc10
    [<00000000efd3fd5f>] alloc_inode+0x84/0x230
    [<000000000feec68a>] new_inode_pseudo+0x14/0xe0
    [<0000000086d833dc>] create_pipe_files+0x4d/0x890
    [<00000000dd797bae>] do_pipe2+0x96/0x1b0
    [<00000000f55f306a>] __x64_sys_pipe+0x2f/0x40
    [<00000000bf0d741e>] do_syscall_64+0x33/0x40
    [<000000008651067b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff88810697e400 (size 512):
  comm "syz-executor771", pid 271, jiffies 4294800488 (age 19.242s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 ad 4e ad de  .............N..
    ff ff ff ff 00 00 00 00 ff ff ff ff ff ff ff ff  ................
  backtrace:
    [<0000000019157336>] alloc_pipe_info+0x105/0x580
    [<00000000f7c3e5f1>] create_pipe_files+0x8d/0x890
    [<00000000dd797bae>] do_pipe2+0x96/0x1b0
    [<00000000f55f306a>] __x64_sys_pipe+0x2f/0x40
    [<00000000bf0d741e>] do_syscall_64+0x33/0x40
    [<000000008651067b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff8881055c4800 (size 1024):
  comm "syz-executor771", pid 271, jiffies 4294800488 (age 19.242s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000fddbfdd0>] alloc_pipe_info+0x1e0/0x580
    [<00000000f7c3e5f1>] create_pipe_files+0x8d/0x890
    [<00000000dd797bae>] do_pipe2+0x96/0x1b0
    [<00000000f55f306a>] __x64_sys_pipe+0x2f/0x40
    [<00000000bf0d741e>] do_syscall_64+0x33/0x40
    [<000000008651067b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9


VM DIAGNOSIS:
18:36:05  Registers:
info registers vcpu 0
RAX=ffffffff83c4e670 RBX=ffffffff84c34280 RCX=ffffffff83c36748 RDX=0000000000000000
RSI=0000000000000000 RDI=ffffffff83c4ea1f RBP=fffffbfff0986850 RSP=ffffffff84c07e40
R8 =0000000000000001 R9 =ffff88811ae3c06b R10=ffffed10235c780d R11=0000000000000001
R12=0000000000000000 R13=ffffffff85453b48 R14=0000000000000000 R15=dffffc0000000000
RIP=ffffffff83c4e67e RFL=00000206 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=1
ES =0000 0000000000000000 ffffffff 00000000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00000000
FS =0000 0000000000000000 ffffffff 00000000
GS =0000 ffff88811ae00000 ffffffff 00000000
LDT=0000 0000000000000000 000fffff 00000000
TR =0040 fffffe0000003000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe0000001000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=00005649d7ed412c CR3=00000001074dc000 CR4=00350ef0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
YMM00=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM01=0000000000000000 0000000000000000 7465677261742e79 636e656772656d65
YMM02=0000000000000000 0000000000000000 ffffff0f0e0d0c0b 0a09080706050403
YMM03=0000000000000000 0000000000000000 0000000000000021 0065636900656369
YMM04=0000000000000000 0000000000000000 2e2e2e2e2e2e2e2e 2e2e2e2e2e2e2e2e
YMM05=0000000000000000 0000000000000000 00005649d7eef0f0 0000000000000000
YMM06=0000000000000000 0000000000000000 0000000000000021 0000000000000000
YMM07=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM08=0000000000000000 0000000000000000 732f6563696c732e 6d65747379732f3a
YMM09=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM10=0000000000000000 0000000000000000 425b5a024a5b4e4b 5f5a024b424a5b5c
YMM11=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM12=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM13=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM14=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM15=0000000000000000 0000000000000000 0000000000000000 0000000000000000
info registers vcpu 1
RAX=0000000000000000 RBX=0000000000000001 RCX=ffffffff8145e7e9 RDX=ffff88810bcb8000
RSI=ffffffff8145e7f2 RDI=0000000000000005 RBP=ffff88810030ad80 RSP=ffff88810bd27cd8
R8 =0000000000000000 R9 =ffff88810030ad83 R10=0000000000000000 R11=0000000000000001
R12=1ffff110217a4fa5 R13=ffff88810bd27ed0 R14=dffffc0000000000 R15=0000000000000001
RIP=ffffffff813fa790 RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 000fffff 00000000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 000fffff 00000000
FS =0000 00000000006fc3c0 000fffff 00000000
GS =0000 ffff88811ae80000 000fffff 00000000
LDT=0000 0000000000000000 00000000 00000000
TR =0040 fffffe000004a000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe0000048000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=00000000004f72c0 CR3=00000001057c6000 CR4=00350ee0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
YMM00=0000000000000000 0000000000000000 2525252525252525 2525252525252525
YMM01=0000000000000000 0000000000000000 00000000000000ff ffffffffffffff00
YMM02=0000000000000000 0000000000000000 00000000000000ff ffffffffffffff00
YMM03=0000000000000000 0000000000000000 207365696666696a 202c313732206469
YMM04=0000000000000000 0000000000000000 6a626f206465636e 6572656665726e75
YMM05=0000000000000000 0000000000000000 32303120657a6973 2820303038346335
YMM06=0000000000000000 0000000000000000 6970202c22313737 726f747563657865
YMM07=0000000000000000 0000000000000000 3120656761282038 3834303038343932
YMM08=0000000000000000 0000000000000000 73662f7379732f00 000000000000000a
YMM09=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM10=0000000000000000 0000000000000000 5e415a5a165e450b 030f1c081906311c
YMM11=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM12=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM13=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM14=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM15=0000000000000000 0000000000000000 0000000000000000 0000000000000000
info registers vcpu 2
RAX=ffffffff83c4e670 RBX=ffff8881009799c0 RCX=ffffffff83c36748 RDX=0000000000000000
RSI=0000000000000000 RDI=ffffffff83c4ea1f RBP=ffffed102012f338 RSP=ffff88810098fe78
R8 =0000000000000001 R9 =ffff88811af3c06b R10=ffffed10235e780d R11=0000000000000001
R12=0000000000000002 R13=ffffffff85453b48 R14=0000000000000000 R15=dffffc0000000000
RIP=ffffffff83c4e67e RFL=00000202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=1
ES =0000 0000000000000000 ffffffff 00000000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00000000
FS =0000 0000000000000000 ffffffff 00000000
GS =0000 ffff88811af00000 ffffffff 00000000
LDT=0000 0000000000000000 00000000 00000000
TR =0040 fffffe0000091000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe000008f000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=00007f3699c8dd50 CR3=00000001095a2000 CR4=00350ee0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
YMM00=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM01=0000000000000000 0000000000000000 f58e76c330976d97 c4490268c130e4bb
YMM02=0000000000000000 0000000000000000 5627cf12a5e52d08 00000000000ae9e8
YMM03=0000000000000000 0000000000000000 80f09f19808d26a3 00000000000aec28
YMM04=0000000000000000 0000000000000000 07959290bdff0429 000000000012cf90
YMM05=0000000000000000 0000000000000000 d3fdd5f48436fbd7 00000000000aea90
YMM06=0000000000000000 0000000000000000 8e7c8dff50f68680 00000000000ae948
YMM07=0000000000000000 0000000000000000 a1fcdcf819d7e1e5 00000000000ae728
YMM08=0000000000000000 0000000000000000 44495f474f4c5359 530069253d595449
YMM09=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM10=0000000000000000 0000000000000000 0750515151515168 5b0707241100226b
YMM11=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM12=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM13=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM14=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM15=0000000000000000 0000000000000000 0000000000000000 0000000000000000
info registers vcpu 3
RAX=ffffffff83c4e670 RBX=ffff88810097b380 RCX=ffffffff83c36748 RDX=0000000000000000
RSI=0000000000000000 RDI=ffffffff83c4ea1f RBP=ffffed102012f670 RSP=ffff88810099fe78
R8 =0000000000000001 R9 =ffff88811afbc06b R10=ffffed10235f780d R11=0000000000000001
R12=0000000000000003 R13=ffffffff85453b48 R14=0000000000000000 R15=dffffc0000000000
RIP=ffffffff83c4e67e RFL=00000202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=1
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 0000000000000000 ffffffff 00c00000
GS =0000 ffff88811af80000 ffffffff 00c00000
LDT=0000 0000000000000000 00000000 00000000
TR =0040 fffffe00000d8000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe00000d6000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=00005578bf028dd8 CR3=00000001150b8000 CR4=00350ee0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
YMM00=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM01=0000000000000000 0000000000000000 206465636e657265 6665726e750a6b61
YMM02=0000000000000000 0000000000000000 20657a6973282030 3032363432343131
YMM03=0000000000000000 0000000000000000 202c22313737726f 7475636578652d7a
YMM04=0000000000000000 0000000000000000 2e2e2e2e20203030 2030302030302030
YMM05=0000000000000000 0000000000000000 2030302030302030 3020303020303020
YMM06=0000000000000000 0000000000000000 6220323320747372 69662820706d7564
YMM07=0000000000000000 0000000000000000 6567612820343834 3030383439323420
YMM08=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM09=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM10=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM11=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM12=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM13=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM14=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM15=0000000000000000 0000000000000000 0000000000000000 0000000000000000


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-27 21:54       ` aissur0002
@ 2022-03-27 22:21         ` Linus Torvalds
  2022-03-29 10:23           ` Christian Brauner
                             ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Linus Torvalds @ 2022-03-27 22:21 UTC (permalink / raw)
  To: Fedor Pchelkin
  Cc: Alexey Khoroshilov, Eric Biggers, Christian Brauner,
	Alexander Viro, linux-fsdevel, Linux Kernel Mailing List

On Sun, Mar 27, 2022 at 2:53 PM <aissur0002@gmail.com> wrote:
>
> I am sorry, that was my first attempt to contribute to the kernel and
> I messed up a little bit with the patch tag: it is actually a single,
> standalone patch and there has been nothing posted previously.

No problem, thanks for clarifying.

But the patch itself in that case is missing some detail, since it
clearly doesn't apply to upstream.

Anyway:

> In few words, an error occurs while executing close_range() call with
> CLOSE_RANGE_UNSHARE flag: in __close_range() the value of
> max_unshare_fds (later used as max_fds in dup_fd() and copy_fd_bitmaps())
> can be an arbitrary number.
>
>   if (max_fd >= last_fd(files_fdtable(cur_fds)))
>     max_unshare_fds = fd;
>
> and not be a multiple of BITS_PER_LONG or BITS_PER_BYTE.

Very good, that's the piece I was missing. I only looked in fs/file.c,
and missed how that max_unshare_fds gets passed from __close_range()
into fork.c for unshare_fd() and then back to file.c and dup_fd(). And
then affects sane_fdtable_size().

I _think_ it should be sufficient to just do something like

       max_fds = ALIGN(max_fds, BITS_PER_LONG)

in sane_fdtable_size(), but honestly, I haven't actually thought about
it all that much. That's just a gut feel without really analyzing
things - sane_fdtable_size() really should never return a value that
isn't BITS_PER_LONG aligned.

And that whole close_range() is why I added Christian Brauner to the
participant list, though, so let's see if Christian has any comments.

Christian?

Btw, do you have a pointer to the syzbot report? I see the repro and
the crashlog you attached, but it would be good to have that pointer
to the syzbot original too.

Or did you just do this by running syzkaller yourself and there is no
external report?

                 Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-27 22:21         ` Linus Torvalds
@ 2022-03-29 10:23           ` Christian Brauner
  2022-03-29 14:40             ` Christian Brauner
  2022-03-29 20:44           ` aissur0002
  2022-03-29 23:02           ` Alexey Khoroshilov
  2 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2022-03-29 10:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fedor Pchelkin, Alexey Khoroshilov, Eric Biggers, Alexander Viro,
	linux-fsdevel, Linux Kernel Mailing List

On Sun, Mar 27, 2022 at 03:21:18PM -0700, Linus Torvalds wrote:
> On Sun, Mar 27, 2022 at 2:53 PM <aissur0002@gmail.com> wrote:
> >
> > I am sorry, that was my first attempt to contribute to the kernel and
> > I messed up a little bit with the patch tag: it is actually a single,
> > standalone patch and there has been nothing posted previously.
> 
> No problem, thanks for clarifying.
> 
> But the patch itself in that case is missing some detail, since it
> clearly doesn't apply to upstream.
> 
> Anyway:
> 
> > In few words, an error occurs while executing close_range() call with
> > CLOSE_RANGE_UNSHARE flag: in __close_range() the value of
> > max_unshare_fds (later used as max_fds in dup_fd() and copy_fd_bitmaps())
> > can be an arbitrary number.
> >
> >   if (max_fd >= last_fd(files_fdtable(cur_fds)))
> >     max_unshare_fds = fd;
> >
> > and not be a multiple of BITS_PER_LONG or BITS_PER_BYTE.
> 
> Very good, that's the piece I was missing. I only looked in fs/file.c,
> and missed how that max_unshare_fds gets passed from __close_range()
> into fork.c for unshare_fd() and then back to file.c and dup_fd(). And
> then affects sane_fdtable_size().
> 
> I _think_ it should be sufficient to just do something like
> 
>        max_fds = ALIGN(max_fds, BITS_PER_LONG)
> 
> in sane_fdtable_size(), but honestly, I haven't actually thought about
> it all that much. That's just a gut feel without really analyzing
> things - sane_fdtable_size() really should never return a value that
> isn't BITS_PER_LONG aligned.
> 
> And that whole close_range() is why I added Christian Brauner to the
> participant list, though, so let's see if Christian has any comments.
> 
> Christian?

(Sorry, I was heads-deep in some other fs work and went into airplaine
mode. I'm back.)

So I investigated a similar report a little while back and I spent quite
a lot of time trying to track this down but didn't find the cause.
If you'd call:

close_range(131, -1, CLOSE_RANGE_UNSHARE);

for an fdtable that is smaller than 131 then we'd call:

unshare_fd(..., 131)
\dup_fd(..., 131)
  \sane_fdtable_size(..., 131)

So sane_fdtable_size() would return 131 which is not aligned. This
couldn't happen before CLOSE_RANGE_UNSHARE afaict. I'll try to do a
repro with this with your suggested fix applied.

Christian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-29 10:23           ` Christian Brauner
@ 2022-03-29 14:40             ` Christian Brauner
  2022-03-29 21:28               ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2022-03-29 14:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fedor Pchelkin, Alexey Khoroshilov, Eric Biggers, Alexander Viro,
	linux-fsdevel, Linux Kernel Mailing List

On Tue, Mar 29, 2022 at 12:23:47PM +0200, Christian Brauner wrote:
> On Sun, Mar 27, 2022 at 03:21:18PM -0700, Linus Torvalds wrote:
> > On Sun, Mar 27, 2022 at 2:53 PM <aissur0002@gmail.com> wrote:
> > >
> > > I am sorry, that was my first attempt to contribute to the kernel and
> > > I messed up a little bit with the patch tag: it is actually a single,
> > > standalone patch and there has been nothing posted previously.
> > 
> > No problem, thanks for clarifying.
> > 
> > But the patch itself in that case is missing some detail, since it
> > clearly doesn't apply to upstream.
> > 
> > Anyway:
> > 
> > > In few words, an error occurs while executing close_range() call with
> > > CLOSE_RANGE_UNSHARE flag: in __close_range() the value of
> > > max_unshare_fds (later used as max_fds in dup_fd() and copy_fd_bitmaps())
> > > can be an arbitrary number.
> > >
> > >   if (max_fd >= last_fd(files_fdtable(cur_fds)))
> > >     max_unshare_fds = fd;
> > >
> > > and not be a multiple of BITS_PER_LONG or BITS_PER_BYTE.
> > 
> > Very good, that's the piece I was missing. I only looked in fs/file.c,
> > and missed how that max_unshare_fds gets passed from __close_range()
> > into fork.c for unshare_fd() and then back to file.c and dup_fd(). And
> > then affects sane_fdtable_size().
> > 
> > I _think_ it should be sufficient to just do something like
> > 
> >        max_fds = ALIGN(max_fds, BITS_PER_LONG)
> > 
> > in sane_fdtable_size(), but honestly, I haven't actually thought about
> > it all that much. That's just a gut feel without really analyzing
> > things - sane_fdtable_size() really should never return a value that
> > isn't BITS_PER_LONG aligned.
> > 
> > And that whole close_range() is why I added Christian Brauner to the
> > participant list, though, so let's see if Christian has any comments.
> > 
> > Christian?
> 
> (Sorry, I was heads-deep in some other fs work and went into airplaine
> mode. I'm back.)
> 
> So I investigated a similar report a little while back and I spent quite
> a lot of time trying to track this down but didn't find the cause.
> If you'd call:
> 
> close_range(131, -1, CLOSE_RANGE_UNSHARE);
> 
> for an fdtable that is smaller than 131 then we'd call:
> 
> unshare_fd(..., 131)
> \dup_fd(..., 131)
>   \sane_fdtable_size(..., 131)
> 
> So sane_fdtable_size() would return 131 which is not aligned. This
> couldn't happen before CLOSE_RANGE_UNSHARE afaict. I'll try to do a
> repro with this with your suggested fix applied.

Ok, I managed to repro this issue on an upstream 5.17 kernel. You'll
need kmemleak enabled:

CONFIG_MEMCG_KMEM=y
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE=16000
# CONFIG_DEBUG_KMEMLEAK_TEST is not set
# CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF is not set
CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y

The reproducer at [1] I'm using for this is super ugly fwiw. Compile with

gcc -pthread <bla>.c -o <bla>

The repro should trigger in about 2-3 iterations once the fdtable has
grown sufficiently for fd_start to become the upper limit instead of the
actual fdtable size when calling close_range(131, -1, CLOSE_RANGE_UNSHARE).

I don't think this actually leads to any fd leaks afaict as they should
all be properly closed after all sane_fdtable_size() does return the
number of open fds. But this is indeed triggered by the missing
BITS_PER_LONG alignment.

Your patch fixes this, Linus. I've compiled a kernel with your patch in
sane_fdtable_size() applied running with the repro for a few hours now
and no leaks. Feel free to take my ack.

I think alloc_fdtable() does everything correctly. The issue imho is
sane_fdtable_size() not aligning and then below when we do:
	for (i = open_files; i != 0; i--) {
in dup_fd() it looks to kmemleak like it's leaking (I'm not completely
sure it is actually.).

[1]:
#define _GNU_SOURCE 
 
#include <dirent.h>
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/prctl.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>
 
#include <linux/futex.h>
 
#ifndef __NR_close_range
#define __NR_close_range 436
#endif
 
#ifndef SYS_gettid
#error "SYS_gettid unavailable on this system"
#endif
 
#define gettid() ((pid_t)syscall(SYS_gettid))
 
static void sleep_ms(uint64_t ms)
{
  usleep(ms * 1000);
}
 
static uint64_t current_time_ms(void)
{
  struct timespec ts;
  if (clock_gettime(CLOCK_MONOTONIC, &ts))
  exit(1);
  return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
}
 
static void thread_start(void* (*fn)(void*), void* arg)
{
  pid_t pid = getpid();
  pid_t tid = gettid();
 
  pthread_t th;
  pthread_attr_t attr;
  pthread_attr_init(&attr);
  pthread_attr_setstacksize(&attr, 128 << 10);
  int i = 0;
  for (; i < 100; i++) {
    if (pthread_create(&th, &attr, fn, arg) == 0) {
      pthread_attr_destroy(&attr);
      printf("%d %d created thread\n", pid, tid);
      return;
    }
    if (errno == EAGAIN) {
      usleep(50);
      continue;
    }
    break;
  }
  exit(1);
}
 
typedef struct {
  int state;
} event_t;
 
static void event_init(event_t* ev)
{
  ev->state = 0;
}
 
static void event_reset(event_t* ev)
{
  ev->state = 0;
}
 
static void event_set(event_t* ev)
{
  if (ev->state)
  exit(1);
  __atomic_store_n(&ev->state, 1, __ATOMIC_RELEASE);
  syscall(SYS_futex, &ev->state, FUTEX_WAKE | FUTEX_PRIVATE_FLAG, 1000000);
}
 
static void event_wait(event_t* ev)
{
  while (!__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
    syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, 0);
}
 
static int event_isset(event_t* ev)
{
  return __atomic_load_n(&ev->state, __ATOMIC_ACQUIRE);
}
 
static int event_timedwait(event_t* ev, uint64_t timeout)
{
  uint64_t start = current_time_ms();
  uint64_t now = start;
  for (;;) {
    uint64_t remain = timeout - (now - start);
    struct timespec ts;
    ts.tv_sec = remain / 1000;
    ts.tv_nsec = (remain % 1000) * 1000 * 1000;
    syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, &ts);
    if (__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
      return 1;
    now = current_time_ms();
    if (now - start > timeout)
      return 0;
  }
}
 
static bool write_file(const char* file, const char* what, ...)
{
  char buf[1024];
  va_list args;
  va_start(args, what);
  vsnprintf(buf, sizeof(buf), what, args);
  va_end(args);
  buf[sizeof(buf) - 1] = 0;
  int len = strlen(buf);
  int fd = open(file, O_WRONLY | O_CLOEXEC);
  if (fd == -1)
    return false;
  if (write(fd, buf, len) != len) {
    int err = errno;
    close(fd);
    errno = err;
    return false;
  }
  close(fd);
  return true;
}
 
static void kill_and_wait(int pid, int* status)
{
  kill(-pid, SIGKILL);
  kill(pid, SIGKILL);
  for (int i = 0; i < 100; i++) {
    if (waitpid(-1, status, WNOHANG | __WALL) == pid)
      return;
    usleep(1000);
  }
  DIR* dir = opendir("/sys/fs/fuse/connections");
  if (dir) {
    for (;;) {
      struct dirent* ent = readdir(dir);
      if (!ent)
        break;
      if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
        continue;
      char abort[300];
      snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort", ent->d_name);
      int fd = open(abort, O_WRONLY);
      if (fd == -1) {
        continue;
      }
      if (write(fd, abort, 1) < 0) {
      }
      close(fd);
    }
    closedir(dir);
  } else {
  }
  while (waitpid(-1, status, __WALL) != pid) {
  }
}
 
static void setup_test()
{
  prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
  setpgrp();
  write_file("/proc/self/oom_score_adj", "1000");
}
 
#define KMEMLEAK_FILE "/sys/kernel/debug/kmemleak"
 
static void setup_leak()
{
  if (!write_file(KMEMLEAK_FILE, "scan"))
    exit(1);
  sleep(5);
  if (!write_file(KMEMLEAK_FILE, "scan"))
    exit(1);
  if (!write_file(KMEMLEAK_FILE, "clear"))
    exit(1);
}
 
static void check_leaks(void)
{
  int fd = open(KMEMLEAK_FILE, O_RDWR);
  if (fd == -1)
  exit(1);
  uint64_t start = current_time_ms();
  if (write(fd, "scan", 4) != 4)
  exit(1);
  sleep(1);
  while (current_time_ms() - start < 4 * 1000)
    sleep(1);
  if (write(fd, "scan", 4) != 4)
  exit(1);
  static char buf[128 << 10];
  ssize_t n = read(fd, buf, sizeof(buf) - 1);
  if (n < 0)
  exit(1);
  int nleaks = 0;
  if (n != 0) {
    sleep(1);
    if (write(fd, "scan", 4) != 4)
  exit(1);
    if (lseek(fd, 0, SEEK_SET) < 0)
  exit(1);
    n = read(fd, buf, sizeof(buf) - 1);
    if (n < 0)
  exit(1);
    buf[n] = 0;
    char* pos = buf;
    char* end = buf + n;
    while (pos < end) {
      char* next = strstr(pos + 1, "unreferenced object");
      if (!next)
        next = end;
      char prev = *next;
      *next = 0;
      fprintf(stderr, "BUG: memory leak\n%s\n", pos);
      *next = prev;
      pos = next;
      nleaks++;
    }
  }
  if (write(fd, "clear", 5) != 5)
  exit(1);
  close(fd);
  if (nleaks)
    exit(1);
}
 
struct thread_t {
  int created, call;
  event_t ready, done;
};
 
static struct thread_t threads[16];
static void execute_call(int call);
static int running;
 
static void* thr(void* arg)
{
  struct thread_t* th = (struct thread_t*)arg;
  for (;;) {
    event_wait(&th->ready);
    event_reset(&th->ready);
    execute_call(th->call);
    __atomic_fetch_sub(&running, 1, __ATOMIC_RELAXED);
    event_set(&th->done);
  }
  return 0;
}
 
static void execute_one(void)
{
  sleep_ms(100);
  int i, call, thread;
  for (call = 0; call < 2; call++) {
    for (thread = 0; thread < (int)(sizeof(threads) / sizeof(threads[0])); thread++) {
      struct thread_t* th = &threads[thread];
      if (!th->created) {
        th->created = 1;
        event_init(&th->ready);
        event_init(&th->done);
        event_set(&th->done);
        thread_start(thr, th);
      }
      if (!event_isset(&th->done))
        continue;
      event_reset(&th->done);
      th->call = call;
      __atomic_fetch_add(&running, 1, __ATOMIC_RELAXED);
      event_set(&th->ready);
      event_timedwait(&th->done, 50);
      break;
    }
  }
  for (i = 0; i < 100 && __atomic_load_n(&running, __ATOMIC_RELAXED); i++)
    sleep_ms(1);
}
 
static void execute_one(void);
 
#define WAIT_FLAGS __WALL
 
static void loop(void)
{
  pid_t parentPID = getpid();
  pid_t parentTID = gettid();
  int iter = 0;
  for (;; iter++) {
    printf("%d %d forking\n", parentPID, parentTID);
    int pid = fork();
    if (pid < 0)
      exit(1);
    if (pid == 0) {
      setup_test();
      execute_one();
      exit(0);
    }
    printf("%d %d forked child %d\n", parentPID, parentTID, pid);
    int status = 0;
    uint64_t start = current_time_ms();
    for (;;) {
      if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid)
        break;
      sleep_ms(1);
      if (current_time_ms() - start < 5000)
        continue;
      kill_and_wait(pid, &status);
      break;
    }
    printf("%d %d checking leak after executor exited\n", parentPID, parentTID);
    check_leaks();
  }
}
 
uint64_t r[1] = {0xffffffffffffffff};
 
void execute_call(int call)
{
  pid_t pid = getpid();
  pid_t tid = gettid();
 
  intptr_t res = 0;
  switch (call) {
  case 0:
    printf("%d %d pipe2\n", pid, tid);
    res = syscall(__NR_pipe2, 0x20000080ul, 0ul);
    {
      int i;
      for(i = 0; i < 64 /* 32 also triggers the bug, but 16 doesn't */; i++) {
        syscall(__NR_pipe2, 0x20000080ul, 0ul);
      }
    }
    if (res != -1)
      r[0] = *(uint32_t*)0x20000080;
    break;
  case 1:
    printf("%d %d close_range\n", pid, tid);
    syscall(__NR_close_range, r[0], -1, 2ul /* CLOSE_RANGE_UNSHARE */);
    break;
  }
 
}
 
int main(void)
{
  syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  setup_leak();
  printf("%d %d starting loop\n", getpid(), gettid());
  loop();
  return 0;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-27 22:21         ` Linus Torvalds
  2022-03-29 10:23           ` Christian Brauner
@ 2022-03-29 20:44           ` aissur0002
  2022-03-29 21:02             ` Linus Torvalds
  2022-03-29 23:02           ` Alexey Khoroshilov
  2 siblings, 1 reply; 20+ messages in thread
From: aissur0002 @ 2022-03-29 20:44 UTC (permalink / raw)
  To: Fedor Pchelkin, Linus Torvalds
  Cc: Alexey Khoroshilov, Eric Biggers, Christian Brauner,
	Alexander Viro, linux-fsdevel, Linux Kernel Mailing List

> Btw, do you have a pointer to the syzbot report? I see the repro and
> the crashlog you attached, but it would be good to have that pointer
> to the syzbot original too.
> 
> Or did you just do this by running syzkaller yourself and there is no
> external report?

Alexey V. Khoroshilov (<khoroshilov@ispras.ru>) will soon answer about
the syzbot original, I suppose. Personally, I possess  only Crashlog and
repro.c file which I ran on a local machine and I don't know whether
there is an external report.

As for the solution you proposed, I agree with it: definitely the problem
was caused by an incorrect alignment of max_fds. Frankly speaking, I
didn't know that
> sane_fdtable_size() really should never return a value that
> isn't BITS_PER_LONG aligned 
because there is no explicit alignment of max_fds value in the code as
I can see.   





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-29 20:44           ` aissur0002
@ 2022-03-29 21:02             ` Linus Torvalds
  2022-03-29 22:18               ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2022-03-29 21:02 UTC (permalink / raw)
  To: Fedor Pchelkin
  Cc: Alexey Khoroshilov, Eric Biggers, Christian Brauner,
	Alexander Viro, linux-fsdevel, Linux Kernel Mailing List

On Tue, Mar 29, 2022 at 1:43 PM <aissur0002@gmail.com> wrote:
>
> As for the solution you proposed, I agree with it: definitely the problem
> was caused by an incorrect alignment of max_fds. Frankly speaking, I
> didn't know that
> > sane_fdtable_size() really should never return a value that
> > isn't BITS_PER_LONG aligned
> because there is no explicit alignment of max_fds value in the code as
> I can see.

Yeah, I think a lot of it is implicit and historical knowledge. Much
of it is basically just part of the whole "all bitmap operations act
on arrays of 'unsigned long'".

That whole bitmap base type is perhaps not as well known as it should
be, but it's one reason why the allocation granularity really *cannot*
be a byte - because on big-endian machines, the next bits you need is
not "one more byte". So on a 64-bit big-endian machine, the least
significant bits are not one byte away, but seven bytes away.

Of course, big-endian is fairly rare these days, so your "copy one
more byte" would have worked in practice on most machines out there.
Which together with "it's hard to hit this situation in the first
place" would have made it really hard to notice that it didn't
_really_ work.

I will apply that ALIGN() thing since Christian could confirm it fixes
things, and try to add a few more comments about how bitmaps are
fundamentally in chunks of BITS_PER_LONG.

             Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-29 14:40             ` Christian Brauner
@ 2022-03-29 21:28               ` Linus Torvalds
  0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2022-03-29 21:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Fedor Pchelkin, Alexey Khoroshilov, Eric Biggers, Alexander Viro,
	linux-fsdevel, Linux Kernel Mailing List

On Tue, Mar 29, 2022 at 7:40 AM Christian Brauner <brauner@kernel.org> wrote:
>
> I think alloc_fdtable() does everything correctly.

So I agree that alloc_fdtable() _works_, but it's certainly a bit opaque.

The reason it works is that

        nr *= (1024 / sizeof(struct file *));

and in practice, since a "sizeof(struct file *)" is either 4 or 8,
that will multiply 'nr' by either 256 or 128.

End result: 'nr' will most definitely always be a multiple of
BITS_PER_LONG, at least until we end up with 128-bit machines, in
which case the multiplier will be only 64.

So I'm inclined to add an explicit ALIGN() there too. I just verified
that at least clang notices that "hey, after multiplying by 128 and
the subsequent ALIGN() is a no-op".

Sadly, gcc seems to be too stupid to realize that. I'm surprised.

But it seems to be because gcc has terminally confused itself and
actually combined the shifts in roundup_pow_of_two() with the
multiply, and as a result has lost sight of the fact that we just
shifted by 7.

So that's a bit sad, but the extra two instructions gcc inserts are
pretty harmless in the end, and the clarification is worth it. And
maybe some day gcc will figure it out.

               Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-29 21:02             ` Linus Torvalds
@ 2022-03-29 22:18               ` Linus Torvalds
  2022-03-29 22:23                 ` Linus Torvalds
  2022-03-30  5:21                 ` Jason A. Donenfeld
  0 siblings, 2 replies; 20+ messages in thread
From: Linus Torvalds @ 2022-03-29 22:18 UTC (permalink / raw)
  To: Fedor Pchelkin
  Cc: Alexey Khoroshilov, Eric Biggers, Christian Brauner,
	Alexander Viro, linux-fsdevel, Linux Kernel Mailing List

On Tue, Mar 29, 2022 at 2:02 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I will apply that ALIGN() thing since Christian could confirm it fixes
> things, and try to add a few more comments about how bitmaps are
> fundamentally in chunks of BITS_PER_LONG.

Ok, applied as commit 1c24a186398f ("fs: fd tables have to be
multiples of BITS_PER_LONG").

I tried to add reasonable commentary too, both to the commit and the
file.c code.

Christian already tested that added ALIGN() for his test-case (that's
obviously assuming the obvious patch was identical - it can go in
multiple places), but the more the merrier, of course.

                 Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-29 22:18               ` Linus Torvalds
@ 2022-03-29 22:23                 ` Linus Torvalds
  2022-03-30  7:47                   ` Christian Brauner
  2022-03-30  5:21                 ` Jason A. Donenfeld
  1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2022-03-29 22:23 UTC (permalink / raw)
  To: Fedor Pchelkin
  Cc: Alexey Khoroshilov, Eric Biggers, Christian Brauner,
	Alexander Viro, linux-fsdevel, Linux Kernel Mailing List

On Tue, Mar 29, 2022 at 3:18 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, applied as commit 1c24a186398f ("fs: fd tables have to be
> multiples of BITS_PER_LONG").

Oh, forgot to mention...

Christian - it strikes me that the whole "min(count, max_fds)" in
sane_fdtable_size() is a bit stupid.

A _smarter_ approach might be to pass in 'max_fds' to
count_open_files(), and simply not count past that value.

I didn't do that, because I wanted to keep the patch obvious. And it
probably doesn't matter, but it's kind of silly to possibly count a
lot of open files that we want to close anyway, when we already know
the place we should stop counting.

Whatever. I just wanted to mention it in case you decide you want to
clean that part up. This is mostly your code anyway.

               Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-27 22:21         ` Linus Torvalds
  2022-03-29 10:23           ` Christian Brauner
  2022-03-29 20:44           ` aissur0002
@ 2022-03-29 23:02           ` Alexey Khoroshilov
  2 siblings, 0 replies; 20+ messages in thread
From: Alexey Khoroshilov @ 2022-03-29 23:02 UTC (permalink / raw)
  To: Linus Torvalds, Fedor Pchelkin
  Cc: Eric Biggers, Christian Brauner, Alexander Viro, linux-fsdevel,
	Linux Kernel Mailing List

On 28.03.2022 01:21, Linus Torvalds wrote:

> Btw, do you have a pointer to the syzbot report? I see the repro and
> the crashlog you attached, but it would be good to have that pointer
> to the syzbot original too.
> 
> Or did you just do this by running syzkaller yourself and there is no
> external report?

We are deploying a syzkaller instance in Linux Verification Center of
ISPRAS (linuxtesting.org), but we have no yet a public web server to be
referred.

--
Alexey Khoroshilov


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-29 22:18               ` Linus Torvalds
  2022-03-29 22:23                 ` Linus Torvalds
@ 2022-03-30  5:21                 ` Jason A. Donenfeld
  2022-03-30  6:08                   ` Linus Torvalds
  1 sibling, 1 reply; 20+ messages in thread
From: Jason A. Donenfeld @ 2022-03-30  5:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fedor Pchelkin, Alexey Khoroshilov, Eric Biggers,
	Christian Brauner, Alexander Viro, linux-fsdevel,
	Linux Kernel Mailing List

Hi Linus,

On Tue, Mar 29, 2022 at 03:18:56PM -0700, Linus Torvalds wrote:
> On Tue, Mar 29, 2022 at 2:02 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I will apply that ALIGN() thing since Christian could confirm it fixes
> > things, and try to add a few more comments about how bitmaps are
> > fundamentally in chunks of BITS_PER_LONG.
> 
> Ok, applied as commit 1c24a186398f ("fs: fd tables have to be
> multiples of BITS_PER_LONG").

This broke the WireGuard test suite, <https://www.wireguard.com/build-status/>,
on 32-bit archs with a line like:

[+] NS1: wg set wg0 private-key /dev/fd/63 listen-port 1 peer xb6I3yo5N/A9PXGeqSVdMywrogPz82Ug5vWTdqQJRF8= preshared-key /dev/fd/62 allowed-ips 192.168.241.2/32,fd00::2/128
fopen: No such file or directory

Those /dev/fd/63 and /dev/fd/62 are coming from bash process
redirection:

n1 wg set wg0 private-key <(echo "$key1") peer "$pub2" preshared-key <(echo "$psk") allowed-ips 192.168.241.2/32 endpoint 127.0.0.1:2

Peppering some printks, it looks like in `max_fds = ALIGN(max_fds,
BITS_PER_LONG);`, max_fds is sometimes 4294967295 before the call, and
that ALIGN winds up bringing it to 0.

Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-30  5:21                 ` Jason A. Donenfeld
@ 2022-03-30  6:08                   ` Linus Torvalds
  2022-03-30  6:21                     ` Jason A. Donenfeld
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2022-03-30  6:08 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Fedor Pchelkin, Alexey Khoroshilov, Eric Biggers,
	Christian Brauner, Alexander Viro, linux-fsdevel,
	Linux Kernel Mailing List

On Tue, Mar 29, 2022 at 10:21 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Peppering some printks, it looks like in `max_fds = ALIGN(max_fds,
> BITS_PER_LONG);`, max_fds is sometimes 4294967295 before the call, and
> that ALIGN winds up bringing it to 0.

Gaah. I actually went back and forth on the location of the ALIGN().

And then ended up putting it where it is for just a "smallest patch"
reason, which was clearly not the right thing to do.

The easier and more obvious fix was to just make the ALIGN() be at the
final 'return' statement, but I ended up moving it up just because I
didn't like how complicated the expression looked.

That was obviously very very wrong of me.

So does it help if you just remove that

        max_fds = ALIGN(max_fds, BITS_PER_LONG);

and instead make the final return be

        return ALIGN(min(count, max_fds), BITS_PER_LONG);

instead?

And now I feel like I should as penance just do what I tried to get
Christian to do, which was to just integrate the whole "don't even
bother looking past that passed-in max_fds" in count_open_files().

The whole organization of that "calculate current highest fd, only to
then ignore it if we didn't want that many file descriptors" is just
historical baggage.

                 Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-30  6:08                   ` Linus Torvalds
@ 2022-03-30  6:21                     ` Jason A. Donenfeld
  2022-03-30  6:28                       ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Jason A. Donenfeld @ 2022-03-30  6:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fedor Pchelkin, Alexey Khoroshilov, Eric Biggers,
	Christian Brauner, Alexander Viro, linux-fsdevel,
	Linux Kernel Mailing List

Hi Linus,

On Tue, Mar 29, 2022 at 11:08:55PM -0700, Linus Torvalds wrote:
> So does it help if you just remove that
> 
>         max_fds = ALIGN(max_fds, BITS_PER_LONG);
> 
> and instead make the final return be
> 
>         return ALIGN(min(count, max_fds), BITS_PER_LONG);
> 
> instead?

Yep, that works:

diff --git a/fs/file.c b/fs/file.c
index c01c29417ae6..ee9317346702 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -303,10 +303,9 @@ static unsigned int sane_fdtable_size(struct fdtable *fdt, unsigned int max_fds)
 	unsigned int count;

 	count = count_open_files(fdt);
-	max_fds = ALIGN(max_fds, BITS_PER_LONG);
 	if (max_fds < NR_OPEN_DEFAULT)
 		max_fds = NR_OPEN_DEFAULT;
-	return min(count, max_fds);
+	return ALIGN(min(count, max_fds), BITS_PER_LONG);
 }

 /*

Jason

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-30  6:21                     ` Jason A. Donenfeld
@ 2022-03-30  6:28                       ` Linus Torvalds
  2022-03-30  6:43                         ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2022-03-30  6:28 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Fedor Pchelkin, Alexey Khoroshilov, Eric Biggers,
	Christian Brauner, Alexander Viro, linux-fsdevel,
	Linux Kernel Mailing List

On Tue, Mar 29, 2022 at 11:21 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Yep, that works:

Ok, I'll apply that asap.

I think the "do this better" version is to get rid of
count_open_files() and moving all the logic into sane_fdtable_size(),
and you end up with something like this:

    static unsigned int sane_fdtable_size(struct fdtable *fdt,
unsigned int max_fds)
    {
        unsigned int i;

        max_fds = min(max_fds, fdt->max_fds);
        i = (max_fds + BITS_PER_LONG - 1) / BITS_PER_LONG;
        while (i > 0) {
                if (fdt->open_fds[--i])
                        break;
        }
        return (i + 1) * BITS_PER_LONG;
    }

but considering that I couldn't even get the simple ALIGN() in the
right place, I think going with the obvious version I should have
picked originally is the way to go..

              Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-30  6:28                       ` Linus Torvalds
@ 2022-03-30  6:43                         ` Linus Torvalds
  0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2022-03-30  6:43 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Fedor Pchelkin, Alexey Khoroshilov, Eric Biggers,
	Christian Brauner, Alexander Viro, linux-fsdevel,
	Linux Kernel Mailing List

On Tue, Mar 29, 2022 at 11:28 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, I'll apply that asap.

Ok, pushed out. This time with no last-minute patch cleanup.

              Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps()
  2022-03-29 22:23                 ` Linus Torvalds
@ 2022-03-30  7:47                   ` Christian Brauner
  0 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2022-03-30  7:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fedor Pchelkin, Alexey Khoroshilov, Eric Biggers, Alexander Viro,
	linux-fsdevel, Linux Kernel Mailing List

On Tue, Mar 29, 2022 at 03:23:10PM -0700, Linus Torvalds wrote:
> On Tue, Mar 29, 2022 at 3:18 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Ok, applied as commit 1c24a186398f ("fs: fd tables have to be
> > multiples of BITS_PER_LONG").
> 
> Oh, forgot to mention...
> 
> Christian - it strikes me that the whole "min(count, max_fds)" in
> sane_fdtable_size() is a bit stupid.
> 
> A _smarter_ approach might be to pass in 'max_fds' to
> count_open_files(), and simply not count past that value.
> 
> I didn't do that, because I wanted to keep the patch obvious. And it
> probably doesn't matter, but it's kind of silly to possibly count a
> lot of open files that we want to close anyway, when we already know
> the place we should stop counting.
> 
> Whatever. I just wanted to mention it in case you decide you want to
> clean that part up. This is mostly your code anyway.

I put it on my TODO and will look at this soon!

Thank you!
Christian

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-03-30  7:47 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-26 11:40 [PATCH 4/4] file: Fix file descriptor leak in copy_fd_bitmaps() Fedor Pchelkin
2022-03-26 14:17 ` Alexey Khoroshilov
2022-03-26 22:15   ` Linus Torvalds
2022-03-26 22:37     ` Linus Torvalds
2022-03-27 21:54       ` aissur0002
2022-03-27 22:21         ` Linus Torvalds
2022-03-29 10:23           ` Christian Brauner
2022-03-29 14:40             ` Christian Brauner
2022-03-29 21:28               ` Linus Torvalds
2022-03-29 20:44           ` aissur0002
2022-03-29 21:02             ` Linus Torvalds
2022-03-29 22:18               ` Linus Torvalds
2022-03-29 22:23                 ` Linus Torvalds
2022-03-30  7:47                   ` Christian Brauner
2022-03-30  5:21                 ` Jason A. Donenfeld
2022-03-30  6:08                   ` Linus Torvalds
2022-03-30  6:21                     ` Jason A. Donenfeld
2022-03-30  6:28                       ` Linus Torvalds
2022-03-30  6:43                         ` Linus Torvalds
2022-03-29 23:02           ` Alexey Khoroshilov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.