linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fd == 0 means AT_FDCWD BPF_OBJ_GET commands
       [not found]     ` <CAEf4BzafCCeRm9M8pPzpwexadKy5OAEmrYcnVpKmqNJ2tnSVuw@mail.gmail.com>
@ 2023-05-17  9:11       ` Christian Brauner
  2023-05-17 12:05         ` Christoph Hellwig
  2023-05-18 21:56         ` Andrii Nakryiko
  0 siblings, 2 replies; 21+ messages in thread
From: Christian Brauner @ 2023-05-17  9:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, cyphar, lennart,
	linux-fsdevel, Christoph Hellwig, Al Viro

On Tue, May 16, 2023 at 11:02:42AM -0700, Andrii Nakryiko wrote:
> On Tue, May 16, 2023 at 2:07 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote:
> > > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
> > > forces users to specify pinning location as a string-based absolute or
> > > relative (to current working directory) path. This has various
> > > implications related to security (e.g., symlink-based attacks), forces
> > > BPF FS to be exposed in the file system, which can cause races with
> > > other applications.
> > >
> > > One of the feedbacks we got from folks working with containers heavily
> > > was that inability to use purely FD-based location specification was an
> > > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
> > > commands. This patch closes this oversight, adding path_fd field to
> >
> > Cool!
> >
> > > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
> > > *at() syscalls for dirfd + pathname combinations.
> > >
> > > This now allows interesting possibilities like working with detached BPF
> > > FS mount (e.g., to perform multiple pinnings without running a risk of
> > > someone interfering with them), and generally making pinning/getting
> > > more secure and not prone to any races and/or security attacks.
> > >
> > > This is demonstrated by a selftest added in subsequent patch that takes
> > > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
> > > creating detached BPF FS mount, pinning, and then getting BPF map out of
> > > it, all while never exposing this private instance of BPF FS to outside
> > > worlds.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  include/linux/bpf.h            |  4 ++--
> > >  include/uapi/linux/bpf.h       |  5 +++++
> > >  kernel/bpf/inode.c             | 16 ++++++++--------
> > >  kernel/bpf/syscall.c           |  8 +++++---
> > >  tools/include/uapi/linux/bpf.h |  5 +++++
> > >  5 files changed, 25 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 36e4b2d8cca2..f58895830ada 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
> > >  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> > >  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
> > >
> > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> > > -int bpf_obj_get_user(const char __user *pathname, int flags);
> > > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> > > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
> > >
> > >  #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
> > >  #define DEFINE_BPF_ITER_FUNC(target, args...)                        \
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 1bb11a6ee667..db2870a52ce0 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -1420,6 +1420,11 @@ union bpf_attr {
> > >               __aligned_u64   pathname;
> > >               __u32           bpf_fd;
> > >               __u32           file_flags;
> > > +             /* same as dirfd in openat() syscall; see openat(2)
> > > +              * manpage for details of dirfd/path_fd and pathname semantics;
> > > +              * zero path_fd implies AT_FDCWD behavior
> > > +              */
> > > +             __u32           path_fd;
> > >       };
> >
> > So 0 is a valid file descriptor and can trivially be created and made to
> > refer to any file. Is this a conscious decision to have a zero value
> > imply AT_FDCWD and have you done this somewhere else in bpf already?
> > Because that's contrary to how any file descriptor based apis work.
> >
> > How this is usually solved for extensible structs is to have a flag
> > field that raises a flag to indicate that the fd fiel is set and thus 0
> > can be used as a valid value.
> >
> > The way you're doing it right now is very counterintuitive to userspace
> > and pretty much guaranteed to cause subtle bugs.
> 
> Yes, it's a very bpf()-specific convention we've settled on a while
> ago. It allows a cleaner and simpler backwards compatibility story
> without having to introduce new flags every single time. Most of BPF
> UAPI by now dictates that (otherwise valid) FD 0 can't be used to pass
> it to bpf() syscall. Most of the time users will be blissfully unaware
> because libbpf and other BPF libraries are checking for fd == 0 and
> dup()'ing them to avoid ever returning FD 0 to the user.
> 
> tl;dr, a conscious decision consistent with the rest of BPF UAPI. It
> is a bpf() peculiarity, yes.

Adding fsdevel so we're aware of this quirk.

So I'm not sure whether this was ever discussed on fsdevel when you took
the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
invalid value.

If it was discussed then great but if not then I would like to make it
very clear that if in the future you decide to introduce custom
semantics for vfs provided infrastructure - especially when exposed to
userspace - that you please Cc us.

You often make it very clear on the list that you don't like it when
anything that touches bpf code doesn't end up getting sent to the bpf
mailing list. It is exactly the same for us.

This is not a rant I'm really just trying to make sure that we agree on
common ground when it comes to touching each others code or semantic
assumptions.

I personally find this extremely weird to treat fd 0 as anything other
than a random fd number as it goes against any userspace assumptions and
drastically deviates from basically every file descriptor interface we
have. I mean, you're not just saying fd 0 is invalid you're even saying
it means AT_FDCWD.

For every other interface, including those that pass fds in structs
whose extensibility is premised on unknown fields being set to zero,
have ways to make fd 0 work just fine. You could've done that to without
inventing custom fd semantics.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-17  9:11       ` fd == 0 means AT_FDCWD BPF_OBJ_GET commands Christian Brauner
@ 2023-05-17 12:05         ` Christoph Hellwig
  2023-05-17 16:17           ` Alexei Starovoitov
  2023-05-18 21:56         ` Andrii Nakryiko
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2023-05-17 12:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, ast, daniel, martin.lau,
	cyphar, lennart, linux-fsdevel, Christoph Hellwig, Al Viro

On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> Adding fsdevel so we're aware of this quirk.
> 
> So I'm not sure whether this was ever discussed on fsdevel when you took
> the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> invalid value.

I've never heard of this before, and I think it is compltely
unacceptable. 0 ist just a normal FD, although one that happens to
have specific meaning in userspace as stdin.

> 
> If it was discussed then great but if not then I would like to make it
> very clear that if in the future you decide to introduce custom
> semantics for vfs provided infrastructure - especially when exposed to
> userspace - that you please Cc us.

I don't think it's just the future.  We really need to undo this ASAP.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-17 12:05         ` Christoph Hellwig
@ 2023-05-17 16:17           ` Alexei Starovoitov
  2023-05-17 21:48             ` Alexei Starovoitov
  2023-05-18  8:38             ` Christian Brauner
  0 siblings, 2 replies; 21+ messages in thread
From: Alexei Starovoitov @ 2023-05-17 16:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro

On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > Adding fsdevel so we're aware of this quirk.
> >
> > So I'm not sure whether this was ever discussed on fsdevel when you took
> > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > invalid value.
>
> I've never heard of this before, and I think it is compltely
> unacceptable. 0 ist just a normal FD, although one that happens to
> have specific meaning in userspace as stdin.
>
> >
> > If it was discussed then great but if not then I would like to make it
> > very clear that if in the future you decide to introduce custom
> > semantics for vfs provided infrastructure - especially when exposed to
> > userspace - that you please Cc us.
>
> I don't think it's just the future.  We really need to undo this ASAP.

Christian is not correct in stating that treatment of fd==0 as invalid
bpf object applies to vfs fd-s.
The path_fd addition in this patch is really the very first one of this kind.
At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
are invalid and this is not going to change. It's been uapi for a long time.

More so fd-s 0,1,2 are not "normal FDs".
Unix has made two mistakes:
1. fd==0 being valid fd
2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.

The first mistake makes it hard to pass FD without an extra flag.
The 2nd mistake is just awful.
We've seen plenty of severe datacenter wide issues because some
library or piece of software assumes stdin/out/err.
Various services have been hurt badly by this "convention".
In libbpf we added ensure_good_fd() to make sure none of bpf objects
(progs, maps, etc) are ever seen with fd=0,1,2.
Other pieces of datacenter software enforce the same.

In other words fds=0,1,2 are taken. They must not be anything but
stdin/out/err or gutted to /dev/null.
Otherwise expect horrible bugs and multi day debugging.

Because of that, several years ago, we've decided to fix unix mistake #1
when it comes to bpf objects and started reserving fd=0 as invalid.
This patch is proposing to do the same for path_fd (normal vfs fd) when
it is passed to bpf syscall. I think it's a good trade-off and fits
the rest of bpf uapi.

Everyone who's hiding behind statements: but POSIX is a standard..
or this is how we've been doing things... are ignoring the practical
situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-17 16:17           ` Alexei Starovoitov
@ 2023-05-17 21:48             ` Alexei Starovoitov
  2023-05-18  8:38             ` Christian Brauner
  1 sibling, 0 replies; 21+ messages in thread
From: Alexei Starovoitov @ 2023-05-17 21:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro

On Wed, May 17, 2023 at 9:17 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > > Adding fsdevel so we're aware of this quirk.
> > >
> > > So I'm not sure whether this was ever discussed on fsdevel when you took
> > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > > invalid value.
> >
> > I've never heard of this before, and I think it is compltely
> > unacceptable. 0 ist just a normal FD, although one that happens to
> > have specific meaning in userspace as stdin.
> >
> > >
> > > If it was discussed then great but if not then I would like to make it
> > > very clear that if in the future you decide to introduce custom
> > > semantics for vfs provided infrastructure - especially when exposed to
> > > userspace - that you please Cc us.
> >
> > I don't think it's just the future.  We really need to undo this ASAP.
>
> Christian is not correct in stating that treatment of fd==0 as invalid
> bpf object applies to vfs fd-s.
> The path_fd addition in this patch is really the very first one of this kind.
> At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
> are invalid and this is not going to change. It's been uapi for a long time.
>
> More so fd-s 0,1,2 are not "normal FDs".
> Unix has made two mistakes:
> 1. fd==0 being valid fd
> 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.
>
> The first mistake makes it hard to pass FD without an extra flag.
> The 2nd mistake is just awful.
> We've seen plenty of severe datacenter wide issues because some
> library or piece of software assumes stdin/out/err.
> Various services have been hurt badly by this "convention".
> In libbpf we added ensure_good_fd() to make sure none of bpf objects
> (progs, maps, etc) are ever seen with fd=0,1,2.
> Other pieces of datacenter software enforce the same.
>
> In other words fds=0,1,2 are taken. They must not be anything but
> stdin/out/err or gutted to /dev/null.
> Otherwise expect horrible bugs and multi day debugging.
>
> Because of that, several years ago, we've decided to fix unix mistake #1
> when it comes to bpf objects and started reserving fd=0 as invalid.
> This patch is proposing to do the same for path_fd (normal vfs fd) when
> it is passed to bpf syscall. I think it's a good trade-off and fits
> the rest of bpf uapi.
>
> Everyone who's hiding behind statements: but POSIX is a standard..
> or this is how we've been doing things... are ignoring the practical
> situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.

Summarizing an offlist discussion with Christian and Andrii.

The key issue is that fd == 0 must not mean AT_FDCWD and that's clear.
We'll respin with an extra flag to indicate that path_fd should be used.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-17 16:17           ` Alexei Starovoitov
  2023-05-17 21:48             ` Alexei Starovoitov
@ 2023-05-18  8:38             ` Christian Brauner
  2023-05-18 14:30               ` Theodore Ts'o
  2023-05-18 16:25               ` Alexei Starovoitov
  1 sibling, 2 replies; 21+ messages in thread
From: Christian Brauner @ 2023-05-18  8:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro

On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote:
> On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > > Adding fsdevel so we're aware of this quirk.
> > >
> > > So I'm not sure whether this was ever discussed on fsdevel when you took
> > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > > invalid value.
> >
> > I've never heard of this before, and I think it is compltely
> > unacceptable. 0 ist just a normal FD, although one that happens to
> > have specific meaning in userspace as stdin.
> >
> > >
> > > If it was discussed then great but if not then I would like to make it
> > > very clear that if in the future you decide to introduce custom
> > > semantics for vfs provided infrastructure - especially when exposed to
> > > userspace - that you please Cc us.
> >
> > I don't think it's just the future.  We really need to undo this ASAP.
> 
> Christian is not correct in stating that treatment of fd==0 as invalid
> bpf object applies to vfs fd-s.
> The path_fd addition in this patch is really the very first one of this kind.
> At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
> are invalid and this is not going to change. It's been uapi for a long time.
> 
> More so fd-s 0,1,2 are not "normal FDs".
> Unix has made two mistakes:
> 1. fd==0 being valid fd
> 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.
> 
> The first mistake makes it hard to pass FD without an extra flag.
> The 2nd mistake is just awful.
> We've seen plenty of severe datacenter wide issues because some
> library or piece of software assumes stdin/out/err.
> Various services have been hurt badly by this "convention".
> In libbpf we added ensure_good_fd() to make sure none of bpf objects
> (progs, maps, etc) are ever seen with fd=0,1,2.
> Other pieces of datacenter software enforce the same.
> 
> In other words fds=0,1,2 are taken. They must not be anything but
> stdin/out/err or gutted to /dev/null.
> Otherwise expect horrible bugs and multi day debugging.
> 
> Because of that, several years ago, we've decided to fix unix mistake #1
> when it comes to bpf objects and started reserving fd=0 as invalid.
> This patch is proposing to do the same for path_fd (normal vfs fd) when

It isn't as you now realized but I'm glad we cleared that up off-list.

> it is passed to bpf syscall. I think it's a good trade-off and fits
> the rest of bpf uapi.
> 
> Everyone who's hiding behind statements: but POSIX is a standard..
> or this is how we've been doing things... are ignoring the practical
> situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.

(Prefix: Imagine me calmly writing this and in a relaxed tone.)

Just to clarify. I do think that deciding that 0 is an invalid file
descriptor number is weird and I wish you'd have discussed this with us
before you took that decision. You've seen the reaction that other
low-level userspace people you talked to had to these news...

I'm not sure what to make of the POSIX excursion. I think that it is a
complete sideshow to the issue here in a way. But fwiw...

We don't follow arbitrary conventions such as 0, 1, and 2 because we all
have sworn allegiance to The Secret Order of the POSIX but because the
alternative is that one subsystem finds it neat to use fd 0 to refer to
AT_FDCWD and another one to AT_MY_CUSTOM_FD0_MEANING. Which is exactly
what would've happened if this patch would have made it unnoticed.

This doesn't scale and our interfaces aren't designed around
Shakespeare's dictum "What's in a name?".

This will quickly devolve into a situation similar to letting a step on
a staircase be off by a few millimeters. See those users falling.

I'm glad we cleared this up. My main issue is indeed that fd 0 now isn't
just forbidden it would be given an entirely different meaning which is
not acceptable from the vfs perspective.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18  8:38             ` Christian Brauner
@ 2023-05-18 14:30               ` Theodore Ts'o
  2023-05-18 16:25               ` Alexei Starovoitov
  1 sibling, 0 replies; 21+ messages in thread
From: Theodore Ts'o @ 2023-05-18 14:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexei Starovoitov, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

The other thing to note is that while the *convention* may be that 0,
1, and 2 are for stdin, stdout, and stderr, this is a *userspae*
convention.  After all, system daemons like getty, gnome-terminal,
et. al, need to be able to open file descriptors for stdin, stdout,
and stderr, and it would be.....highly undesirable for the kernel to
have to special case those processes from being able to open those
file descriptors.  So in the eyes of Kernel to Userspace API's we
should not specially privilege the meaning of file descriptors 0, 1,
and 2.

Besides, we have a perfectly good way of expressing "not a FD" and
that is negative values!  File descriptors, after all, are signed
integers.

Finally, by having some kernel subsystem have a different meaning for
fd 0 means that there are potential security vulernabilities.  It may
be the case that userspace *SHOULD* not use fd 0 for anythingn other
than stdin, and that should be something which should be handed to it
by its parent process.

However, consider what might happen if a malicious program where to
exec a process, perhaps a setuid process, with fd 0 closed.  Now the
first file opened by that program will be assigned fd 0, and if that
gets passed to BPF, something surprising and wonderous --- but
hopefully not something that can be leveraged to be a high severity
security vulnerability --- may very well happen.

So if there is anyway to that we can change the BPF API's to change to
use negative values for special case meanings, we should do it.
Certainly for any new API's, and even for old API's, Linus has always
said that there are some special case times when we can break the
userspace ABI --- and security vulnerabilites are certainly one of
them.

Best regards,

					- Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18  8:38             ` Christian Brauner
  2023-05-18 14:30               ` Theodore Ts'o
@ 2023-05-18 16:25               ` Alexei Starovoitov
  2023-05-18 16:33                 ` Matthew Wilcox
  2023-05-18 17:20                 ` Christian Brauner
  1 sibling, 2 replies; 21+ messages in thread
From: Alexei Starovoitov @ 2023-05-18 16:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 10:38:46AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote:
> > On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > > > Adding fsdevel so we're aware of this quirk.
> > > >
> > > > So I'm not sure whether this was ever discussed on fsdevel when you took
> > > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > > > invalid value.
> > >
> > > I've never heard of this before, and I think it is compltely
> > > unacceptable. 0 ist just a normal FD, although one that happens to
> > > have specific meaning in userspace as stdin.
> > >
> > > >
> > > > If it was discussed then great but if not then I would like to make it
> > > > very clear that if in the future you decide to introduce custom
> > > > semantics for vfs provided infrastructure - especially when exposed to
> > > > userspace - that you please Cc us.
> > >
> > > I don't think it's just the future.  We really need to undo this ASAP.
> > 
> > Christian is not correct in stating that treatment of fd==0 as invalid
> > bpf object applies to vfs fd-s.
> > The path_fd addition in this patch is really the very first one of this kind.
> > At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
> > are invalid and this is not going to change. It's been uapi for a long time.
> > 
> > More so fd-s 0,1,2 are not "normal FDs".
> > Unix has made two mistakes:
> > 1. fd==0 being valid fd
> > 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.
> > 
> > The first mistake makes it hard to pass FD without an extra flag.
> > The 2nd mistake is just awful.
> > We've seen plenty of severe datacenter wide issues because some
> > library or piece of software assumes stdin/out/err.
> > Various services have been hurt badly by this "convention".
> > In libbpf we added ensure_good_fd() to make sure none of bpf objects
> > (progs, maps, etc) are ever seen with fd=0,1,2.
> > Other pieces of datacenter software enforce the same.
> > 
> > In other words fds=0,1,2 are taken. They must not be anything but
> > stdin/out/err or gutted to /dev/null.
> > Otherwise expect horrible bugs and multi day debugging.
> > 
> > Because of that, several years ago, we've decided to fix unix mistake #1
> > when it comes to bpf objects and started reserving fd=0 as invalid.
> > This patch is proposing to do the same for path_fd (normal vfs fd) when
> 
> It isn't as you now realized but I'm glad we cleared that up off-list.
> 
> > it is passed to bpf syscall. I think it's a good trade-off and fits
> > the rest of bpf uapi.
> > 
> > Everyone who's hiding behind statements: but POSIX is a standard..
> > or this is how we've been doing things... are ignoring the practical
> > situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.
> 
> (Prefix: Imagine me calmly writing this and in a relaxed tone.)
> 
> Just to clarify. I do think that deciding that 0 is an invalid file

We're still talking past each other.
0 is an invalid bpf object. Not file.
There is a difference.
The kernel is breaking user space by returning non-file FDs in 0,1,2.
Especially as fd = 1 and 2.
ensure_good_fd() in libbpf is a library workaround to make sure bpf objects
are not the reason for user app brekage.
I firmly believe that making kernel return socket FDs and other special FDs with fd >=3
(under new sysctl, for example) will prevent user space breakage.

And to answer Ted's question..
Yes. It's a security issue, but it's the other way around.
The kernel returning non vfs file FD in [0,1,2] range is a security issue.
I'm proposing to fix it with new sysctl or boot flag.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 16:25               ` Alexei Starovoitov
@ 2023-05-18 16:33                 ` Matthew Wilcox
  2023-05-18 17:22                   ` Christian Brauner
  2023-05-18 17:20                 ` Christian Brauner
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2023-05-18 16:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christian Brauner, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote:
> We're still talking past each other.
> 0 is an invalid bpf object. Not file.
> There is a difference.
> The kernel is breaking user space by returning non-file FDs in 0,1,2.
> Especially as fd = 1 and 2.
> ensure_good_fd() in libbpf is a library workaround to make sure bpf objects
> are not the reason for user app brekage.
> I firmly believe that making kernel return socket FDs and other special FDs with fd >=3
> (under new sysctl, for example) will prevent user space breakage.

Wait, why are socket FDs special?  I shouldn't be able to have anything
but chardev fds, pipes and regular files as fd 0,1,2?  I agree that having
directory fds and blockdev fds as fd 0,1,2 are confusing and pointless,
but I see the value in having a TCP socket as stdin/stdout/stderr.

If a fd shouldn't be used for stdio, having an ioctl to enable it
and read/write return errors until/unless it's enabled makes sense.
But now we have to label each fd as safe/not-safe for stdio, which we
can as easily do by setting up our fops appropriately.  So I'm not sure
what you're trying to accomplish here.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 16:25               ` Alexei Starovoitov
  2023-05-18 16:33                 ` Matthew Wilcox
@ 2023-05-18 17:20                 ` Christian Brauner
  2023-05-18 17:33                   ` Linus Torvalds
  2023-05-18 18:26                   ` Alexei Starovoitov
  1 sibling, 2 replies; 21+ messages in thread
From: Christian Brauner @ 2023-05-18 17:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro,
	Linus Torvalds

On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote:
> On Thu, May 18, 2023 at 10:38:46AM +0200, Christian Brauner wrote:
> > On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote:
> > > On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
> > > >
> > > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > > > > Adding fsdevel so we're aware of this quirk.
> > > > >
> > > > > So I'm not sure whether this was ever discussed on fsdevel when you took
> > > > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > > > > invalid value.
> > > >
> > > > I've never heard of this before, and I think it is compltely
> > > > unacceptable. 0 ist just a normal FD, although one that happens to
> > > > have specific meaning in userspace as stdin.
> > > >
> > > > >
> > > > > If it was discussed then great but if not then I would like to make it
> > > > > very clear that if in the future you decide to introduce custom
> > > > > semantics for vfs provided infrastructure - especially when exposed to
> > > > > userspace - that you please Cc us.
> > > >
> > > > I don't think it's just the future.  We really need to undo this ASAP.
> > > 
> > > Christian is not correct in stating that treatment of fd==0 as invalid
> > > bpf object applies to vfs fd-s.
> > > The path_fd addition in this patch is really the very first one of this kind.
> > > At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
> > > are invalid and this is not going to change. It's been uapi for a long time.
> > > 
> > > More so fd-s 0,1,2 are not "normal FDs".
> > > Unix has made two mistakes:
> > > 1. fd==0 being valid fd
> > > 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.
> > > 
> > > The first mistake makes it hard to pass FD without an extra flag.
> > > The 2nd mistake is just awful.
> > > We've seen plenty of severe datacenter wide issues because some
> > > library or piece of software assumes stdin/out/err.
> > > Various services have been hurt badly by this "convention".
> > > In libbpf we added ensure_good_fd() to make sure none of bpf objects
> > > (progs, maps, etc) are ever seen with fd=0,1,2.
> > > Other pieces of datacenter software enforce the same.
> > > 
> > > In other words fds=0,1,2 are taken. They must not be anything but
> > > stdin/out/err or gutted to /dev/null.
> > > Otherwise expect horrible bugs and multi day debugging.
> > > 
> > > Because of that, several years ago, we've decided to fix unix mistake #1
> > > when it comes to bpf objects and started reserving fd=0 as invalid.
> > > This patch is proposing to do the same for path_fd (normal vfs fd) when
> > 
> > It isn't as you now realized but I'm glad we cleared that up off-list.
> > 
> > > it is passed to bpf syscall. I think it's a good trade-off and fits
> > > the rest of bpf uapi.
> > > 
> > > Everyone who's hiding behind statements: but POSIX is a standard..
> > > or this is how we've been doing things... are ignoring the practical
> > > situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.
> > 
> > (Prefix: Imagine me calmly writing this and in a relaxed tone.)
> > 
> > Just to clarify. I do think that deciding that 0 is an invalid file

descriptor

> 
> We're still talking past each other.
> 0 is an invalid bpf object. Not file.
> There is a difference.

You cut of a word above. I can't follow your argument.
File descriptor numbers are free to refer to whatever we want.
They don't care about what type of object they refer to and they
better not.

> The kernel is breaking user space by returning non-file FDs in 0,1,2.
> Especially as fd = 1 and 2.

This has a strong aura of a strawman argument. ;)

> ensure_good_fd() in libbpf is a library workaround to make sure bpf objects
> are not the reason for user app brekage.
> I firmly believe that making kernel return socket FDs and other special FDs with fd >=3
> (under new sysctl, for example) will prevent user space breakage.
> 
> And to answer Ted's question..
> Yes. It's a security issue, but it's the other way around.
> The kernel returning non vfs file FD in [0,1,2] range is a security issue.
> I'm proposing to fix it with new sysctl or boot flag.

That's just completely weird. We can see what Linus thinks but I think
that's a somewhat outlandish proposal that I wouldn't support.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 16:33                 ` Matthew Wilcox
@ 2023-05-18 17:22                   ` Christian Brauner
  0 siblings, 0 replies; 21+ messages in thread
From: Christian Brauner @ 2023-05-18 17:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexei Starovoitov, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 05:33:32PM +0100, Matthew Wilcox wrote:
> On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote:
> > We're still talking past each other.
> > 0 is an invalid bpf object. Not file.
> > There is a difference.
> > The kernel is breaking user space by returning non-file FDs in 0,1,2.
> > Especially as fd = 1 and 2.
> > ensure_good_fd() in libbpf is a library workaround to make sure bpf objects
> > are not the reason for user app brekage.
> > I firmly believe that making kernel return socket FDs and other special FDs with fd >=3
> > (under new sysctl, for example) will prevent user space breakage.
> 
> Wait, why are socket FDs special?  I shouldn't be able to have anything
> but chardev fds, pipes and regular files as fd 0,1,2?  I agree that having
> directory fds and blockdev fds as fd 0,1,2 are confusing and pointless,
> but I see the value in having a TCP socket as stdin/stdout/stderr.
> 
> If a fd shouldn't be used for stdio, having an ioctl to enable it
> and read/write return errors until/unless it's enabled makes sense.
> But now we have to label each fd as safe/not-safe for stdio, which we
> can as easily do by setting up our fops appropriately.  So I'm not sure
> what you're trying to accomplish here.

Yeah, I don't think we want weird ioctl()s to restrict file descriptor
ranges in any way. This all sounds pretty weird to me and I don't even
want to imagine the semantical oddness of suddenly restricting the
kernels ability to return some fds.

Honestly, most of the time sysctls such as this are the equivalent of
throwing the hands up in the air and leaving the room.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 17:20                 ` Christian Brauner
@ 2023-05-18 17:33                   ` Linus Torvalds
  2023-05-18 18:21                     ` Christian Brauner
  2023-05-18 18:26                   ` Alexei Starovoitov
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2023-05-18 17:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexei Starovoitov, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 10:20 AM Christian Brauner <brauner@kernel.org> wrote:
>
> That's just completely weird. We can see what Linus thinks but I think
> that's a somewhat outlandish proposal that I wouldn't support.

I have no idea of the background here.

But fd 0 is in absolutely no way special. Anything that thinks that a
zero fd is invalid or in any way different from (say) fd 5 is
completely and utterly buggy by definition.

Now, fd 0 can obviously be invalid in the sense that it may not be
open, exactly the same way fd 100 may not be open. So in *that* sense
we can have an invalid fd 0, and system calls might return EBADF for
trying to access it if somebody has closed it.

If bpf thinks that 0 is not a file descriptor, then bpf is simply
wrong. No ifs, buts or maybes about it. It's like saying "1 is not a
number". It's nonsensical garbage.

But maybe I misunderstand the issue.

              Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 17:33                   ` Linus Torvalds
@ 2023-05-18 18:21                     ` Christian Brauner
  0 siblings, 0 replies; 21+ messages in thread
From: Christian Brauner @ 2023-05-18 18:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexei Starovoitov, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 10:33:30AM -0700, Linus Torvalds wrote:
> On Thu, May 18, 2023 at 10:20 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > That's just completely weird. We can see what Linus thinks but I think
> > that's a somewhat outlandish proposal that I wouldn't support.
> 
> I have no idea of the background here.
> 
> But fd 0 is in absolutely no way special. Anything that thinks that a
> zero fd is invalid or in any way different from (say) fd 5 is
> completely and utterly buggy by definition.
> 
> Now, fd 0 can obviously be invalid in the sense that it may not be
> open, exactly the same way fd 100 may not be open. So in *that* sense
> we can have an invalid fd 0, and system calls might return EBADF for
> trying to access it if somebody has closed it.
> 
> If bpf thinks that 0 is not a file descriptor, then bpf is simply
> wrong. No ifs, buts or maybes about it. It's like saying "1 is not a
> number". It's nonsensical garbage.
> 
> But maybe I misunderstand the issue.

TL;DR bpf has considerd fd 0 to be an invalid fd value for a long time.
We can't exactly follow the motiviation:
https://lore.kernel.org/bpf/CAADnVQLitLUc1SozzKjBgq6HGTchE1cO+e4j8eDgtE0zFn5VEw@mail.gmail.com
and it's probably to late to change this.

Yes, passing fds in extensible structs is inconvenient because you need
to pass an indicator alongside an fd to indicate that the zero value is
anactual file descriptor. Because unknown fields need to be initialzied
to zero so that old kernels can ignore larger structs. So what we
usually have to do:

mount_setattr()

attr.set |= MOUNT_ATTR_IDMAP;
attr.userns_fd = 0;

We just found out about fd 0 being invalid because a new bpf feature
proposal now tried to reuse fd 0 to mean AT_FDCWD when passed through a
new bpf feature; which I said isn't acceptable.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-18 17:20                 ` Christian Brauner
  2023-05-18 17:33                   ` Linus Torvalds
@ 2023-05-18 18:26                   ` Alexei Starovoitov
       [not found]                     ` <CAHk-=whg-ygwrxm3GZ_aNXO=srH9sZ3NmFqu0KkyWw+wgEsi6g@mail.gmail.com>
  1 sibling, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2023-05-18 18:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Aleksa Sarai, Lennart Poettering, Linux-Fsdevel, Al Viro,
	Linus Torvalds

On Thu, May 18, 2023 at 07:20:29PM +0200, Christian Brauner wrote:
> On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote:
> > On Thu, May 18, 2023 at 10:38:46AM +0200, Christian Brauner wrote:
> > > On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote:
> > > > On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote:
> > > > >
> > > > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote:
> > > > > > Adding fsdevel so we're aware of this quirk.
> > > > > >
> > > > > > So I'm not sure whether this was ever discussed on fsdevel when you took
> > > > > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> > > > > > invalid value.
> > > > >
> > > > > I've never heard of this before, and I think it is compltely
> > > > > unacceptable. 0 ist just a normal FD, although one that happens to
> > > > > have specific meaning in userspace as stdin.
> > > > >
> > > > > >
> > > > > > If it was discussed then great but if not then I would like to make it
> > > > > > very clear that if in the future you decide to introduce custom
> > > > > > semantics for vfs provided infrastructure - especially when exposed to
> > > > > > userspace - that you please Cc us.
> > > > >
> > > > > I don't think it's just the future.  We really need to undo this ASAP.
> > > > 
> > > > Christian is not correct in stating that treatment of fd==0 as invalid
> > > > bpf object applies to vfs fd-s.
> > > > The path_fd addition in this patch is really the very first one of this kind.
> > > > At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0
> > > > are invalid and this is not going to change. It's been uapi for a long time.
> > > > 
> > > > More so fd-s 0,1,2 are not "normal FDs".
> > > > Unix has made two mistakes:
> > > > 1. fd==0 being valid fd
> > > > 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr.
> > > > 
> > > > The first mistake makes it hard to pass FD without an extra flag.
> > > > The 2nd mistake is just awful.
> > > > We've seen plenty of severe datacenter wide issues because some
> > > > library or piece of software assumes stdin/out/err.
> > > > Various services have been hurt badly by this "convention".
> > > > In libbpf we added ensure_good_fd() to make sure none of bpf objects
> > > > (progs, maps, etc) are ever seen with fd=0,1,2.
> > > > Other pieces of datacenter software enforce the same.
> > > > 
> > > > In other words fds=0,1,2 are taken. They must not be anything but
> > > > stdin/out/err or gutted to /dev/null.
> > > > Otherwise expect horrible bugs and multi day debugging.
> > > > 
> > > > Because of that, several years ago, we've decided to fix unix mistake #1
> > > > when it comes to bpf objects and started reserving fd=0 as invalid.
> > > > This patch is proposing to do the same for path_fd (normal vfs fd) when
> > > 
> > > It isn't as you now realized but I'm glad we cleared that up off-list.
> > > 
> > > > it is passed to bpf syscall. I think it's a good trade-off and fits
> > > > the rest of bpf uapi.
> > > > 
> > > > Everyone who's hiding behind statements: but POSIX is a standard..
> > > > or this is how we've been doing things... are ignoring the practical
> > > > situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.
> > > 
> > > (Prefix: Imagine me calmly writing this and in a relaxed tone.)
> > > 
> > > Just to clarify. I do think that deciding that 0 is an invalid file
> 
> descriptor
> 
> > 
> > We're still talking past each other.
> > 0 is an invalid bpf object. Not file.
> > There is a difference.
> 
> You cut of a word above. I can't follow your argument.
> File descriptor numbers are free to refer to whatever we want.
> They don't care about what type of object they refer to and they
> better not.

User space cares what they refer to.
Unix made integer FD into broken abstraction.
regular files need one set of syscalls to work with such 'integer FDs'.
sockets need another set of syscalls.
All other anon-inodes need another set of syscalls.
While user space sees an integer without type and that a root cause of the bugs.
And that's particularly bad for integers 0,1,2 where user space assumes that
regular files are behind them.

Imagine C++ project where base class is called 'FD' while children
implement their own access methods that don't overlap with each other.
Now one application passes this 'FD' class to another.
That other app can only do trial and error to figure out what it got.
Any user space developer would reject such class hierarchy design,
but kernel folks are so proud of this broken abstraction.
FDs are not special.. POSIX is the standard... Right.
That's why user space keeps their workarounds, because kernel folks
are not empathic to user space needs.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-17  9:11       ` fd == 0 means AT_FDCWD BPF_OBJ_GET commands Christian Brauner
  2023-05-17 12:05         ` Christoph Hellwig
@ 2023-05-18 21:56         ` Andrii Nakryiko
  1 sibling, 0 replies; 21+ messages in thread
From: Andrii Nakryiko @ 2023-05-18 21:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, cyphar, lennart,
	linux-fsdevel, Christoph Hellwig, Al Viro

On Wed, May 17, 2023 at 2:11 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, May 16, 2023 at 11:02:42AM -0700, Andrii Nakryiko wrote:
> > On Tue, May 16, 2023 at 2:07 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote:
> > > > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
> > > > forces users to specify pinning location as a string-based absolute or
> > > > relative (to current working directory) path. This has various
> > > > implications related to security (e.g., symlink-based attacks), forces
> > > > BPF FS to be exposed in the file system, which can cause races with
> > > > other applications.
> > > >
> > > > One of the feedbacks we got from folks working with containers heavily
> > > > was that inability to use purely FD-based location specification was an
> > > > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
> > > > commands. This patch closes this oversight, adding path_fd field to
> > >
> > > Cool!
> > >
> > > > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
> > > > *at() syscalls for dirfd + pathname combinations.
> > > >
> > > > This now allows interesting possibilities like working with detached BPF
> > > > FS mount (e.g., to perform multiple pinnings without running a risk of
> > > > someone interfering with them), and generally making pinning/getting
> > > > more secure and not prone to any races and/or security attacks.
> > > >
> > > > This is demonstrated by a selftest added in subsequent patch that takes
> > > > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
> > > > creating detached BPF FS mount, pinning, and then getting BPF map out of
> > > > it, all while never exposing this private instance of BPF FS to outside
> > > > worlds.
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > >  include/linux/bpf.h            |  4 ++--
> > > >  include/uapi/linux/bpf.h       |  5 +++++
> > > >  kernel/bpf/inode.c             | 16 ++++++++--------
> > > >  kernel/bpf/syscall.c           |  8 +++++---
> > > >  tools/include/uapi/linux/bpf.h |  5 +++++
> > > >  5 files changed, 25 insertions(+), 13 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 36e4b2d8cca2..f58895830ada 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
> > > >  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> > > >  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
> > > >
> > > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> > > > -int bpf_obj_get_user(const char __user *pathname, int flags);
> > > > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> > > > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
> > > >
> > > >  #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
> > > >  #define DEFINE_BPF_ITER_FUNC(target, args...)                        \
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 1bb11a6ee667..db2870a52ce0 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -1420,6 +1420,11 @@ union bpf_attr {
> > > >               __aligned_u64   pathname;
> > > >               __u32           bpf_fd;
> > > >               __u32           file_flags;
> > > > +             /* same as dirfd in openat() syscall; see openat(2)
> > > > +              * manpage for details of dirfd/path_fd and pathname semantics;
> > > > +              * zero path_fd implies AT_FDCWD behavior
> > > > +              */
> > > > +             __u32           path_fd;
> > > >       };
> > >
> > > So 0 is a valid file descriptor and can trivially be created and made to
> > > refer to any file. Is this a conscious decision to have a zero value
> > > imply AT_FDCWD and have you done this somewhere else in bpf already?
> > > Because that's contrary to how any file descriptor based apis work.
> > >
> > > How this is usually solved for extensible structs is to have a flag
> > > field that raises a flag to indicate that the fd fiel is set and thus 0
> > > can be used as a valid value.
> > >
> > > The way you're doing it right now is very counterintuitive to userspace
> > > and pretty much guaranteed to cause subtle bugs.
> >
> > Yes, it's a very bpf()-specific convention we've settled on a while
> > ago. It allows a cleaner and simpler backwards compatibility story
> > without having to introduce new flags every single time. Most of BPF
> > UAPI by now dictates that (otherwise valid) FD 0 can't be used to pass
> > it to bpf() syscall. Most of the time users will be blissfully unaware
> > because libbpf and other BPF libraries are checking for fd == 0 and
> > dup()'ing them to avoid ever returning FD 0 to the user.
> >
> > tl;dr, a conscious decision consistent with the rest of BPF UAPI. It
> > is a bpf() peculiarity, yes.
>
> Adding fsdevel so we're aware of this quirk.
>
> So I'm not sure whether this was ever discussed on fsdevel when you took
> the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an
> invalid value.
>
> If it was discussed then great but if not then I would like to make it
> very clear that if in the future you decide to introduce custom
> semantics for vfs provided infrastructure - especially when exposed to
> userspace - that you please Cc us.

Yep, I'll remember to cc linux-fsdevel@vger.kernel.org for future
patches touching on vfs-related concepts, no problem. I wasn't trying
to sneak it in or anything, it just never occurred to me, sorry about
that.

>
> You often make it very clear on the list that you don't like it when
> anything that touches bpf code doesn't end up getting sent to the bpf
> mailing list. It is exactly the same for us.

That's a fair request, ack.

>
> This is not a rant I'm really just trying to make sure that we agree on
> common ground when it comes to touching each others code or semantic
> assumptions.
>
> I personally find this extremely weird to treat fd 0 as anything other
> than a random fd number as it goes against any userspace assumptions and
> drastically deviates from basically every file descriptor interface we
> have. I mean, you're not just saying fd 0 is invalid you're even saying
> it means AT_FDCWD.

Agreed, I can see how this could have undesirable (not just
surprising) implications if, say, open(O_PATH) returned fd=0. I just
sent a v2 with a new flag that needs to be specified to be able to use
path FD (and falling back to backwards-compatible AT_FDCWD behavior
otherwise).

>
> For every other interface, including those that pass fds in structs
> whose extensibility is premised on unknown fields being set to zero,
> have ways to make fd 0 work just fine. You could've done that to without
> inventing custom fd semantics.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
       [not found]                     ` <CAHk-=whg-ygwrxm3GZ_aNXO=srH9sZ3NmFqu0KkyWw+wgEsi6g@mail.gmail.com>
@ 2023-05-19  4:44                       ` Alexei Starovoitov
  2023-05-19  8:13                         ` Christian Brauner
                                           ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Alexei Starovoitov @ 2023-05-19  4:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 11:57:14AM -0700, Linus Torvalds wrote:
> That is nobody's fault but your own, and you should just admit it rather
> than trying to double down on being wrong.

You're correct. I was indeed doubling down on that.
Thanks for putting it straight like that.

> The 0/1/2 file descriptors are not at all special. They are a shell
> pipeline default, nothing more. They are not the argument your think they
> are, and you should stop trying to make them an argument.

I'm well aware that any file type is allowed to be in FDs 0,1,2 and
some user space is using it that way, like old inetd:
https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428
That puts the same socket into 0,1,2 before exec-ing new process.

My point that the kernel has to assist user space instead of
stubbornly sticking to POSIX and saying all FDs are equal.

Most user space developers know that care should be taken with FDs 0,1,2,
but it's still easy to make a mistake.

To explain the motivation a bit of background:
"folly" is a core C++ library for fb apps. Like libstdc++ and a lot more.
Until this commit in 2021:
https://github.com/facebook/folly/commit/cc9032a0e41a0cba9aa93240c483cfceb0ff44ea
the user could launch a new process with flag "folly::Subprocess::CLOSE".
It's useful for the cases when child doesn't want to inherit stdin/out/err.
There is also GLOG. google's logging library that can be configured to log to stderr.
Both libraries are well written with the high code quality.
In a big app multiple people use different pieces and may not be aware
how all pieces are put together. You can guess the rest...
Important service used a library that used another library that started a
process with folly::Subprocess::CLOSE. That process opened network connections
and used glog. It was "working" for some time, because sys_write() to a socket
is a valid command, but when TCP buffers got full synchronous innocuous logging
prevented parent from making progress.

That footgun was removed from folly in 2021, but we still see this issue from time to time.
My point that the kernel can help here.
Since folks don't like sysctl to control FD assignment how about something like this:

diff --git a/fs/file.c b/fs/file.c
index 7893ea161d77..896e79433f61 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -554,9 +554,15 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
        return error;
 }

+__weak noinline u32 get_start_fd(void)
+{
+       return 0;
+}
+/* mark it as BPF_MODIFY_RETURN to let bpf progs adjust return value */
+
 int __get_unused_fd_flags(unsigned flags, unsigned long nofile)
 {
-       return alloc_fd(0, nofile, flags);
+       return alloc_fd(get_start_fd(), nofile, flags);
 }

Then we can enforce fd >= 3 for a certain container or for a particular app.

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-19  4:44                       ` Alexei Starovoitov
@ 2023-05-19  8:13                         ` Christian Brauner
  2023-05-19 14:27                           ` Theodore Ts'o
  2023-05-19 17:51                         ` Linus Torvalds
                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Christian Brauner @ 2023-05-19  8:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 09:44:33PM -0700, Alexei Starovoitov wrote:
> On Thu, May 18, 2023 at 11:57:14AM -0700, Linus Torvalds wrote:
> > That is nobody's fault but your own, and you should just admit it rather
> > than trying to double down on being wrong.
> 
> You're correct. I was indeed doubling down on that.
> Thanks for putting it straight like that.
> 
> > The 0/1/2 file descriptors are not at all special. They are a shell
> > pipeline default, nothing more. They are not the argument your think they
> > are, and you should stop trying to make them an argument.
> 
> I'm well aware that any file type is allowed to be in FDs 0,1,2 and
> some user space is using it that way, like old inetd:
> https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428
> That puts the same socket into 0,1,2 before exec-ing new process.
> 
> My point that the kernel has to assist user space instead of
> stubbornly sticking to POSIX and saying all FDs are equal.
> 
> Most user space developers know that care should be taken with FDs 0,1,2,
> but it's still easy to make a mistake.
> 
> To explain the motivation a bit of background:
> "folly" is a core C++ library for fb apps. Like libstdc++ and a lot more.
> Until this commit in 2021:
> https://github.com/facebook/folly/commit/cc9032a0e41a0cba9aa93240c483cfceb0ff44ea
> the user could launch a new process with flag "folly::Subprocess::CLOSE".
> It's useful for the cases when child doesn't want to inherit stdin/out/err.
> There is also GLOG. google's logging library that can be configured to log to stderr.
> Both libraries are well written with the high code quality.
> In a big app multiple people use different pieces and may not be aware
> how all pieces are put together. You can guess the rest...
> Important service used a library that used another library that started a
> process with folly::Subprocess::CLOSE. That process opened network connections
> and used glog. It was "working" for some time, because sys_write() to a socket
> is a valid command, but when TCP buffers got full synchronous innocuous logging
> prevented parent from making progress.
> 
> That footgun was removed from folly in 2021, but we still see this issue from time to time.
> My point that the kernel can help here.
> Since folks don't like sysctl to control FD assignment how about something like this:
> 
> diff --git a/fs/file.c b/fs/file.c
> index 7893ea161d77..896e79433f61 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -554,9 +554,15 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>         return error;
>  }
> 
> +__weak noinline u32 get_start_fd(void)
> +{
> +       return 0;
> +}
> +/* mark it as BPF_MODIFY_RETURN to let bpf progs adjust return value */
> +
>  int __get_unused_fd_flags(unsigned flags, unsigned long nofile)
>  {
> -       return alloc_fd(0, nofile, flags);
> +       return alloc_fd(get_start_fd(), nofile, flags);

I'm sorry but I really don't think this is a good idea. We're not going
to run BPF programs in core file code. That stuff is sensitive and
complex enough as it is without having to take into account that a bpf
program can modify behavior. It's also completely unclear whether that's
safe to do as this would allow to change fd allocation across the whole
kernel.

This idea that fd 0, 1, and 2 or any other fd deserve special treatment
by the kernel needs to die; and quickly at that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-19  8:13                         ` Christian Brauner
@ 2023-05-19 14:27                           ` Theodore Ts'o
  0 siblings, 0 replies; 21+ messages in thread
From: Theodore Ts'o @ 2023-05-19 14:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexei Starovoitov, Linus Torvalds, Christoph Hellwig,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Aleksa Sarai,
	Lennart Poettering, Linux-Fsdevel, Al Viro

On Fri, May 19, 2023 at 10:13:09AM +0200, Christian Brauner wrote:
> > I'm well aware that any file type is allowed to be in FDs 0,1,2 and
> > some user space is using it that way, like old inetd:
> > https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428
> > That puts the same socket into 0,1,2 before exec-ing new process.

This is a *feature*.  I've seen, and actually written shell scripts
which have been wired into /etc/inetd.conf. amd so the fact that shell
script can send stdout out to a incoming TCP connection.  It should be
possible to implement the finger protocol (RFC 1288) as a shell or
python script, *precisely* because having inetd connect a socket to
FDs 0, 1, and 2 is a good and useful thing to do.

> > My point that the kernel has to assist user space instead of
> > stubbornly sticking to POSIX and saying all FDs are equal.

This is not a matter of adhering to Posix.  It's about the fundamental
Unix philosophy.  Not everything needs to be implemented in a
complicated C++ program....


> > To explain the motivation a bit of background:
> > "folly" is a core C++ library for fb apps. Like libstdc++ and a lot more.
> > Until this commit in 2021:
> > https://github.com/facebook/folly/commit/cc9032a0e41a0cba9aa93240c483cfceb0ff44ea
> > the user could launch a new process with flag "folly::Subprocess::CLOSE".
> > It's useful for the cases when child doesn't want to inherit stdin/out/err.

Yeah, sorry, that's just simple bug in the Folly library (which I
guess was well named).  Closing all of the file descriptors and then
opening 0, 1, and 2 using /dev/null is a pretty basic.  In fact,
there's a convenient daemon(3) will do this for you.  No muss, no
fuss, no dirty dishes.

> I'm sorry but I really don't think this is a good idea. We're not going
> to run BPF programs in core file code. That stuff is sensitive and
> complex enough as it is without having to take into account that a bpf
> program can modify behavior. It's also completely unclear whether that's
> safe to do as this would allow to change fd allocation across the whole
> kernel.
> 
> This idea that fd 0, 1, and 2 or any other fd deserve special treatment
> by the kernel needs to die; and quickly at that.

+1.

Making fundamentally violent changes to core Unix design and
philosophy just to accomodate incompetent user space programmers is
IMHO a really bad idea.

						- Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-19  4:44                       ` Alexei Starovoitov
  2023-05-19  8:13                         ` Christian Brauner
@ 2023-05-19 17:51                         ` Linus Torvalds
  2023-05-23  7:49                         ` Lennart Poettering
  2023-08-26  4:27                         ` Al Viro
  3 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2023-05-19 17:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christian Brauner, Christoph Hellwig, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Aleksa Sarai, Lennart Poettering,
	Linux-Fsdevel, Al Viro

On Thu, May 18, 2023 at 9:44 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> My point that the kernel can help here.
> Since folks don't like sysctl to control FD assignment how about something like this:

No. Really.

The only thing that needs to happen is that people need to *know* that
fd's 0/1/2 are not at all special.

The thing is, it's even *traditional* to do something like

        close(0);
        close(1);
        close(2);

and fork() twice (exiting the first child after the second fork).
Usually you'd also make sure to re-set umask, and do a setsid() to
make sure you're not part of the controlling terminal any more.

Now, some people would then redirect /dev/null to those file
descriptors, but not always: file descriptors used to be a fairly
limited resource. So people would *want* to use all the file
descriptors they could for whatever server connections they
implemented. Including, very much, fd 0.

So really. There is not a way in hell we will *EVER* consider fd 0 to
be special. It isn't, it never has been, and it never shall be.

Instead, just spread the word that fd 0 isn't special.

The fact that you think that some completely broken C++ code was
"written with high quality", and you think that is an excuse for
garbage is just sad.

Those C++ libraries WERE INCREDIBLE CRAP. They were buggy garbage. And
no, they are in no way going to affect the kernel and make the kernel
do stupid and wrong things.

                    Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-19  4:44                       ` Alexei Starovoitov
  2023-05-19  8:13                         ` Christian Brauner
  2023-05-19 17:51                         ` Linus Torvalds
@ 2023-05-23  7:49                         ` Lennart Poettering
  2023-05-23 17:25                           ` Andrii Nakryiko
  2023-08-26  4:27                         ` Al Viro
  3 siblings, 1 reply; 21+ messages in thread
From: Lennart Poettering @ 2023-05-23  7:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Christian Brauner, Christoph Hellwig,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Aleksa Sarai, Linux-Fsdevel,
	Al Viro

On Do, 18.05.23 21:44, Alexei Starovoitov (alexei.starovoitov@gmail.com) wrote:

> > The 0/1/2 file descriptors are not at all special. They are a shell
> > pipeline default, nothing more. They are not the argument your think they
> > are, and you should stop trying to make them an argument.
>
> I'm well aware that any file type is allowed to be in FDs 0,1,2 and
> some user space is using it that way, like old inetd:
> https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428
> That puts the same socket into 0,1,2 before exec-ing new process.
>
> My point that the kernel has to assist user space instead of
> stubbornly sticking to POSIX and saying all FDs are equal.
>
> Most user space developers know that care should be taken with FDs 0,1,2,
> but it's still easy to make a mistake.

If I look at libbpf, which supposedly gets the fd handling right I
can't find any hint it actually moves the fds it gets from open() to
an fd > 2, though?

i.e. the code that invokes open() calls in the libbpf codebase happily
just accepts an fd < 2, including fd == 0, and this is then later
passed back into the kernel in various bpf() syscall invocations,
which should refuse it, no? So what's going on there?

I did find this though:

<snip>
	new_fd = open("/", O_RDONLY | O_CLOEXEC);
	if (new_fd < 0) {
		err = -errno;
		goto err_free_new_name;
	}

	new_fd = dup3(fd, new_fd, O_CLOEXEC);
	if (new_fd < 0) {
		err = -errno;
		goto err_close_new_fd;
	}
</snip>

(This is from libbpf.c, bpf_map__reuse_fd(), i.e. https://github.com/libbpf/libbpf/blob/master/src/libbpf.c)

Not sure what's going on here, what is this about? you allocate an fd
you then immediately replace? Is this done to move the fd away from
fd=0?  but that doesn't work that way, in case fd 0 is closed when
entering this function.

Or is this about dup'ping with O_CLOEXEC?

Please be aware that F_DUPFD_CLOEXEC exists, which allows you to
easily move some fd above some treshold, *and* set O_CLOEXEC at the
same time. In the systemd codebase we call this frequently for code
that ends up being loaded in arbitrary processes (glibc NSS modules,
PAM modules), in order to ensure we get out of the fd < 3 territory
quickly.

(btw, if you do care about O_CLOEXEC – which is great – then you also
want to replace a bunch of fopen(…, "r") with fopen(…, "re") in your
codebase)

Lennart

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-23  7:49                         ` Lennart Poettering
@ 2023-05-23 17:25                           ` Andrii Nakryiko
  0 siblings, 0 replies; 21+ messages in thread
From: Andrii Nakryiko @ 2023-05-23 17:25 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Alexei Starovoitov, Linus Torvalds, Christian Brauner,
	Christoph Hellwig, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Aleksa Sarai, Linux-Fsdevel,
	Al Viro

On Tue, May 23, 2023 at 12:49 AM Lennart Poettering
<lennart@poettering.net> wrote:
>
> On Do, 18.05.23 21:44, Alexei Starovoitov (alexei.starovoitov@gmail.com) wrote:
>
> > > The 0/1/2 file descriptors are not at all special. They are a shell
> > > pipeline default, nothing more. They are not the argument your think they
> > > are, and you should stop trying to make them an argument.
> >
> > I'm well aware that any file type is allowed to be in FDs 0,1,2 and
> > some user space is using it that way, like old inetd:
> > https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428
> > That puts the same socket into 0,1,2 before exec-ing new process.
> >
> > My point that the kernel has to assist user space instead of
> > stubbornly sticking to POSIX and saying all FDs are equal.
> >
> > Most user space developers know that care should be taken with FDs 0,1,2,
> > but it's still easy to make a mistake.
>
> If I look at libbpf, which supposedly gets the fd handling right I
> can't find any hint it actually moves the fds it gets from open() to
> an fd > 2, though?
>
> i.e. the code that invokes open() calls in the libbpf codebase happily
> just accepts an fd < 2, including fd == 0, and this is then later
> passed back into the kernel in various bpf() syscall invocations,
> which should refuse it, no? So what's going on there?

libbpf's attempt to ensure that fd>2 applies mostly to BPF objects:
maps, progs, btfs, links, which are always returned from bpf()
syscall. That's where we use ensure_good_fd(). The snippet you found
in bpf_map__reuse_fd() is problematic and slipped through the cracks,
I'll fix it, thanks for bringing this to my attention. I'll also check
if there are any other places where we should use ensure_good_fd() as
well.

>
> I did find this though:
>
> <snip>
>         new_fd = open("/", O_RDONLY | O_CLOEXEC);
>         if (new_fd < 0) {
>                 err = -errno;
>                 goto err_free_new_name;
>         }
>
>         new_fd = dup3(fd, new_fd, O_CLOEXEC);
>         if (new_fd < 0) {
>                 err = -errno;
>                 goto err_close_new_fd;
>         }
> </snip>
>
> (This is from libbpf.c, bpf_map__reuse_fd(), i.e. https://github.com/libbpf/libbpf/blob/master/src/libbpf.c)
>
> Not sure what's going on here, what is this about? you allocate an fd
> you then immediately replace? Is this done to move the fd away from
> fd=0?  but that doesn't work that way, in case fd 0 is closed when
> entering this function.
>
> Or is this about dup'ping with O_CLOEXEC?


This code predates me, so I don't know the exact reasons why it's
implemented exactly like this. It seems like instead of doing
open()+dup3() we should do dup3()+ensure_good_fd(). I need to look at
this carefully, this is not the code I work with regularly.

>
> Please be aware that F_DUPFD_CLOEXEC exists, which allows you to
> easily move some fd above some treshold, *and* set O_CLOEXEC at the
> same time. In the systemd codebase we call this frequently for code
> that ends up being loaded in arbitrary processes (glibc NSS modules,
> PAM modules), in order to ensure we get out of the fd < 3 territory
> quickly.

nice, thanks

>
> (btw, if you do care about O_CLOEXEC – which is great – then you also
> want to replace a bunch of fopen(…, "r") with fopen(…, "re") in your
> codebase)

I will check this as well, thanks for the hints :)

>
> Lennart

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: fd == 0 means AT_FDCWD BPF_OBJ_GET commands
  2023-05-19  4:44                       ` Alexei Starovoitov
                                           ` (2 preceding siblings ...)
  2023-05-23  7:49                         ` Lennart Poettering
@ 2023-08-26  4:27                         ` Al Viro
  3 siblings, 0 replies; 21+ messages in thread
From: Al Viro @ 2023-08-26  4:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Christian Brauner, Christoph Hellwig,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Aleksa Sarai,
	Lennart Poettering, Linux-Fsdevel

On Thu, May 18, 2023 at 09:44:33PM -0700, Alexei Starovoitov wrote:

> That footgun was removed from folly in 2021, but we still see this issue from time to time.
> My point that the kernel can help here.
> Since folks don't like sysctl to control FD assignment how about something like this:
> 
> diff --git a/fs/file.c b/fs/file.c
> index 7893ea161d77..896e79433f61 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -554,9 +554,15 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>         return error;
>  }
> 
> +__weak noinline u32 get_start_fd(void)
> +{
> +       return 0;
> +}
> +/* mark it as BPF_MODIFY_RETURN to let bpf progs adjust return value */
> +
>  int __get_unused_fd_flags(unsigned flags, unsigned long nofile)
>  {
> -       return alloc_fd(0, nofile, flags);
> +       return alloc_fd(get_start_fd(), nofile, flags);
>  }
> 
> Then we can enforce fd >= 3 for a certain container or for a particular app.

[an extremely belated reply - had been net.dead since mid-May, just got to
that thread]

As far as I'm concerned, the main conclusion is that BPF handling of file
descriptors needs a fairly hostile code review, regarding the interactions
with dup2(), shared descriptor tables, SCM_RIGHTS, etc.  Rationale:
demonstrated utter lack of clue about the nature of file descriptors,
along with a weird mental model of how they are used, complete with
"if they are used not in the way we expect, let's shove a hook
somewhere to enforce The Right Way(tm)".  Will do...

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-08-26  4:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20230516001348.286414-1-andrii@kernel.org>
     [not found] ` <20230516001348.286414-2-andrii@kernel.org>
     [not found]   ` <20230516-briefe-blutzellen-0432957bdd15@brauner>
     [not found]     ` <CAEf4BzafCCeRm9M8pPzpwexadKy5OAEmrYcnVpKmqNJ2tnSVuw@mail.gmail.com>
2023-05-17  9:11       ` fd == 0 means AT_FDCWD BPF_OBJ_GET commands Christian Brauner
2023-05-17 12:05         ` Christoph Hellwig
2023-05-17 16:17           ` Alexei Starovoitov
2023-05-17 21:48             ` Alexei Starovoitov
2023-05-18  8:38             ` Christian Brauner
2023-05-18 14:30               ` Theodore Ts'o
2023-05-18 16:25               ` Alexei Starovoitov
2023-05-18 16:33                 ` Matthew Wilcox
2023-05-18 17:22                   ` Christian Brauner
2023-05-18 17:20                 ` Christian Brauner
2023-05-18 17:33                   ` Linus Torvalds
2023-05-18 18:21                     ` Christian Brauner
2023-05-18 18:26                   ` Alexei Starovoitov
     [not found]                     ` <CAHk-=whg-ygwrxm3GZ_aNXO=srH9sZ3NmFqu0KkyWw+wgEsi6g@mail.gmail.com>
2023-05-19  4:44                       ` Alexei Starovoitov
2023-05-19  8:13                         ` Christian Brauner
2023-05-19 14:27                           ` Theodore Ts'o
2023-05-19 17:51                         ` Linus Torvalds
2023-05-23  7:49                         ` Lennart Poettering
2023-05-23 17:25                           ` Andrii Nakryiko
2023-08-26  4:27                         ` Al Viro
2023-05-18 21:56         ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).