Re: [PATCH bpf 1/3] bpf: syscall to start at file-descriptor 1

From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: David Ahern <dsahern@gmail.com>, bpf <bpf@vger.kernel.org>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Networking <netdev@vger.kernel.org>,
	Daniel Borkmann <borkmann@iogearbox.net>,
	Lorenzo Bianconi <lorenzo@kernel.org>
Subject: Re: [PATCH bpf 1/3] bpf: syscall to start at file-descriptor 1
Date: Mon, 8 Jun 2020 13:05:18 -0700	[thread overview]
Message-ID: <CAEf4Bzah_5jrFcvbk+mLa31KQjY77JdoXw=cPGAYw92X8nN=gw@mail.gmail.com> (raw)
In-Reply-To: <20200608214437.5f7766ab@carbon>

-- Andrii

On Mon, Jun 8, 2020 at 12:45 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Mon, 8 Jun 2020 11:36:33 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > On Mon, Jun 8, 2020 at 9:51 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > >
> > > This patch change BPF syscall to avoid returning file descriptor value zero.
> > >
> > > As mentioned in cover letter, it is very impractical when extending kABI
> > > that the file-descriptor value 'zero' is valid, as this requires new fields
> > > must be initialised as minus-1. First step is to change the kernel such that
> > > BPF-syscall simply doesn't return value zero as a FD number.
> > >
> > > This patch achieves this by similar code to anon_inode_getfd(), with the
> > > exception of getting unused FD starting from 1. The kernel already supports
> > > starting from a specific FD value, as this is used by f_dupfd(). It seems
> > > simpler to replicate part of anon_inode_getfd() code and use this start from
> > > offset feature, instead of using f_dupfd() handling afterwards.
> >
> > Wouldn't it be better to just handle that on libbpf side? That way it
> > works on all kernels and doesn't require this duplication of logic
> > inside kernel?
>
> IMHO this is needed on the kernel side, to pair it with the API change.
> I don't see how doing this in libbpf can cover all cases.
>

In practice, it's going to be very rare that fd=0 is not already
allocated for application. So even not doing anything is going to work
in 99.9% of cases.

Think about this, any realistic BPF program will have a map or global
variable associated with it. So in the rare case where we have FD 0
not used, map will get that FD. Even if not, if you load your program
from ELF file, that ELF file will get FD 0. Even if not, modern BPF
programs will have BTF object associated, which will get FD 0. And so
on... Even daemons will probably already have some FD open for
whatever logging/output it needs to do (it doesn't have to be stdout).
So this FD=0 is very hypothetical case, very. You have to essentially
stage it consciously, to actually hit it.

Second of all, this BPF-specific FD allocation logic is just that --
duplication and extra code to maintain. Other folks in kernel will
eventually just be "huh? bpf needs its own anon_file API, why?!..." Do
we really want more of that?

Third, you already missed anon_inode_getfile use in bpf_link_prime(),
and in the future we'll undoubtedly will miss something else, so this
FD >= 1 from kernel will work only sometimes and no one will notice
until it breaks for someone, which won't happen for a while (because
it works as is most of the time, see above).

I'm not even talking about the fact that we are also subverting
standard Linux promise that a user gets the smallest available FD in
the system...

So I think libbpf can be kind to user and do:

int fd = bpf_whatever();
if (fd == 0) {
    fd = dup(0);
    close(0);
}

But even if it doesn't, not a big deal and probably no one ever will
have this problem with FD 0 for BPF program.

> First of all, some users might not be using libbpf.
>
> Second, a userspace application could be using an older version of
> libbpf on a newer kernel. (Notice this is not only due to older
> distros, but also because projects using git submodule libbpf will
> freeze at some point in time that worked).

Theoretically this is a problem, in practice libbpf is always more
up-to-date compared to kernel... so I don't buy this argument,
honestly :)

>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>