Re: handling unsupported optlen in cgroup bpf getsockopt: (was [PATCH net-next v4 2/4] net: socket: add sockopts blacklist for BPF cgroup hook)

From: Stanislav Fomichev <sdf@google.com>
To: Kui-Feng Lee <sinquersw@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>,
	Eric Dumazet <edumazet@google.com>,
	davem@davemloft.net, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, daniel@iogearbox.net,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Leon Romanovsky <leon@kernel.org>,
	David Ahern <dsahern@kernel.org>, Arnd Bergmann <arnd@arndb.de>,
	Kees Cook <keescook@chromium.org>,
	Christian Brauner <brauner@kernel.org>,
	Kuniyuki Iwashima <kuniyu@amazon.com>,
	Lennart Poettering <mzxreary@0pointer.de>,
	linux-arch@vger.kernel.org,
	Aleksandr Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>,
	bpf <bpf@vger.kernel.org>
Subject: Re: handling unsupported optlen in cgroup bpf getsockopt: (was [PATCH net-next v4 2/4] net: socket: add sockopts blacklist for BPF cgroup hook)
Date: Wed, 26 Apr 2023 08:50:53 -0700	[thread overview]
Message-ID: <CAKH8qBu3d+DpnyaCusRV3Q5xPCjuC4ym0PfJJ7pAOov+Mg2ayw@mail.gmail.com> (raw)
In-Reply-To: <927ddd10-ae5b-886c-6725-3daf04456e52@gmail.com>

On Tue, Apr 25, 2023 at 2:28 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 4/25/23 11:42, Stanislav Fomichev wrote:
> > On Tue, Apr 25, 2023 at 10:59 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
> >>
> >>
> >>
> >> On 4/18/23 09:47, Stanislav Fomichev wrote:
> >>> On 04/17, Martin KaFai Lau wrote:
> >>>> On 4/14/23 6:55 PM, Stanislav Fomichev wrote:
> >>>>> On 04/13, Stanislav Fomichev wrote:
> >>>>>> On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn
> >>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote:
> >>>>>>>
> >>>>>>> On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@google.com> wrote:
> >>>>>>>>
> >>>>>>>> On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn
> >>>>>>>> <aleksandr.mikhalitsyn@canonical.com> wrote:
> >>>>>>>>>
> >>>>>>>>> During work on SO_PEERPIDFD, it was discovered (thanks to Christian),
> >>>>>>>>> that bpf cgroup hook can cause FD leaks when used with sockopts which
> >>>>>>>>> install FDs into the process fdtable.
> >>>>>>>>>
> >>>>>>>>> After some offlist discussion it was proposed to add a blacklist of
> >>>>>>>>
> >>>>>>>> We try to replace this word by either denylist or blocklist, even in changelogs.
> >>>>>>>
> >>>>>>> Hi Eric,
> >>>>>>>
> >>>>>>> Oh, I'm sorry about that. :( Sure.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> socket options those can cause troubles when BPF cgroup hook is enabled.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Can we find the appropriate Fixes: tag to help stable teams ?
> >>>>>>>
> >>>>>>> Sure, I will add next time.
> >>>>>>>
> >>>>>>> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
> >>>>>>>
> >>>>>>> I think it's better to add Stanislav Fomichev to CC.
> >>>>>>
> >>>>>> Can we use 'struct proto' bpf_bypass_getsockopt instead? We already
> >>>>>> use it for tcp zerocopy, I'm assuming it should work in this case as
> >>>>>> well?
> >>>>>
> >>>>> Jakub reminded me of the other things I wanted to ask here bug forgot:
> >>>>>
> >>>>> - setsockopt is probably not needed, right? setsockopt hook triggers
> >>>>>      before the kernel and shouldn't leak anything
> >>>>> - for getsockopt, instead of bypassing bpf completely, should we instead
> >>>>>      ignore the error from the bpf program? that would still preserve
> >>>>>      the observability aspect
> >>>>
> >>>> stealing this thread to discuss the optlen issue which may make sense to
> >>>> bypass also.
> >>>>
> >>>> There has been issue with optlen. Other than this older post related to
> >>>> optlen > PAGE_SIZE:
> >>>> https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@linux.dev/,
> >>>> the recent one related to optlen that we have seen is
> >>>> NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen == 0 and the kernel
> >>>> put the expected optlen (> 0) and 'return 0;' to userspace. The userspace
> >>>> intention is to learn the expected optlen. This makes 'ctx.optlen >
> >>>> max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning
> >>>> -EFAULT to the userspace even the bpf prog has not changed anything.
> >>>
> >>> (ignoring -EFAULT issue) this seems like it needs to be
> >>>
> >>>        if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) {
> >>>                /* error */
> >>>        }
> >>>
> >>> ?
> >>>
> >>>> Does it make sense to also bypass the bpf prog when 'ctx.optlen >
> >>>> max_optlen' for now (and this can use a separate patch which as usual
> >>>> requires a bpf selftests)?
> >>>
> >>> Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or something
> >>> seems like the way to go. It caused too much trouble already :-(
> >>>
> >>> Should I prepare a patch or do you want to take a stab at it?
> >>>
> >>>> In the future, does it make sense to have a specific cgroup-bpf-prog (a
> >>>> specific attach type?) that only uses bpf_dynptr kfunc to access the optval
> >>>> such that it can enforce read-only for some optname and potentially also
> >>>> track if bpf-prog has written a new optval? The bpf-prog can only return 1
> >>>> (OK) and only allows using bpf_set_retval() instead. Likely there is still
> >>>> holes but could be a seed of thought to continue polishing the idea.
> >>>
> >>> Ack, let's think about it.
> >>>
> >>> Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' idea
> >>> as well? If we can have a sleepable hook that can copy_from_user/copy_to_user,
> >>> and we have a mostly working bpf_getsockopt (after your refactoring),
> >>> I don't see why we need to continue the current scheme of triggering
> >>> after the kernel?
> >>
> >> Since a sleepable hook would cause some restrictions, perhaps, we could
> >> introduce something like the promise pattern.  In our case here, BPF
> >> program call an async version of copy_from_user()/copy_to_user() to
> >> return a promise.
> >
> > Having a promise might work. This is essentially what we already do
> > with sockets/etc with acquire/release pattern.
>
> Would you mind to give me some context of the socket things?

I'm mostly referring to the infra around KF_ACQUIRE/KF_RELEASE where
the verifier already has some resource tracking functionality. We can
probably extend it to verify that the program does copy_to_user
equivalent at the end of the run (or somehow specifically marks that
it's not needed).

> >
> > What are the sleepable restrictions you're hinting about? I feel like
> > with the sleepable bpf, we can also remove all the temporary buffer
> > management / extra copies which sounds like a win to me. (we have this
> > ugly heuristics with BPF_SOCKOPT_KERN_BUF_SIZE) The program can
> > allocate temporary buffers if needed..
> >
> >>>>> - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that
> >>>>>      gets called whenever bpf returns an error to make sure protocols have
> >>>>>      a chance to handle that condition (and free the fd)
> >>>>>
> >>>>
> >>>>