bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nitin Hande <nitin.hande@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>,
	Joe Stringer <joe@isovalent.com>,
	Networking <netdev@vger.kernel.org>,
	bpf@vger.kernel.org, Martin KaFai Lau <kafai@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	edumazet@google.com
Subject: Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers
Date: Thu, 16 May 2019 16:38:19 -0700	[thread overview]
Message-ID: <CAGUcTrqnrE+9BGsuc3sf_DpzsD01wP6h3PbK3-u6hk=6wM0zGg@mail.gmail.com> (raw)
In-Reply-To: <20190516203325.uhg7c5sr45od7lzm@ast-mbp>

On Thu, May 16, 2019 at 2:57 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, May 16, 2019 at 09:41:34AM +0100, Lorenz Bauer wrote:
> > On Wed, 15 May 2019 at 18:16, Joe Stringer <joe@isovalent.com> wrote:
> > >
> > > On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > > >
> > > > In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> > > > that the sk_lookup_* helpers currently return inconsistent results if
> > > > SK_REUSEPORT programs are in play.
> > > >
> > > > SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> > > > to the full packet
> > > > that triggered the look up. To support this, inet_lookup gained a new
> > > > skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> > > > program is skipped and instead the socket is selected by its hash.
> > > >
> > > > The first problem is that not all callers to inet_lookup from BPF have
> > > > an skb, e.g. XDP. This means that a look up from XDP gives an
> > > > incorrect result. For now that is not a huge problem. However, once we
> > > > get sk_assign as proposed by Joe, we can end up circumventing
> > > > SK_REUSEPORT.
> > >
> > > To clarify a bit, the reason this is a problem is that a
> > > straightforward implementation may just consider passing the skb
> > > context into the sk_lookup_*() and through to the inet_lookup() so
> > > that it would run the SK_REUSEPORT BPF program for socket selection on
> > > the skb when the packet-path BPF program performs the socket lookup.
> > > However, as this paragraph describes, the skb context is not always
> > > available.
> > >
> > > > At the conference, someone suggested using a similar approach to the
> > > > work done on the flow dissector by Stanislav: create a dedicated
> > > > context sk_reuseport which can either take an skb or a plain pointer.
> > > > Patch up load_bytes to deal with both. Pass the context to
> > > > inet_lookup.
> > > >
> > > > This is when we hit the second problem: using the skb or XDP context
> > > > directly is incorrect, because it assumes that the relevant protocol
> > > > headers are at the start of the buffer. In our use case, the correct
> > > > headers are at an offset since we're inspecting encapsulated packets.
> > > >
> > > > The best solution I've come up with is to steal 17 bits from the flags
> > > > argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> > > > the offset itself.
> > >
> > > FYI there's also the upper 32 bits of the netns_id parameter, another
> > > option would be to steal 16 bits from there.
> >
> > Or len, which is only 16 bits realistically. The offset doesn't really fit into
> > either of them very well, using flags seemed the cleanest to me.
> > Is there some best practice around this?
> >
> > >
> > > > Thoughts?
> > >
> > > Internally with skbs, we use `skb_pull()` to manage header offsets,
> > > could we do something similar with `bpf_xdp_adjust_head()` prior to
> > > the call to `bpf_sk_lookup_*()`?
> >
> > That would only work if it retained the contents of the skipped
> > buffer, and if there
> > was a way to undo the adjustment later. We're doing the sk_lookup to
> > decide whether to
> > accept or forward the packet, so at the point of the call we might still need
> > that data. Is that feasible with skb / XDP ctx?
>
> While discussing the solution for reuseport I propose to use
> progs/test_select_reuseport_kern.c as an example of realistic program.
> It reads tcp/udp header directly via ctx->data or via bpf_skb_load_bytes()
> including payload after the header.
> It also uses bpf_skb_load_bytes_relative() to fetch IP.
> I think if we're fixing the sk_lookup from XDP the above program
> would need to work.
>
> And I think we can make it work by adding new requirement that
> 'struct bpf_sock_tuple *' argument to bpf_sk_lookup_* must be
> a pointer to the packet and not a pointer to bpf program stack.
> Then helper can construct a fake skb and assign
> fake_skb->data = &bpf_sock_tuple_arg.sport
> It can check that struct bpf_sock_tuple * pointer is within 100-ish bytes
> from xdp->data and within xdp->data_end
> This way the reuseport program's assumption that ctx->data points to tcp/udp
> will be preserved and it can access it all including payload.
>
> This approach doesn't need to mess with xdp_adjust_head and adjust uapi to pass length.
> Existing progs/test_sk_lookup_kern.c will magically start working with XDP
> even when reuseport prog is attached.
> Thoughts?

I like this approach. A fake_skb approach will normalize the bpf_sk_lookup_*()
API peering into the kernel API between TC and XDP invocation. Just one question
that comes, I remember one of the comments I received during my XDP commit
was the stateless nature of XDP services and providing a fake_skb may bring
some potential side-effects to the desire of statelessness. Is that
still a possibility?
How do we guard against it?

Thanks
Nitin

>

  reply	other threads:[~2019-05-16 23:38 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-15 15:11 RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers Lorenz Bauer
2019-05-15 17:16 ` Joe Stringer
2019-05-16  8:41   ` Lorenz Bauer
2019-05-16 20:33     ` Alexei Starovoitov
2019-05-16 23:38       ` Nitin Hande [this message]
2019-05-21 15:47         ` Lorenz Bauer
2019-05-21 21:39           ` Alexei Starovoitov
2019-05-17 14:15       ` Lorenz Bauer
2019-05-19  1:20         ` Joe Stringer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGUcTrqnrE+9BGsuc3sf_DpzsD01wP6h3PbK3-u6hk=6wM0zGg@mail.gmail.com' \
    --to=nitin.hande@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=edumazet@google.com \
    --cc=joe@isovalent.com \
    --cc=kafai@fb.com \
    --cc=lmb@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).