BPF Archive on lore.kernel.org
 help / Atom feed
* RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers
@ 2019-05-15 15:11 Lorenz Bauer
  2019-05-15 17:16 ` Joe Stringer
  0 siblings, 1 reply; 9+ messages in thread
From: Lorenz Bauer @ 2019-05-15 15:11 UTC (permalink / raw)
  To: Networking, bpf; +Cc: Joe Stringer

In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
that the sk_lookup_* helpers currently return inconsistent results if
SK_REUSEPORT programs are in play.

SK_REUSEPORT programs are a hook point in inet_lookup. They get access
to the full packet
that triggered the look up. To support this, inet_lookup gained a new
skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
program is skipped and instead the socket is selected by its hash.

The first problem is that not all callers to inet_lookup from BPF have
an skb, e.g. XDP. This means that a look up from XDP gives an
incorrect result. For now that is not a huge problem. However, once we
get sk_assign as proposed by Joe, we can end up circumventing
SK_REUSEPORT.

At the conference, someone suggested using a similar approach to the
work done on the flow dissector by Stanislav: create a dedicated
context sk_reuseport which can either take an skb or a plain pointer.
Patch up load_bytes to deal with both. Pass the context to
inet_lookup.

This is when we hit the second problem: using the skb or XDP context
directly is incorrect, because it assumes that the relevant protocol
headers are at the start of the buffer. In our use case, the correct
headers are at an offset since we're inspecting encapsulated packets.

The best solution I've come up with is to steal 17 bits from the flags
argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
the offset itself.

Thoughts?

1: http://vger.kernel.org/bpfconf2019.html#session-7
-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers
  2019-05-15 15:11 RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers Lorenz Bauer
@ 2019-05-15 17:16 ` Joe Stringer
  2019-05-16  8:41   ` Lorenz Bauer
  0 siblings, 1 reply; 9+ messages in thread
From: Joe Stringer @ 2019-05-15 17:16 UTC (permalink / raw)
  To: Lorenz Bauer; +Cc: Networking, bpf

On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> that the sk_lookup_* helpers currently return inconsistent results if
> SK_REUSEPORT programs are in play.
>
> SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> to the full packet
> that triggered the look up. To support this, inet_lookup gained a new
> skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> program is skipped and instead the socket is selected by its hash.
>
> The first problem is that not all callers to inet_lookup from BPF have
> an skb, e.g. XDP. This means that a look up from XDP gives an
> incorrect result. For now that is not a huge problem. However, once we
> get sk_assign as proposed by Joe, we can end up circumventing
> SK_REUSEPORT.

To clarify a bit, the reason this is a problem is that a
straightforward implementation may just consider passing the skb
context into the sk_lookup_*() and through to the inet_lookup() so
that it would run the SK_REUSEPORT BPF program for socket selection on
the skb when the packet-path BPF program performs the socket lookup.
However, as this paragraph describes, the skb context is not always
available.

> At the conference, someone suggested using a similar approach to the
> work done on the flow dissector by Stanislav: create a dedicated
> context sk_reuseport which can either take an skb or a plain pointer.
> Patch up load_bytes to deal with both. Pass the context to
> inet_lookup.
>
> This is when we hit the second problem: using the skb or XDP context
> directly is incorrect, because it assumes that the relevant protocol
> headers are at the start of the buffer. In our use case, the correct
> headers are at an offset since we're inspecting encapsulated packets.
>
> The best solution I've come up with is to steal 17 bits from the flags
> argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> the offset itself.

FYI there's also the upper 32 bits of the netns_id parameter, another
option would be to steal 16 bits from there.

> Thoughts?

Internally with skbs, we use `skb_pull()` to manage header offsets,
could we do something similar with `bpf_xdp_adjust_head()` prior to
the call to `bpf_sk_lookup_*()`?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers
  2019-05-15 17:16 ` Joe Stringer
@ 2019-05-16  8:41   ` Lorenz Bauer
  2019-05-16 20:33     ` Alexei Starovoitov
  0 siblings, 1 reply; 9+ messages in thread
From: Lorenz Bauer @ 2019-05-16  8:41 UTC (permalink / raw)
  To: Joe Stringer; +Cc: Networking, bpf

On Wed, 15 May 2019 at 18:16, Joe Stringer <joe@isovalent.com> wrote:
>
> On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >
> > In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> > that the sk_lookup_* helpers currently return inconsistent results if
> > SK_REUSEPORT programs are in play.
> >
> > SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> > to the full packet
> > that triggered the look up. To support this, inet_lookup gained a new
> > skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> > program is skipped and instead the socket is selected by its hash.
> >
> > The first problem is that not all callers to inet_lookup from BPF have
> > an skb, e.g. XDP. This means that a look up from XDP gives an
> > incorrect result. For now that is not a huge problem. However, once we
> > get sk_assign as proposed by Joe, we can end up circumventing
> > SK_REUSEPORT.
>
> To clarify a bit, the reason this is a problem is that a
> straightforward implementation may just consider passing the skb
> context into the sk_lookup_*() and through to the inet_lookup() so
> that it would run the SK_REUSEPORT BPF program for socket selection on
> the skb when the packet-path BPF program performs the socket lookup.
> However, as this paragraph describes, the skb context is not always
> available.
>
> > At the conference, someone suggested using a similar approach to the
> > work done on the flow dissector by Stanislav: create a dedicated
> > context sk_reuseport which can either take an skb or a plain pointer.
> > Patch up load_bytes to deal with both. Pass the context to
> > inet_lookup.
> >
> > This is when we hit the second problem: using the skb or XDP context
> > directly is incorrect, because it assumes that the relevant protocol
> > headers are at the start of the buffer. In our use case, the correct
> > headers are at an offset since we're inspecting encapsulated packets.
> >
> > The best solution I've come up with is to steal 17 bits from the flags
> > argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> > the offset itself.
>
> FYI there's also the upper 32 bits of the netns_id parameter, another
> option would be to steal 16 bits from there.

Or len, which is only 16 bits realistically. The offset doesn't really fit into
either of them very well, using flags seemed the cleanest to me.
Is there some best practice around this?

>
> > Thoughts?
>
> Internally with skbs, we use `skb_pull()` to manage header offsets,
> could we do something similar with `bpf_xdp_adjust_head()` prior to
> the call to `bpf_sk_lookup_*()`?

That would only work if it retained the contents of the skipped
buffer, and if there
was a way to undo the adjustment later. We're doing the sk_lookup to
decide whether to
accept or forward the packet, so at the point of the call we might still need
that data. Is that feasible with skb / XDP ctx?

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers
  2019-05-16  8:41   ` Lorenz Bauer
@ 2019-05-16 20:33     ` Alexei Starovoitov
  2019-05-16 23:38       ` Nitin Hande
  2019-05-17 14:15       ` Lorenz Bauer
  0 siblings, 2 replies; 9+ messages in thread
From: Alexei Starovoitov @ 2019-05-16 20:33 UTC (permalink / raw)
  To: Lorenz Bauer; +Cc: Joe Stringer, Networking, bpf, kafai, daniel, edumazet

On Thu, May 16, 2019 at 09:41:34AM +0100, Lorenz Bauer wrote:
> On Wed, 15 May 2019 at 18:16, Joe Stringer <joe@isovalent.com> wrote:
> >
> > On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > >
> > > In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> > > that the sk_lookup_* helpers currently return inconsistent results if
> > > SK_REUSEPORT programs are in play.
> > >
> > > SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> > > to the full packet
> > > that triggered the look up. To support this, inet_lookup gained a new
> > > skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> > > program is skipped and instead the socket is selected by its hash.
> > >
> > > The first problem is that not all callers to inet_lookup from BPF have
> > > an skb, e.g. XDP. This means that a look up from XDP gives an
> > > incorrect result. For now that is not a huge problem. However, once we
> > > get sk_assign as proposed by Joe, we can end up circumventing
> > > SK_REUSEPORT.
> >
> > To clarify a bit, the reason this is a problem is that a
> > straightforward implementation may just consider passing the skb
> > context into the sk_lookup_*() and through to the inet_lookup() so
> > that it would run the SK_REUSEPORT BPF program for socket selection on
> > the skb when the packet-path BPF program performs the socket lookup.
> > However, as this paragraph describes, the skb context is not always
> > available.
> >
> > > At the conference, someone suggested using a similar approach to the
> > > work done on the flow dissector by Stanislav: create a dedicated
> > > context sk_reuseport which can either take an skb or a plain pointer.
> > > Patch up load_bytes to deal with both. Pass the context to
> > > inet_lookup.
> > >
> > > This is when we hit the second problem: using the skb or XDP context
> > > directly is incorrect, because it assumes that the relevant protocol
> > > headers are at the start of the buffer. In our use case, the correct
> > > headers are at an offset since we're inspecting encapsulated packets.
> > >
> > > The best solution I've come up with is to steal 17 bits from the flags
> > > argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> > > the offset itself.
> >
> > FYI there's also the upper 32 bits of the netns_id parameter, another
> > option would be to steal 16 bits from there.
> 
> Or len, which is only 16 bits realistically. The offset doesn't really fit into
> either of them very well, using flags seemed the cleanest to me.
> Is there some best practice around this?
> 
> >
> > > Thoughts?
> >
> > Internally with skbs, we use `skb_pull()` to manage header offsets,
> > could we do something similar with `bpf_xdp_adjust_head()` prior to
> > the call to `bpf_sk_lookup_*()`?
> 
> That would only work if it retained the contents of the skipped
> buffer, and if there
> was a way to undo the adjustment later. We're doing the sk_lookup to
> decide whether to
> accept or forward the packet, so at the point of the call we might still need
> that data. Is that feasible with skb / XDP ctx?

While discussing the solution for reuseport I propose to use
progs/test_select_reuseport_kern.c as an example of realistic program.
It reads tcp/udp header directly via ctx->data or via bpf_skb_load_bytes()
including payload after the header.
It also uses bpf_skb_load_bytes_relative() to fetch IP.
I think if we're fixing the sk_lookup from XDP the above program
would need to work.

And I think we can make it work by adding new requirement that
'struct bpf_sock_tuple *' argument to bpf_sk_lookup_* must be
a pointer to the packet and not a pointer to bpf program stack.
Then helper can construct a fake skb and assign
fake_skb->data = &bpf_sock_tuple_arg.sport
It can check that struct bpf_sock_tuple * pointer is within 100-ish bytes
from xdp->data and within xdp->data_end
This way the reuseport program's assumption that ctx->data points to tcp/udp
will be preserved and it can access it all including payload.

This approach doesn't need to mess with xdp_adjust_head and adjust uapi to pass length.
Existing progs/test_sk_lookup_kern.c will magically start working with XDP
even when reuseport prog is attached.
Thoughts?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers
  2019-05-16 20:33     ` Alexei Starovoitov
@ 2019-05-16 23:38       ` Nitin Hande
  2019-05-21 15:47         ` Lorenz Bauer
  2019-05-17 14:15       ` Lorenz Bauer
  1 sibling, 1 reply; 9+ messages in thread
From: Nitin Hande @ 2019-05-16 23:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Lorenz Bauer, Joe Stringer, Networking, bpf, Martin KaFai Lau,
	Daniel Borkmann, edumazet

On Thu, May 16, 2019 at 2:57 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, May 16, 2019 at 09:41:34AM +0100, Lorenz Bauer wrote:
> > On Wed, 15 May 2019 at 18:16, Joe Stringer <joe@isovalent.com> wrote:
> > >
> > > On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > > >
> > > > In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> > > > that the sk_lookup_* helpers currently return inconsistent results if
> > > > SK_REUSEPORT programs are in play.
> > > >
> > > > SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> > > > to the full packet
> > > > that triggered the look up. To support this, inet_lookup gained a new
> > > > skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> > > > program is skipped and instead the socket is selected by its hash.
> > > >
> > > > The first problem is that not all callers to inet_lookup from BPF have
> > > > an skb, e.g. XDP. This means that a look up from XDP gives an
> > > > incorrect result. For now that is not a huge problem. However, once we
> > > > get sk_assign as proposed by Joe, we can end up circumventing
> > > > SK_REUSEPORT.
> > >
> > > To clarify a bit, the reason this is a problem is that a
> > > straightforward implementation may just consider passing the skb
> > > context into the sk_lookup_*() and through to the inet_lookup() so
> > > that it would run the SK_REUSEPORT BPF program for socket selection on
> > > the skb when the packet-path BPF program performs the socket lookup.
> > > However, as this paragraph describes, the skb context is not always
> > > available.
> > >
> > > > At the conference, someone suggested using a similar approach to the
> > > > work done on the flow dissector by Stanislav: create a dedicated
> > > > context sk_reuseport which can either take an skb or a plain pointer.
> > > > Patch up load_bytes to deal with both. Pass the context to
> > > > inet_lookup.
> > > >
> > > > This is when we hit the second problem: using the skb or XDP context
> > > > directly is incorrect, because it assumes that the relevant protocol
> > > > headers are at the start of the buffer. In our use case, the correct
> > > > headers are at an offset since we're inspecting encapsulated packets.
> > > >
> > > > The best solution I've come up with is to steal 17 bits from the flags
> > > > argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> > > > the offset itself.
> > >
> > > FYI there's also the upper 32 bits of the netns_id parameter, another
> > > option would be to steal 16 bits from there.
> >
> > Or len, which is only 16 bits realistically. The offset doesn't really fit into
> > either of them very well, using flags seemed the cleanest to me.
> > Is there some best practice around this?
> >
> > >
> > > > Thoughts?
> > >
> > > Internally with skbs, we use `skb_pull()` to manage header offsets,
> > > could we do something similar with `bpf_xdp_adjust_head()` prior to
> > > the call to `bpf_sk_lookup_*()`?
> >
> > That would only work if it retained the contents of the skipped
> > buffer, and if there
> > was a way to undo the adjustment later. We're doing the sk_lookup to
> > decide whether to
> > accept or forward the packet, so at the point of the call we might still need
> > that data. Is that feasible with skb / XDP ctx?
>
> While discussing the solution for reuseport I propose to use
> progs/test_select_reuseport_kern.c as an example of realistic program.
> It reads tcp/udp header directly via ctx->data or via bpf_skb_load_bytes()
> including payload after the header.
> It also uses bpf_skb_load_bytes_relative() to fetch IP.
> I think if we're fixing the sk_lookup from XDP the above program
> would need to work.
>
> And I think we can make it work by adding new requirement that
> 'struct bpf_sock_tuple *' argument to bpf_sk_lookup_* must be
> a pointer to the packet and not a pointer to bpf program stack.
> Then helper can construct a fake skb and assign
> fake_skb->data = &bpf_sock_tuple_arg.sport
> It can check that struct bpf_sock_tuple * pointer is within 100-ish bytes
> from xdp->data and within xdp->data_end
> This way the reuseport program's assumption that ctx->data points to tcp/udp
> will be preserved and it can access it all including payload.
>
> This approach doesn't need to mess with xdp_adjust_head and adjust uapi to pass length.
> Existing progs/test_sk_lookup_kern.c will magically start working with XDP
> even when reuseport prog is attached.
> Thoughts?

I like this approach. A fake_skb approach will normalize the bpf_sk_lookup_*()
API peering into the kernel API between TC and XDP invocation. Just one question
that comes, I remember one of the comments I received during my XDP commit
was the stateless nature of XDP services and providing a fake_skb may bring
some potential side-effects to the desire of statelessness. Is that
still a possibility?
How do we guard against it?

Thanks
Nitin

>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers
  2019-05-16 20:33     ` Alexei Starovoitov
  2019-05-16 23:38       ` Nitin Hande
@ 2019-05-17 14:15       ` Lorenz Bauer
  2019-05-19  1:20         ` Joe Stringer
  1 sibling, 1 reply; 9+ messages in thread
From: Lorenz Bauer @ 2019-05-17 14:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, Networking, bpf, Martin Lau, Daniel Borkmann, edumazet

On Thu, 16 May 2019 at 21:33, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, May 16, 2019 at 09:41:34AM +0100, Lorenz Bauer wrote:
> > On Wed, 15 May 2019 at 18:16, Joe Stringer <joe@isovalent.com> wrote:
> > >
> > > On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > > >
> > > > In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> > > > that the sk_lookup_* helpers currently return inconsistent results if
> > > > SK_REUSEPORT programs are in play.
> > > >
> > > > SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> > > > to the full packet
> > > > that triggered the look up. To support this, inet_lookup gained a new
> > > > skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> > > > program is skipped and instead the socket is selected by its hash.
> > > >
> > > > The first problem is that not all callers to inet_lookup from BPF have
> > > > an skb, e.g. XDP. This means that a look up from XDP gives an
> > > > incorrect result. For now that is not a huge problem. However, once we
> > > > get sk_assign as proposed by Joe, we can end up circumventing
> > > > SK_REUSEPORT.
> > >
> > > To clarify a bit, the reason this is a problem is that a
> > > straightforward implementation may just consider passing the skb
> > > context into the sk_lookup_*() and through to the inet_lookup() so
> > > that it would run the SK_REUSEPORT BPF program for socket selection on
> > > the skb when the packet-path BPF program performs the socket lookup.
> > > However, as this paragraph describes, the skb context is not always
> > > available.
> > >
> > > > At the conference, someone suggested using a similar approach to the
> > > > work done on the flow dissector by Stanislav: create a dedicated
> > > > context sk_reuseport which can either take an skb or a plain pointer.
> > > > Patch up load_bytes to deal with both. Pass the context to
> > > > inet_lookup.
> > > >
> > > > This is when we hit the second problem: using the skb or XDP context
> > > > directly is incorrect, because it assumes that the relevant protocol
> > > > headers are at the start of the buffer. In our use case, the correct
> > > > headers are at an offset since we're inspecting encapsulated packets.
> > > >
> > > > The best solution I've come up with is to steal 17 bits from the flags
> > > > argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> > > > the offset itself.
> > >
> > > FYI there's also the upper 32 bits of the netns_id parameter, another
> > > option would be to steal 16 bits from there.
> >
> > Or len, which is only 16 bits realistically. The offset doesn't really fit into
> > either of them very well, using flags seemed the cleanest to me.
> > Is there some best practice around this?
> >
> > >
> > > > Thoughts?
> > >
> > > Internally with skbs, we use `skb_pull()` to manage header offsets,
> > > could we do something similar with `bpf_xdp_adjust_head()` prior to
> > > the call to `bpf_sk_lookup_*()`?
> >
> > That would only work if it retained the contents of the skipped
> > buffer, and if there
> > was a way to undo the adjustment later. We're doing the sk_lookup to
> > decide whether to
> > accept or forward the packet, so at the point of the call we might still need
> > that data. Is that feasible with skb / XDP ctx?
>
> While discussing the solution for reuseport I propose to use
> progs/test_select_reuseport_kern.c as an example of realistic program.
> It reads tcp/udp header directly via ctx->data or via bpf_skb_load_bytes()
> including payload after the header.
> It also uses bpf_skb_load_bytes_relative() to fetch IP.
> I think if we're fixing the sk_lookup from XDP the above program
> would need to work.

Agreed.

> And I think we can make it work by adding new requirement that
> 'struct bpf_sock_tuple *' argument to bpf_sk_lookup_* must be
> a pointer to the packet and not a pointer to bpf program stack.

This would break existing users, no? The sk_assign use case Joe Stringer
is working on would also break, because its impossible to look up a tuple
that hasn't come from the network.

It occurs to me that it's impossible to reconcile this use case with
SK_REUSEPORT in general. It would be great if we could return an
error in such case.

> Then helper can construct a fake skb and assign
> fake_skb->data = &bpf_sock_tuple_arg.sport

That isn't valid if the packet contains IP options or extension headers, because
the offset of sport is variable.

> It can check that struct bpf_sock_tuple * pointer is within 100-ish bytes
> from xdp->data and within xdp->data_end

Why the 100-byte limitation?

> This way the reuseport program's assumption that ctx->data points to tcp/udp
> will be preserved and it can access it all including payload.

How about the following:

    sk_lookup(ctx, &saddr, len, netns, BPF_F_IPV4 |
BPF_F_OFFSET(offsetof(sport))

SK_REUSEPORT can then access from saddr+offsetof(sport) to saddr+len.
The helper uses
offsetof(sport) to retrieve the tuple.

- Works with stack, map, packet pointers
- The verifier does bounds checking on the buffer for us due to ARG_CONST_SIZE
- If no BPF_F_IPV? is present, we fall back to current behaviour

>
> This approach doesn't need to mess with xdp_adjust_head and adjust uapi to pass length.
> Existing progs/test_sk_lookup_kern.c will magically start working with XDP
> even when reuseport prog is attached.
> Thoughts?
>


-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers
  2019-05-17 14:15       ` Lorenz Bauer
@ 2019-05-19  1:20         ` Joe Stringer
  0 siblings, 0 replies; 9+ messages in thread
From: Joe Stringer @ 2019-05-19  1:20 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Alexei Starovoitov, Networking, bpf, Martin Lau, Daniel Borkmann,
	edumazet

On Fri, May 17, 2019 at 7:15 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Thu, 16 May 2019 at 21:33, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, May 16, 2019 at 09:41:34AM +0100, Lorenz Bauer wrote:
> > > On Wed, 15 May 2019 at 18:16, Joe Stringer <joe@isovalent.com> wrote:
> > > >
> > > > On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > > > >
> > > > > In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> > > > > that the sk_lookup_* helpers currently return inconsistent results if
> > > > > SK_REUSEPORT programs are in play.
> > > > >
> > > > > SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> > > > > to the full packet
> > > > > that triggered the look up. To support this, inet_lookup gained a new
> > > > > skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> > > > > program is skipped and instead the socket is selected by its hash.
> > > > >
> > > > > The first problem is that not all callers to inet_lookup from BPF have
> > > > > an skb, e.g. XDP. This means that a look up from XDP gives an
> > > > > incorrect result. For now that is not a huge problem. However, once we
> > > > > get sk_assign as proposed by Joe, we can end up circumventing
> > > > > SK_REUSEPORT.
> > > >
> > > > To clarify a bit, the reason this is a problem is that a
> > > > straightforward implementation may just consider passing the skb
> > > > context into the sk_lookup_*() and through to the inet_lookup() so
> > > > that it would run the SK_REUSEPORT BPF program for socket selection on
> > > > the skb when the packet-path BPF program performs the socket lookup.
> > > > However, as this paragraph describes, the skb context is not always
> > > > available.
> > > >
> > > > > At the conference, someone suggested using a similar approach to the
> > > > > work done on the flow dissector by Stanislav: create a dedicated
> > > > > context sk_reuseport which can either take an skb or a plain pointer.
> > > > > Patch up load_bytes to deal with both. Pass the context to
> > > > > inet_lookup.
> > > > >
> > > > > This is when we hit the second problem: using the skb or XDP context
> > > > > directly is incorrect, because it assumes that the relevant protocol
> > > > > headers are at the start of the buffer. In our use case, the correct
> > > > > headers are at an offset since we're inspecting encapsulated packets.
> > > > >
> > > > > The best solution I've come up with is to steal 17 bits from the flags
> > > > > argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> > > > > the offset itself.
> > > >
> > > > FYI there's also the upper 32 bits of the netns_id parameter, another
> > > > option would be to steal 16 bits from there.
> > >
> > > Or len, which is only 16 bits realistically. The offset doesn't really fit into
> > > either of them very well, using flags seemed the cleanest to me.
> > > Is there some best practice around this?
> > >
> > > >
> > > > > Thoughts?
> > > >
> > > > Internally with skbs, we use `skb_pull()` to manage header offsets,
> > > > could we do something similar with `bpf_xdp_adjust_head()` prior to
> > > > the call to `bpf_sk_lookup_*()`?
> > >
> > > That would only work if it retained the contents of the skipped
> > > buffer, and if there
> > > was a way to undo the adjustment later. We're doing the sk_lookup to
> > > decide whether to
> > > accept or forward the packet, so at the point of the call we might still need
> > > that data. Is that feasible with skb / XDP ctx?
> >
> > While discussing the solution for reuseport I propose to use
> > progs/test_select_reuseport_kern.c as an example of realistic program.
> > It reads tcp/udp header directly via ctx->data or via bpf_skb_load_bytes()
> > including payload after the header.
> > It also uses bpf_skb_load_bytes_relative() to fetch IP.
> > I think if we're fixing the sk_lookup from XDP the above program
> > would need to work.
>
> Agreed.
>
> > And I think we can make it work by adding new requirement that
> > 'struct bpf_sock_tuple *' argument to bpf_sk_lookup_* must be
> > a pointer to the packet and not a pointer to bpf program stack.
>
> This would break existing users, no? The sk_assign use case Joe Stringer
> is working on would also break, because its impossible to look up a tuple
> that hasn't come from the network.

Right, in practice the bpf prog sk lookups for tproxy use case look
like first, look up with packet tuple, then if no tproxied socket is
found, substitute the destination port with the pre-configured tproxy
port and look up. Requiring packet pointer means rewriting the packet
for this case (not to mention the ext hdrs case that Lorenz mentions
below).

> It occurs to me that it's impossible to reconcile this use case with
> SK_REUSEPORT in general. It would be great if we could return an
> error in such case.
>
> > Then helper can construct a fake skb and assign
> > fake_skb->data = &bpf_sock_tuple_arg.sport
>
> That isn't valid if the packet contains IP options or extension headers, because
> the offset of sport is variable.
>
> > It can check that struct bpf_sock_tuple * pointer is within 100-ish bytes
> > from xdp->data and within xdp->data_end
>
> Why the 100-byte limitation?
>
> > This way the reuseport program's assumption that ctx->data points to tcp/udp
> > will be preserved and it can access it all including payload.
>
> How about the following:
>
>     sk_lookup(ctx, &saddr, len, netns, BPF_F_IPV4 |
> BPF_F_OFFSET(offsetof(sport))
>
> SK_REUSEPORT can then access from saddr+offsetof(sport) to saddr+len.
> The helper uses
> offsetof(sport) to retrieve the tuple.
>
> - Works with stack, map, packet pointers
> - The verifier does bounds checking on the buffer for us due to ARG_CONST_SIZE
> - If no BPF_F_IPV? is present, we fall back to current behaviour
>
> >
> > This approach doesn't need to mess with xdp_adjust_head and adjust uapi to pass length.
> > Existing progs/test_sk_lookup_kern.c will magically start working with XDP
> > even when reuseport prog is attached.
> > Thoughts?
> >
>
>
> --
> Lorenz Bauer  |  Systems Engineer
> 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK
>
> www.cloudflare.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers
  2019-05-16 23:38       ` Nitin Hande
@ 2019-05-21 15:47         ` Lorenz Bauer
  2019-05-21 21:39           ` Alexei Starovoitov
  0 siblings, 1 reply; 9+ messages in thread
From: Lorenz Bauer @ 2019-05-21 15:47 UTC (permalink / raw)
  To: Nitin Hande
  Cc: Alexei Starovoitov, Joe Stringer, Networking, bpf,
	Martin KaFai Lau, Daniel Borkmann, Eric Dumazet

On Fri, 17 May 2019 at 00:38, Nitin Hande <nitin.hande@gmail.com> wrote:
>
> On Thu, May 16, 2019 at 2:57 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, May 16, 2019 at 09:41:34AM +0100, Lorenz Bauer wrote:
> > > On Wed, 15 May 2019 at 18:16, Joe Stringer <joe@isovalent.com> wrote:
> > > >
> > > > On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > > > >
> > > > > In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> > > > > that the sk_lookup_* helpers currently return inconsistent results if
> > > > > SK_REUSEPORT programs are in play.
> > > > >
> > > > > SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> > > > > to the full packet
> > > > > that triggered the look up. To support this, inet_lookup gained a new
> > > > > skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> > > > > program is skipped and instead the socket is selected by its hash.
> > > > >
> > > > > The first problem is that not all callers to inet_lookup from BPF have
> > > > > an skb, e.g. XDP. This means that a look up from XDP gives an
> > > > > incorrect result. For now that is not a huge problem. However, once we
> > > > > get sk_assign as proposed by Joe, we can end up circumventing
> > > > > SK_REUSEPORT.
> > > >
> > > > To clarify a bit, the reason this is a problem is that a
> > > > straightforward implementation may just consider passing the skb
> > > > context into the sk_lookup_*() and through to the inet_lookup() so
> > > > that it would run the SK_REUSEPORT BPF program for socket selection on
> > > > the skb when the packet-path BPF program performs the socket lookup.
> > > > However, as this paragraph describes, the skb context is not always
> > > > available.
> > > >
> > > > > At the conference, someone suggested using a similar approach to the
> > > > > work done on the flow dissector by Stanislav: create a dedicated
> > > > > context sk_reuseport which can either take an skb or a plain pointer.
> > > > > Patch up load_bytes to deal with both. Pass the context to
> > > > > inet_lookup.
> > > > >
> > > > > This is when we hit the second problem: using the skb or XDP context
> > > > > directly is incorrect, because it assumes that the relevant protocol
> > > > > headers are at the start of the buffer. In our use case, the correct
> > > > > headers are at an offset since we're inspecting encapsulated packets.
> > > > >
> > > > > The best solution I've come up with is to steal 17 bits from the flags
> > > > > argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> > > > > the offset itself.
> > > >
> > > > FYI there's also the upper 32 bits of the netns_id parameter, another
> > > > option would be to steal 16 bits from there.
> > >
> > > Or len, which is only 16 bits realistically. The offset doesn't really fit into
> > > either of them very well, using flags seemed the cleanest to me.
> > > Is there some best practice around this?
> > >
> > > >
> > > > > Thoughts?
> > > >
> > > > Internally with skbs, we use `skb_pull()` to manage header offsets,
> > > > could we do something similar with `bpf_xdp_adjust_head()` prior to
> > > > the call to `bpf_sk_lookup_*()`?
> > >
> > > That would only work if it retained the contents of the skipped
> > > buffer, and if there
> > > was a way to undo the adjustment later. We're doing the sk_lookup to
> > > decide whether to
> > > accept or forward the packet, so at the point of the call we might still need
> > > that data. Is that feasible with skb / XDP ctx?
> >
> > While discussing the solution for reuseport I propose to use
> > progs/test_select_reuseport_kern.c as an example of realistic program.
> > It reads tcp/udp header directly via ctx->data or via bpf_skb_load_bytes()
> > including payload after the header.
> > It also uses bpf_skb_load_bytes_relative() to fetch IP.
> > I think if we're fixing the sk_lookup from XDP the above program
> > would need to work.
> >
> > And I think we can make it work by adding new requirement that
> > 'struct bpf_sock_tuple *' argument to bpf_sk_lookup_* must be
> > a pointer to the packet and not a pointer to bpf program stack.
> > Then helper can construct a fake skb and assign
> > fake_skb->data = &bpf_sock_tuple_arg.sport
> > It can check that struct bpf_sock_tuple * pointer is within 100-ish bytes
> > from xdp->data and within xdp->data_end
> > This way the reuseport program's assumption that ctx->data points to tcp/udp
> > will be preserved and it can access it all including payload.
> >
> > This approach doesn't need to mess with xdp_adjust_head and adjust uapi to pass length.
> > Existing progs/test_sk_lookup_kern.c will magically start working with XDP
> > even when reuseport prog is attached.
> > Thoughts?
>
> I like this approach. A fake_skb approach will normalize the bpf_sk_lookup_*()
> API peering into the kernel API between TC and XDP invocation. Just one question
> that comes, I remember one of the comments I received during my XDP commit
> was the stateless nature of XDP services and providing a fake_skb may bring
> some potential side-effects to the desire of statelessness. Is that
> still a possibility?
> How do we guard against it?

To follow up on this, I'm also not sure how to tackle a "fake skb". If
I remember this
came up during the flow dissector series, and wasn't met with
enthusiasm. Granted,
replacing the skb argument to the lookup functions seems even harder, so maybe
this is the lesser evil?

>
> Thanks
> Nitin
>
> >



-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers
  2019-05-21 15:47         ` Lorenz Bauer
@ 2019-05-21 21:39           ` Alexei Starovoitov
  0 siblings, 0 replies; 9+ messages in thread
From: Alexei Starovoitov @ 2019-05-21 21:39 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Nitin Hande, Joe Stringer, Networking, bpf, Martin KaFai Lau,
	Daniel Borkmann, Eric Dumazet

On Tue, May 21, 2019 at 04:47:58PM +0100, Lorenz Bauer wrote:
> On Fri, 17 May 2019 at 00:38, Nitin Hande <nitin.hande@gmail.com> wrote:
> >
> > On Thu, May 16, 2019 at 2:57 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, May 16, 2019 at 09:41:34AM +0100, Lorenz Bauer wrote:
> > > > On Wed, 15 May 2019 at 18:16, Joe Stringer <joe@isovalent.com> wrote:
> > > > >
> > > > > On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > > > > >
> > > > > > In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> > > > > > that the sk_lookup_* helpers currently return inconsistent results if
> > > > > > SK_REUSEPORT programs are in play.
> > > > > >
> > > > > > SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> > > > > > to the full packet
> > > > > > that triggered the look up. To support this, inet_lookup gained a new
> > > > > > skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> > > > > > program is skipped and instead the socket is selected by its hash.
> > > > > >
> > > > > > The first problem is that not all callers to inet_lookup from BPF have
> > > > > > an skb, e.g. XDP. This means that a look up from XDP gives an
> > > > > > incorrect result. For now that is not a huge problem. However, once we
> > > > > > get sk_assign as proposed by Joe, we can end up circumventing
> > > > > > SK_REUSEPORT.
> > > > >
> > > > > To clarify a bit, the reason this is a problem is that a
> > > > > straightforward implementation may just consider passing the skb
> > > > > context into the sk_lookup_*() and through to the inet_lookup() so
> > > > > that it would run the SK_REUSEPORT BPF program for socket selection on
> > > > > the skb when the packet-path BPF program performs the socket lookup.
> > > > > However, as this paragraph describes, the skb context is not always
> > > > > available.
> > > > >
> > > > > > At the conference, someone suggested using a similar approach to the
> > > > > > work done on the flow dissector by Stanislav: create a dedicated
> > > > > > context sk_reuseport which can either take an skb or a plain pointer.
> > > > > > Patch up load_bytes to deal with both. Pass the context to
> > > > > > inet_lookup.
> > > > > >
> > > > > > This is when we hit the second problem: using the skb or XDP context
> > > > > > directly is incorrect, because it assumes that the relevant protocol
> > > > > > headers are at the start of the buffer. In our use case, the correct
> > > > > > headers are at an offset since we're inspecting encapsulated packets.
> > > > > >
> > > > > > The best solution I've come up with is to steal 17 bits from the flags
> > > > > > argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> > > > > > the offset itself.
> > > > >
> > > > > FYI there's also the upper 32 bits of the netns_id parameter, another
> > > > > option would be to steal 16 bits from there.
> > > >
> > > > Or len, which is only 16 bits realistically. The offset doesn't really fit into
> > > > either of them very well, using flags seemed the cleanest to me.
> > > > Is there some best practice around this?
> > > >
> > > > >
> > > > > > Thoughts?
> > > > >
> > > > > Internally with skbs, we use `skb_pull()` to manage header offsets,
> > > > > could we do something similar with `bpf_xdp_adjust_head()` prior to
> > > > > the call to `bpf_sk_lookup_*()`?
> > > >
> > > > That would only work if it retained the contents of the skipped
> > > > buffer, and if there
> > > > was a way to undo the adjustment later. We're doing the sk_lookup to
> > > > decide whether to
> > > > accept or forward the packet, so at the point of the call we might still need
> > > > that data. Is that feasible with skb / XDP ctx?
> > >
> > > While discussing the solution for reuseport I propose to use
> > > progs/test_select_reuseport_kern.c as an example of realistic program.
> > > It reads tcp/udp header directly via ctx->data or via bpf_skb_load_bytes()
> > > including payload after the header.
> > > It also uses bpf_skb_load_bytes_relative() to fetch IP.
> > > I think if we're fixing the sk_lookup from XDP the above program
> > > would need to work.
> > >
> > > And I think we can make it work by adding new requirement that
> > > 'struct bpf_sock_tuple *' argument to bpf_sk_lookup_* must be
> > > a pointer to the packet and not a pointer to bpf program stack.
> > > Then helper can construct a fake skb and assign
> > > fake_skb->data = &bpf_sock_tuple_arg.sport
> > > It can check that struct bpf_sock_tuple * pointer is within 100-ish bytes
> > > from xdp->data and within xdp->data_end
> > > This way the reuseport program's assumption that ctx->data points to tcp/udp
> > > will be preserved and it can access it all including payload.
> > >
> > > This approach doesn't need to mess with xdp_adjust_head and adjust uapi to pass length.
> > > Existing progs/test_sk_lookup_kern.c will magically start working with XDP
> > > even when reuseport prog is attached.
> > > Thoughts?
> >
> > I like this approach. A fake_skb approach will normalize the bpf_sk_lookup_*()
> > API peering into the kernel API between TC and XDP invocation. Just one question
> > that comes, I remember one of the comments I received during my XDP commit
> > was the stateless nature of XDP services and providing a fake_skb may bring
> > some potential side-effects to the desire of statelessness. Is that
> > still a possibility?
> > How do we guard against it?
> 
> To follow up on this, I'm also not sure how to tackle a "fake skb". If
> I remember this
> came up during the flow dissector series, and wasn't met with
> enthusiasm. Granted,
> replacing the skb argument to the lookup functions seems even harder, so maybe
> this is the lesser evil?

flow_dissector pretends to have 'skb' as bpf prog argument.
It doesn't allocate actual 'struct sk_buff' on the kernel side.
The verifier rewrites __sk_buff->foo accesses into
'struct bpf_flow_dissector'->bar access.
I was hoping similar idea can apply here.
BPF_PROG_TYPE_SK_REUSEPORT type takes 'struct sk_reuseport_md *'
so we don't need real skb there anyway.
But for older socket_filter based so_reuseport progs it's more difficult,
since at the verification time they're generic socket_filter progs.
Right now I don't have a better idea than to do static per_cpu struct sk_buff.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-15 15:11 RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers Lorenz Bauer
2019-05-15 17:16 ` Joe Stringer
2019-05-16  8:41   ` Lorenz Bauer
2019-05-16 20:33     ` Alexei Starovoitov
2019-05-16 23:38       ` Nitin Hande
2019-05-21 15:47         ` Lorenz Bauer
2019-05-21 21:39           ` Alexei Starovoitov
2019-05-17 14:15       ` Lorenz Bauer
2019-05-19  1:20         ` Joe Stringer

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org bpf@archiver.kernel.org
	public-inbox-index bpf


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/ public-inbox