[RFC] Socket termination for policy enforcement and load-balancing

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Socket termination for policy enforcement and load-balancing
@ 2022-08-31 16:37 Aditi Ghag
  2022-08-31 22:02 ` sdf
  2022-08-31 23:01 ` Martin KaFai Lau
  0 siblings, 2 replies; 10+ messages in thread
From: Aditi Ghag @ 2022-08-31 16:37 UTC (permalink / raw)
  To: netdev, bpf; +Cc: Daniel Borkmann

This is an RFC for terminating sockets with intent. We have two
prominent use cases in Cilium [1] where we need a way to identify and
forcefully terminate a set of sockets so that they can reconnect.
Cilium uses eBPF cgroup hooks for load-balancing, where it translates
a service vip to one of the service backend ip addresses at socket
connect time for TCP and connected UDP. Client applications are likely
to be unaware of the remote containers that they are connected to
getting deleted, and are left hanging when the remotes go away
(long-running UDP applications, particularly). For the policy
enforcement use case, users may want to enforce policies on-the-fly
where they want all client applications traffic including established
connections to be redirected to a subset of destinations.

We evaluated following ways to identify, and forcefully terminate sockets:

- The sock_destroy API added for similar Android use cases is
effective in tearing down sockets. The API is behind the
CONFIG_INET_DIAG_DESTROY config that's disabled by default, and
currently exposed via SOCK_DIAG netlink infrastructure in userspace.
The sock destroy handlers for TCP and UDP protocols send ECONNABORTED
error code to sockets related to the abort state as mentioned in RFC
793.

- Add unreachable routes for deleted backends. I experimented with
this approach with my colleague, Nikolay Aleksandrov. We found that
TCP and connected UDP sockets in the established state simply ignore
the ICMP error messages, and continue to send data in the presence of
such routes. My read is that applications are ignoring the ICMP errors
reported on sockets [2].

- Use BPF (sockets) iterator to identify sockets connected to a
deleted backend. The BPF (sockets) iterator is network namespace aware
so we'll either need to enter every possible container network
namespace to identify the affected connections, or adapt the iterator
to be without netns checks [3]. This was discussed with my colleague
Daniel Borkmann based on the feedback he shared from the LSFMMBPF
conference discussions.

- Use INET_DIAG infrastructure to filter and destroy sockets connected
to stale backends. This approach involves first making a query to
filter sockets connecting to a destination ip address/port using
netlink messages with type SOCK_DIAG_BY_FAMILY, and then use the query
results to make another message of type SOCK_DESTROY to actually
destroy the sockets. The SOCK_DIAG infrastructure, similar to BPF
iterators, is network namespace aware.

We are currently leaning towards invoking the sock_destroy API
directly from BPF programs. This allows us to have an effective
mechanism without having to enter every possible container network
namespace on a node, and rely on the CONFIG_INET_DIAG_DESTROY config
with the right permissions. BPF programs attached to cgroup hooks can
store client sockets connected to a backend, and invoke destroy APIs
when backends are deleted.

To that end, I'm in the process of adding a new BPF helper for the
sock_destroy kernel function similar to the sock_diag_destroy function
[4], and am soliciting early feedback on the evaluated and selected
approaches. Happy to share more context.

[1] https://github.com/cilium/cilium
[2] https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_ipv4.c#L464
[3] https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L3011
[4] https://github.com/torvalds/linux/blob/master/net/core/sock_diag.c#L298

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Socket termination for policy enforcement and load-balancing
  2022-08-31 16:37 [RFC] Socket termination for policy enforcement and load-balancing Aditi Ghag
@ 2022-08-31 22:02 ` sdf
  2022-09-04 17:41   ` Aditi Ghag
  2022-08-31 23:01 ` Martin KaFai Lau
  1 sibling, 1 reply; 10+ messages in thread
From: sdf @ 2022-08-31 22:02 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: netdev, bpf, Daniel Borkmann

On 08/31, Aditi Ghag wrote:
> This is an RFC for terminating sockets with intent. We have two
> prominent use cases in Cilium [1] where we need a way to identify and
> forcefully terminate a set of sockets so that they can reconnect.
> Cilium uses eBPF cgroup hooks for load-balancing, where it translates
> a service vip to one of the service backend ip addresses at socket
> connect time for TCP and connected UDP. Client applications are likely
> to be unaware of the remote containers that they are connected to
> getting deleted, and are left hanging when the remotes go away
> (long-running UDP applications, particularly). For the policy
> enforcement use case, users may want to enforce policies on-the-fly
> where they want all client applications traffic including established
> connections to be redirected to a subset of destinations.

> We evaluated following ways to identify, and forcefully terminate sockets:

> - The sock_destroy API added for similar Android use cases is
> effective in tearing down sockets. The API is behind the
> CONFIG_INET_DIAG_DESTROY config that's disabled by default, and
> currently exposed via SOCK_DIAG netlink infrastructure in userspace.
> The sock destroy handlers for TCP and UDP protocols send ECONNABORTED
> error code to sockets related to the abort state as mentioned in RFC
> 793.

> - Add unreachable routes for deleted backends. I experimented with
> this approach with my colleague, Nikolay Aleksandrov. We found that
> TCP and connected UDP sockets in the established state simply ignore
> the ICMP error messages, and continue to send data in the presence of
> such routes. My read is that applications are ignoring the ICMP errors
> reported on sockets [2].

[..]

> - Use BPF (sockets) iterator to identify sockets connected to a
> deleted backend. The BPF (sockets) iterator is network namespace aware
> so we'll either need to enter every possible container network
> namespace to identify the affected connections, or adapt the iterator
> to be without netns checks [3]. This was discussed with my colleague
> Daniel Borkmann based on the feedback he shared from the LSFMMBPF
> conference discussions.

Maybe something worth fixing as well even if you end up using netlink?
Having to manually go over all networking namespaces (if I want
to iterate over all sockets on the host) doesn't seem feasible?

> - Use INET_DIAG infrastructure to filter and destroy sockets connected
> to stale backends. This approach involves first making a query to
> filter sockets connecting to a destination ip address/port using
> netlink messages with type SOCK_DIAG_BY_FAMILY, and then use the query
> results to make another message of type SOCK_DESTROY to actually
> destroy the sockets. The SOCK_DIAG infrastructure, similar to BPF
> iterators, is network namespace aware.

> We are currently leaning towards invoking the sock_destroy API
> directly from BPF programs. This allows us to have an effective
> mechanism without having to enter every possible container network
> namespace on a node, and rely on the CONFIG_INET_DIAG_DESTROY config
> with the right permissions. BPF programs attached to cgroup hooks can
> store client sockets connected to a backend, and invoke destroy APIs
> when backends are deleted.

> To that end, I'm in the process of adding a new BPF helper for the
> sock_destroy kernel function similar to the sock_diag_destroy function
> [4], and am soliciting early feedback on the evaluated and selected
> approaches. Happy to share more context.

> [1] https://github.com/cilium/cilium
> [2] https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_ipv4.c#L464
> [3] https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L3011
> [4]  
> https://github.com/torvalds/linux/blob/master/net/core/sock_diag.c#L298

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Socket termination for policy enforcement and load-balancing
  2022-08-31 16:37 [RFC] Socket termination for policy enforcement and load-balancing Aditi Ghag
  2022-08-31 22:02 ` sdf
@ 2022-08-31 23:01 ` Martin KaFai Lau
  2022-08-31 23:43   ` Kuniyuki Iwashima
  2022-09-04 18:14   ` Aditi Ghag
  1 sibling, 2 replies; 10+ messages in thread
From: Martin KaFai Lau @ 2022-08-31 23:01 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: netdev, bpf, Daniel Borkmann, Yonghong Song, Kuniyuki Iwashima

On Wed, Aug 31, 2022 at 09:37:41AM -0700, Aditi Ghag wrote:
> - Use BPF (sockets) iterator to identify sockets connected to a
> deleted backend. The BPF (sockets) iterator is network namespace aware
> so we'll either need to enter every possible container network
> namespace to identify the affected connections, or adapt the iterator
> to be without netns checks [3]. This was discussed with my colleague
> Daniel Borkmann based on the feedback he shared from the LSFMMBPF
> conference discussions.
Being able to iterate all sockets across different netns will
be useful.

It should be doable to ignore the netns check.  For udp, a quick
thought is to have another iter target. eg. "udp_all_netns".
From the sk, the bpf prog should be able to learn the netns and
the bpf prog can filter the netns by itself.

The TCP side is going to have an 'optional' per netns ehash table [0] soon,
not lhash2 (listening hash) though.  Ideally, the same bpf
all-netns iter interface should work similarly for both udp and
tcp case.  Thus, both should be considered and work at the same time.

For udp, something more useful than plain udp_abort() could potentially
be done.  eg. directly connect to another backend (by bpf kfunc?).
There may be some details in socket locking...etc but should
be doable and the bpf-iter program could be sleepable also.
fwiw, we are iterating the tcp socket to retire some older
bpf-tcp-cc (congestion control) on the long-lived connections
by bpf_setsockopt(TCP_CONGESTION).

Also, potentially, instead of iterating all,
a more selective case can be done by
bpf_prog_test_run()+bpf_sk_lookup_*()+udp_abort().

[0]: https://lore.kernel.org/netdev/20220830191518.77083-1-kuniyu@amazon.com/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Socket termination for policy enforcement and load-balancing
  2022-08-31 23:01 ` Martin KaFai Lau
@ 2022-08-31 23:43   ` Kuniyuki Iwashima
  2022-09-04 18:14   ` Aditi Ghag
  1 sibling, 0 replies; 10+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-31 23:43 UTC (permalink / raw)
  To: kafai; +Cc: aditivghag, bpf, daniel, kuniyu, netdev, yhs

Thanks for CCing, Martin.

Date:   Wed, 31 Aug 2022 16:01:57 -0700
From:   Martin KaFai Lau <kafai@fb.com>
> On Wed, Aug 31, 2022 at 09:37:41AM -0700, Aditi Ghag wrote:
> > - Use BPF (sockets) iterator to identify sockets connected to a
> > deleted backend. The BPF (sockets) iterator is network namespace aware
> > so we'll either need to enter every possible container network
> > namespace to identify the affected connections, or adapt the iterator
> > to be without netns checks [3]. This was discussed with my colleague
> > Daniel Borkmann based on the feedback he shared from the LSFMMBPF
> > conference discussions.
> Being able to iterate all sockets across different netns will
> be useful.
> 
> It should be doable to ignore the netns check.  For udp, a quick
> thought is to have another iter target. eg. "udp_all_netns".
> From the sk, the bpf prog should be able to learn the netns and
> the bpf prog can filter the netns by itself.
> 
> The TCP side is going to have an 'optional' per netns ehash table [0] soon,
> not lhash2 (listening hash) though.  Ideally, the same bpf
> all-netns iter interface should work similarly for both udp and
> tcp case.  Thus, both should be considered and work at the same time.

I'm going to add optional hash tables for UDP as well.  The first series [1]
had TCP/UDP stuff and was split, and UDP part is pending for now.

So, if the both series was merged, the TCP/UDP all netns iter would have
similar logic.

[1]: https://lore.kernel.org/netdev/20220826000445.46552-14-kuniyu@amazon.com/


> 
> For udp, something more useful than plain udp_abort() could potentially
> be done.  eg. directly connect to another backend (by bpf kfunc?).
> There may be some details in socket locking...etc but should
> be doable and the bpf-iter program could be sleepable also.
> fwiw, we are iterating the tcp socket to retire some older
> bpf-tcp-cc (congestion control) on the long-lived connections
> by bpf_setsockopt(TCP_CONGESTION).
> 
> Also, potentially, instead of iterating all,
> a more selective case can be done by
> bpf_prog_test_run()+bpf_sk_lookup_*()+udp_abort().
> 
> [0]: https://lore.kernel.org/netdev/20220830191518.77083-1-kuniyu@amazon.com/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Socket termination for policy enforcement and load-balancing
  2022-08-31 22:02 ` sdf
@ 2022-09-04 17:41   ` Aditi Ghag
  2022-09-06 16:28     ` Stanislav Fomichev
  0 siblings, 1 reply; 10+ messages in thread
From: Aditi Ghag @ 2022-09-04 17:41 UTC (permalink / raw)
  To: sdf; +Cc: netdev, bpf, Daniel Borkmann

On Wed, Aug 31, 2022 at 3:02 PM <sdf@google.com> wrote:
>
> On 08/31, Aditi Ghag wrote:
> [...]
>
> > - The sock_destroy API added for similar Android use cases is
> > effective in tearing down sockets. The API is behind the
> > CONFIG_INET_DIAG_DESTROY config that's disabled by default, and
> > currently exposed via SOCK_DIAG netlink infrastructure in userspace.
> > The sock destroy handlers for TCP and UDP protocols send ECONNABORTED
> > error code to sockets related to the abort state as mentioned in RFC
> > 793.
>
> > - Add unreachable routes for deleted backends. I experimented with
> > this approach with my colleague, Nikolay Aleksandrov. We found that
> > TCP and connected UDP sockets in the established state simply ignore
> > the ICMP error messages, and continue to send data in the presence of
> > such routes. My read is that applications are ignoring the ICMP errors
> > reported on sockets [2].
>
> [..]
>
> > - Use BPF (sockets) iterator to identify sockets connected to a
> > deleted backend. The BPF (sockets) iterator is network namespace aware
> > so we'll either need to enter every possible container network
> > namespace to identify the affected connections, or adapt the iterator
> > to be without netns checks [3]. This was discussed with my colleague
> > Daniel Borkmann based on the feedback he shared from the LSFMMBPF
> > conference discussions.
>
> Maybe something worth fixing as well even if you end up using netlink?
> Having to manually go over all networking namespaces (if I want
> to iterate over all sockets on the host) doesn't seem feasible?

SOCK_DIAG netlink infrastructure also has similar netns checks. The
iterator approach
would allow us to invoke sock destroy handlers from BPF though.

>
> [...]
>
> > [1] https://github.com/cilium/cilium
> > [2] https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_ipv4.c#L464
> > [3] https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L3011
> > [4]
> > https://github.com/torvalds/linux/blob/master/net/core/sock_diag.c#L298

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Socket termination for policy enforcement and load-balancing
  2022-08-31 23:01 ` Martin KaFai Lau
  2022-08-31 23:43   ` Kuniyuki Iwashima
@ 2022-09-04 18:14   ` Aditi Ghag
  2022-09-04 21:24     ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 10+ messages in thread
From: Aditi Ghag @ 2022-09-04 18:14 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, bpf, Daniel Borkmann, Yonghong Song, Kuniyuki Iwashima

On Wed, Aug 31, 2022 at 4:02 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Aug 31, 2022 at 09:37:41AM -0700, Aditi Ghag wrote:
> > - Use BPF (sockets) iterator to identify sockets connected to a
> > deleted backend. The BPF (sockets) iterator is network namespace aware
> > so we'll either need to enter every possible container network
> > namespace to identify the affected connections, or adapt the iterator
> > to be without netns checks [3]. This was discussed with my colleague
> > Daniel Borkmann based on the feedback he shared from the LSFMMBPF
> > conference discussions.
> Being able to iterate all sockets across different netns will
> be useful.
>
> It should be doable to ignore the netns check.  For udp, a quick
> thought is to have another iter target. eg. "udp_all_netns".
> From the sk, the bpf prog should be able to learn the netns and
> the bpf prog can filter the netns by itself.
>
> The TCP side is going to have an 'optional' per netns ehash table [0] soon,
> not lhash2 (listening hash) though.  Ideally, the same bpf
> all-netns iter interface should work similarly for both udp and
> tcp case.  Thus, both should be considered and work at the same time.
>
> For udp, something more useful than plain udp_abort() could potentially
> be done.  eg. directly connect to another backend (by bpf kfunc?).
> There may be some details in socket locking...etc but should
> be doable and the bpf-iter program could be sleepable also.

This won't be effective for connected udp though, will it? Interesting thought
around using bpf kfunc!

> fwiw, we are iterating the tcp socket to retire some older
> bpf-tcp-cc (congestion control) on the long-lived connections
> by bpf_setsockopt(TCP_CONGESTION).
>
> Also, potentially, instead of iterating all,
> a more selective case can be done by
> bpf_prog_test_run()+bpf_sk_lookup_*()+udp_abort().

Can you elaborate more on the more selective iterator approach?

On a similar note, are there better ways as alternatives to the
sockets iterator approach.
Since we have BPF programs executed on cgroup BPF hooks (e.g.,
connect), we already know what client
sockets are connected to a backend. Can we somehow store these socket
pointers in a regular BPF map, and
when a backend is deleted, use a regular map iterator to invoke
sock_destroy() for these sockets? Does anyone have
experience using the "typed pointer support in BPF maps" APIs [0]?

[0] https://lwn.net/ml/bpf/20220424214901.2743946-1-memxor@gmail.com/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Socket termination for policy enforcement and load-balancing
  2022-09-04 18:14   ` Aditi Ghag
@ 2022-09-04 21:24     ` Kumar Kartikeya Dwivedi
  2022-09-08  2:26       ` Martin KaFai Lau
  0 siblings, 1 reply; 10+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 21:24 UTC (permalink / raw)
  To: Aditi Ghag
  Cc: Martin KaFai Lau, netdev, bpf, Daniel Borkmann, Yonghong Song,
	Kuniyuki Iwashima

On Sun, 4 Sept 2022 at 20:55, Aditi Ghag <aditivghag@gmail.com> wrote:
>
> On Wed, Aug 31, 2022 at 4:02 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Aug 31, 2022 at 09:37:41AM -0700, Aditi Ghag wrote:
> > > - Use BPF (sockets) iterator to identify sockets connected to a
> > > deleted backend. The BPF (sockets) iterator is network namespace aware
> > > so we'll either need to enter every possible container network
> > > namespace to identify the affected connections, or adapt the iterator
> > > to be without netns checks [3]. This was discussed with my colleague
> > > Daniel Borkmann based on the feedback he shared from the LSFMMBPF
> > > conference discussions.
> > Being able to iterate all sockets across different netns will
> > be useful.
> >
> > It should be doable to ignore the netns check.  For udp, a quick
> > thought is to have another iter target. eg. "udp_all_netns".
> > From the sk, the bpf prog should be able to learn the netns and
> > the bpf prog can filter the netns by itself.
> >
> > The TCP side is going to have an 'optional' per netns ehash table [0] soon,
> > not lhash2 (listening hash) though.  Ideally, the same bpf
> > all-netns iter interface should work similarly for both udp and
> > tcp case.  Thus, both should be considered and work at the same time.
> >
> > For udp, something more useful than plain udp_abort() could potentially
> > be done.  eg. directly connect to another backend (by bpf kfunc?).
> > There may be some details in socket locking...etc but should
> > be doable and the bpf-iter program could be sleepable also.
>
> This won't be effective for connected udp though, will it? Interesting thought
> around using bpf kfunc!
>
> > fwiw, we are iterating the tcp socket to retire some older
> > bpf-tcp-cc (congestion control) on the long-lived connections
> > by bpf_setsockopt(TCP_CONGESTION).
> >
> > Also, potentially, instead of iterating all,
> > a more selective case can be done by
> > bpf_prog_test_run()+bpf_sk_lookup_*()+udp_abort().
>
> Can you elaborate more on the more selective iterator approach?
>
> On a similar note, are there better ways as alternatives to the
> sockets iterator approach.
> Since we have BPF programs executed on cgroup BPF hooks (e.g.,
> connect), we already know what client
> sockets are connected to a backend. Can we somehow store these socket
> pointers in a regular BPF map, and
> when a backend is deleted, use a regular map iterator to invoke
> sock_destroy() for these sockets? Does anyone have
> experience using the "typed pointer support in BPF maps" APIs [0]?

I am not very familiar with how socket lifetime is managed, it may not
be possible in case lifetime is managed by RCU only,
or due to other limitations.
Martin will probably be able to comment more on that.

Apart from that, from the BPF side, it referenced kptr won't work out
of the box, you will need to add support for each type you want to
support.

But the way you're describing should work well. Ideally you would inc
a ref and move it into map from the hook program, and just xchg out
the sk to destroy from map value during iteration and then pass it to
sock_destroy helper to release it (instead of sk_release).

First task for this will be teaching kptr_xchg to work with
non-PTR_TO_BTF_ID arguments. You can use the same process as how
translation is done to in-kernel PTR_TO_BTF_ID by reg2btf_ids in
kernel/bpf/btf.c for socket types for kfuncs.
Usually socket types will be PTR_TO_SOCKET or PTR_TO_TCP_SOCK etc,
they can be mapped using that table to the btf_id of in-kernel type
they shadow.

From there, it will be about writing the right dtor for the socket
type which can work in all contexts the dtor for the socket is called
from map implementations, and registering it, and probably also
restricting the kptr_xchg for socket to certain known contexts to make
life easier.

>
> [0] https://lwn.net/ml/bpf/20220424214901.2743946-1-memxor@gmail.com/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Socket termination for policy enforcement and load-balancing
  2022-09-04 17:41   ` Aditi Ghag
@ 2022-09-06 16:28     ` Stanislav Fomichev
  0 siblings, 0 replies; 10+ messages in thread
From: Stanislav Fomichev @ 2022-09-06 16:28 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: netdev, bpf, Daniel Borkmann

On Sun, Sep 4, 2022 at 10:41 AM Aditi Ghag <aditivghag@gmail.com> wrote:
>
> On Wed, Aug 31, 2022 at 3:02 PM <sdf@google.com> wrote:
> >
> > On 08/31, Aditi Ghag wrote:
> > [...]
> >
> > > - The sock_destroy API added for similar Android use cases is
> > > effective in tearing down sockets. The API is behind the
> > > CONFIG_INET_DIAG_DESTROY config that's disabled by default, and
> > > currently exposed via SOCK_DIAG netlink infrastructure in userspace.
> > > The sock destroy handlers for TCP and UDP protocols send ECONNABORTED
> > > error code to sockets related to the abort state as mentioned in RFC
> > > 793.
> >
> > > - Add unreachable routes for deleted backends. I experimented with
> > > this approach with my colleague, Nikolay Aleksandrov. We found that
> > > TCP and connected UDP sockets in the established state simply ignore
> > > the ICMP error messages, and continue to send data in the presence of
> > > such routes. My read is that applications are ignoring the ICMP errors
> > > reported on sockets [2].
> >
> > [..]
> >
> > > - Use BPF (sockets) iterator to identify sockets connected to a
> > > deleted backend. The BPF (sockets) iterator is network namespace aware
> > > so we'll either need to enter every possible container network
> > > namespace to identify the affected connections, or adapt the iterator
> > > to be without netns checks [3]. This was discussed with my colleague
> > > Daniel Borkmann based on the feedback he shared from the LSFMMBPF
> > > conference discussions.
> >
> > Maybe something worth fixing as well even if you end up using netlink?
> > Having to manually go over all networking namespaces (if I want
> > to iterate over all sockets on the host) doesn't seem feasible?
>
> SOCK_DIAG netlink infrastructure also has similar netns checks. The
> iterator approach
> would allow us to invoke sock destroy handlers from BPF though.

Sorry, I think I wasn't clear enough.
I meant that having a mode to iterate over all sockets on the host
regardless of the namespace might be useful.
Martin suggests the same in [1], I'll follow that thread :-)

[1] https://lore.kernel.org/bpf/20220831230157.7lchomcdxmvq3qqw@kafai-mbp.dhcp.thefacebook.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Socket termination for policy enforcement and load-balancing
  2022-09-04 21:24     ` Kumar Kartikeya Dwivedi
@ 2022-09-08  2:26       ` Martin KaFai Lau
  2022-09-12 10:01         ` Aditi Ghag
  0 siblings, 1 reply; 10+ messages in thread
From: Martin KaFai Lau @ 2022-09-08  2:26 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Aditi Ghag
  Cc: Martin KaFai Lau, netdev, bpf, Daniel Borkmann, Yonghong Song,
	Kuniyuki Iwashima

On 9/4/22 2:24 PM, Kumar Kartikeya Dwivedi wrote:
> On Sun, 4 Sept 2022 at 20:55, Aditi Ghag <aditivghag@gmail.com> wrote:
>>
>> On Wed, Aug 31, 2022 at 4:02 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>>
>>> On Wed, Aug 31, 2022 at 09:37:41AM -0700, Aditi Ghag wrote:
>>>> - Use BPF (sockets) iterator to identify sockets connected to a
>>>> deleted backend. The BPF (sockets) iterator is network namespace aware
>>>> so we'll either need to enter every possible container network
>>>> namespace to identify the affected connections, or adapt the iterator
>>>> to be without netns checks [3]. This was discussed with my colleague
>>>> Daniel Borkmann based on the feedback he shared from the LSFMMBPF
>>>> conference discussions.
>>> Being able to iterate all sockets across different netns will
>>> be useful.
>>>
>>> It should be doable to ignore the netns check.  For udp, a quick
>>> thought is to have another iter target. eg. "udp_all_netns".
>>>  From the sk, the bpf prog should be able to learn the netns and
>>> the bpf prog can filter the netns by itself.
>>>
>>> The TCP side is going to have an 'optional' per netns ehash table [0] soon,
>>> not lhash2 (listening hash) though.  Ideally, the same bpf
>>> all-netns iter interface should work similarly for both udp and
>>> tcp case.  Thus, both should be considered and work at the same time.
>>>
>>> For udp, something more useful than plain udp_abort() could potentially
>>> be done.  eg. directly connect to another backend (by bpf kfunc?).
>>> There may be some details in socket locking...etc but should
>>> be doable and the bpf-iter program could be sleepable also.
>>
>> This won't be effective for connected udp though, will it? Interesting thought
>> around using bpf kfunchmm... why the bpf-prog doing the udp re-connect() won't be effective? 
I suspect we are talking about different thing.

Regardless, for tcp, I think the user space needs to handle the tcp 
aborted-error by redoing the connect().  Thus, lets stay with 
{tcp,udp}_abort() for now.  Try to expose {tcp,udp}_abort() as a kfunc 
instead of a new bpf_helper.

>>
>>> fwiw, we are iterating the tcp socket to retire some older
>>> bpf-tcp-cc (congestion control) on the long-lived connections
>>> by bpf_setsockopt(TCP_CONGESTION).
>>>
>>> Also, potentially, instead of iterating all,
>>> a more selective case can be done by
>>> bpf_prog_test_run()+bpf_sk_lookup_*()+udp_abort().
>>
>> Can you elaborate more on the more selective iterator approach?
If the 4 tuples (src/dst ip/port) is known, bpf_sk_lookup_*() can lookup 
a sk from the tcp_hashinfo or udp_table.  bpf_sk_lookup_*() also takes a 
netns_id argument.  However, yeah, it will still go back to the need to 
get all netns, so may not work well in the RFC case here.

>>
>> On a similar note, are there better ways as alternatives to the
>> sockets iterator approach.
>> Since we have BPF programs executed on cgroup BPF hooks (e.g.,
>> connect), we already know what client
>> sockets are connected to a backend. Can we somehow store these socket
>> pointers in a regular BPF map, and
>> when a backend is deleted, use a regular map iterator to invoke
>> sock_destroy() for these sockets? Does anyone have
>> experience using the "typed pointer support in BPF maps" APIs [0]?
> 
> I am not very familiar with how socket lifetime is managed, it may not
> be possible in case lifetime is managed by RCU only,
> or due to other limitations.
> Martin will probably be able to comment more on that.
sk is the usual refcnt+rcu_reader pattern.  afaik, the use case here is 
the sk should be removed from the map when there is a tcp_close() or 
udp_lib_close().  There is sock_map and sock_hash to store sk as the 
map-value.  iirc the sk will be automatically removed from the map 
during tcp_close() and udp_lib_close().  The sock_map and sock_hash have 
bpf iterator also.  Meaning a bpf-iter-prog can iterate the sock_map and 
sock_hash and then do abort on each sk, so it looks like most of the 
pieces are in place.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Socket termination for policy enforcement and load-balancing
  2022-09-08  2:26       ` Martin KaFai Lau
@ 2022-09-12 10:01         ` Aditi Ghag
  0 siblings, 0 replies; 10+ messages in thread
From: Aditi Ghag @ 2022-09-12 10:01 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Kumar Kartikeya Dwivedi, Martin KaFai Lau, netdev, bpf,
	Daniel Borkmann, Yonghong Song, Kuniyuki Iwashima

On Thu, Sep 8, 2022 at 3:26 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 9/4/22 2:24 PM, Kumar Kartikeya Dwivedi wrote:
> > On Sun, 4 Sept 2022 at 20:55, Aditi Ghag <aditivghag@gmail.com> wrote:
> >>
> >> On Wed, Aug 31, 2022 at 4:02 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >>>
> >>> On Wed, Aug 31, 2022 at 09:37:41AM -0700, Aditi Ghag wrote:
>>>> [...]
> >>
> >> On a similar note, are there better ways as alternatives to the
> >> sockets iterator approach.
> >> Since we have BPF programs executed on cgroup BPF hooks (e.g.,
> >> connect), we already know what client
> >> sockets are connected to a backend. Can we somehow store these socket
> >> pointers in a regular BPF map, and
> >> when a backend is deleted, use a regular map iterator to invoke
> >> sock_destroy() for these sockets? Does anyone have
> >> experience using the "typed pointer support in BPF maps" APIs [0]?
> >
> > I am not very familiar with how socket lifetime is managed, it may not
> > be possible in case lifetime is managed by RCU only,
> > or due to other limitations.
> > Martin will probably be able to comment more on that.
> sk is the usual refcnt+rcu_reader pattern.  afaik, the use case here is
> the sk should be removed from the map when there is a tcp_close() or
> udp_lib_close().  There is sock_map and sock_hash to store sk as the
> map-value.  iirc the sk will be automatically removed from the map
> during tcp_close() and udp_lib_close().  The sock_map and sock_hash have
> bpf iterator also.  Meaning a bpf-iter-prog can iterate the sock_map and
> sock_hash and then do abort on each sk, so it looks like most of the
> pieces are in place.
>

Yes, I did consider using a sock_hash type map. But for some reason I
thought it would be
accessible only from bpf_prog_type_sk_skb or sockops. Looks like the
regular map helpers
can be used for update/lookup operations on sock_hash map type. It's
great to know that sock
ref counting in sock map/hash types are automatically accounted for,
so we don't have to add
additional INET sock_release hooks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-09-12 10:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-31 16:37 [RFC] Socket termination for policy enforcement and load-balancing Aditi Ghag
2022-08-31 22:02 ` sdf
2022-09-04 17:41   ` Aditi Ghag
2022-09-06 16:28     ` Stanislav Fomichev
2022-08-31 23:01 ` Martin KaFai Lau
2022-08-31 23:43   ` Kuniyuki Iwashima
2022-09-04 18:14   ` Aditi Ghag
2022-09-04 21:24     ` Kumar Kartikeya Dwivedi
2022-09-08  2:26       ` Martin KaFai Lau
2022-09-12 10:01         ` Aditi Ghag

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).