Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
@ 2018-03-14 18:41 Alexei Starovoitov
  2018-03-15  0:17 ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2018-03-14 18:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David S. Miller, Daniel Borkmann,
	Network Development, Kernel Team

On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
>> It seems this is exactly the case where a netns would be the correct answer.
>
> Unfortuantely that's not the case. That's what I tried to explain
> in the cover letter:
> "The setup involves per-container IPs, policy, etc, so traditional
> network-only solutions that involve VRFs, netns, acls are not applicable."
> To elaborate more on that:
> netns is l2 isolation.
> vrf is l3 isolation.
> whereas to containerize an application we need to punch connectivity holes
> in these layered techniques.
> We also considered resurrecting Hannes's afnetns work
> and even went as far as designing a new namespace for L4 isolation.
> Unfortunately all hierarchical namespace abstraction don't work.
> To run an application inside cgroup container that was not written
> with containers in mind we have to make an illusion of running
> in non-containerized environment.
> In some cases we remember the port and container id in the post-bind hook
> in a bpf map and when some other task in a different container is trying
> to connect to a service we need to know where this service is running.
> It can be remote and can be local. Both client and service may or may not
> be written with containers in mind and this sockaddr rewrite is providing
> connectivity and load balancing feature that you simply cannot do
> with hierarchical networking primitives.

have to explain this a bit further...
We also considered hacking these 'connectivity holes' in
netns and/or vrf, but that would be real layering violation,
since clean l2, l3 abstraction would suddenly support
something that breaks through the layers.
Just like many consider ipvlan a bad hack that punches
through the layers and connects l2 abstraction of netns
at l3 layer, this is not something kernel should ever do.
We really didn't want another ipvlan-like hack in the kernel.
Instead bpf programs at bind/connect time _help_
applications discover and connect to each other.
All containers are running in init_nens and there are no vrfs.
After bind/connect the normal fib/neighbor core networking
logic works as it should always do. The whole system is
clean from network point of view.

^ permalink raw reply	[flat|nested] 13+ messages in thread