All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
@ 2017-01-23 20:36 Andy Lutomirski
  2017-01-23 21:03 ` David Ahern
  2017-01-24  2:09 ` Alexei Starovoitov
  0 siblings, 2 replies; 24+ messages in thread
From: Andy Lutomirski @ 2017-01-23 20:36 UTC (permalink / raw)
  To: Network Development, David S. Miller
  Cc: Andy Lutomirski, Daniel Borkmann, Alexei Starovoitov, David Ahern

To see how cgroup+bpf interacts with network namespaces, I wrote a
little program called show_bind that calls getsockopt(...,
SO_BINDTODEVICE, ...) and prints the result.  It did this:

 # ./ip link add dev vrf0 type vrf table 10
 # ./ip vrf exec vrf0 ./show_bind
 Default binding is "vrf0"
 # ./ip vrf exec vrf0 unshare -n ./show_bind
 show_bind: getsockopt: No such device

What's happening here is that "ip vrf" looks up vrf0's ifindex in
the init netns and installs a hook that binds sockets to that
ifindex.  When the hook runs in a different netns, it sets
sk_bound_dev_if to an ifindex from the wrong netns, resulting in
incorrect behavior.  In this particular example, the ifindex was 4
and there was no ifindex 4 in the new netns.  If there had been,
this test would have malfunctioned differently

Since it's rather late in the release cycle, let's punt.  This patch
makes it impossible to install cgroup+bpf hooks outside the init
netns and makes them not run on sockets that aren't in the init
netns.

In a future release, it should be relatively straightforward to make
these hooks be local to a netns and, if needed, to add a flag so
that hooks can be made global if necessary.  Global hooks should
presumably be constrained so that they can't write to any ifindex
fields.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---

DaveM, this mitigates a bug in a feature that's new in 4.10, and the
bug can be hit using current iproute2 -git.  please consider this for
-net.

Changes from v1:
 - Fix the commit message.  'git commit' was very clever and thought that
   all the interesting bits of the test case were intended to be comments
   and stripped them.  Whoops!

kernel/bpf/cgroup.c  | 21 +++++++++++++++++++++
 kernel/bpf/syscall.c | 11 +++++++++++
 2 files changed, 32 insertions(+)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a515f7b007c6..a824f543de69 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 	if (!sk || !sk_fullsock(sk))
 		return 0;
 
+	/*
+	 * For now, socket bpf hooks attached to cgroups can only be
+	 * installed in the init netns and only affect the init netns.
+	 * This could be relaxed in the future once some semantic issues
+	 * are resolved.  For example, ifindexes belonging to one netns
+	 * should probably not be visible to hooks installed by programs
+	 * running in a different netns.
+	 */
+	if (sock_net(sk) != &init_net)
+		return 0;
+
 	if (sk->sk_family != AF_INET &&
 	    sk->sk_family != AF_INET6)
 		return 0;
@@ -186,6 +197,16 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
 	struct bpf_prog *prog;
 	int ret = 0;
 
+	/*
+	 * For now, socket bpf hooks attached to cgroups can only be
+	 * installed in the init netns and only affect the init netns.
+	 * This could be relaxed in the future once some semantic issues
+	 * are resolved.  For example, ifindexes belonging to one netns
+	 * should probably not be visible to hooks installed by programs
+	 * running in a different netns.
+	 */
+	if (sock_net(sk) != &init_net)
+		return 0;
 
 	rcu_read_lock();
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e89acea22ecf..c0bbc55e244d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	struct cgroup *cgrp;
 	enum bpf_prog_type ptype;
 
+	/*
+	 * For now, socket bpf hooks attached to cgroups can only be
+	 * installed in the init netns and only affect the init netns.
+	 * This could be relaxed in the future once some semantic issues
+	 * are resolved.  For example, ifindexes belonging to one netns
+	 * should probably not be visible to hooks installed by programs
+	 * running in a different netns.
+	 */
+	if (current->nsproxy->net_ns != &init_net)
+		return -EINVAL;
+
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-23 20:36 [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns Andy Lutomirski
@ 2017-01-23 21:03 ` David Ahern
  2017-01-23 21:49   ` Andy Lutomirski
  2017-01-24  2:09 ` Alexei Starovoitov
  1 sibling, 1 reply; 24+ messages in thread
From: David Ahern @ 2017-01-23 21:03 UTC (permalink / raw)
  To: Andy Lutomirski, Network Development, David S. Miller
  Cc: Daniel Borkmann, Alexei Starovoitov

On 1/23/17 1:36 PM, Andy Lutomirski wrote:
> To see how cgroup+bpf interacts with network namespaces, I wrote a
> little program called show_bind that calls getsockopt(...,
> SO_BINDTODEVICE, ...) and prints the result.  It did this:
> 
>  # ./ip link add dev vrf0 type vrf table 10
>  # ./ip vrf exec vrf0 ./show_bind
>  Default binding is "vrf0"
>  # ./ip vrf exec vrf0 unshare -n ./show_bind
>  show_bind: getsockopt: No such device
> 
> What's happening here is that "ip vrf" looks up vrf0's ifindex in
> the init netns and installs a hook that binds sockets to that

It looks up the device name in the current namespace.

> ifindex.  When the hook runs in a different netns, it sets
> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
> incorrect behavior.  In this particular example, the ifindex was 4
> and there was no ifindex 4 in the new netns.  If there had been,
> this test would have malfunctioned differently

While the cgroups and network namespace interaction needs improvement, a management tool can workaround the deficiencies:

A shell in the default namespace, mgmt vrf (PS1 tells me the network context):
dsa@kenny:mgmt:~$ 

Switch to a different namespace (one that I run VMs for network testing):
dsa@kenny:mgmt:~$ sudo ip netns exec vms su - dsa

And then bind the shell to vrf2
dsa@kenny:vms:~$ sudo ip vrf exec vrf2 su - dsa
dsa@kenny:vms:vrf2:~$

Or I can go straight to vrf2:
dsa@kenny:mgmt:~$ sudo ip netns exec vms ip vrf exec vrf2 su - dsa
dsa@kenny:vms:vrf2:~$


I am testing additional iproute2 cleanups which will be sent before 4.10 is released.

-----8<-----

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e89acea22ecf..c0bbc55e244d 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  	struct cgroup *cgrp;
>  	enum bpf_prog_type ptype;
>  
> +	/*
> +	 * For now, socket bpf hooks attached to cgroups can only be
> +	 * installed in the init netns and only affect the init netns.
> +	 * This could be relaxed in the future once some semantic issues
> +	 * are resolved.  For example, ifindexes belonging to one netns
> +	 * should probably not be visible to hooks installed by programs
> +	 * running in a different netns.
> +	 */
> +	if (current->nsproxy->net_ns != &init_net)
> +		return -EINVAL;
> +
>  	if (!capable(CAP_NET_ADMIN))
>  		return -EPERM;
>  

But should this patch be taken, shouldn't the EPERM out rank the namespace check.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-23 21:03 ` David Ahern
@ 2017-01-23 21:49   ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2017-01-23 21:49 UTC (permalink / raw)
  To: David Ahern
  Cc: Andy Lutomirski, Network Development, David S. Miller,
	Daniel Borkmann, Alexei Starovoitov

On Mon, Jan 23, 2017 at 1:03 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 1/23/17 1:36 PM, Andy Lutomirski wrote:
>> To see how cgroup+bpf interacts with network namespaces, I wrote a
>> little program called show_bind that calls getsockopt(...,
>> SO_BINDTODEVICE, ...) and prints the result.  It did this:
>>
>>  # ./ip link add dev vrf0 type vrf table 10
>>  # ./ip vrf exec vrf0 ./show_bind
>>  Default binding is "vrf0"
>>  # ./ip vrf exec vrf0 unshare -n ./show_bind
>>  show_bind: getsockopt: No such device
>>
>> What's happening here is that "ip vrf" looks up vrf0's ifindex in
>> the init netns and installs a hook that binds sockets to that
>
> It looks up the device name in the current namespace.
>
>> ifindex.  When the hook runs in a different netns, it sets
>> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
>> incorrect behavior.  In this particular example, the ifindex was 4
>> and there was no ifindex 4 in the new netns.  If there had been,
>> this test would have malfunctioned differently
>
> While the cgroups and network namespace interaction needs improvement, a management tool can workaround the deficiencies:
>
> A shell in the default namespace, mgmt vrf (PS1 tells me the network context):
> dsa@kenny:mgmt:~$
>
> Switch to a different namespace (one that I run VMs for network testing):
> dsa@kenny:mgmt:~$ sudo ip netns exec vms su - dsa
>
> And then bind the shell to vrf2
> dsa@kenny:vms:~$ sudo ip vrf exec vrf2 su - dsa
> dsa@kenny:vms:vrf2:~$
>
> Or I can go straight to vrf2:
> dsa@kenny:mgmt:~$ sudo ip netns exec vms ip vrf exec vrf2 su - dsa
> dsa@kenny:vms:vrf2:~$

Indeed, if you're careful to set up the vrf cgroup in the same netns
that you end up using it in, it'll work.  But there's a bigger footgun
there than I think is warranted, and I'm not sure how iproute2 is
supposed to do all that much better given that the eBPF program can
neither see what namespace a socket is bound to nor can it act in a
way that works correctly in any namespace.

Long-term, I think the real fix is to make the hook work on a
per-netns basis and, if needed, add an interface for a cross-netns
hook to work sensibly.  But I think it's a bit late to do that for
4.10, so instead I'm proposing to limit the API to the case where it
works and the semantics are unambiguous and to leave further
improvements for later.

It's a bit unfortunate that there seems to be an impedance mismatch in
that "ip vrf" acts on cgroups and that cgroups are somewhat orthogonal
to network namespaces.

>
>
> I am testing additional iproute2 cleanups which will be sent before 4.10 is released.
>
> -----8<-----
>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index e89acea22ecf..c0bbc55e244d 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>>       struct cgroup *cgrp;
>>       enum bpf_prog_type ptype;
>>
>> +     /*
>> +      * For now, socket bpf hooks attached to cgroups can only be
>> +      * installed in the init netns and only affect the init netns.
>> +      * This could be relaxed in the future once some semantic issues
>> +      * are resolved.  For example, ifindexes belonging to one netns
>> +      * should probably not be visible to hooks installed by programs
>> +      * running in a different netns.
>> +      */
>> +     if (current->nsproxy->net_ns != &init_net)
>> +             return -EINVAL;
>> +
>>       if (!capable(CAP_NET_ADMIN))
>>               return -EPERM;
>>
>
> But should this patch be taken, shouldn't the EPERM out rank the namespace check.
>

I could see that going either way.  If the hook becomes per-netns,
then the capable() check could potentially become ns_capable() and it
would start succeeding.  I'd be happy to change it, though.

--Andy
-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-23 20:36 [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns Andy Lutomirski
  2017-01-23 21:03 ` David Ahern
@ 2017-01-24  2:09 ` Alexei Starovoitov
  2017-01-24  2:31   ` David Ahern
  2017-01-24  2:42   ` Andy Lutomirski
  1 sibling, 2 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2017-01-24  2:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Network Development, David S. Miller, Daniel Borkmann,
	Alexei Starovoitov, David Ahern

On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote:
> To see how cgroup+bpf interacts with network namespaces, I wrote a
> little program called show_bind that calls getsockopt(...,
> SO_BINDTODEVICE, ...) and prints the result.  It did this:
> 
>  # ./ip link add dev vrf0 type vrf table 10
>  # ./ip vrf exec vrf0 ./show_bind
>  Default binding is "vrf0"
>  # ./ip vrf exec vrf0 unshare -n ./show_bind
>  show_bind: getsockopt: No such device
> 
> What's happening here is that "ip vrf" looks up vrf0's ifindex in
> the init netns and installs a hook that binds sockets to that
> ifindex.  When the hook runs in a different netns, it sets
> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
> incorrect behavior.  In this particular example, the ifindex was 4
> and there was no ifindex 4 in the new netns.  If there had been,
> this test would have malfunctioned differently
> 
> Since it's rather late in the release cycle, let's punt.  This patch
> makes it impossible to install cgroup+bpf hooks outside the init
> netns and makes them not run on sockets that aren't in the init
> netns.
> 
> In a future release, it should be relatively straightforward to make
> these hooks be local to a netns and, if needed, to add a flag so
> that hooks can be made global if necessary.  Global hooks should
> presumably be constrained so that they can't write to any ifindex
> fields.
> 
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: David Ahern <dsa@cumulusnetworks.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> 
> DaveM, this mitigates a bug in a feature that's new in 4.10, and the
> bug can be hit using current iproute2 -git.  please consider this for
> -net.
> 
> Changes from v1:
>  - Fix the commit message.  'git commit' was very clever and thought that
>    all the interesting bits of the test case were intended to be comments
>    and stripped them.  Whoops!
> 
> kernel/bpf/cgroup.c  | 21 +++++++++++++++++++++
>  kernel/bpf/syscall.c | 11 +++++++++++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index a515f7b007c6..a824f543de69 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>  	if (!sk || !sk_fullsock(sk))
>  		return 0;
>  
> +	/*
> +	 * For now, socket bpf hooks attached to cgroups can only be
> +	 * installed in the init netns and only affect the init netns.
> +	 * This could be relaxed in the future once some semantic issues
> +	 * are resolved.  For example, ifindexes belonging to one netns
> +	 * should probably not be visible to hooks installed by programs
> +	 * running in a different netns.
> +	 */
> +	if (sock_net(sk) != &init_net)
> +		return 0;

Nack.
Such check will create absolutely broken abi that we won't be able to fix later.

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e89acea22ecf..c0bbc55e244d 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  	struct cgroup *cgrp;
>  	enum bpf_prog_type ptype;
>  
> +	/*
> +	 * For now, socket bpf hooks attached to cgroups can only be
> +	 * installed in the init netns and only affect the init netns.
> +	 * This could be relaxed in the future once some semantic issues
> +	 * are resolved.  For example, ifindexes belonging to one netns
> +	 * should probably not be visible to hooks installed by programs
> +	 * running in a different netns.

the comment is bogus and shows complete lack of understanding
how bpf is working and how it's being used.
See SKF_AD_IFINDEX that was in classic bpf forever.

> +	 */
> +	if (current->nsproxy->net_ns != &init_net)
> +		return -EINVAL;

this restriction I actually don't mind, since it indeed can be
relaxed later, but please make it proper with net_eq()

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24  2:09 ` Alexei Starovoitov
@ 2017-01-24  2:31   ` David Ahern
  2017-01-24  2:39     ` Andy Lutomirski
  2017-01-24  2:42   ` Andy Lutomirski
  1 sibling, 1 reply; 24+ messages in thread
From: David Ahern @ 2017-01-24  2:31 UTC (permalink / raw)
  To: Alexei Starovoitov, Andy Lutomirski
  Cc: Network Development, David S. Miller, Daniel Borkmann,
	Alexei Starovoitov

On 1/23/17 7:09 PM, Alexei Starovoitov wrote:
>> +	 */
>> +	if (current->nsproxy->net_ns != &init_net)
>> +		return -EINVAL;
> 
> this restriction I actually don't mind, since it indeed can be
> relaxed later, but please make it proper with net_eq()
> 

I do mind. Why have different restrictions for the skb and sk filters?

I have already shown that ip can alleviate the management aspects of the implementation -- just like ip netns does.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24  2:31   ` David Ahern
@ 2017-01-24  2:39     ` Andy Lutomirski
  2017-01-24  4:10       ` David Ahern
  0 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2017-01-24  2:39 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Andy Lutomirski, Network Development,
	David S. Miller, Daniel Borkmann, Alexei Starovoitov

On Mon, Jan 23, 2017 at 6:31 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 1/23/17 7:09 PM, Alexei Starovoitov wrote:
>>> +     */
>>> +    if (current->nsproxy->net_ns != &init_net)
>>> +            return -EINVAL;
>>
>> this restriction I actually don't mind, since it indeed can be
>> relaxed later, but please make it proper with net_eq()
>>
>
> I do mind. Why have different restrictions for the skb and sk filters?
>
> I have already shown that ip can alleviate the management aspects of the implementation -- just like ip netns does.

I'm not sure I followed what you meant.  If I understood right (which
is a big if) you're saying that ip vrf when run in a netns works
correctly in that netns.  I agree, but I think that it would continue
to work (even more reliably) if the hooks only executed in the netns
in which they were created.  I think that would probably be a good
improvement, and it could be done on top of my patch, but I'm not
about to write that code for 4.10.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24  2:09 ` Alexei Starovoitov
  2017-01-24  2:31   ` David Ahern
@ 2017-01-24  2:42   ` Andy Lutomirski
  2017-01-24  3:13     ` Alexei Starovoitov
  1 sibling, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2017-01-24  2:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Network Development, David S. Miller,
	Daniel Borkmann, Alexei Starovoitov, David Ahern

On Mon, Jan 23, 2017 at 6:09 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote:
>> To see how cgroup+bpf interacts with network namespaces, I wrote a
>> little program called show_bind that calls getsockopt(...,
>> SO_BINDTODEVICE, ...) and prints the result.  It did this:
>>
>>  # ./ip link add dev vrf0 type vrf table 10
>>  # ./ip vrf exec vrf0 ./show_bind
>>  Default binding is "vrf0"
>>  # ./ip vrf exec vrf0 unshare -n ./show_bind
>>  show_bind: getsockopt: No such device
>>
>> What's happening here is that "ip vrf" looks up vrf0's ifindex in
>> the init netns and installs a hook that binds sockets to that
>> ifindex.  When the hook runs in a different netns, it sets
>> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
>> incorrect behavior.  In this particular example, the ifindex was 4
>> and there was no ifindex 4 in the new netns.  If there had been,
>> this test would have malfunctioned differently
>>
>> Since it's rather late in the release cycle, let's punt.  This patch
>> makes it impossible to install cgroup+bpf hooks outside the init
>> netns and makes them not run on sockets that aren't in the init
>> netns.
>>
>> In a future release, it should be relatively straightforward to make
>> these hooks be local to a netns and, if needed, to add a flag so
>> that hooks can be made global if necessary.  Global hooks should
>> presumably be constrained so that they can't write to any ifindex
>> fields.
>>
>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: David Ahern <dsa@cumulusnetworks.com>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>
>> DaveM, this mitigates a bug in a feature that's new in 4.10, and the
>> bug can be hit using current iproute2 -git.  please consider this for
>> -net.
>>
>> Changes from v1:
>>  - Fix the commit message.  'git commit' was very clever and thought that
>>    all the interesting bits of the test case were intended to be comments
>>    and stripped them.  Whoops!
>>
>> kernel/bpf/cgroup.c  | 21 +++++++++++++++++++++
>>  kernel/bpf/syscall.c | 11 +++++++++++
>>  2 files changed, 32 insertions(+)
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index a515f7b007c6..a824f543de69 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>>       if (!sk || !sk_fullsock(sk))
>>               return 0;
>>
>> +     /*
>> +      * For now, socket bpf hooks attached to cgroups can only be
>> +      * installed in the init netns and only affect the init netns.
>> +      * This could be relaxed in the future once some semantic issues
>> +      * are resolved.  For example, ifindexes belonging to one netns
>> +      * should probably not be visible to hooks installed by programs
>> +      * running in a different netns.
>> +      */
>> +     if (sock_net(sk) != &init_net)
>> +             return 0;
>
> Nack.
> Such check will create absolutely broken abi that we won't be able to fix later.

Please explain how the change results in a broken ABI and how the
current ABI is better.  I gave a fully worked out example of how the
current ABI misbehaves using only standard Linux networking tools.

>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index e89acea22ecf..c0bbc55e244d 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>>       struct cgroup *cgrp;
>>       enum bpf_prog_type ptype;
>>
>> +     /*
>> +      * For now, socket bpf hooks attached to cgroups can only be
>> +      * installed in the init netns and only affect the init netns.
>> +      * This could be relaxed in the future once some semantic issues
>> +      * are resolved.  For example, ifindexes belonging to one netns
>> +      * should probably not be visible to hooks installed by programs
>> +      * running in a different netns.
>
> the comment is bogus and shows complete lack of understanding
> how bpf is working and how it's being used.
> See SKF_AD_IFINDEX that was in classic bpf forever.
>

I think I disagree completely.

With classic BPF, a program creates a socket and binds a bpf program
to it.  That program can see the ifindex, which is fine because that
ifindex is scoped to the socket's netns, which is (unless the caller
uses setns() or unshare()) the same as the caller's netns, so the
ifindex means exactly what the caller thinks it means.

With cgroup+bpf, the program installing the hook can be completely
unrelated to the program whose sockets are subject to the hook, and,
if they're using different namespaces, it breaks.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24  2:42   ` Andy Lutomirski
@ 2017-01-24  3:13     ` Alexei Starovoitov
  2017-01-24  3:37       ` Andy Lutomirski
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2017-01-24  3:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Network Development, David S. Miller,
	Daniel Borkmann, Alexei Starovoitov, David Ahern

On Mon, Jan 23, 2017 at 06:42:27PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 23, 2017 at 6:09 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote:
> >> To see how cgroup+bpf interacts with network namespaces, I wrote a
> >> little program called show_bind that calls getsockopt(...,
> >> SO_BINDTODEVICE, ...) and prints the result.  It did this:
> >>
> >>  # ./ip link add dev vrf0 type vrf table 10
> >>  # ./ip vrf exec vrf0 ./show_bind
> >>  Default binding is "vrf0"
> >>  # ./ip vrf exec vrf0 unshare -n ./show_bind
> >>  show_bind: getsockopt: No such device
> >>
> >> What's happening here is that "ip vrf" looks up vrf0's ifindex in
> >> the init netns and installs a hook that binds sockets to that
> >> ifindex.  When the hook runs in a different netns, it sets
> >> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
> >> incorrect behavior.  In this particular example, the ifindex was 4
> >> and there was no ifindex 4 in the new netns.  If there had been,
> >> this test would have malfunctioned differently
> >>
> >> Since it's rather late in the release cycle, let's punt.  This patch
> >> makes it impossible to install cgroup+bpf hooks outside the init
> >> netns and makes them not run on sockets that aren't in the init
> >> netns.
> >>
> >> In a future release, it should be relatively straightforward to make
> >> these hooks be local to a netns and, if needed, to add a flag so
> >> that hooks can be made global if necessary.  Global hooks should
> >> presumably be constrained so that they can't write to any ifindex
> >> fields.
> >>
> >> Cc: Daniel Borkmann <daniel@iogearbox.net>
> >> Cc: Alexei Starovoitov <ast@kernel.org>
> >> Cc: David Ahern <dsa@cumulusnetworks.com>
> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> ---
> >>
> >> DaveM, this mitigates a bug in a feature that's new in 4.10, and the
> >> bug can be hit using current iproute2 -git.  please consider this for
> >> -net.
> >>
> >> Changes from v1:
> >>  - Fix the commit message.  'git commit' was very clever and thought that
> >>    all the interesting bits of the test case were intended to be comments
> >>    and stripped them.  Whoops!
> >>
> >> kernel/bpf/cgroup.c  | 21 +++++++++++++++++++++
> >>  kernel/bpf/syscall.c | 11 +++++++++++
> >>  2 files changed, 32 insertions(+)
> >>
> >> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> >> index a515f7b007c6..a824f543de69 100644
> >> --- a/kernel/bpf/cgroup.c
> >> +++ b/kernel/bpf/cgroup.c
> >> @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
> >>       if (!sk || !sk_fullsock(sk))
> >>               return 0;
> >>
> >> +     /*
> >> +      * For now, socket bpf hooks attached to cgroups can only be
> >> +      * installed in the init netns and only affect the init netns.
> >> +      * This could be relaxed in the future once some semantic issues
> >> +      * are resolved.  For example, ifindexes belonging to one netns
> >> +      * should probably not be visible to hooks installed by programs
> >> +      * running in a different netns.
> >> +      */
> >> +     if (sock_net(sk) != &init_net)
> >> +             return 0;
> >
> > Nack.
> > Such check will create absolutely broken abi that we won't be able to fix later.
> 
> Please explain how the change results in a broken ABI and how the
> current ABI is better.  I gave a fully worked out example of how the
> current ABI misbehaves using only standard Linux networking tools.

I gave an example in the other thread that demonstrated
cgroup escape by the appliction when it changes netns.
If application became part of cgroup, it has to stay there,
no matter setns() calls afterwards.

> >
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index e89acea22ecf..c0bbc55e244d 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> >>       struct cgroup *cgrp;
> >>       enum bpf_prog_type ptype;
> >>
> >> +     /*
> >> +      * For now, socket bpf hooks attached to cgroups can only be
> >> +      * installed in the init netns and only affect the init netns.
> >> +      * This could be relaxed in the future once some semantic issues
> >> +      * are resolved.  For example, ifindexes belonging to one netns
> >> +      * should probably not be visible to hooks installed by programs
> >> +      * running in a different netns.
> >
> > the comment is bogus and shows complete lack of understanding
> > how bpf is working and how it's being used.
> > See SKF_AD_IFINDEX that was in classic bpf forever.
> >
> 
> I think I disagree completely.
> 
> With classic BPF, a program creates a socket and binds a bpf program
> to it.  That program can see the ifindex, which is fine because that
> ifindex is scoped to the socket's netns, which is (unless the caller
> uses setns() or unshare()) the same as the caller's netns, so the
> ifindex means exactly what the caller thinks it means.
> 
> With cgroup+bpf, the program installing the hook can be completely
> unrelated to the program whose sockets are subject to the hook, and,
> if they're using different namespaces, it breaks.

Please also see ingress_ifindex, ifindex, bpf_redirect(), bpf_clone_redirect()
that all operate on the ifindex while the program can be detached from
the application. In general bpf program and application that loaded and
attached it are mostly disjoint in case of networking. We have tail_call
mechanism and master bpf prog may call programs installed by other apps
that may have exited already.
cls_bpf is scoped by netdev that belongs to some netns.
If after attaching a program with hardcoded if(ifindex==3) check
to such netdev, this netdev is moved into different netns, this 'if' check
and the program become bogus and won't do what program author
expected. It is expected behavior. The same thing with current 'ip vrf'
implementation that Dave is adjusting. It's not a bug in cgroup+bpf behavior.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24  3:13     ` Alexei Starovoitov
@ 2017-01-24  3:37       ` Andy Lutomirski
  2017-01-24  4:05         ` David Ahern
  0 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2017-01-24  3:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Network Development, David S. Miller,
	Daniel Borkmann, Alexei Starovoitov, David Ahern

On Mon, Jan 23, 2017 at 7:13 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Jan 23, 2017 at 06:42:27PM -0800, Andy Lutomirski wrote:
>> Please explain how the change results in a broken ABI and how the
>> current ABI is better.  I gave a fully worked out example of how the
>> current ABI misbehaves using only standard Linux networking tools.
>
> I gave an example in the other thread that demonstrated
> cgroup escape by the appliction when it changes netns.
> If application became part of cgroup, it has to stay there,
> no matter setns() calls afterwards.

The other thread is out of control.  Can you restate your example?
Please keep in mind that uprivileged programs can unshare their netns.

>
>> >
>> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> >> index e89acea22ecf..c0bbc55e244d 100644
>> >> --- a/kernel/bpf/syscall.c
>> >> +++ b/kernel/bpf/syscall.c
>> >> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>> >>       struct cgroup *cgrp;
>> >>       enum bpf_prog_type ptype;
>> >>
>> >> +     /*
>> >> +      * For now, socket bpf hooks attached to cgroups can only be
>> >> +      * installed in the init netns and only affect the init netns.
>> >> +      * This could be relaxed in the future once some semantic issues
>> >> +      * are resolved.  For example, ifindexes belonging to one netns
>> >> +      * should probably not be visible to hooks installed by programs
>> >> +      * running in a different netns.
>> >
>> > the comment is bogus and shows complete lack of understanding
>> > how bpf is working and how it's being used.
>> > See SKF_AD_IFINDEX that was in classic bpf forever.
>> >
>>
>> I think I disagree completely.
>>
>> With classic BPF, a program creates a socket and binds a bpf program
>> to it.  That program can see the ifindex, which is fine because that
>> ifindex is scoped to the socket's netns, which is (unless the caller
>> uses setns() or unshare()) the same as the caller's netns, so the
>> ifindex means exactly what the caller thinks it means.
>>
>> With cgroup+bpf, the program installing the hook can be completely
>> unrelated to the program whose sockets are subject to the hook, and,
>> if they're using different namespaces, it breaks.
>
> Please also see ingress_ifindex, ifindex, bpf_redirect(), bpf_clone_redirect()
> that all operate on the ifindex while the program can be detached from
> the application. In general bpf program and application that loaded and
> attached it are mostly disjoint in case of networking. We have tail_call
> mechanism and master bpf prog may call programs installed by other apps
> that may have exited already.

If program A acquires a BPF object from program B where program B runs
in a different netns from program A, and program B uses or tail-calls
that BPF object, then A is either doing it intentionally and is
netns-aware or it is terminally buggy and deserves what it gets.

> cls_bpf is scoped by netdev that belongs to some netns.
> If after attaching a program with hardcoded if(ifindex==3) check
> to such netdev, this netdev is moved into different netns, this 'if' check
> and the program become bogus and won't do what program author
> expected. It is expected behavior. The same thing with current 'ip vrf'
> implementation that Dave is adjusting. It's not a bug in cgroup+bpf behavior.
>

Yes, it is a bug because cgroup+bpf causes unwitting programs to be
subject to BPF code installed by a different, potentially unrelated
process.  That's a new situation.  The failure can happen when a
privileged supervisor (whoever runs ip vrf) runs a clueless or
unprivileged program (the thing calling unshare()).

If the way cgroup+bpf worked was that a program did a prctl() or
similar to opt in to being managed by a provided cgroup-aware BPF
program, then I'd agree with everything you're saying.  But that's not
at all how this code works.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24  3:37       ` Andy Lutomirski
@ 2017-01-24  4:05         ` David Ahern
  2017-01-24  4:32           ` Andy Lutomirski
  0 siblings, 1 reply; 24+ messages in thread
From: David Ahern @ 2017-01-24  4:05 UTC (permalink / raw)
  To: Andy Lutomirski, Alexei Starovoitov
  Cc: Andy Lutomirski, Network Development, David S. Miller,
	Daniel Borkmann, Alexei Starovoitov

On 1/23/17 8:37 PM, Andy Lutomirski wrote:
> Yes, it is a bug because cgroup+bpf causes unwitting programs to be
> subject to BPF code installed by a different, potentially unrelated
> process.  That's a new situation.  The failure can happen when a
> privileged supervisor (whoever runs ip vrf) runs a clueless or
> unprivileged program (the thing calling unshare()).

There are many, many ways to misconfigure networking and to run programs in a context or with an input argument that causes the program to not work at all, not work as expected or stop working. This situation is no different. 

For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are namespace based is the ifindex. You brought up the example of changing namespaces where the ifindex is not defined. Alexei mentioned an example where interfaces can be moved to another namespace breaking any ifindex based programs. Another example is the interface can be deleted. Deleting an interface with sockets bound to it does not impact the program in any way - no notification, no wakeup, nothing. The sockets just don't work.

The point of using a management tool like ip is to handle the details to make things sane -- which is the consistent with the whole point of ip netns vs running unshare -n.

> 
> If the way cgroup+bpf worked was that a program did a prctl() or
> similar to opt in to being managed by a provided cgroup-aware BPF
> program, then I'd agree with everything you're saying.  But that's not
> at all how this code works.

I don't want an opt-in approach. The program does not need to know or even care that it has some restriction. Today, someone runs 'ip netns exec NAME COMMAND' the command does not need to know or care that the network namespace was changed on it. Same with 'ip vrf exec' -- it is a network policy that is forced on the program by the user.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24  2:39     ` Andy Lutomirski
@ 2017-01-24  4:10       ` David Ahern
  2017-01-24  4:16         ` Andy Lutomirski
  0 siblings, 1 reply; 24+ messages in thread
From: David Ahern @ 2017-01-24  4:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Andy Lutomirski, Network Development,
	David S. Miller, Daniel Borkmann, Alexei Starovoitov

On 1/23/17 7:39 PM, Andy Lutomirski wrote:
> I'm not sure I followed what you meant.  If I understood right (which
> is a big if) you're saying that ip vrf when run in a netns works
> correctly in that netns.  I agree, but I think that it would continue
> to work (even more reliably) if the hooks only executed in the netns
> in which they were created.  I think that would probably be a good
> improvement, and it could be done on top of my patch, but I'm not
> about to write that code for 4.10.

Sounds like an efficiency on the implementation that does not require limiting the code today to just init namespace.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24  4:10       ` David Ahern
@ 2017-01-24  4:16         ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2017-01-24  4:16 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Andy Lutomirski, Network Development,
	David S. Miller, Daniel Borkmann, Alexei Starovoitov

On Mon, Jan 23, 2017 at 8:10 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 1/23/17 7:39 PM, Andy Lutomirski wrote:
>> I'm not sure I followed what you meant.  If I understood right (which
>> is a big if) you're saying that ip vrf when run in a netns works
>> correctly in that netns.  I agree, but I think that it would continue
>> to work (even more reliably) if the hooks only executed in the netns
>> in which they were created.  I think that would probably be a good
>> improvement, and it could be done on top of my patch, but I'm not
>> about to write that code for 4.10.
>
> Sounds like an efficiency on the implementation that does not require limiting the code today to just init namespace.
>

It's an ABI change, not an implementation change.  If someone wants to
make the code fully netns-aware in time for 4.10, that sounds great.
I'm not going to do that.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24  4:05         ` David Ahern
@ 2017-01-24  4:32           ` Andy Lutomirski
  2017-01-24 17:48             ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2017-01-24  4:32 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Andy Lutomirski, Network Development,
	David S. Miller, Daniel Borkmann, Alexei Starovoitov

On Mon, Jan 23, 2017 at 8:05 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 1/23/17 8:37 PM, Andy Lutomirski wrote:
>> Yes, it is a bug because cgroup+bpf causes unwitting programs to be
>> subject to BPF code installed by a different, potentially unrelated
>> process.  That's a new situation.  The failure can happen when a
>> privileged supervisor (whoever runs ip vrf) runs a clueless or
>> unprivileged program (the thing calling unshare()).
>
> There are many, many ways to misconfigure networking and to run programs in a context or with an input argument that causes the program to not work at all, not work as expected or stop working. This situation is no different.
>
> For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are namespace based is the ifindex. You brought up the example of changing namespaces where the ifindex is not defined. Alexei mentioned an example where interfaces can be moved to another namespace breaking any ifindex based programs. Another example is the interface can be deleted. Deleting an interface with sockets bound to it does not impact the program in any way - no notification, no wakeup, nothing. The sockets just don't work.

And if you use 'ip vrf' to bind to a vrf with ifindex 4 and a program
unshares netns and creates an interface with ifindex 4, then that
program will end up with its sockets magically bound to ifindex 4 and
will silently malfunction.

I can think of multiple ways to address this problem.  You could scope
the hooks to a netns (which is sort of what my patch does).  You could
find a way to force programs in a given cgroup to only execute in a
single netns, although that would probably cause other breakage.  You
could improve the BPF hook API to be netns-aware, which could plausbly
address issues related to unshare() but might get very tricky when
setns() is involved.

My point is that, in 4.10-rc, it doesn't work right, and I doubt this
problem is restricted to just 'ip vrf'.  Without some kind of change
to the way that netns and cgroup+bpf interact, anything that uses
sk_bound_dev_if or reads the ifindex on an skb will be subject to a
huge footgun that unprivileged programs can trigger and any future
attempt to make the cgroup+bpf work for unprivileged users is going to
be more complicated than it deserves to be.

(Aside: I wonder if a similar goal to 'ip vrf' could be achieved by
moving the vrf to a different network namespace and putting programs
that you want to be subject to the VRF into that namespace.)

>
> The point of using a management tool like ip is to handle the details to make things sane -- which is the consistent with the whole point of ip netns vs running unshare -n.

I still don't understand how you're going to make 'ip netns' make this
problem go away.  Keep in mind that there are real programs that call
the unshare() syscall directly.

>
>>
>> If the way cgroup+bpf worked was that a program did a prctl() or
>> similar to opt in to being managed by a provided cgroup-aware BPF
>> program, then I'd agree with everything you're saying.  But that's not
>> at all how this code works.
>
> I don't want an opt-in approach. The program does not need to know or even care that it has some restriction. Today, someone runs 'ip netns exec NAME COMMAND' the command does not need to know or care that the network namespace was changed on it. Same with 'ip vrf exec' -- it is a network policy that is forced on the program by the user.

I absolutely agree.  My point is that cgroup+bpf is *not* opt-in, so
the bar should be higher.  You can't blame the evil program that
called unshare() for breaking things when 'ip vrf' unavoidably sets
things up so that unshare will malfunction.  It should avoid
malfunctioning when running Linux programs that are unaware of it.
This should include programs like unshare and 'ip netns'.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24  4:32           ` Andy Lutomirski
@ 2017-01-24 17:48             ` Alexei Starovoitov
  2017-01-24 18:54               ` Andy Lutomirski
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2017-01-24 17:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Ahern, Andy Lutomirski, Network Development,
	David S. Miller, Daniel Borkmann, Alexei Starovoitov

On Mon, Jan 23, 2017 at 08:32:02PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 23, 2017 at 8:05 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> > On 1/23/17 8:37 PM, Andy Lutomirski wrote:
> >> Yes, it is a bug because cgroup+bpf causes unwitting programs to be
> >> subject to BPF code installed by a different, potentially unrelated
> >> process.  That's a new situation.  The failure can happen when a
> >> privileged supervisor (whoever runs ip vrf) runs a clueless or
> >> unprivileged program (the thing calling unshare()).
> >
> > There are many, many ways to misconfigure networking and to run programs in a context or with an input argument that causes the program to not work at all, not work as expected or stop working. This situation is no different.
> >
> > For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are namespace based is the ifindex. You brought up the example of changing namespaces where the ifindex is not defined. Alexei mentioned an example where interfaces can be moved to another namespace breaking any ifindex based programs. Another example is the interface can be deleted. Deleting an interface with sockets bound to it does not impact the program in any way - no notification, no wakeup, nothing. The sockets just don't work.
> 
> And if you use 'ip vrf' to bind to a vrf with ifindex 4 and a program
> unshares netns and creates an interface with ifindex 4, then that
> program will end up with its sockets magically bound to ifindex 4 and
> will silently malfunction.
> 
> I can think of multiple ways to address this problem.  You could scope
> the hooks to a netns (which is sort of what my patch does).  You could
> find a way to force programs in a given cgroup to only execute in a
> single netns, although that would probably cause other breakage.  You
> could improve the BPF hook API to be netns-aware, which could plausbly
> address issues related to unshare() but might get very tricky when
> setns() is involved.

scoping cgroup to netns will create weird combination of cgroup and netns
which was never done before. cgroup and netns scopes have to be able
to overlap in arbitrary way. Application shouldn't not be able to escape
cgroup scope by changing netns. They are orthogonal scoping constructs.
What we can do instead is to do add bpf helper function to retrieve
the netns id, so that program created for specific netns and specific
ifindex will be more resilient to netns changes.
Note that this ifindex behavior has been present since classic bpf
and its shortcomings are understood. We need to improve it for all bpf
users. It really has nothing to do with cgroups.
cls_bpf filters can do packet redirect into ifindex and they can break
if user process changes.
The 'malfunction' scenario describe above is no different than
cls_bpf is using bpf_skb_under_cgroup() to scope particular
skb->sk->cgroup and doing redirect into ifindex.
It will 'malfunction' exactly the same way if application stays
in the same cgroup, but moves from one netns into another.

> My point is that, in 4.10-rc, it doesn't work right, and I doubt this
> problem is restricted to just 'ip vrf'.  Without some kind of change
> to the way that netns and cgroup+bpf interact, anything that uses
> sk_bound_dev_if or reads the ifindex on an skb will be subject to a
> huge footgun that unprivileged programs can trigger and any future
> attempt to make the cgroup+bpf work for unprivileged users is going to
> be more complicated than it deserves to be.

For n-th time, the current BPF_PROG_TYPE_CGROUP* is root only and
speculation about unprivileged usage are not helping the discussion.

> things up so that unshare will malfunction.  It should avoid
> malfunctioning when running Linux programs that are unaware of it.

I agree that for VRF use case it will help to make programs netns
aware by adding new bpf_get_current_netns_id() or something helper,
but it's up to the program to function properly or be broken.
Adding weird netns restrictions is not going to make bpf programs
any more sane. The program author can always shoot in the foot.

I will work on the patch to add that.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24 17:48             ` Alexei Starovoitov
@ 2017-01-24 18:54               ` Andy Lutomirski
  2017-01-24 20:29                 ` David Ahern
  2017-01-24 22:18                 ` Tejun Heo
  0 siblings, 2 replies; 24+ messages in thread
From: Andy Lutomirski @ 2017-01-24 18:54 UTC (permalink / raw)
  To: Alexei Starovoitov, Tejun Heo
  Cc: David Ahern, Andy Lutomirski, Network Development,
	David S. Miller, Daniel Borkmann, Alexei Starovoitov

On Tue, Jan 24, 2017 at 9:48 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Jan 23, 2017 at 08:32:02PM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 23, 2017 at 8:05 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>> > On 1/23/17 8:37 PM, Andy Lutomirski wrote:
>> >> Yes, it is a bug because cgroup+bpf causes unwitting programs to be
>> >> subject to BPF code installed by a different, potentially unrelated
>> >> process.  That's a new situation.  The failure can happen when a
>> >> privileged supervisor (whoever runs ip vrf) runs a clueless or
>> >> unprivileged program (the thing calling unshare()).
>> >
>> > There are many, many ways to misconfigure networking and to run programs in a context or with an input argument that causes the program to not work at all, not work as expected or stop working. This situation is no different.
>> >
>> > For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are namespace based is the ifindex. You brought up the example of changing namespaces where the ifindex is not defined. Alexei mentioned an example where interfaces can be moved to another namespace breaking any ifindex based programs. Another example is the interface can be deleted. Deleting an interface with sockets bound to it does not impact the program in any way - no notification, no wakeup, nothing. The sockets just don't work.
>>
>> And if you use 'ip vrf' to bind to a vrf with ifindex 4 and a program
>> unshares netns and creates an interface with ifindex 4, then that
>> program will end up with its sockets magically bound to ifindex 4 and
>> will silently malfunction.
>>
>> I can think of multiple ways to address this problem.  You could scope
>> the hooks to a netns (which is sort of what my patch does).  You could
>> find a way to force programs in a given cgroup to only execute in a
>> single netns, although that would probably cause other breakage.  You
>> could improve the BPF hook API to be netns-aware, which could plausbly
>> address issues related to unshare() but might get very tricky when
>> setns() is involved.
>
> scoping cgroup to netns will create weird combination of cgroup and netns
> which was never done before. cgroup and netns scopes have to be able
> to overlap in arbitrary way. Application shouldn't not be able to escape
> cgroup scope by changing netns.

I had assumed that too, but I'm not longer at all convinced that this
is a problem.  It's certainly the case that, if you put an application
into a restrictive cgroup, it shouldn't be able to bypass those
restrictions using unshare() or setns().  But I don't think this is
really possible regardless.  The easy way to try to escape is using
unshare(), but this doesn't actually buy you anything.

$ unshare -Urn ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group
default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Maybe I just escaped the cgroup restrictions on ingress and egress,
but that doesn't seem to matter because I also no longer have any
interfaces that have any traffic.  And, if I created a socket before
unsharing, it's still safe because that socket is bound to the old
netns and would still be subject to the cgroup rules with my patch
applied.

setns() is a bit more complicated, but it should still be fine.
netns_install() requires CAP_NET_ADMIN over the target netns, so you
can only switch in to a netns if you already have privilege in that
netns.

>> My point is that, in 4.10-rc, it doesn't work right, and I doubt this
>> problem is restricted to just 'ip vrf'.  Without some kind of change
>> to the way that netns and cgroup+bpf interact, anything that uses
>> sk_bound_dev_if or reads the ifindex on an skb will be subject to a
>> huge footgun that unprivileged programs can trigger and any future
>> attempt to make the cgroup+bpf work for unprivileged users is going to
>> be more complicated than it deserves to be.
>
> For n-th time, the current BPF_PROG_TYPE_CGROUP* is root only and
> speculation about unprivileged usage are not helping the discussion.

...which has nothing to do with my example of how it's broken.  I used
'ip vrf' the way it was intended to be used and then I ran code in it
that uses only APIs that predate eBPF.  The result was that the
program obtained a broken socket that had sk_bound_dev_if filled in
with a nonsense index.  There's no speculation -- it's just broken.

Maybe you can argue that this is a missing feature in cgroup+bpf (no
API to query which netns is in use) and a corresponding bug in 'ip
vrf', but I see this as evidence that cgroup+bpf as it exists in 4.10
is not carefully enough throught through.  The only non-example user
of it that I can find (ip vrf) is buggy and can't really be fixed
using mechanisms that exist in 4.10-rc.

>
>> things up so that unshare will malfunction.  It should avoid
>> malfunctioning when running Linux programs that are unaware of it.
>
> I agree that for VRF use case it will help to make programs netns
> aware by adding new bpf_get_current_netns_id() or something helper,
> but it's up to the program to function properly or be broken.

This will cause David's code to run slower, and I think he wants very
high performance.

> I will work on the patch to add that.
>

I think that may be a reasonable thing to do, but I think this may be
better as a non-default option.

Tejun, I can see two basic ways that cgroup+bpf delegation could work
down the road.  Both depend on unprivileged users being able to load
BPF programs of the correct type, but that's mostly a matter of
auditing the code and doesn't have any particularly interesting design
challenges as far as I know.

Approach 1: Scope cgroup+bpf hooks to a netns and require
ns_capable(CAP_NET_ADMIN), fs permissions, and (optionally) delegation
to be enabled to install them.  This shouldn't have any particularly
dangerous security implications because ns_capable(CAP_NET_ADMIN)
means you already own the netns.

Approach 2: Keep cgroup+bpf hooks global.  This makes the delegation
story much trickier.  There needs to be some mechanism to prevent some
program that has delegated cgroup control from affecting the behavior
of outside programs.  This isn't so easy.  Imagine that you've
delegated /cgroup/foo to UID 1000.  A program with UID 1000 can now
install a hook on /cgroup/foo, but there needs to be a mechanism to
prevent UID 1000 from running in /cgroup/foo in the global netns and
then running a tool like sudo.  This *might* be possible using just
existing cgroup mechanisms, but it's going to be quite tricky to make
sure it's done completely correctly and without dangerous races.

If cgroup+bpf stays global, then I think you're basically committing
to approach 2.

I would suggest doing approach 1 (i.e. apply my patch) and then, if
truly needed for some use case, add an option so that globally
privileged programs can create hooks that affect all namespaces.

Alexei, do you have an actual use case in mind that requires hooks to
be global?  The only current uses I'm aware of seem to work better if
they're local to a netns.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24 18:54               ` Andy Lutomirski
@ 2017-01-24 20:29                 ` David Ahern
  2017-01-24 21:24                   ` Andy Lutomirski
  2017-01-24 22:18                 ` Tejun Heo
  1 sibling, 1 reply; 24+ messages in thread
From: David Ahern @ 2017-01-24 20:29 UTC (permalink / raw)
  To: Andy Lutomirski, Alexei Starovoitov, Tejun Heo
  Cc: Andy Lutomirski, Network Development, David S. Miller,
	Daniel Borkmann, Alexei Starovoitov

On 1/24/17 11:54 AM, Andy Lutomirski wrote:
>>> My point is that, in 4.10-rc, it doesn't work right, and I doubt this
>>> problem is restricted to just 'ip vrf'.  Without some kind of change
>>> to the way that netns and cgroup+bpf interact, anything that uses
>>> sk_bound_dev_if or reads the ifindex on an skb will be subject to a
>>> huge footgun that unprivileged programs can trigger and any future
>>> attempt to make the cgroup+bpf work for unprivileged users is going to
>>> be more complicated than it deserves to be.
>>
>> For n-th time, the current BPF_PROG_TYPE_CGROUP* is root only and
>> speculation about unprivileged usage are not helping the discussion.
> 
> ...which has nothing to do with my example of how it's broken.  I used
> 'ip vrf' the way it was intended to be used and then I ran code in it
> that uses only APIs that predate eBPF.  The result was that the
> program obtained a broken socket that had sk_bound_dev_if filled in
> with a nonsense index.  There's no speculation -- it's just broken.

Again, there are many ways to arrive at this point where a socket is bound to a device index that is invalid. Nothing stops you from downloading a bpf program with an invalid ifindex. That is up to the user.

Users do not run around exec'ing commands in random network contexts (namespace, vrf, device, whatever) and expect them to just work.

> 
> Maybe you can argue that this is a missing feature in cgroup+bpf (no
> API to query which netns is in use) and a corresponding bug in 'ip
> vrf', but I see this as evidence that cgroup+bpf as it exists in 4.10
> is not carefully enough throught through.  The only non-example user
> of it that I can find (ip vrf) is buggy and can't really be fixed
> using mechanisms that exist in 4.10-rc.

The argument is that cgroups and namespaces are completely disjoint infrastructure and that users need to know what they are doing. If one uses a management tool (ip in this case) in a consistent manner it will work. If 'ip netns' is used to change namespaces, vrf is reset on the switch. And, I have a patch now to properly nest namespaces and vrfs so that mgmt vrf in multiple namespaces will work properly.

> 
>>
>>> things up so that unshare will malfunction.  It should avoid
>>> malfunctioning when running Linux programs that are unaware of it.
>>
>> I agree that for VRF use case it will help to make programs netns
>> aware by adding new bpf_get_current_netns_id() or something helper,
>> but it's up to the program to function properly or be broken.
> 
> This will cause David's code to run slower, and I think he wants very
> high performance.

This is a socket create path not a packet path. While overhead should be contained, a few extra cycles should be fine.

Adding the capability to allow users to check the netns id would offer a solution to the namespace problem, but there is nothing that *requires* a bpf program to do it.

Who's to say an admin does not *want* all processes in a group to have sockets bound to a non-existent device? 'ip vrf' restricts the device index to a VRF device because as a management tool I want it to be user friendly, but generically the BPF code does not have any restrictions. ifindex can be any u32 value.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24 20:29                 ` David Ahern
@ 2017-01-24 21:24                   ` Andy Lutomirski
  2017-01-24 21:34                     ` David Ahern
  2017-01-25  0:11                     ` Alexei Starovoitov
  0 siblings, 2 replies; 24+ messages in thread
From: Andy Lutomirski @ 2017-01-24 21:24 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Tejun Heo, Andy Lutomirski,
	Network Development, David S. Miller, Daniel Borkmann,
	Alexei Starovoitov

On Tue, Jan 24, 2017 at 12:29 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>
> Users do not run around exec'ing commands in random network contexts (namespace, vrf, device, whatever) and expect them to just work.

I worked on some code once (deployed in production, even!) that calls
unshare() to make a netns and creates an interface and runs code in
there and expects it to just work.  It wouldn't work the outer program
were run under current ip vrf.

>
>>
>> Maybe you can argue that this is a missing feature in cgroup+bpf (no
>> API to query which netns is in use) and a corresponding bug in 'ip
>> vrf', but I see this as evidence that cgroup+bpf as it exists in 4.10
>> is not carefully enough throught through.  The only non-example user
>> of it that I can find (ip vrf) is buggy and can't really be fixed
>> using mechanisms that exist in 4.10-rc.
>
> The argument is that cgroups and namespaces are completely disjoint infrastructure and that users need to know what they are doing.

But perhaps they should be less disjoint.  As far as I know,
cgroup+bpf is the *only* network configuration that is being set up to
run across all network namespaces.  No one has said why this is okay,
let alone why it's preferable to making it work per-netns just like
basically all other Linux network configuration.

>
>>
>>>
>>>> things up so that unshare will malfunction.  It should avoid
>>>> malfunctioning when running Linux programs that are unaware of it.
>>>
>>> I agree that for VRF use case it will help to make programs netns
>>> aware by adding new bpf_get_current_netns_id() or something helper,
>>> but it's up to the program to function properly or be broken.
>>
>> This will cause David's code to run slower, and I think he wants very
>> high performance.
>
> This is a socket create path not a packet path. While overhead should be contained, a few extra cycles should be fine.
>
> Adding the capability to allow users to check the netns id would offer a solution to the namespace problem, but there is nothing that *requires* a bpf program to do it.
>
> Who's to say an admin does not *want* all processes in a group to have sockets bound to a non-existent device? 'ip vrf' restricts the device index to a VRF device because as a management tool I want it to be user friendly, but generically the BPF code does not have any restrictions. ifindex can be any u32 value.

I was hoping for an actual likely use case for the bpf hooks to be run
in all namespaces.  You're arguing that iproute2 can be made to work
mostly okay if bpf hooks can run in all namespaces, but the use case
of intentionally making sk_bound_dev_if invalid across all namespaces
seems dubious.

But all of what you're suggesting doing would still work fine and
would result in less kernel code *and* less eBPF code if the hooks
were per netns.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24 21:24                   ` Andy Lutomirski
@ 2017-01-24 21:34                     ` David Ahern
  2017-01-25  0:11                     ` Alexei Starovoitov
  1 sibling, 0 replies; 24+ messages in thread
From: David Ahern @ 2017-01-24 21:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Tejun Heo, Andy Lutomirski,
	Network Development, David S. Miller, Daniel Borkmann,
	Alexei Starovoitov

On 1/24/17 2:24 PM, Andy Lutomirski wrote:
> I was hoping for an actual likely use case for the bpf hooks to be run
> in all namespaces.  You're arguing that iproute2 can be made to work
> mostly okay if bpf hooks can run in all namespaces, but the use case
> of intentionally making sk_bound_dev_if invalid across all namespaces
> seems dubious.

you can use the bpf hook to deny socket create based on family and/or protocol. 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24 18:54               ` Andy Lutomirski
  2017-01-24 20:29                 ` David Ahern
@ 2017-01-24 22:18                 ` Tejun Heo
  1 sibling, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2017-01-24 22:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, David Ahern, Andy Lutomirski,
	Network Development, David S. Miller, Daniel Borkmann,
	Alexei Starovoitov

Hello, Andy.

On Tue, Jan 24, 2017 at 10:54:03AM -0800, Andy Lutomirski wrote:
> Tejun, I can see two basic ways that cgroup+bpf delegation could work
> down the road.  Both depend on unprivileged users being able to load
> BPF programs of the correct type, but that's mostly a matter of
> auditing the code and doesn't have any particularly interesting design
> challenges as far as I know.
> 
> Approach 1: Scope cgroup+bpf hooks to a netns and require
> ns_capable(CAP_NET_ADMIN), fs permissions, and (optionally) delegation
> to be enabled to install them.  This shouldn't have any particularly
> dangerous security implications because ns_capable(CAP_NET_ADMIN)
> means you already own the netns.
> 
> Approach 2: Keep cgroup+bpf hooks global.  This makes the delegation
> story much trickier.  There needs to be some mechanism to prevent some
> program that has delegated cgroup control from affecting the behavior
> of outside programs.  This isn't so easy.  Imagine that you've
> delegated /cgroup/foo to UID 1000.  A program with UID 1000 can now
> install a hook on /cgroup/foo, but there needs to be a mechanism to
> prevent UID 1000 from running in /cgroup/foo in the global netns and
> then running a tool like sudo.  This *might* be possible using just
> existing cgroup mechanisms, but it's going to be quite tricky to make
> sure it's done completely correctly and without dangerous races.
> 
> If cgroup+bpf stays global, then I think you're basically committing
> to approach 2.
> 
> I would suggest doing approach 1 (i.e. apply my patch) and then, if
> truly needed for some use case, add an option so that globally
> privileged programs can create hooks that affect all namespaces.

I thought about restricting according to namespaces and your ifindex
example.  While the ifindex issue seems broken and on the surface
seems to indicate that we need to segregate cgroup bpf programs per
netns as iptables does, the more I think about it, the less convinced
I'm that that is the only right way forward.

I'm not too familiar with netns but if you think about other
namespaces, !root namespaces nest inside the root one.  When you enter
a pid namespace, it adds its pid namespace on top of the global one.
It doesn't replace that, and that is an important characteristic which
allows the root namespace to maintain control over the nested ones.
This is the same with cgroupns, which basically is just nesting, or
mount namespace.  Namespaced objects don't disappear in the eyes of
the root namespace.  They just get additional scoped names.

Back to netns, if I'm understanding correctly, this doesn't seem to be
the case.  An interface belongs to a namespace and can be moved around
but it disappears from the root namespace once it enters a non-root
one.  I guess there are reasons, even if historical, for it but this
is what is breaking your ifindex example.  It's fine for an interface
to get a new scoped ifindex inside a netns, but that shouldn't
override the global one.  This, I think, should be fixable without
breaking existing APIs by introducing a new global index.

My opinion on this isn't very strong but what iptables is doing
actually seems wrong to me in that it actually precludes any way of
implementing global policies which encompasses the whole system.
Again, this is a fixable problem if ever necessary by introducing
properly global rule tables or even just adding a flag.

So, I don't think the situation is as clear as "ifindex is broken, so
cgroup bpf programs should be segregated per netns".

David, in another reply, you mentioned about fixing ifindex.  If I
understood correctly, we're talking about the same thing, right?  If
so, is this something easy to implement?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-24 21:24                   ` Andy Lutomirski
  2017-01-24 21:34                     ` David Ahern
@ 2017-01-25  0:11                     ` Alexei Starovoitov
  2017-01-25  0:19                       ` David Ahern
  2017-01-25 20:28                       ` Andy Lutomirski
  1 sibling, 2 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2017-01-25  0:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Ahern, Tejun Heo, Andy Lutomirski, Network Development,
	David S. Miller, Daniel Borkmann, Alexei Starovoitov

On Tue, Jan 24, 2017 at 01:24:54PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 24, 2017 at 12:29 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> >
> > Users do not run around exec'ing commands in random network contexts (namespace, vrf, device, whatever) and expect them to just work.
> 
> I worked on some code once (deployed in production, even!) that calls
> unshare() to make a netns and creates an interface and runs code in
> there and expects it to just work.

that is exactly the counter argument to your proposal which is making cgroups
ignore bpf programs acting on sockets/apps that are no longer in the root ns.
If app can escape cgroup by switching netns, it breaks the scoping assumption
in a way that user space control plane that uses cgroup+bpf cannot oversee.
The program should just work and stay visible to bpf program within
that cgroup scope. More below.

> setns() is a bit more complicated, but it should still be fine.
> netns_install() requires CAP_NET_ADMIN over the target netns, so you
> can only switch in to a netns if you already have privilege in that
> netns.

it's not fine. Consider our main use case which is cgroup-scoped traffic
monitoring and enforcement for well behaving apps.
The container management framework cannot use sandboxing and monitor
all syscalls, because it's unacceptable overhead for production apps.
These apps do unknown things including netns. Some of them run as root too.
The container management needs to see what these apps are doing from
networking point of view with lowest possible overhead. While
cgroup+bpf hook is independent from netns, it's all fine. Initially
the program will simply count bytes/packets and associate them with jobs
where job is a collection of processes. In some cases we need to
restrict whom these jobs can talk to and amount of traffic they send.
Container management doesn't place jobs into netns-es today, because
veth overhead is too high. Something like afnetns might be an option.
Whether it's going to be netns or afnetns should be orthogonal to
cgroup scoped monitoring. A job per cgroup may have multiple
netns inside and the other way around. Multiple cgroup scopes
within single netns. I don't think there is a case where
cgroup/netns scopes will be misaligned, like:
cgroup A + netns 1 and 2
cgroup B + netns 2 and 3
but I don't see a reason to restrict that either.

> But perhaps they should be less disjoint.  As far as I know,
> cgroup+bpf is the *only* network configuration that is being set up to
> run across all network namespaces.  No one has said why this is okay,
> let alone why it's preferable to making it work per-netns just like
> basically all other Linux network configuration.

Single cls_bpf program attached to all netdevs in tc egress
hook can call bpf_skb_under_cgroup() to check whether given skb
is under given cgroup and afterwards decide to drop/redirect.
In terms of 'run across all netns' it's exactly the same.
Our monitoring use case came out of it.
Unfortunately bpf_skb_under_cgroup() in cls_bpf is not scalable.
It works fine for quick on the box monitoring where user logs in
to the box and can see what particular job is doing,
but it's not usable when there are thousands of cgroups and
we need to monitor them all.

> I was hoping for an actual likely use case for the bpf hooks to be run
> in all namespaces.  You're arguing that iproute2 can be made to work
> mostly okay if bpf hooks can run in all namespaces, but the use case
> of intentionally making sk_bound_dev_if invalid across all namespaces
> seems dubious.

I think what Tejun is proposing regarding adding global_ifindex
is a great idea and it will make ifindex interaction cleaner not only
for cgroup+bpf, but for socket and cls_bpf programs too.
I think it would be ok to disallow unprivileged programs (whenever
they come) to use of bpf_sock->bound_dev_if and only
allow bpf_sock->global_bound_dev_if and that should solve
your security concern for future unpriv programs.

I think bpf_get_sock_netns_id() helper or sk->netns_id field would
be good addition as well, since it will allow 'ip vrf' to be smarter today.
It's also more flexible, since bpf_type_cgroup_sock program can make
its own decision when netns_id != expecte_id instead of hard coded.
Today the ip vrf use case does: 'sk->bound_dev_if = idx; return OK;'
it will be able to:
  if (sk->netns_id != expected_id)
    return DROP;
  sk->bound_dev_if = idx;
  return OK;
or
  if (sk->netns_id != expected_id)
    return OK;
  sk->bound_dev_if = idx;
  return OK;
or it will be able to bpf_trace_printk() or bpf_perf_event_output()
to send notification to vrf user space daemon and so on.
Such checks will run at the same speed as checks inside
__cgroup_bpf_run_filter_sk(), but all other users won't pay
for them when not in use. imo it's a win-win.

In parallel I'm planning to work on 'disallow override' flag
for bpf_prog_attach. That should solve remaining concern.
I still disagree about urgency of implementing it, but
it's simple enough, so why not now.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-25  0:11                     ` Alexei Starovoitov
@ 2017-01-25  0:19                       ` David Ahern
  2017-01-25 20:28                       ` Andy Lutomirski
  1 sibling, 0 replies; 24+ messages in thread
From: David Ahern @ 2017-01-25  0:19 UTC (permalink / raw)
  To: Alexei Starovoitov, Andy Lutomirski
  Cc: Tejun Heo, Andy Lutomirski, Network Development, David S. Miller,
	Daniel Borkmann, Alexei Starovoitov

On 1/24/17 5:11 PM, Alexei Starovoitov wrote:
> I think bpf_get_sock_netns_id() helper or sk->netns_id field would
> be good addition as well, since it will allow 'ip vrf' to be smarter today.
> It's also more flexible, since bpf_type_cgroup_sock program can make
> its own decision when netns_id != expecte_id instead of hard coded.
> Today the ip vrf use case does: 'sk->bound_dev_if = idx; return OK;'
> it will be able to:
>   if (sk->netns_id != expected_id)
>     return DROP;
>   sk->bound_dev_if = idx;
>   return OK;
> or
>   if (sk->netns_id != expected_id)
>     return OK;
>   sk->bound_dev_if = idx;
>   return OK;

I agree

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-25  0:11                     ` Alexei Starovoitov
  2017-01-25  0:19                       ` David Ahern
@ 2017-01-25 20:28                       ` Andy Lutomirski
  2017-01-26  0:45                         ` Eric W. Biederman
  1 sibling, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2017-01-25 20:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric W. Biederman
  Cc: David Ahern, Tejun Heo, Andy Lutomirski, Network Development,
	David S. Miller, Daniel Borkmann, Alexei Starovoitov

On Tue, Jan 24, 2017 at 4:11 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Jan 24, 2017 at 01:24:54PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 24, 2017 at 12:29 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>> >
>> > Users do not run around exec'ing commands in random network contexts (namespace, vrf, device, whatever) and expect them to just work.
>>

>> setns() is a bit more complicated, but it should still be fine.
>> netns_install() requires CAP_NET_ADMIN over the target netns, so you
>> can only switch in to a netns if you already have privilege in that
>> netns.
>
> it's not fine. Consider our main use case which is cgroup-scoped traffic
> monitoring and enforcement for well behaving apps.

[...]

That does indeed sound like a sensible use case.  Thanks!

>
>> But perhaps they should be less disjoint.  As far as I know,
>> cgroup+bpf is the *only* network configuration that is being set up to
>> run across all network namespaces.  No one has said why this is okay,
>> let alone why it's preferable to making it work per-netns just like
>> basically all other Linux network configuration.
>
> Single cls_bpf program attached to all netdevs in tc egress
> hook can call bpf_skb_under_cgroup() to check whether given skb
> is under given cgroup and afterwards decide to drop/redirect.
> In terms of 'run across all netns' it's exactly the same.
> Our monitoring use case came out of it.
> Unfortunately bpf_skb_under_cgroup() in cls_bpf is not scalable.
> It works fine for quick on the box monitoring where user logs in
> to the box and can see what particular job is doing,
> but it's not usable when there are thousands of cgroups and
> we need to monitor them all.
>
>> I was hoping for an actual likely use case for the bpf hooks to be run
>> in all namespaces.  You're arguing that iproute2 can be made to work
>> mostly okay if bpf hooks can run in all namespaces, but the use case
>> of intentionally making sk_bound_dev_if invalid across all namespaces
>> seems dubious.
>
> I think what Tejun is proposing regarding adding global_ifindex
> is a great idea and it will make ifindex interaction cleaner not only
> for cgroup+bpf, but for socket and cls_bpf programs too.
> I think it would be ok to disallow unprivileged programs (whenever
> they come) to use of bpf_sock->bound_dev_if and only
> allow bpf_sock->global_bound_dev_if and that should solve
> your security concern for future unpriv programs.
>
> I think bpf_get_sock_netns_id() helper or sk->netns_id field would
> be good addition as well, since it will allow 'ip vrf' to be smarter today.
> It's also more flexible, since bpf_type_cgroup_sock program can make
> its own decision when netns_id != expecte_id instead of hard coded.
> Today the ip vrf use case does: 'sk->bound_dev_if = idx; return OK;'
> it will be able to:
>   if (sk->netns_id != expected_id)
>     return DROP;
>   sk->bound_dev_if = idx;
>   return OK;
> or
>   if (sk->netns_id != expected_id)
>     return OK;
>   sk->bound_dev_if = idx;
>   return OK;
> or it will be able to bpf_trace_printk() or bpf_perf_event_output()
> to send notification to vrf user space daemon and so on.
> Such checks will run at the same speed as checks inside
> __cgroup_bpf_run_filter_sk(), but all other users won't pay
> for them when not in use. imo it's a win-win.

Eric, does this sound okay to you?  You're the authority on exposing
things like namespace ids to users.

>
> In parallel I'm planning to work on 'disallow override' flag
> for bpf_prog_attach. That should solve remaining concern.
> I still disagree about urgency of implementing it, but
> it's simple enough, so why not now.
>

Can you describe the exact semantics?  I really think there should be
room for adding a mode where all the relevant programs are invoked and
that this should eventually be the default.  It's the obvious behavior
and it has many fewer footguns.

I would propose the following:

 - By default, socket creation and egress filters call all registered
programs, innermost cgroup first.  If any of them return a reject
code, stop processing further filters and reject.  By default, ingress
filters call all registered programs, innermost cgroup first.  If any
of them return a reject code, stop processing further filters and
reject.  This is indended to Do The Right Thing (TM) for both
monitoring systems and for sandboxing.  For monitoring, if you stick
your hooks in the /a cgroup, you presumably want to see all traffic
that actually goes in and out.  For ingress, you want to see what's
really there before it gets further modified by hooks set up by the
monitored application (which Alexei noted above can run as root).  For
egress, you want to see what really egresses the machine, after any
modifications made by BPF programs set up by the application.  In
particular, if the application sends a packet but then rejects it via
its own hook, you *don't* want to see it because it won't actually go
out on the wire and you shouldn't charge the application for it.  For
socket creation, the outer manager should have the last word on things
like sk_bound_dev_if.

 - As an option, if needed for some application, let a hook give some
indication that it is permissible for it to be overridden.  It's not
entirely clear to me what the exact semantics should be (do only other
hooks with the special flag set override it?  Do all hooks override
it?)  If you do this, give some thought to a workload in which an
outer accounting system is using an overridable hook (for whatever
reason) and an inner application running as root has installed BPF
programs for its own purposes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-25 20:28                       ` Andy Lutomirski
@ 2017-01-26  0:45                         ` Eric W. Biederman
  2017-01-28 14:43                           ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2017-01-26  0:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, David Ahern, Tejun Heo, Andy Lutomirski,
	Network Development, David S. Miller, Daniel Borkmann,
	Alexei Starovoitov

Andy Lutomirski <luto@amacapital.net> writes:

> On Tue, Jan 24, 2017 at 4:11 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Tue, Jan 24, 2017 at 01:24:54PM -0800, Andy Lutomirski wrote:
>>> On Tue, Jan 24, 2017 at 12:29 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>>> >
>>> > Users do not run around exec'ing commands in random network contexts (namespace, vrf, device, whatever) and expect them to just work.
>>>
>
>>> setns() is a bit more complicated, but it should still be fine.
>>> netns_install() requires CAP_NET_ADMIN over the target netns, so you
>>> can only switch in to a netns if you already have privilege in that
>>> netns.
>>
>> it's not fine. Consider our main use case which is cgroup-scoped traffic
>> monitoring and enforcement for well behaving apps.
>
> [...]
>
> That does indeed sound like a sensible use case.  Thanks!
>
>>
>>> But perhaps they should be less disjoint.  As far as I know,
>>> cgroup+bpf is the *only* network configuration that is being set up to
>>> run across all network namespaces.  No one has said why this is okay,
>>> let alone why it's preferable to making it work per-netns just like
>>> basically all other Linux network configuration.
>>
>> Single cls_bpf program attached to all netdevs in tc egress
>> hook can call bpf_skb_under_cgroup() to check whether given skb
>> is under given cgroup and afterwards decide to drop/redirect.
>> In terms of 'run across all netns' it's exactly the same.
>> Our monitoring use case came out of it.
>> Unfortunately bpf_skb_under_cgroup() in cls_bpf is not scalable.
>> It works fine for quick on the box monitoring where user logs in
>> to the box and can see what particular job is doing,
>> but it's not usable when there are thousands of cgroups and
>> we need to monitor them all.
>>
>>> I was hoping for an actual likely use case for the bpf hooks to be run
>>> in all namespaces.  You're arguing that iproute2 can be made to work
>>> mostly okay if bpf hooks can run in all namespaces, but the use case
>>> of intentionally making sk_bound_dev_if invalid across all namespaces
>>> seems dubious.
>>
>> I think what Tejun is proposing regarding adding global_ifindex
>> is a great idea and it will make ifindex interaction cleaner not only
>> for cgroup+bpf, but for socket and cls_bpf programs too.
>> I think it would be ok to disallow unprivileged programs (whenever
>> they come) to use of bpf_sock->bound_dev_if and only
>> allow bpf_sock->global_bound_dev_if and that should solve
>> your security concern for future unpriv programs.
>>
>> I think bpf_get_sock_netns_id() helper or sk->netns_id field would
>> be good addition as well, since it will allow 'ip vrf' to be smarter today.
>> It's also more flexible, since bpf_type_cgroup_sock program can make
>> its own decision when netns_id != expecte_id instead of hard coded.
>> Today the ip vrf use case does: 'sk->bound_dev_if = idx; return OK;'
>> it will be able to:
>>   if (sk->netns_id != expected_id)
>>     return DROP;
>>   sk->bound_dev_if = idx;
>>   return OK;
>> or
>>   if (sk->netns_id != expected_id)
>>     return OK;
>>   sk->bound_dev_if = idx;
>>   return OK;
>> or it will be able to bpf_trace_printk() or bpf_perf_event_output()
>> to send notification to vrf user space daemon and so on.
>> Such checks will run at the same speed as checks inside
>> __cgroup_bpf_run_filter_sk(), but all other users won't pay
>> for them when not in use. imo it's a win-win.
>
> Eric, does this sound okay to you?  You're the authority on exposing
> things like namespace ids to users.

*Boggle*  Things that run across all network namespaces break any kind
 of sense I have about thinking about them.

Running across more than one network namespace by default seems very
broken to me.

ifindex is definitely a per network namespace property.

We do have a netns_id in the network stack, but that is also
network namespace local.  It is just the local name of another network
namespace.

I can't imagine how any of those identifiers would make any sense
in a context that was not network namespace specific.  It has to be at a
minimum be interpreted in the context of the network namespace things
are made in.

There are also some huge security and maintenance implications if there
is a filter that can run acrosss all network namespaces.  As it sounds
like it can grant processes visibility into network namespaces they do
not already have.  Which can easily break things like Checkpoint/Restart
as well as having rather large implications for information leaks
and interference in namespaces that you a process would not otherwise
have access to.

When I have looked cgroups + network stack has never worked well in
any use case I have seen.  As cgroups are per process and that makes
things like that only appropriate for very close to the process
transmit decissions.  AKA is it ok to send this process to a socket
controlled by this PID.

I don't have the full context but I am have a hard time imagining
anything sensible going on.   I like the subject.  Let's restrict this
to the initial network namespace so things have a chance of being well
defined.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
  2017-01-26  0:45                         ` Eric W. Biederman
@ 2017-01-28 14:43                           ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2017-01-28 14:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Alexei Starovoitov, David Ahern,
	Andy Lutomirski, Network Development, David S. Miller,
	Daniel Borkmann, Alexei Starovoitov

Hello, Eric.

On Thu, Jan 26, 2017 at 01:45:07PM +1300, Eric W. Biederman wrote:
> > Eric, does this sound okay to you?  You're the authority on exposing
> > things like namespace ids to users.
> 
> *Boggle*  Things that run across all network namespaces break any kind
>  of sense I have about thinking about them.
> 
> Running across more than one network namespace by default seems very
> broken to me.

Can you explain why that is?  Other namespaces don't behave this way.
For example, a PID namespace doesn't hide the processes at the system
level.  It just gives additional nested names to the namespaced
processes and having objects visible at the system level is very
useful for monitoring and management.  Are there inherent reasons why
network namespace should be very different from other namespaces in
this regard?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-01-28 14:44 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-23 20:36 [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns Andy Lutomirski
2017-01-23 21:03 ` David Ahern
2017-01-23 21:49   ` Andy Lutomirski
2017-01-24  2:09 ` Alexei Starovoitov
2017-01-24  2:31   ` David Ahern
2017-01-24  2:39     ` Andy Lutomirski
2017-01-24  4:10       ` David Ahern
2017-01-24  4:16         ` Andy Lutomirski
2017-01-24  2:42   ` Andy Lutomirski
2017-01-24  3:13     ` Alexei Starovoitov
2017-01-24  3:37       ` Andy Lutomirski
2017-01-24  4:05         ` David Ahern
2017-01-24  4:32           ` Andy Lutomirski
2017-01-24 17:48             ` Alexei Starovoitov
2017-01-24 18:54               ` Andy Lutomirski
2017-01-24 20:29                 ` David Ahern
2017-01-24 21:24                   ` Andy Lutomirski
2017-01-24 21:34                     ` David Ahern
2017-01-25  0:11                     ` Alexei Starovoitov
2017-01-25  0:19                       ` David Ahern
2017-01-25 20:28                       ` Andy Lutomirski
2017-01-26  0:45                         ` Eric W. Biederman
2017-01-28 14:43                           ` Tejun Heo
2017-01-24 22:18                 ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.