On 17/12/2016 19:18, Andy Lutomirski wrote: > Hi all- > > I apologize for being rather late with this. I didn't realize that > cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see > it on the linux-api list, so I missed the discussion. > > I think that the inet ingress, egress etc filters are a neat feature, > but I think the API has some issues that will bite us down the road > if it becomes stable in its current form. > > Most of the problems I see are summarized in this transcript: > > # mkdir cg2 > # mount -t cgroup2 none cg2 > # mkdir cg2/nosockets > # strace cgrp_socket_rule cg2/nosockets/ 0 > ... > open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 > > ^^^^ You can modify a cgroup after opening it O_RDONLY? I sent a patch to check the cgroup.procs permission before attaching a BPF program to it [1], but it was not merged because not part of the current security model (which may not be crystal clear). The thing is that the current socket/BPF/cgroup feature is only available to a process with the *global CAP_NET_ADMIN* and such a process can already modify the network for every processes, so it doesn't make much sense to check if it can modify the network for a subset of this processes. [1] https://lkml.org/lkml/2016/9/19/854 However, needing a process to open a cgroup *directory* in write mode may not make sense because the process does not modify the content of the cgroup but only use it as a *reference* in the network stack. Forcing an open with write mode may forbid to use this kind of network-filtering feature in a read-only file-system but not necessarily read-only *network configuration*. Another point of view is that the CAP_NET_ADMIN may be an unneeded privilege if the cgroup migration is using a no_new_privs-like feature as I proposed with Landlock [2] (with an extra ptrace_may_access() check). The new capability proposition for cgroup may be interesting too. [2] https://lkml.org/lkml/2016/9/14/82 > > bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, > insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, > log_buf=0x6020c0, kern_version=0}, 48) = 4 > > ^^^^ This is fine. The bpf() syscall manipulates bpf objects. > > bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 > > ^^^^ This is not so good: > ^^^^ > ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This > ^^^^ is manipulating a cgroup. There's no reason that a socket creation > ^^^^ filter couldn't be written in a different language (new iptables > ^^^^ table? Simple list of address families?), but if that happened, > ^^^^ then using bpf() to install it would be entirely nonsensical. Another point of view is to say that the BPF program (called by the network stack) is using a reference to a set of processes thanks to a cgroup. > ^^^^ > ^^^^ b) This is starting to be an excessively ugly multiplexer. Among > ^^^^ other things, it's very unfriendly to seccomp. FWIW, Landlock will have the capability to filter this kind of action. > > # echo $$ >cg2/nosockets/cgroup.procs > # ping 127.0.0.1 > ping: socket: Operation not permitted > # ls cg2/nosockets/ > cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control > # cat cg2/nosockets/cgroup.controllers > > ^^^^ Something in cgroupfs should give an indication that this cgroup > ^^^^ filters socket creation, but there's nothing there. You should also > ^^^^ be able to turn the filter off from cgroupfs. Right. Everybody was OK at LPC to add such an information but it is not there yet. > > # mkdir cg2/nosockets/sockets > # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1 > > ^^^^ This succeeded, which means that, if this feature is enabled in 4.10, > ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead, > ^^^^ there would be a chance to refine it. This is indeed unfortunate. > > # echo $$ >cg2/nosockets/sockets/cgroup.procs > # ping 127.0.0.1 > PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. > 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms > ^C > --- 127.0.0.1 ping statistics --- > 1 packets transmitted, 1 received, 0% packet loss, time 0ms > rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms > > ^^^^ Bash was inside a cgroup that disallowed socket creation, but socket > ^^^^ creation wasn't disallowed. This means that the obvious use of socket > ^^^^ creation filters in nestable constainers fails insecurely. > > > There's also a subtle but nasty potential security problem here. > In 4.9 and before, cgroups has only one real effect in the kernel: > resource control. A process in a malicious cgroup could be DoSed, > but that was about the extent of the damage that a malicious cgroup > could do. > > In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf > programs attached that can do things if various events occur. (Right > now, this means socket operations, but there are plans in the works > to do this for LSM hooks too.) These bpf programs can say yes or no, > but they can also read out various data (including socket payloads!) > and save them away where an attacker can find them. This sounds a > lot like seccomp with a narrower scope but a much stronger ability to > exfiltrate private information. > > Unfortunately, while seccomp is very, very careful to prevent > injection of a privileged victim into a malicious sandbox, the > CGROUP_BPF mechanism appears to have no real security model. There > is nothing to prevent a program that's in a malicious cgroup from > running a setuid binary, and there is nothing to prevent a program > that has the ability to move itself or another program into a > malicious cgroup from doing so and then, if needed for exploitation, > exec a setuid binary. > > This isn't much of a problem yet because you currently need > CAP_NET_ADMIN to create a malicious sandbox in the first place. I'm > sure that, in the near future, someone will want to make this stuff > work in containers with delegated cgroup hierarchies, and then there > may be a real problem here. > > > I've included a few security people on this thread. The current API > looks abusable, and it would be nice to find all the holes before > 4.10 comes out. > > > (The cgrp_socket_rule source is attached. You can build it by sticking it > in samples/bpf and doing: > > $ make headers_install > $ cd samples/bpf > $ gcc -o cgrp_socket_rule cgrp_socket_rule.c libbpf.c -I../../usr/include > ) > > --Andy > Right. Because this feature doesn't handle namespaces (only global CAP_NET_ADMIN), nested containers should not be allowed to use it at all. If we want this kind of feature to be usable by something other than a process with a global capability, then we need an inheritance mechanism similar to what I designed for Landlock. I think it could be added later. Regards, Mickaël