linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Potential issues (security and otherwise) with the current cgroup-bpf API
@ 2016-12-17 18:18 Andy Lutomirski
  2016-12-17 19:26 ` Mickaël Salaün
  2016-12-19 20:56 ` Alexei Starovoitov
  0 siblings, 2 replies; 48+ messages in thread
From: Andy Lutomirski @ 2016-12-17 18:18 UTC (permalink / raw)
  To: Daniel Mack, Alexei Starovoitov, Mickaël Salaün,
	Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller,
	Thomas Graf, Michael Kerrisk, Peter Zijlstra
  Cc: Linux API, linux-kernel, Network Development

[-- Attachment #1: Type: text/plain, Size: 4679 bytes --]

Hi all-

I apologize for being rather late with this.  I didn't realize that
cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
it on the linux-api list, so I missed the discussion.

I think that the inet ingress, egress etc filters are a neat feature,
but I think the API has some issues that will bite us down the road
if it becomes stable in its current form.

Most of the problems I see are summarized in this transcript:

# mkdir cg2
# mount -t cgroup2 none cg2
# mkdir cg2/nosockets
# strace cgrp_socket_rule cg2/nosockets/ 0
...
open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3

^^^^ You can modify a cgroup after opening it O_RDONLY?

bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
log_buf=0x6020c0, kern_version=0}, 48) = 4

^^^^ This is fine.  The bpf() syscall manipulates bpf objects.

bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0

^^^^ This is not so good:
^^^^
^^^^ a) The bpf() syscall is supposed to manipulate bpf objects.  This
^^^^    is manipulating a cgroup.  There's no reason that a socket creation
^^^^    filter couldn't be written in a different language (new iptables
^^^^    table?  Simple list of address families?), but if that happened,
^^^^    then using bpf() to install it would be entirely nonsensical.
^^^^
^^^^ b) This is starting to be an excessively ugly multiplexer.  Among
^^^^    other things, it's very unfriendly to seccomp.

# echo $$ >cg2/nosockets/cgroup.procs
# ping 127.0.0.1
ping: socket: Operation not permitted
# ls cg2/nosockets/
cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
# cat cg2/nosockets/cgroup.controllers

^^^^ Something in cgroupfs should give an indication that this cgroup
^^^^ filters socket creation, but there's nothing there.  You should also
^^^^ be able to turn the filter off from cgroupfs.

# mkdir cg2/nosockets/sockets
# /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1

^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
^^^^ then we're stuck with its semantics.  If it returned -EINVAL instead,
^^^^ there would be a chance to refine it.

# echo $$ >cg2/nosockets/sockets/cgroup.procs
# ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
^C
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms

^^^^ Bash was inside a cgroup that disallowed socket creation, but socket
^^^^ creation wasn't disallowed.  This means that the obvious use of socket
^^^^ creation filters in nestable constainers fails insecurely.


There's also a subtle but nasty potential security problem here.
In 4.9 and before, cgroups has only one real effect in the kernel:
resource control. A process in a malicious cgroup could be DoSed,
but that was about the extent of the damage that a malicious cgroup
could do.

In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
programs attached that can do things if various events occur. (Right
now, this means socket operations, but there are plans in the works
to do this for LSM hooks too.) These bpf programs can say yes or no,
but they can also read out various data (including socket payloads!)
and save them away where an attacker can find them. This sounds a
lot like seccomp with a narrower scope but a much stronger ability to
exfiltrate private information.

Unfortunately, while seccomp is very, very careful to prevent
injection of a privileged victim into a malicious sandbox, the
CGROUP_BPF mechanism appears to have no real security model. There
is nothing to prevent a program that's in a malicious cgroup from
running a setuid binary, and there is nothing to prevent a program
that has the ability to move itself or another program into a
malicious cgroup from doing so and then, if needed for exploitation,
exec a setuid binary.

This isn't much of a problem yet because you currently need
CAP_NET_ADMIN to create a malicious sandbox in the first place.  I'm
sure that, in the near future, someone will want to make this stuff
work in containers with delegated cgroup hierarchies, and then there
may be a real problem here.


I've included a few security people on this thread.  The current API
looks abusable, and it would be nice to find all the holes before
4.10 comes out.


(The cgrp_socket_rule source is attached.  You can build it by sticking it
 in samples/bpf and doing:

 $ make headers_install
 $ cd samples/bpf
 $ gcc -o cgrp_socket_rule cgrp_socket_rule.c libbpf.c -I../../usr/include
)

--Andy

[-- Attachment #2: cgrp_socket_rule.c --]
[-- Type: text/x-csrc, Size: 1545 bytes --]

/* eBPF example program:
 *
 * - Loads eBPF program
 *
 *   The eBPF program sets the sk_bound_dev_if index in new AF_INET{6}
 *   sockets opened by processes in the cgroup.
 *
 * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
 */

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <string.h>
#include <unistd.h>
#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <net/if.h>
#include <linux/bpf.h>

#include "libbpf.h"

static int prog_load(int value)
{
	struct bpf_insn prog[] = {
		BPF_MOV64_IMM(BPF_REG_0, value), /* r0 = verdict */
		BPF_EXIT_INSN(),
	};

	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCK, prog, sizeof(prog),
			     "GPL", 0);
}

static int usage(const char *argv0)
{
	printf("Usage: %s cg-path value\n", argv0);
	return EXIT_FAILURE;
}

int main(int argc, char **argv)
{
	int cg_fd, prog_fd, value, ret;

	if (argc < 2)
		return usage(argv[0]);

	cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
	if (cg_fd < 0) {
		printf("Failed to open cgroup path: '%s'\n", strerror(errno));
		return EXIT_FAILURE;
	}

	value = atoi(argv[2]);

	prog_fd = prog_load(value);
	/* printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf); */

	if (prog_fd < 0) {
		printf("Failed to load prog: '%s'\n", strerror(errno));
		return EXIT_FAILURE;
	}

	ret = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_INET_SOCK_CREATE);
	if (ret < 0) {
		printf("Failed to attach prog to cgroup: '%s'\n",
		       strerror(errno));
		return EXIT_FAILURE;
	}

	return EXIT_SUCCESS;
}

^ permalink raw reply	[flat|nested] 48+ messages in thread
* (no subject)
@ 2017-01-17  5:18 Andy Lutomirski
  2017-01-18 22:41 ` Potential issues (security and otherwise) with the current cgroup-bpf API Tejun Heo
  0 siblings, 1 reply; 48+ messages in thread
From: Andy Lutomirski @ 2017-01-17  5:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

;; This buffer is for text that is not saved, and for Lisp evaluation.
;; To create a file, visit it with C-x C-f and enter text in its buffer.



On Sun, Jan 15, 2017 at 5:19 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> Sorry about the delay.  Some fire fighthing followed the holidays.
>
> On Tue, Jan 03, 2017 at 11:25:59AM +0100, Michal Hocko wrote:
>> > So from what I understand the proposed cgroup is not in fact
>> > hierarchical at all.
>> >
>> > @TJ, I thought you were enforcing all new cgroups to be properly
>> > hierarchical, that would very much include this one.
>>
>> I would be interested in that as well. We have made that mistake in
>> memcg v1 where hierarchy could be disabled for performance reasons and
>> that turned out to be major PITA in the end. Why do we want to repeat
>> the same mistake here?
>
> Across the different threads on this subject, there have been multiple
> explanations but I'll try to sum it up more clearly.
>
> The big issue here is whether this is a cgroup thing or a bpf thing.
> I don't think there's anything inherently wrong with one approach or
> the other.  Forget about the proposed cgroup bpf extentions but thinkg
> about how iptables does cgroups.  Whether it's the netcls/netprio in
> v1 or direct membership matching in v2, it is the network side testing
> for cgroup membership one way or the other.  The only part where
> cgroup is involved in is answering that test.
>

[...]

>
> None of the issues that people have been raising here is actually an
> issue if one thinks of it as a part of bpf.  Its security model is
> exactly the same as any other bpf programs.  Recursive behavior is
> exactly the same as how other external cgroup descendant membership
> testing work.  There is no issue here whatsoever.

After sleeping on this, here are my thoughts:

First, there are three ways to think about this, not just two.  It
could be a BPF feature, a net feature, or a cgroup feature.

I think that calling it a BPF feature is a cop-out.  BPF is an
assembly language and a mechanism for importing little programs into
the kernel.  BPF maps are a BPF feature.  These new hooks are a
feature that actively changes the behavior of other parts of the
kernel.  I don't see how calling this new feature a "BPF" feature
excuses it from playing by the expected rules of the subsystems it
affects or from generally working well with the rest of the kernel.

Perhaps this is a net feature, though, not a cgroup feature.  This
would IMO make a certain amount of sense.  Iptables cgroup matches,
for example, logically are an iptables (i.e., net) feature.  The
problem here is that network configuration (and this explicitly
includes iptables) is done on a per-netns level, whereas these new
hooks entirely ignore network namespaces.  I've pointed out that
trying to enable them in a namespaced world is problematic (e.g.
switching to ns_capable() will cause serious problems), and Alexei
seems to think that this will never happen.  So I don't think we can
really call this a net feature that works the way that other net
features work.

(Suppose that this actually tried to be netns-enabled.  Then you'd
have to map from (netns, cgroup) -> hook, and this would probably be
slow and messy.)

So I think that leaves it being a cgroup feature.  And it definitely
does *not* play by the rules of cgroups right now.

> I'm sure we'll
> eventually get there but from what I hear it isn't something we can
> pull off in a restricted timeframe.

To me, this sounds like "the API isn't that great, we know how to do
better, but we really really want this feature ASAP so we want to ship
it with a sub-par API."  I think that's a bad idea.

> This also holds true for the perf controller.  While it is implemented
> as a controller, it isn't visible to cgroup users in any way and the
> only function it serves is providing the membership test to perf
> subsystem.  perf is the one which decides whether and how it is to be
> used.  cgroup providing membership test to other subsystems is
> completely acceptable and established.

Unless I'm mistaken, "perf_event" is an actual cgroup controller, and
it explicitly respects the cgroup hierarchy.  It shows up in
/proc/cgroups, and I had no problem mounting a cgroupfs instance with
perf_event enabled.  So I'm not sure what you mean.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2017-02-04 17:11 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-17 18:18 Potential issues (security and otherwise) with the current cgroup-bpf API Andy Lutomirski
2016-12-17 19:26 ` Mickaël Salaün
2016-12-17 20:02   ` Andy Lutomirski
2016-12-19 20:56 ` Alexei Starovoitov
2016-12-19 21:23   ` Andy Lutomirski
2016-12-20  0:02     ` Alexei Starovoitov
2016-12-20  0:25       ` Andy Lutomirski
2016-12-20  1:43         ` Andy Lutomirski
2016-12-20  1:44         ` David Ahern
2016-12-20  1:56           ` Andy Lutomirski
2016-12-20  2:52             ` David Ahern
2016-12-20  3:12               ` Andy Lutomirski
2016-12-20  4:44                 ` Alexei Starovoitov
2016-12-20  5:27                   ` Andy Lutomirski
2016-12-20  5:32                     ` Alexei Starovoitov
2016-12-20  9:11             ` Peter Zijlstra
2017-01-03 10:25               ` Michal Hocko
2017-01-16  1:19                 ` Tejun Heo
2017-01-17 13:03                   ` Michal Hocko
2017-01-17 13:32                     ` Peter Zijlstra
2017-01-17 13:58                       ` Michal Hocko
2017-01-17 20:23                         ` Andy Lutomirski
2017-01-18 22:18                         ` Tejun Heo
2017-01-19  9:00                           ` Michal Hocko
2016-12-20  3:18         ` Alexei Starovoitov
2016-12-20  3:50           ` Andy Lutomirski
2016-12-20  4:41             ` Alexei Starovoitov
2016-12-20 10:21             ` Daniel Mack
2016-12-20 17:23               ` Andy Lutomirski
2016-12-20 18:36                 ` Daniel Mack
2016-12-20 18:49                   ` Andy Lutomirski
2016-12-21  4:01                     ` Alexei Starovoitov
2016-12-20  1:34       ` David Miller
2016-12-20  1:40         ` Andy Lutomirski
2016-12-20  4:51           ` Alexei Starovoitov
2016-12-20  5:26             ` Andy Lutomirski
2017-01-17  5:18 Andy Lutomirski
2017-01-18 22:41 ` Potential issues (security and otherwise) with the current cgroup-bpf API Tejun Heo
2017-01-19  0:18   ` Andy Lutomirski
2017-01-19  0:59     ` Tejun Heo
2017-01-19  2:29       ` Andy Lutomirski
2017-01-20  2:39         ` Alexei Starovoitov
2017-01-20  4:04           ` Andy Lutomirski
2017-01-23  4:31             ` Alexei Starovoitov
2017-01-23 20:20               ` Andy Lutomirski
2017-02-03 21:07                 ` Andy Lutomirski
2017-02-03 23:21                   ` Alexei Starovoitov
2017-02-04 17:10                     ` Andy Lutomirski
2017-01-19  1:01     ` Mickaël Salaün

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).