All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
@ 2016-08-04  7:11 Sargun Dhillon
  2016-08-04  8:41 ` Richard Weinberger
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Sargun Dhillon @ 2016-08-04  7:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: alexei.starovoitov, daniel, linux-security-module, netdev

I distributed this patchset to linux-security-module@vger.kernel.org earlier, 
but based on the fact that the archive is down, and this is a fairly 
broad-sweeping proposal, I figured I'd grow the audience a little bit. Sorry
if you received this multiple times.

I've begun building out the skeleton of a Linux Security Module, and I'd like to 
get feedback on it. It's a skeleton, and I've only populated a few hooks, so I'm 
mostly looking for input on the general proposal, interest, and design. It's a 
minor LSM. My particular use case is one in which containers are being 
dynamically deployed to machines by internal developers in a different group. 
The point of Checmate is to act as an extensible bed for _safe_, complex 
security policies. It's nice to enable dynamic security policies that can be 
defined in C, and change as neccessary, without ever having to patch, or rebuild 
the kernel.

For many of these containers, the security policies can be fairly nuanced. One 
particular one to take into account is network security. Often times, 
administrators want to prevent ingress, and egress connectivity except from a 
few select IPs. Egress filtering can be managed using net_cls, but without 
modifying running software, it's non-trivial to attach a filter to all sockets 
being created within a container. The inet_conn_request, socket_recvmsg, 
socket_sock_rcv_skb hooks make this trivial to implement. 

Other times, containers need to be throttled in places where there's not really 
a good place to impose that policy for software which isn't built in-house.  If 
one wants to limit file creations/sec, or reject I/O under certain 
characteristics, there's not a great place to do it now. This gives engineers a 
mechanism to write those policies. 

This same flexibility can be used to take existing programs and enable safe BPF 
helpers to modify memory to allow rules to pass. One example that I prototyped 
was Docker's port mapping, which has an overhead (DNAT), and there's some loss 
of fidelity in the BSD Socket API to identify what's going on. Instead, we can 
just rewrite the port in a bind, based upon some data in a BPF map, and a cgroup 
match.

I can actually see other minor security modules being implemented in Checmate, 
for example, Yama, or the recently proposed Hardchroot could be reimplemented in 
BPF. Potentially, they could even be API compatible.

Although, at first, much of this sounds like seccomp, it's quite different. For
one, what we can do in the security hooks is more complex (access to kernel
pointers). The other side of this is we can have effects on a system-wide,
or cgroup level. This also circumvents the need for CRIU-friendly policies.

Lastly, the flexibility of this mechanism allows for prevention of security
vulnerabilities which are often complex in nature and require the interaction
of multiple hooks (CVE-2014-9717 is a good example), and although ksplice,
and livepatch exist, they're not always easy to use, as compared to loading
a single bpf program across all kernels.

The user-facing API is exposed via prctl as it's meant to be very simple (at 
least the kernel components). It only has three operations. For a given security 
hook, you can attach a BPF program to it, which will add it to the set of 
programs that are executed over when the hook is hit. You can reset a hook, 
which removes all program associated with a given hook, and you can set a 
deny_reset flag on a hook to prevent anyone from resetting it. It's likely that 
an individual would want to set this in any production use case.

On the BPF side of it, all that's involved in the work in progress is to
move some of the tracing helpers into the shared helpers. For example,
it's very valuable to have access to current when enforcing a hook.
BPF programs also have access to maps, which somewhat works around
the need for security blobs in some cases.

I would love to know what y'all think.

Sargun Dhillon (4):
  bpf: move tracing helpers to shared helpers
  bpf, security: Add Checmate
  security/checmate: Add Checmate sample
  bpf: Restrict Checmate bpf programs to current kernel ABI

 include/linux/bpf.h              |   2 +
 include/linux/checmate.h         |  38 +++++
 include/uapi/linux/Kbuild        |   1 +
 include/uapi/linux/bpf.h         |   1 +
 include/uapi/linux/checmate.h    |  65 +++++++++
 include/uapi/linux/prctl.h       |   3 +
 kernel/bpf/helpers.c             |  34 +++++
 kernel/bpf/syscall.c             |   2 +-
 kernel/trace/bpf_trace.c         |  33 -----
 samples/bpf/Makefile             |   4 +
 samples/bpf/bpf_load.c           |  11 +-
 samples/bpf/checmate1_kern.c     |  28 ++++
 samples/bpf/checmate1_user.c     |  54 +++++++
 security/Kconfig                 |   1 +
 security/Makefile                |   2 +
 security/checmate/Kconfig        |   6 +
 security/checmate/Makefile       |   3 +
 security/checmate/checmate_bpf.c |  67 +++++++++
 security/checmate/checmate_lsm.c | 304 +++++++++++++++++++++++++++++++++++++++
 19 files changed, 622 insertions(+), 37 deletions(-)
 create mode 100644 include/linux/checmate.h
 create mode 100644 include/uapi/linux/checmate.h
 create mode 100644 samples/bpf/checmate1_kern.c
 create mode 100644 samples/bpf/checmate1_user.c
 create mode 100644 security/checmate/Kconfig
 create mode 100644 security/checmate/Makefile
 create mode 100644 security/checmate/checmate_bpf.c
 create mode 100644 security/checmate/checmate_lsm.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
  2016-08-04  7:11 [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM Sargun Dhillon
@ 2016-08-04  8:41 ` Richard Weinberger
  2016-08-04  9:24   ` Sargun Dhillon
  2016-08-04  9:45 ` Daniel Borkmann
  2016-08-08 23:44 ` Kees Cook
  2 siblings, 1 reply; 12+ messages in thread
From: Richard Weinberger @ 2016-08-04  8:41 UTC (permalink / raw)
  To: Sargun Dhillon; +Cc: LKML, alexei.starovoitov, Daniel Borkmann, LSM, netdev

Sargun,

On Thu, Aug 4, 2016 at 9:11 AM, Sargun Dhillon <sargun@sargun.me> wrote:
> I distributed this patchset to linux-security-module@vger.kernel.org earlier,
> but based on the fact that the archive is down, and this is a fairly
> broad-sweeping proposal, I figured I'd grow the audience a little bit. Sorry
> if you received this multiple times.
>
> I've begun building out the skeleton of a Linux Security Module, and I'd like to
> get feedback on it. It's a skeleton, and I've only populated a few hooks, so I'm
> mostly looking for input on the general proposal, interest, and design. It's a
> minor LSM. My particular use case is one in which containers are being
> dynamically deployed to machines by internal developers in a different group.
> The point of Checmate is to act as an extensible bed for _safe_, complex
> security policies. It's nice to enable dynamic security policies that can be
> defined in C, and change as neccessary, without ever having to patch, or rebuild
> the kernel.
>
> For many of these containers, the security policies can be fairly nuanced. One
> particular one to take into account is network security. Often times,
> administrators want to prevent ingress, and egress connectivity except from a
> few select IPs. Egress filtering can be managed using net_cls, but without
> modifying running software, it's non-trivial to attach a filter to all sockets
> being created within a container. The inet_conn_request, socket_recvmsg,
> socket_sock_rcv_skb hooks make this trivial to implement.

What is wrong with having firewall rules per container?
Either by matching the container IP or an interface...

> Other times, containers need to be throttled in places where there's not really
> a good place to impose that policy for software which isn't built in-house.  If
> one wants to limit file creations/sec, or reject I/O under certain
> characteristics, there's not a great place to do it now. This gives engineers a
> mechanism to write those policies.

Hmm, not sure if resource control is something we want to do with an LSM.

> This same flexibility can be used to take existing programs and enable safe BPF
> helpers to modify memory to allow rules to pass. One example that I prototyped
> was Docker's port mapping, which has an overhead (DNAT), and there's some loss
> of fidelity in the BSD Socket API to identify what's going on. Instead, we can
> just rewrite the port in a bind, based upon some data in a BPF map, and a cgroup
> match.
>
> I can actually see other minor security modules being implemented in Checmate,
> for example, Yama, or the recently proposed Hardchroot could be reimplemented in
> BPF. Potentially, they could even be API compatible.
>
> Although, at first, much of this sounds like seccomp, it's quite different. For
> one, what we can do in the security hooks is more complex (access to kernel
> pointers). The other side of this is we can have effects on a system-wide,
> or cgroup level. This also circumvents the need for CRIU-friendly policies.

It is like seccomp except that you have a single rule set and target LSM hooks
instead of syscalls, right?

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
  2016-08-04  8:41 ` Richard Weinberger
@ 2016-08-04  9:24   ` Sargun Dhillon
  0 siblings, 0 replies; 12+ messages in thread
From: Sargun Dhillon @ 2016-08-04  9:24 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: LKML, alexei.starovoitov, Daniel Borkmann, LSM, netdev

On Thu, Aug 04, 2016 at 10:41:17AM +0200, Richard Weinberger wrote:
> Sargun,
> 
> On Thu, Aug 4, 2016 at 9:11 AM, Sargun Dhillon <sargun@sargun.me> wrote:
> > I distributed this patchset to linux-security-module@vger.kernel.org earlier,
> > but based on the fact that the archive is down, and this is a fairly
> > broad-sweeping proposal, I figured I'd grow the audience a little bit. Sorry
> > if you received this multiple times.
> >
> > I've begun building out the skeleton of a Linux Security Module, and I'd like to
> > get feedback on it. It's a skeleton, and I've only populated a few hooks, so I'm
> > mostly looking for input on the general proposal, interest, and design. It's a
> > minor LSM. My particular use case is one in which containers are being
> > dynamically deployed to machines by internal developers in a different group.
> > The point of Checmate is to act as an extensible bed for _safe_, complex
> > security policies. It's nice to enable dynamic security policies that can be
> > defined in C, and change as neccessary, without ever having to patch, or rebuild
> > the kernel.
> >
> > For many of these containers, the security policies can be fairly nuanced. One
> > particular one to take into account is network security. Often times,
> > administrators want to prevent ingress, and egress connectivity except from a
> > few select IPs. Egress filtering can be managed using net_cls, but without
> > modifying running software, it's non-trivial to attach a filter to all sockets
> > being created within a container. The inet_conn_request, socket_recvmsg,
> > socket_sock_rcv_skb hooks make this trivial to implement.
> 
> What is wrong with having firewall rules per container?
> Either by matching the container IP or an interface...
> 
This requires infrastructure that's not always available. For one, this approach 
typically requires a network namespace per container, and therefore a dedicated 
IP. It's pretty common [1][2] to not have an IP/container solution, nor a 
network namespace per container solution. The alternatives to have a network 
namespace without IP/container typically involve bifurcating traffic using TC 
mirred actions, and friends. This isn't really great for debuggability. Twitter 
does this with their Mesos network isolator [3]. Cgroups / net_cls is great for 
egress traffic, but not ingress.

> > Other times, containers need to be throttled in places where there's not really
> > a good place to impose that policy for software which isn't built in-house.  If
> > one wants to limit file creations/sec, or reject I/O under certain
> > characteristics, there's not a great place to do it now. This gives engineers a
> > mechanism to write those policies.
> 
> Hmm, not sure if resource control is something we want to do with an LSM.
> 
This is just an example I brought up. I know of a fairly large security vendor 
that has abuse "patterns", and locks software down if it looks "abusive". They 
do it for VMs, but it'd be nice to do similar for containers.

> > This same flexibility can be used to take existing programs and enable safe BPF
> > helpers to modify memory to allow rules to pass. One example that I prototyped
> > was Docker's port mapping, which has an overhead (DNAT), and there's some loss
> > of fidelity in the BSD Socket API to identify what's going on. Instead, we can
> > just rewrite the port in a bind, based upon some data in a BPF map, and a cgroup
> > match.
> >
> > I can actually see other minor security modules being implemented in Checmate,
> > for example, Yama, or the recently proposed Hardchroot could be reimplemented in
> > BPF. Potentially, they could even be API compatible.
> >
> > Although, at first, much of this sounds like seccomp, it's quite different. For
> > one, what we can do in the security hooks is more complex (access to kernel
> > pointers). The other side of this is we can have effects on a system-wide,
> > or cgroup level. This also circumvents the need for CRIU-friendly policies.
> 
> It is like seccomp except that you have a single rule set and target LSM hooks
> instead of syscalls, right?
You're right, it's very similar. I like to think of Checmate as nftables for 
syscalls.

It turns out having this on LSM hooks is a very big difference. Since LSM hooks 
are executed after data is copied to the kernel, you can safely dereference 
pointers and inspect the user's intentions. In one of the attached patches, I 
block traffic to AF_INET, port 1 -- there's no way to do that with seccomp(-bpf) 
today. This could also be used for things like filesystem path based filtering.

You also have full access to the gambit of eBPF, as opposed to seccomp's cBPF. 
This allows you do to a variety of things, like write your programs in C, and 
compile them down to BPF via LLVM. You also have access to maps to share 
information between programs, and tail calls to chain together policies. seccomp 
cannot easily do this because of the checkpoint requirement [4].

> 
> -- 
> Thanks,
> //richard

[1] https://docs.docker.com/engine/userguide/networking/dockernetworks/
[2] http://research.google.com/pubs/pub41684.html (Google Omega)
[3] http://mesos.readthedocs.io/en/0.22.2/network-monitoring/
[4] https://lwn.net/Articles/658422/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
  2016-08-04  7:11 [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM Sargun Dhillon
  2016-08-04  8:41 ` Richard Weinberger
@ 2016-08-04  9:45 ` Daniel Borkmann
  2016-08-04 10:12   ` Sargun Dhillon
  2016-08-08 23:44 ` Kees Cook
  2 siblings, 1 reply; 12+ messages in thread
From: Daniel Borkmann @ 2016-08-04  9:45 UTC (permalink / raw)
  To: Sargun Dhillon, linux-kernel
  Cc: alexei.starovoitov, linux-security-module, netdev

Hi Sargun,

On 08/04/2016 09:11 AM, Sargun Dhillon wrote:
[...]
> [It's a] minor LSM. My particular use case is one in which containers are being
> dynamically deployed to machines by internal developers in a different group.
[...]
> For many of these containers, the security policies can be fairly nuanced. One
> particular one to take into account is network security. Often times,
> administrators want to prevent ingress, and egress connectivity except from a
> few select IPs. Egress filtering can be managed using net_cls, but without
> modifying running software, it's non-trivial to attach a filter to all sockets
> being created within a container. The inet_conn_request, socket_recvmsg,
> socket_sock_rcv_skb hooks make this trivial to implement.

I'm not too familiar with LSMs, but afaik, when you install such policies they
are effectively global, right? How would you install/manage such policies per
container?

On a quick glance, this would then be the job of the BPF proglet from the global
hook, no? If yes, then the BPF contexts the BPF prog works with seem rather quite
limited ...

+struct checmate_file_open_ctx {
+	struct file *file;
+	const struct cred *cred;
+};
+
+struct checmate_task_create_ctx {
+	unsigned long clone_flags;
+};
+
+struct checmate_task_free_ctx {
+	struct task_struct *task;
+};
+
+struct checmate_socket_connect_ctx {
+	struct socket *sock;
+	struct sockaddr *address;
+	int addrlen;
+};

... or are you using bpf_probe_read() in some way to walk 'current' to retrieve
a namespace from there somehow? Like via nsproxy? But how you make sense of this
for defining a per container policy?

Do you see a way where we don't need to define so many different ctx each time?

My other concern from a security PoV is that when using things like bpf_probe_read()
we're dependent on kernel structs and there's a risk that when people migrate such
policies that expectations break due to underlying structs changed. I see you've
addressed that in patch 4 to place a small stone in the way, yeah kinda works. It's
mostly a reminder that this is not stable ABI.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
  2016-08-04  9:45 ` Daniel Borkmann
@ 2016-08-04 10:12   ` Sargun Dhillon
  0 siblings, 0 replies; 12+ messages in thread
From: Sargun Dhillon @ 2016-08-04 10:12 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: linux-kernel, alexei.starovoitov, linux-security-module, netdev

On Thu, Aug 04, 2016 at 11:45:08AM +0200, Daniel Borkmann wrote:
> Hi Sargun,
> 
> On 08/04/2016 09:11 AM, Sargun Dhillon wrote:
> [...]
> >[It's a] minor LSM. My particular use case is one in which containers are being
> >dynamically deployed to machines by internal developers in a different group.
> [...]
> >For many of these containers, the security policies can be fairly nuanced. One
> >particular one to take into account is network security. Often times,
> >administrators want to prevent ingress, and egress connectivity except from a
> >few select IPs. Egress filtering can be managed using net_cls, but without
> >modifying running software, it's non-trivial to attach a filter to all sockets
> >being created within a container. The inet_conn_request, socket_recvmsg,
> >socket_sock_rcv_skb hooks make this trivial to implement.
> 
> I'm not too familiar with LSMs, but afaik, when you install such policies they
> are effectively global, right? How would you install/manage such policies per
> container?
> 
> On a quick glance, this would then be the job of the BPF proglet from the global
> hook, no? If yes, then the BPF contexts the BPF prog works with seem rather quite
> limited ...
You're right. They are global hooks. If you'd want the policy to be specific to 
a given cgroup, or namespace, you'd have to introduce a level of indirection 
through a prog_array, or some such. There are still cases (the CVE, and Docker 
bind) case where you want global isolation. The other big aspect is being able 
to implement application-specific LSMs without requiring kmods. (A la hardchroot).

> 
> +struct checmate_file_open_ctx {
> +	struct file *file;
> +	const struct cred *cred;
> +};
> +
> +struct checmate_task_create_ctx {
> +	unsigned long clone_flags;
> +};
> +
> +struct checmate_task_free_ctx {
> +	struct task_struct *task;
> +};
> +
> +struct checmate_socket_connect_ctx {
> +	struct socket *sock;
> +	struct sockaddr *address;
> +	int addrlen;
> +};
> 
> ... or are you using bpf_probe_read() in some way to walk 'current' to retrieve
> a namespace from there somehow? Like via nsproxy? But how you make sense of this
> for defining a per container policy?
In my prototype code, I'm using uts namespace + hostname, and I'm extracting 
that via the bpf_probe_read walk. You're right, that's less than awesome. In the 
longer-term, I'd hope we'd be able to add a helper like bpf_current_in_cgroup (a 
la bpf_skb_in_cgroup). The idea is that we'd add enough helpers to avoid this. I 
can submit some more example BPF programs if that'd help.  Off the top of my 
head:

* current_in_cgroup 
* introduce struct pid map 
* introduce helpers to inspect common datatypes passed to the helper -- if you 
  look at something like the the net hooks, there aren't actually that many
  datatypes being passed around
* Introduce an example top-level cgroup that maps cgroup -> tail_call into
  other programs

> 
> Do you see a way where we don't need to define so many different ctx each time?
> 
> My other concern from a security PoV is that when using things like bpf_probe_read()
> we're dependent on kernel structs and there's a risk that when people migrate such
> policies that expectations break due to underlying structs changed. I see you've
> addressed that in patch 4 to place a small stone in the way, yeah kinda works. It's
> mostly a reminder that this is not stable ABI.
> 
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
  2016-08-04  7:11 [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM Sargun Dhillon
  2016-08-04  8:41 ` Richard Weinberger
  2016-08-04  9:45 ` Daniel Borkmann
@ 2016-08-08 23:44 ` Kees Cook
  2016-08-09  0:00   ` Sargun Dhillon
  2 siblings, 1 reply; 12+ messages in thread
From: Kees Cook @ 2016-08-08 23:44 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: LKML, Alexei Starovoitov, Daniel Borkmann, linux-security-module,
	Network Development, Reshetova, Elena

On Thu, Aug 4, 2016 at 12:11 AM, Sargun Dhillon <sargun@sargun.me> wrote:
> I distributed this patchset to linux-security-module@vger.kernel.org earlier,
> but based on the fact that the archive is down, and this is a fairly
> broad-sweeping proposal, I figured I'd grow the audience a little bit. Sorry
> if you received this multiple times.
>
> I've begun building out the skeleton of a Linux Security Module, and I'd like to
> get feedback on it. It's a skeleton, and I've only populated a few hooks, so I'm
> mostly looking for input on the general proposal, interest, and design. It's a
> minor LSM. My particular use case is one in which containers are being
> dynamically deployed to machines by internal developers in a different group.
> The point of Checmate is to act as an extensible bed for _safe_, complex
> security policies. It's nice to enable dynamic security policies that can be
> defined in C, and change as neccessary, without ever having to patch, or rebuild
> the kernel.
>
> For many of these containers, the security policies can be fairly nuanced. One
> particular one to take into account is network security. Often times,
> administrators want to prevent ingress, and egress connectivity except from a
> few select IPs. Egress filtering can be managed using net_cls, but without
> modifying running software, it's non-trivial to attach a filter to all sockets
> being created within a container. The inet_conn_request, socket_recvmsg,
> socket_sock_rcv_skb hooks make this trivial to implement.
>
> Other times, containers need to be throttled in places where there's not really
> a good place to impose that policy for software which isn't built in-house.  If
> one wants to limit file creations/sec, or reject I/O under certain
> characteristics, there's not a great place to do it now. This gives engineers a
> mechanism to write those policies.
>
> This same flexibility can be used to take existing programs and enable safe BPF
> helpers to modify memory to allow rules to pass. One example that I prototyped
> was Docker's port mapping, which has an overhead (DNAT), and there's some loss
> of fidelity in the BSD Socket API to identify what's going on. Instead, we can
> just rewrite the port in a bind, based upon some data in a BPF map, and a cgroup
> match.
>
> I can actually see other minor security modules being implemented in Checmate,
> for example, Yama, or the recently proposed Hardchroot could be reimplemented in
> BPF. Potentially, they could even be API compatible.
>
> Although, at first, much of this sounds like seccomp, it's quite different. For
> one, what we can do in the security hooks is more complex (access to kernel
> pointers). The other side of this is we can have effects on a system-wide,
> or cgroup level. This also circumvents the need for CRIU-friendly policies.
>
> Lastly, the flexibility of this mechanism allows for prevention of security
> vulnerabilities which are often complex in nature and require the interaction
> of multiple hooks (CVE-2014-9717 is a good example), and although ksplice,
> and livepatch exist, they're not always easy to use, as compared to loading
> a single bpf program across all kernels.
>
> The user-facing API is exposed via prctl as it's meant to be very simple (at
> least the kernel components). It only has three operations. For a given security
> hook, you can attach a BPF program to it, which will add it to the set of
> programs that are executed over when the hook is hit. You can reset a hook,
> which removes all program associated with a given hook, and you can set a
> deny_reset flag on a hook to prevent anyone from resetting it. It's likely that
> an individual would want to set this in any production use case.

One fairly serious problem that seccomp had to overcome was dealing
with exec+setuid in the face of an attacker. The main example is "what
if we refuse to allow a program to drop privileges via a filter rule?"
For seccomp, no-new-privs was introduced for non-root users of
seccomp. Programmatic syscall (or LSM) filters need to deal with this,
and it's a bit ungainly. :)

Also, if you have a prctl API that already has 3 operations, you might
want to use a new syscall anyway. :)

> On the BPF side of it, all that's involved in the work in progress is to
> move some of the tracing helpers into the shared helpers. For example,
> it's very valuable to have access to current when enforcing a hook.
> BPF programs also have access to maps, which somewhat works around
> the need for security blobs in some cases.

Just from a compatibility perspective, doesn't this end up exposing
kernel structures to userspace? What happens when the structures
change?

And from a security perspective, programmatic examination of kernel
structures means you can trivially leak kernel memory locations and
contents. Resisting these sorts of leaks needs to be addressed too.

This looks like a subset of kprobes but available to non-root users,
which looks rather scary to me at first glance. :)

-Kees

>
> I would love to know what y'all think.
>
> Sargun Dhillon (4):
>   bpf: move tracing helpers to shared helpers
>   bpf, security: Add Checmate
>   security/checmate: Add Checmate sample
>   bpf: Restrict Checmate bpf programs to current kernel ABI
>
>  include/linux/bpf.h              |   2 +
>  include/linux/checmate.h         |  38 +++++
>  include/uapi/linux/Kbuild        |   1 +
>  include/uapi/linux/bpf.h         |   1 +
>  include/uapi/linux/checmate.h    |  65 +++++++++
>  include/uapi/linux/prctl.h       |   3 +
>  kernel/bpf/helpers.c             |  34 +++++
>  kernel/bpf/syscall.c             |   2 +-
>  kernel/trace/bpf_trace.c         |  33 -----
>  samples/bpf/Makefile             |   4 +
>  samples/bpf/bpf_load.c           |  11 +-
>  samples/bpf/checmate1_kern.c     |  28 ++++
>  samples/bpf/checmate1_user.c     |  54 +++++++
>  security/Kconfig                 |   1 +
>  security/Makefile                |   2 +
>  security/checmate/Kconfig        |   6 +
>  security/checmate/Makefile       |   3 +
>  security/checmate/checmate_bpf.c |  67 +++++++++
>  security/checmate/checmate_lsm.c | 304 +++++++++++++++++++++++++++++++++++++++
>  19 files changed, 622 insertions(+), 37 deletions(-)
>  create mode 100644 include/linux/checmate.h
>  create mode 100644 include/uapi/linux/checmate.h
>  create mode 100644 samples/bpf/checmate1_kern.c
>  create mode 100644 samples/bpf/checmate1_user.c
>  create mode 100644 security/checmate/Kconfig
>  create mode 100644 security/checmate/Makefile
>  create mode 100644 security/checmate/checmate_bpf.c
>  create mode 100644 security/checmate/checmate_lsm.c
>
> --
> 2.7.4
>



-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
  2016-08-08 23:44 ` Kees Cook
@ 2016-08-09  0:00   ` Sargun Dhillon
  2016-08-09  0:22     ` Kees Cook
  0 siblings, 1 reply; 12+ messages in thread
From: Sargun Dhillon @ 2016-08-09  0:00 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Alexei Starovoitov, Daniel Borkmann, linux-security-module,
	Network Development, Reshetova, Elena

On Mon, Aug 08, 2016 at 04:44:02PM -0700, Kees Cook wrote:
> On Thu, Aug 4, 2016 at 12:11 AM, Sargun Dhillon <sargun@sargun.me> wrote:
> > I distributed this patchset to linux-security-module@vger.kernel.org earlier,
> > but based on the fact that the archive is down, and this is a fairly
> > broad-sweeping proposal, I figured I'd grow the audience a little bit. Sorry
> > if you received this multiple times.
> >
> > I've begun building out the skeleton of a Linux Security Module, and I'd like to
> > get feedback on it. It's a skeleton, and I've only populated a few hooks, so I'm
> > mostly looking for input on the general proposal, interest, and design. It's a
> > minor LSM. My particular use case is one in which containers are being
> > dynamically deployed to machines by internal developers in a different group.
> > The point of Checmate is to act as an extensible bed for _safe_, complex
> > security policies. It's nice to enable dynamic security policies that can be
> > defined in C, and change as neccessary, without ever having to patch, or rebuild
> > the kernel.
> >
> > For many of these containers, the security policies can be fairly nuanced. One
> > particular one to take into account is network security. Often times,
> > administrators want to prevent ingress, and egress connectivity except from a
> > few select IPs. Egress filtering can be managed using net_cls, but without
> > modifying running software, it's non-trivial to attach a filter to all sockets
> > being created within a container. The inet_conn_request, socket_recvmsg,
> > socket_sock_rcv_skb hooks make this trivial to implement.
> >
> > Other times, containers need to be throttled in places where there's not really
> > a good place to impose that policy for software which isn't built in-house.  If
> > one wants to limit file creations/sec, or reject I/O under certain
> > characteristics, there's not a great place to do it now. This gives engineers a
> > mechanism to write those policies.
> >
> > This same flexibility can be used to take existing programs and enable safe BPF
> > helpers to modify memory to allow rules to pass. One example that I prototyped
> > was Docker's port mapping, which has an overhead (DNAT), and there's some loss
> > of fidelity in the BSD Socket API to identify what's going on. Instead, we can
> > just rewrite the port in a bind, based upon some data in a BPF map, and a cgroup
> > match.
> >
> > I can actually see other minor security modules being implemented in Checmate,
> > for example, Yama, or the recently proposed Hardchroot could be reimplemented in
> > BPF. Potentially, they could even be API compatible.
> >
> > Although, at first, much of this sounds like seccomp, it's quite different. For
> > one, what we can do in the security hooks is more complex (access to kernel
> > pointers). The other side of this is we can have effects on a system-wide,
> > or cgroup level. This also circumvents the need for CRIU-friendly policies.
> >
> > Lastly, the flexibility of this mechanism allows for prevention of security
> > vulnerabilities which are often complex in nature and require the interaction
> > of multiple hooks (CVE-2014-9717 is a good example), and although ksplice,
> > and livepatch exist, they're not always easy to use, as compared to loading
> > a single bpf program across all kernels.
> >
> > The user-facing API is exposed via prctl as it's meant to be very simple (at
> > least the kernel components). It only has three operations. For a given security
> > hook, you can attach a BPF program to it, which will add it to the set of
> > programs that are executed over when the hook is hit. You can reset a hook,
> > which removes all program associated with a given hook, and you can set a
> > deny_reset flag on a hook to prevent anyone from resetting it. It's likely that
> > an individual would want to set this in any production use case.
> 
> One fairly serious problem that seccomp had to overcome was dealing
> with exec+setuid in the face of an attacker. The main example is "what
> if we refuse to allow a program to drop privileges via a filter rule?"
> For seccomp, no-new-privs was introduced for non-root users of
> seccomp. Programmatic syscall (or LSM) filters need to deal with this,
> and it's a bit ungainly. :)
> 
Couldn't someone do the same with SELinux, or Apparmor?

> Also, if you have a prctl API that already has 3 operations, you might
> want to use a new syscall anyway. :)
> 
Looking at other LSMs, they appear to expose their API via a virtual filesystem, 
or prctl. I followed the model of YAMA. I think there may be two more operations 
(detach program, and mark a hook as append-only / read-only / disabled). It 
seems like overkill to implement my own syscall.

> > On the BPF side of it, all that's involved in the work in progress is to
> > move some of the tracing helpers into the shared helpers. For example,
> > it's very valuable to have access to current when enforcing a hook.
> > BPF programs also have access to maps, which somewhat works around
> > the need for security blobs in some cases.
> 
> Just from a compatibility perspective, doesn't this end up exposing
> kernel structures to userspace? What happens when the structures
> change?
> 
I wouldn't consider BPF userspace. Although it executes in the kernel, I 
wouldn't really consider it kernel space either as it's restricted to safe 
operations.

As far as addressing this issue -- A significant part of the LSM hooks API is 
tied to the syscall, giving stability to those datastructures. If you look at 
the API itself a significant part of it has been untouched for 3+ years, and 
it's been even longer since there has been an API breaking change. On the other 
hand, the developer has the ability to perform arbitrary reads of kernel space 
using bpf_probe_read.

This is addressed in the 4th patch, which requires the BPF program is compiled 
against the current kernel version. The userspace policy orchestration code 
should recompile the BPF program on the fly matching the current kernel's 
datastructures. There's a certain level of rope here given to the operator,
and it's expected that they use it carefully. Similarly, folks could load
kprobes, kmods, and other programs that have the same issues.

> And from a security perspective, programmatic examination of kernel
> structures means you can trivially leak kernel memory locations and
> contents. Resisting these sorts of leaks needs to be addressed too.
> 
I'm unsure of that unintentional exfiltration of kernel memory locations is 
possible. You may be able to via a BPF map or similar (logging). What kinds of 
attacks are you thinking about specifically?

> This looks like a subset of kprobes but available to non-root users,
> which looks rather scary to me at first glance. :)
You need CAP_SYS_ADMIN to touch this. These folks are the same ones that control 
SELinux, and Apparmor.

> 
> -Kees
> 
> >
> > I would love to know what y'all think.
> >
> > Sargun Dhillon (4):
> >   bpf: move tracing helpers to shared helpers
> >   bpf, security: Add Checmate
> >   security/checmate: Add Checmate sample
> >   bpf: Restrict Checmate bpf programs to current kernel ABI
> >
> >  include/linux/bpf.h              |   2 +
> >  include/linux/checmate.h         |  38 +++++
> >  include/uapi/linux/Kbuild        |   1 +
> >  include/uapi/linux/bpf.h         |   1 +
> >  include/uapi/linux/checmate.h    |  65 +++++++++
> >  include/uapi/linux/prctl.h       |   3 +
> >  kernel/bpf/helpers.c             |  34 +++++
> >  kernel/bpf/syscall.c             |   2 +-
> >  kernel/trace/bpf_trace.c         |  33 -----
> >  samples/bpf/Makefile             |   4 +
> >  samples/bpf/bpf_load.c           |  11 +-
> >  samples/bpf/checmate1_kern.c     |  28 ++++
> >  samples/bpf/checmate1_user.c     |  54 +++++++
> >  security/Kconfig                 |   1 +
> >  security/Makefile                |   2 +
> >  security/checmate/Kconfig        |   6 +
> >  security/checmate/Makefile       |   3 +
> >  security/checmate/checmate_bpf.c |  67 +++++++++
> >  security/checmate/checmate_lsm.c | 304 +++++++++++++++++++++++++++++++++++++++
> >  19 files changed, 622 insertions(+), 37 deletions(-)
> >  create mode 100644 include/linux/checmate.h
> >  create mode 100644 include/uapi/linux/checmate.h
> >  create mode 100644 samples/bpf/checmate1_kern.c
> >  create mode 100644 samples/bpf/checmate1_user.c
> >  create mode 100644 security/checmate/Kconfig
> >  create mode 100644 security/checmate/Makefile
> >  create mode 100644 security/checmate/checmate_bpf.c
> >  create mode 100644 security/checmate/checmate_lsm.c
> >
> > --
> > 2.7.4
> >
> 
> 
> 
> -- 
> Kees Cook
> Nexus Security

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
  2016-08-09  0:00   ` Sargun Dhillon
@ 2016-08-09  0:22     ` Kees Cook
  2016-08-14 22:57       ` Mickaël Salaün
  0 siblings, 1 reply; 12+ messages in thread
From: Kees Cook @ 2016-08-09  0:22 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: LKML, Alexei Starovoitov, Daniel Borkmann, linux-security-module,
	Network Development, Reshetova, Elena

On Mon, Aug 8, 2016 at 5:00 PM, Sargun Dhillon <sargun@sargun.me> wrote:
> On Mon, Aug 08, 2016 at 04:44:02PM -0700, Kees Cook wrote:
>> On Thu, Aug 4, 2016 at 12:11 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>> > I distributed this patchset to linux-security-module@vger.kernel.org earlier,
>> > but based on the fact that the archive is down, and this is a fairly
>> > broad-sweeping proposal, I figured I'd grow the audience a little bit. Sorry
>> > if you received this multiple times.
>> >
>> > I've begun building out the skeleton of a Linux Security Module, and I'd like to
>> > get feedback on it. It's a skeleton, and I've only populated a few hooks, so I'm
>> > mostly looking for input on the general proposal, interest, and design. It's a
>> > minor LSM. My particular use case is one in which containers are being
>> > dynamically deployed to machines by internal developers in a different group.
>> > The point of Checmate is to act as an extensible bed for _safe_, complex
>> > security policies. It's nice to enable dynamic security policies that can be
>> > defined in C, and change as neccessary, without ever having to patch, or rebuild
>> > the kernel.
>> >
>> > For many of these containers, the security policies can be fairly nuanced. One
>> > particular one to take into account is network security. Often times,
>> > administrators want to prevent ingress, and egress connectivity except from a
>> > few select IPs. Egress filtering can be managed using net_cls, but without
>> > modifying running software, it's non-trivial to attach a filter to all sockets
>> > being created within a container. The inet_conn_request, socket_recvmsg,
>> > socket_sock_rcv_skb hooks make this trivial to implement.
>> >
>> > Other times, containers need to be throttled in places where there's not really
>> > a good place to impose that policy for software which isn't built in-house.  If
>> > one wants to limit file creations/sec, or reject I/O under certain
>> > characteristics, there's not a great place to do it now. This gives engineers a
>> > mechanism to write those policies.
>> >
>> > This same flexibility can be used to take existing programs and enable safe BPF
>> > helpers to modify memory to allow rules to pass. One example that I prototyped
>> > was Docker's port mapping, which has an overhead (DNAT), and there's some loss
>> > of fidelity in the BSD Socket API to identify what's going on. Instead, we can
>> > just rewrite the port in a bind, based upon some data in a BPF map, and a cgroup
>> > match.
>> >
>> > I can actually see other minor security modules being implemented in Checmate,
>> > for example, Yama, or the recently proposed Hardchroot could be reimplemented in
>> > BPF. Potentially, they could even be API compatible.
>> >
>> > Although, at first, much of this sounds like seccomp, it's quite different. For
>> > one, what we can do in the security hooks is more complex (access to kernel
>> > pointers). The other side of this is we can have effects on a system-wide,
>> > or cgroup level. This also circumvents the need for CRIU-friendly policies.
>> >
>> > Lastly, the flexibility of this mechanism allows for prevention of security
>> > vulnerabilities which are often complex in nature and require the interaction
>> > of multiple hooks (CVE-2014-9717 is a good example), and although ksplice,
>> > and livepatch exist, they're not always easy to use, as compared to loading
>> > a single bpf program across all kernels.
>> >
>> > The user-facing API is exposed via prctl as it's meant to be very simple (at
>> > least the kernel components). It only has three operations. For a given security
>> > hook, you can attach a BPF program to it, which will add it to the set of
>> > programs that are executed over when the hook is hit. You can reset a hook,
>> > which removes all program associated with a given hook, and you can set a
>> > deny_reset flag on a hook to prevent anyone from resetting it. It's likely that
>> > an individual would want to set this in any production use case.
>>
>> One fairly serious problem that seccomp had to overcome was dealing
>> with exec+setuid in the face of an attacker. The main example is "what
>> if we refuse to allow a program to drop privileges via a filter rule?"
>> For seccomp, no-new-privs was introduced for non-root users of
>> seccomp. Programmatic syscall (or LSM) filters need to deal with this,
>> and it's a bit ungainly. :)
>>
> Couldn't someone do the same with SELinux, or Apparmor?

The "big" LSMs aren't defined programmatically by non-root users, so
there is no risk of elevating privileges (they are already root).

>> Also, if you have a prctl API that already has 3 operations, you might
>> want to use a new syscall anyway. :)
>>
> Looking at other LSMs, they appear to expose their API via a virtual filesystem,
> or prctl. I followed the model of YAMA. I think there may be two more operations
> (detach program, and mark a hook as append-only / read-only / disabled). It
> seems like overkill to implement my own syscall.
>
>> > On the BPF side of it, all that's involved in the work in progress is to
>> > move some of the tracing helpers into the shared helpers. For example,
>> > it's very valuable to have access to current when enforcing a hook.
>> > BPF programs also have access to maps, which somewhat works around
>> > the need for security blobs in some cases.
>>
>> Just from a compatibility perspective, doesn't this end up exposing
>> kernel structures to userspace? What happens when the structures
>> change?
>>
> I wouldn't consider BPF userspace. Although it executes in the kernel, I
> wouldn't really consider it kernel space either as it's restricted to safe
> operations.
>
> As far as addressing this issue -- A significant part of the LSM hooks API is
> tied to the syscall, giving stability to those datastructures.

Just for the sake of clarity: they're tied to internal callers,
usually near syscall entry points; LSMs can't filter syscalls.

> If you look at
> the API itself a significant part of it has been untouched for 3+ years, and
> it's been even longer since there has been an API breaking change. On the other
> hand, the developer has the ability to perform arbitrary reads of kernel space
> using bpf_probe_read.

What's hilarious is that syscall API is unchanged, but LSM API keeps
shifting around a little at a time. So, same issues as with kprobes,
etc, as you mention.

FWIW, I'd much rather have an LSM that reacts to seccomp filters and
maps syscall arguments to in-kernel data structures that can be
examined during an LSM hook. Then we'd have both a stable API and a
programmatic filtering of data structures.

> This is addressed in the 4th patch, which requires the BPF program is compiled
> against the current kernel version. The userspace policy orchestration code
> should recompile the BPF program on the fly matching the current kernel's
> datastructures. There's a certain level of rope here given to the operator,
> and it's expected that they use it carefully. Similarly, folks could load
> kprobes, kmods, and other programs that have the same issues.

Right, perhaps I misunderstood the privilege level you were targeting.
:) Did you intend for unprivileged users to use this, or just the
init-ns root user?

>
>> And from a security perspective, programmatic examination of kernel
>> structures means you can trivially leak kernel memory locations and
>> contents. Resisting these sorts of leaks needs to be addressed too.
>>
> I'm unsure of that unintentional exfiltration of kernel memory locations is
> possible. You may be able to via a BPF map or similar (logging). What kinds of
> attacks are you thinking about specifically?

Well, I was looking at the example you sent, and it seemed like it had
raw access to kernel pointers, which means it could be programmed to
leak the values.

>> This looks like a subset of kprobes but available to non-root users,
>> which looks rather scary to me at first glance. :)
> You need CAP_SYS_ADMIN to touch this. These folks are the same ones that control
> SELinux, and Apparmor.

Ah-ha, missed that. Still, we want to keep a bright line between uid-0
and ring-0, and to make sure this is just init-ns CAP_SYS_ADMIN.

-Kees

>
>>
>> -Kees
>>
>> >
>> > I would love to know what y'all think.
>> >
>> > Sargun Dhillon (4):
>> >   bpf: move tracing helpers to shared helpers
>> >   bpf, security: Add Checmate
>> >   security/checmate: Add Checmate sample
>> >   bpf: Restrict Checmate bpf programs to current kernel ABI
>> >
>> >  include/linux/bpf.h              |   2 +
>> >  include/linux/checmate.h         |  38 +++++
>> >  include/uapi/linux/Kbuild        |   1 +
>> >  include/uapi/linux/bpf.h         |   1 +
>> >  include/uapi/linux/checmate.h    |  65 +++++++++
>> >  include/uapi/linux/prctl.h       |   3 +
>> >  kernel/bpf/helpers.c             |  34 +++++
>> >  kernel/bpf/syscall.c             |   2 +-
>> >  kernel/trace/bpf_trace.c         |  33 -----
>> >  samples/bpf/Makefile             |   4 +
>> >  samples/bpf/bpf_load.c           |  11 +-
>> >  samples/bpf/checmate1_kern.c     |  28 ++++
>> >  samples/bpf/checmate1_user.c     |  54 +++++++
>> >  security/Kconfig                 |   1 +
>> >  security/Makefile                |   2 +
>> >  security/checmate/Kconfig        |   6 +
>> >  security/checmate/Makefile       |   3 +
>> >  security/checmate/checmate_bpf.c |  67 +++++++++
>> >  security/checmate/checmate_lsm.c | 304 +++++++++++++++++++++++++++++++++++++++
>> >  19 files changed, 622 insertions(+), 37 deletions(-)
>> >  create mode 100644 include/linux/checmate.h
>> >  create mode 100644 include/uapi/linux/checmate.h
>> >  create mode 100644 samples/bpf/checmate1_kern.c
>> >  create mode 100644 samples/bpf/checmate1_user.c
>> >  create mode 100644 security/checmate/Kconfig
>> >  create mode 100644 security/checmate/Makefile
>> >  create mode 100644 security/checmate/checmate_bpf.c
>> >  create mode 100644 security/checmate/checmate_lsm.c
>> >
>> > --
>> > 2.7.4
>> >
>>
>>
>>
>> --
>> Kees Cook
>> Nexus Security



-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
  2016-08-09  0:22     ` Kees Cook
@ 2016-08-14 22:57       ` Mickaël Salaün
  2016-08-15  3:09         ` Sargun Dhillon
  0 siblings, 1 reply; 12+ messages in thread
From: Mickaël Salaün @ 2016-08-14 22:57 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Kees Cook, LKML, Alexei Starovoitov, Daniel Borkmann,
	linux-security-module, Network Development, Reshetova, Elena


[-- Attachment #1.1: Type: text/plain, Size: 11573 bytes --]

Hi,

I've been working on an extension to seccomp-bpf since last year and published a first RFC about it [1]. I'm working on a second RFC/PoC which use eBPF instead of cBPF and is more close to a common LSM than the first RFC. I plan to publish this second RFC by the end of the month.

Our approaches have some common points (i.e. use eBPF in an LSM, stacked filters like seccomp) but I'm focused on a kind of unprivileged LSM (i.e. no CAP_SYS_ADMIN), to make standalone sandboxes, which brings more constraints (e.g. no use of unsafe functions like bpf_probe_read(), take care of privacy, SUID exec, stable ABI…). However, I don't want to handle resource limits, which should be the job of cgroups.

For now, I'm focusing on file-system access control which is one of the more complex system to properly filter. I also plan to support basic network access control.

What you are trying to accomplish seems more related to a Netfilter extension (something like ipset but with eBPF maybe?).

 Mickaël


[1] http://www.openwall.com/lists/kernel-hardening/2016/03/24/2



On 09/08/2016 02:22, Kees Cook wrote:
> On Mon, Aug 8, 2016 at 5:00 PM, Sargun Dhillon <sargun@sargun.me> wrote:
>> On Mon, Aug 08, 2016 at 04:44:02PM -0700, Kees Cook wrote:
>>> On Thu, Aug 4, 2016 at 12:11 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>>>> I distributed this patchset to linux-security-module@vger.kernel.org earlier,
>>>> but based on the fact that the archive is down, and this is a fairly
>>>> broad-sweeping proposal, I figured I'd grow the audience a little bit. Sorry
>>>> if you received this multiple times.
>>>>
>>>> I've begun building out the skeleton of a Linux Security Module, and I'd like to
>>>> get feedback on it. It's a skeleton, and I've only populated a few hooks, so I'm
>>>> mostly looking for input on the general proposal, interest, and design. It's a
>>>> minor LSM. My particular use case is one in which containers are being
>>>> dynamically deployed to machines by internal developers in a different group.
>>>> The point of Checmate is to act as an extensible bed for _safe_, complex
>>>> security policies. It's nice to enable dynamic security policies that can be
>>>> defined in C, and change as neccessary, without ever having to patch, or rebuild
>>>> the kernel.
>>>>
>>>> For many of these containers, the security policies can be fairly nuanced. One
>>>> particular one to take into account is network security. Often times,
>>>> administrators want to prevent ingress, and egress connectivity except from a
>>>> few select IPs. Egress filtering can be managed using net_cls, but without
>>>> modifying running software, it's non-trivial to attach a filter to all sockets
>>>> being created within a container. The inet_conn_request, socket_recvmsg,
>>>> socket_sock_rcv_skb hooks make this trivial to implement.
>>>>
>>>> Other times, containers need to be throttled in places where there's not really
>>>> a good place to impose that policy for software which isn't built in-house.  If
>>>> one wants to limit file creations/sec, or reject I/O under certain
>>>> characteristics, there's not a great place to do it now. This gives engineers a
>>>> mechanism to write those policies.
>>>>
>>>> This same flexibility can be used to take existing programs and enable safe BPF
>>>> helpers to modify memory to allow rules to pass. One example that I prototyped
>>>> was Docker's port mapping, which has an overhead (DNAT), and there's some loss
>>>> of fidelity in the BSD Socket API to identify what's going on. Instead, we can
>>>> just rewrite the port in a bind, based upon some data in a BPF map, and a cgroup
>>>> match.
>>>>
>>>> I can actually see other minor security modules being implemented in Checmate,
>>>> for example, Yama, or the recently proposed Hardchroot could be reimplemented in
>>>> BPF. Potentially, they could even be API compatible.
>>>>
>>>> Although, at first, much of this sounds like seccomp, it's quite different. For
>>>> one, what we can do in the security hooks is more complex (access to kernel
>>>> pointers). The other side of this is we can have effects on a system-wide,
>>>> or cgroup level. This also circumvents the need for CRIU-friendly policies.
>>>>
>>>> Lastly, the flexibility of this mechanism allows for prevention of security
>>>> vulnerabilities which are often complex in nature and require the interaction
>>>> of multiple hooks (CVE-2014-9717 is a good example), and although ksplice,
>>>> and livepatch exist, they're not always easy to use, as compared to loading
>>>> a single bpf program across all kernels.
>>>>
>>>> The user-facing API is exposed via prctl as it's meant to be very simple (at
>>>> least the kernel components). It only has three operations. For a given security
>>>> hook, you can attach a BPF program to it, which will add it to the set of
>>>> programs that are executed over when the hook is hit. You can reset a hook,
>>>> which removes all program associated with a given hook, and you can set a
>>>> deny_reset flag on a hook to prevent anyone from resetting it. It's likely that
>>>> an individual would want to set this in any production use case.
>>>
>>> One fairly serious problem that seccomp had to overcome was dealing
>>> with exec+setuid in the face of an attacker. The main example is "what
>>> if we refuse to allow a program to drop privileges via a filter rule?"
>>> For seccomp, no-new-privs was introduced for non-root users of
>>> seccomp. Programmatic syscall (or LSM) filters need to deal with this,
>>> and it's a bit ungainly. :)
>>>
>> Couldn't someone do the same with SELinux, or Apparmor?
> 
> The "big" LSMs aren't defined programmatically by non-root users, so
> there is no risk of elevating privileges (they are already root).
> 
>>> Also, if you have a prctl API that already has 3 operations, you might
>>> want to use a new syscall anyway. :)
>>>
>> Looking at other LSMs, they appear to expose their API via a virtual filesystem,
>> or prctl. I followed the model of YAMA. I think there may be two more operations
>> (detach program, and mark a hook as append-only / read-only / disabled). It
>> seems like overkill to implement my own syscall.
>>
>>>> On the BPF side of it, all that's involved in the work in progress is to
>>>> move some of the tracing helpers into the shared helpers. For example,
>>>> it's very valuable to have access to current when enforcing a hook.
>>>> BPF programs also have access to maps, which somewhat works around
>>>> the need for security blobs in some cases.
>>>
>>> Just from a compatibility perspective, doesn't this end up exposing
>>> kernel structures to userspace? What happens when the structures
>>> change?
>>>
>> I wouldn't consider BPF userspace. Although it executes in the kernel, I
>> wouldn't really consider it kernel space either as it's restricted to safe
>> operations.
>>
>> As far as addressing this issue -- A significant part of the LSM hooks API is
>> tied to the syscall, giving stability to those datastructures.
> 
> Just for the sake of clarity: they're tied to internal callers,
> usually near syscall entry points; LSMs can't filter syscalls.
> 
>> If you look at
>> the API itself a significant part of it has been untouched for 3+ years, and
>> it's been even longer since there has been an API breaking change. On the other
>> hand, the developer has the ability to perform arbitrary reads of kernel space
>> using bpf_probe_read.
> 
> What's hilarious is that syscall API is unchanged, but LSM API keeps
> shifting around a little at a time. So, same issues as with kprobes,
> etc, as you mention.
> 
> FWIW, I'd much rather have an LSM that reacts to seccomp filters and
> maps syscall arguments to in-kernel data structures that can be
> examined during an LSM hook. Then we'd have both a stable API and a
> programmatic filtering of data structures.
> 
>> This is addressed in the 4th patch, which requires the BPF program is compiled
>> against the current kernel version. The userspace policy orchestration code
>> should recompile the BPF program on the fly matching the current kernel's
>> datastructures. There's a certain level of rope here given to the operator,
>> and it's expected that they use it carefully. Similarly, folks could load
>> kprobes, kmods, and other programs that have the same issues.
> 
> Right, perhaps I misunderstood the privilege level you were targeting.
> :) Did you intend for unprivileged users to use this, or just the
> init-ns root user?
> 
>>
>>> And from a security perspective, programmatic examination of kernel
>>> structures means you can trivially leak kernel memory locations and
>>> contents. Resisting these sorts of leaks needs to be addressed too.
>>>
>> I'm unsure of that unintentional exfiltration of kernel memory locations is
>> possible. You may be able to via a BPF map or similar (logging). What kinds of
>> attacks are you thinking about specifically?
> 
> Well, I was looking at the example you sent, and it seemed like it had
> raw access to kernel pointers, which means it could be programmed to
> leak the values.
> 
>>> This looks like a subset of kprobes but available to non-root users,
>>> which looks rather scary to me at first glance. :)
>> You need CAP_SYS_ADMIN to touch this. These folks are the same ones that control
>> SELinux, and Apparmor.
> 
> Ah-ha, missed that. Still, we want to keep a bright line between uid-0
> and ring-0, and to make sure this is just init-ns CAP_SYS_ADMIN.
> 
> -Kees
> 
>>
>>>
>>> -Kees
>>>
>>>>
>>>> I would love to know what y'all think.
>>>>
>>>> Sargun Dhillon (4):
>>>>   bpf: move tracing helpers to shared helpers
>>>>   bpf, security: Add Checmate
>>>>   security/checmate: Add Checmate sample
>>>>   bpf: Restrict Checmate bpf programs to current kernel ABI
>>>>
>>>>  include/linux/bpf.h              |   2 +
>>>>  include/linux/checmate.h         |  38 +++++
>>>>  include/uapi/linux/Kbuild        |   1 +
>>>>  include/uapi/linux/bpf.h         |   1 +
>>>>  include/uapi/linux/checmate.h    |  65 +++++++++
>>>>  include/uapi/linux/prctl.h       |   3 +
>>>>  kernel/bpf/helpers.c             |  34 +++++
>>>>  kernel/bpf/syscall.c             |   2 +-
>>>>  kernel/trace/bpf_trace.c         |  33 -----
>>>>  samples/bpf/Makefile             |   4 +
>>>>  samples/bpf/bpf_load.c           |  11 +-
>>>>  samples/bpf/checmate1_kern.c     |  28 ++++
>>>>  samples/bpf/checmate1_user.c     |  54 +++++++
>>>>  security/Kconfig                 |   1 +
>>>>  security/Makefile                |   2 +
>>>>  security/checmate/Kconfig        |   6 +
>>>>  security/checmate/Makefile       |   3 +
>>>>  security/checmate/checmate_bpf.c |  67 +++++++++
>>>>  security/checmate/checmate_lsm.c | 304 +++++++++++++++++++++++++++++++++++++++
>>>>  19 files changed, 622 insertions(+), 37 deletions(-)
>>>>  create mode 100644 include/linux/checmate.h
>>>>  create mode 100644 include/uapi/linux/checmate.h
>>>>  create mode 100644 samples/bpf/checmate1_kern.c
>>>>  create mode 100644 samples/bpf/checmate1_user.c
>>>>  create mode 100644 security/checmate/Kconfig
>>>>  create mode 100644 security/checmate/Makefile
>>>>  create mode 100644 security/checmate/checmate_bpf.c
>>>>  create mode 100644 security/checmate/checmate_lsm.c
>>>>
>>>> --
>>>> 2.7.4
>>>>
>>>
>>>
>>>
>>> --
>>> Kees Cook
>>> Nexus Security
> 
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
  2016-08-14 22:57       ` Mickaël Salaün
@ 2016-08-15  3:09         ` Sargun Dhillon
  2016-08-15 10:59           ` Mickaël Salaün
  0 siblings, 1 reply; 12+ messages in thread
From: Sargun Dhillon @ 2016-08-15  3:09 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Kees Cook, LKML, Alexei Starovoitov, Daniel Borkmann,
	linux-security-module, Network Development, Reshetova, Elena

On Mon, Aug 15, 2016 at 12:57:44AM +0200, Mickaël Salaün wrote:
> Hi,
> 
> I've been working on an extension to seccomp-bpf since last year and published 
> a first RFC about it [1]. I'm working on a second RFC/PoC which use eBPF 
> instead of cBPF and is more close to a common LSM than the first RFC. I plan 
> to publish this second RFC by the end of the month.
> 
Interesting. I plan on dropping another RFC close to the end of the month as 
well.

> Our approaches have some common points (i.e. use eBPF in an LSM, stacked 
> filters like seccomp) but I'm focused on a kind of unprivileged LSM (i.e. no 
> CAP_SYS_ADMIN), to make standalone sandboxes, which brings more constraints 
> (e.g. no use of unsafe functions like bpf_probe_read(), take care of privacy, 
> SUID exec, stable ABI…). However, I don't want to handle resource limits, 
> which should be the job of cgroups.
> 
Kind of. Sometimes describing these resource limits is difficult. For example, I 
have a customer who is trying to restrict containers from burning up all the 
ephemeral ports on the machine. In this, they have an incredibly elaborate chain 
of wiring to prevent a given container from connecting to the same (proto, 
destip, destport) more than 1000 times.

I'm unsure of how you'd model that in a cgroup. 

> For now, I'm focusing on file-system access control which is one of the more 
> complex system to properly filter. I also plan to support basic network access 
> control.
> 
> What you are trying to accomplish seems more related to a Netfilter extension 
> (something like ipset but with eBPF maybe?).
> 
I don't only want to do network access control, I also want to write to the 
value once it's copied into kernel space. There are lot of benefits of doing 
this at the syscall level, but the two primary ones are performance, and 
capability. 

One of the biggest complaints with our current approach to filtering & load 
balancing (iptables) is that it hides information. When people connect through 
the load balancer, they want to find out who they connected to, and without some 
high application-level mechanism, this isn't possible. On the other hand, if we 
just rewrite the destination address in the connect hook, we can pretty easily
allow them to do getpeername.

I'm curious about your filesystem access limiter. Do you have a way to make it so
that a given container can only write, say, 100mb of data to disk? 


>  Mickaël
> 
> 
> [1] http://www.openwall.com/lists/kernel-hardening/2016/03/24/2
> 
> 
> 

> On 09/08/2016 02:22, Kees Cook wrote:
> > On Mon, Aug 8, 2016 at 5:00 PM, Sargun Dhillon <sargun@sargun.me> wrote:
> >> On Mon, Aug 08, 2016 at 04:44:02PM -0700, Kees Cook wrote:
> >>> On Thu, Aug 4, 2016 at 12:11 AM, Sargun Dhillon <sargun@sargun.me> wrote:
> >>>> I distributed this patchset to linux-security-module@vger.kernel.org earlier,
> >>>> but based on the fact that the archive is down, and this is a fairly
> >>>> broad-sweeping proposal, I figured I'd grow the audience a little bit. Sorry
> >>>> if you received this multiple times.
> >>>>
> >>>> I've begun building out the skeleton of a Linux Security Module, and I'd like to
> >>>> get feedback on it. It's a skeleton, and I've only populated a few hooks, so I'm
> >>>> mostly looking for input on the general proposal, interest, and design. It's a
> >>>> minor LSM. My particular use case is one in which containers are being
> >>>> dynamically deployed to machines by internal developers in a different group.
> >>>> The point of Checmate is to act as an extensible bed for _safe_, complex
> >>>> security policies. It's nice to enable dynamic security policies that can be
> >>>> defined in C, and change as neccessary, without ever having to patch, or rebuild
> >>>> the kernel.
> >>>>
> >>>> For many of these containers, the security policies can be fairly nuanced. One
> >>>> particular one to take into account is network security. Often times,
> >>>> administrators want to prevent ingress, and egress connectivity except from a
> >>>> few select IPs. Egress filtering can be managed using net_cls, but without
> >>>> modifying running software, it's non-trivial to attach a filter to all sockets
> >>>> being created within a container. The inet_conn_request, socket_recvmsg,
> >>>> socket_sock_rcv_skb hooks make this trivial to implement.
> >>>>
> >>>> Other times, containers need to be throttled in places where there's not really
> >>>> a good place to impose that policy for software which isn't built in-house.  If
> >>>> one wants to limit file creations/sec, or reject I/O under certain
> >>>> characteristics, there's not a great place to do it now. This gives engineers a
> >>>> mechanism to write those policies.
> >>>>
> >>>> This same flexibility can be used to take existing programs and enable safe BPF
> >>>> helpers to modify memory to allow rules to pass. One example that I prototyped
> >>>> was Docker's port mapping, which has an overhead (DNAT), and there's some loss
> >>>> of fidelity in the BSD Socket API to identify what's going on. Instead, we can
> >>>> just rewrite the port in a bind, based upon some data in a BPF map, and a cgroup
> >>>> match.
> >>>>
> >>>> I can actually see other minor security modules being implemented in Checmate,
> >>>> for example, Yama, or the recently proposed Hardchroot could be reimplemented in
> >>>> BPF. Potentially, they could even be API compatible.
> >>>>
> >>>> Although, at first, much of this sounds like seccomp, it's quite different. For
> >>>> one, what we can do in the security hooks is more complex (access to kernel
> >>>> pointers). The other side of this is we can have effects on a system-wide,
> >>>> or cgroup level. This also circumvents the need for CRIU-friendly policies.
> >>>>
> >>>> Lastly, the flexibility of this mechanism allows for prevention of security
> >>>> vulnerabilities which are often complex in nature and require the interaction
> >>>> of multiple hooks (CVE-2014-9717 is a good example), and although ksplice,
> >>>> and livepatch exist, they're not always easy to use, as compared to loading
> >>>> a single bpf program across all kernels.
> >>>>
> >>>> The user-facing API is exposed via prctl as it's meant to be very simple (at
> >>>> least the kernel components). It only has three operations. For a given security
> >>>> hook, you can attach a BPF program to it, which will add it to the set of
> >>>> programs that are executed over when the hook is hit. You can reset a hook,
> >>>> which removes all program associated with a given hook, and you can set a
> >>>> deny_reset flag on a hook to prevent anyone from resetting it. It's likely that
> >>>> an individual would want to set this in any production use case.
> >>>
> >>> One fairly serious problem that seccomp had to overcome was dealing
> >>> with exec+setuid in the face of an attacker. The main example is "what
> >>> if we refuse to allow a program to drop privileges via a filter rule?"
> >>> For seccomp, no-new-privs was introduced for non-root users of
> >>> seccomp. Programmatic syscall (or LSM) filters need to deal with this,
> >>> and it's a bit ungainly. :)
> >>>
> >> Couldn't someone do the same with SELinux, or Apparmor?
> > 
> > The "big" LSMs aren't defined programmatically by non-root users, so
> > there is no risk of elevating privileges (they are already root).
> > 
> >>> Also, if you have a prctl API that already has 3 operations, you might
> >>> want to use a new syscall anyway. :)
> >>>
> >> Looking at other LSMs, they appear to expose their API via a virtual filesystem,
> >> or prctl. I followed the model of YAMA. I think there may be two more operations
> >> (detach program, and mark a hook as append-only / read-only / disabled). It
> >> seems like overkill to implement my own syscall.
> >>
> >>>> On the BPF side of it, all that's involved in the work in progress is to
> >>>> move some of the tracing helpers into the shared helpers. For example,
> >>>> it's very valuable to have access to current when enforcing a hook.
> >>>> BPF programs also have access to maps, which somewhat works around
> >>>> the need for security blobs in some cases.
> >>>
> >>> Just from a compatibility perspective, doesn't this end up exposing
> >>> kernel structures to userspace? What happens when the structures
> >>> change?
> >>>
> >> I wouldn't consider BPF userspace. Although it executes in the kernel, I
> >> wouldn't really consider it kernel space either as it's restricted to safe
> >> operations.
> >>
> >> As far as addressing this issue -- A significant part of the LSM hooks API is
> >> tied to the syscall, giving stability to those datastructures.
> > 
> > Just for the sake of clarity: they're tied to internal callers,
> > usually near syscall entry points; LSMs can't filter syscalls.
> > 
> >> If you look at
> >> the API itself a significant part of it has been untouched for 3+ years, and
> >> it's been even longer since there has been an API breaking change. On the other
> >> hand, the developer has the ability to perform arbitrary reads of kernel space
> >> using bpf_probe_read.
> > 
> > What's hilarious is that syscall API is unchanged, but LSM API keeps
> > shifting around a little at a time. So, same issues as with kprobes,
> > etc, as you mention.
> > 
> > FWIW, I'd much rather have an LSM that reacts to seccomp filters and
> > maps syscall arguments to in-kernel data structures that can be
> > examined during an LSM hook. Then we'd have both a stable API and a
> > programmatic filtering of data structures.
> > 
> >> This is addressed in the 4th patch, which requires the BPF program is compiled
> >> against the current kernel version. The userspace policy orchestration code
> >> should recompile the BPF program on the fly matching the current kernel's
> >> datastructures. There's a certain level of rope here given to the operator,
> >> and it's expected that they use it carefully. Similarly, folks could load
> >> kprobes, kmods, and other programs that have the same issues.
> > 
> > Right, perhaps I misunderstood the privilege level you were targeting.
> > :) Did you intend for unprivileged users to use this, or just the
> > init-ns root user?
> > 
> >>
> >>> And from a security perspective, programmatic examination of kernel
> >>> structures means you can trivially leak kernel memory locations and
> >>> contents. Resisting these sorts of leaks needs to be addressed too.
> >>>
> >> I'm unsure of that unintentional exfiltration of kernel memory locations is
> >> possible. You may be able to via a BPF map or similar (logging). What kinds of
> >> attacks are you thinking about specifically?
> > 
> > Well, I was looking at the example you sent, and it seemed like it had
> > raw access to kernel pointers, which means it could be programmed to
> > leak the values.
> > 
> >>> This looks like a subset of kprobes but available to non-root users,
> >>> which looks rather scary to me at first glance. :)
> >> You need CAP_SYS_ADMIN to touch this. These folks are the same ones that control
> >> SELinux, and Apparmor.
> > 
> > Ah-ha, missed that. Still, we want to keep a bright line between uid-0
> > and ring-0, and to make sure this is just init-ns CAP_SYS_ADMIN.
> > 
> > -Kees
> > 
> >>
> >>>
> >>> -Kees
> >>>
> >>>>
> >>>> I would love to know what y'all think.
> >>>>
> >>>> Sargun Dhillon (4):
> >>>>   bpf: move tracing helpers to shared helpers
> >>>>   bpf, security: Add Checmate
> >>>>   security/checmate: Add Checmate sample
> >>>>   bpf: Restrict Checmate bpf programs to current kernel ABI
> >>>>
> >>>>  include/linux/bpf.h              |   2 +
> >>>>  include/linux/checmate.h         |  38 +++++
> >>>>  include/uapi/linux/Kbuild        |   1 +
> >>>>  include/uapi/linux/bpf.h         |   1 +
> >>>>  include/uapi/linux/checmate.h    |  65 +++++++++
> >>>>  include/uapi/linux/prctl.h       |   3 +
> >>>>  kernel/bpf/helpers.c             |  34 +++++
> >>>>  kernel/bpf/syscall.c             |   2 +-
> >>>>  kernel/trace/bpf_trace.c         |  33 -----
> >>>>  samples/bpf/Makefile             |   4 +
> >>>>  samples/bpf/bpf_load.c           |  11 +-
> >>>>  samples/bpf/checmate1_kern.c     |  28 ++++
> >>>>  samples/bpf/checmate1_user.c     |  54 +++++++
> >>>>  security/Kconfig                 |   1 +
> >>>>  security/Makefile                |   2 +
> >>>>  security/checmate/Kconfig        |   6 +
> >>>>  security/checmate/Makefile       |   3 +
> >>>>  security/checmate/checmate_bpf.c |  67 +++++++++
> >>>>  security/checmate/checmate_lsm.c | 304 +++++++++++++++++++++++++++++++++++++++
> >>>>  19 files changed, 622 insertions(+), 37 deletions(-)
> >>>>  create mode 100644 include/linux/checmate.h
> >>>>  create mode 100644 include/uapi/linux/checmate.h
> >>>>  create mode 100644 samples/bpf/checmate1_kern.c
> >>>>  create mode 100644 samples/bpf/checmate1_user.c
> >>>>  create mode 100644 security/checmate/Kconfig
> >>>>  create mode 100644 security/checmate/Makefile
> >>>>  create mode 100644 security/checmate/checmate_bpf.c
> >>>>  create mode 100644 security/checmate/checmate_lsm.c
> >>>>
> >>>> --
> >>>> 2.7.4
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Kees Cook
> >>> Nexus Security
> > 
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
  2016-08-15  3:09         ` Sargun Dhillon
@ 2016-08-15 10:59           ` Mickaël Salaün
  2016-08-15 17:03             ` Sargun Dhillon
  0 siblings, 1 reply; 12+ messages in thread
From: Mickaël Salaün @ 2016-08-15 10:59 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Kees Cook, LKML, Alexei Starovoitov, Daniel Borkmann,
	linux-security-module, Network Development, Reshetova, Elena


[-- Attachment #1.1: Type: text/plain, Size: 2667 bytes --]


On 15/08/2016 05:09, Sargun Dhillon wrote:
> On Mon, Aug 15, 2016 at 12:57:44AM +0200, Mickaël Salaün wrote:
>> Our approaches have some common points (i.e. use eBPF in an LSM, stacked 
>> filters like seccomp) but I'm focused on a kind of unprivileged LSM (i.e. no 
>> CAP_SYS_ADMIN), to make standalone sandboxes, which brings more constraints 
>> (e.g. no use of unsafe functions like bpf_probe_read(), take care of privacy, 
>> SUID exec, stable ABI…). However, I don't want to handle resource limits, 
>> which should be the job of cgroups.
>>
> Kind of. Sometimes describing these resource limits is difficult. For example, I 
> have a customer who is trying to restrict containers from burning up all the 
> ephemeral ports on the machine. In this, they have an incredibly elaborate chain 
> of wiring to prevent a given container from connecting to the same (proto, 
> destip, destport) more than 1000 times.
> 
> I'm unsure of how you'd model that in a cgroup. 

This looks like a Netfilter rule. Have you tried applying this limitation with the connlimit module?


> 
>> For now, I'm focusing on file-system access control which is one of the more 
>> complex system to properly filter. I also plan to support basic network access 
>> control.
>>
>> What you are trying to accomplish seems more related to a Netfilter extension 
>> (something like ipset but with eBPF maybe?).
>>
> I don't only want to do network access control, I also want to write to the 
> value once it's copied into kernel space. There are lot of benefits of doing 
> this at the syscall level, but the two primary ones are performance, and 
> capability. 
> 
> One of the biggest complaints with our current approach to filtering & load 
> balancing (iptables) is that it hides information. When people connect through 
> the load balancer, they want to find out who they connected to, and without some 
> high application-level mechanism, this isn't possible. On the other hand, if we 
> just rewrite the destination address in the connect hook, we can pretty easily
> allow them to do getpeername.

What exactly is not doable with Netfilter (e.g. REDIRECT or TPROXY)?


> 
> I'm curious about your filesystem access limiter. Do you have a way to make it so
> that a given container can only write, say, 100mb of data to disk? 

It's a filesystem access control. It doesn't deal with quota and is not focused on container but process hierarchies (which is more generic).

What is not doable with a quota mount option? It may be more appropriate to enhance the VFS (or overlayfs) to apply this kind of limitation, if needed.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
  2016-08-15 10:59           ` Mickaël Salaün
@ 2016-08-15 17:03             ` Sargun Dhillon
  0 siblings, 0 replies; 12+ messages in thread
From: Sargun Dhillon @ 2016-08-15 17:03 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Kees Cook, LKML, Alexei Starovoitov, Daniel Borkmann,
	linux-security-module, Network Development, Reshetova, Elena

On Mon, Aug 15, 2016 at 12:59:13PM +0200, Mickaël Salaün wrote:
> 
> On 15/08/2016 05:09, Sargun Dhillon wrote:
> > On Mon, Aug 15, 2016 at 12:57:44AM +0200, Mickaël Salaün wrote:
> >> Our approaches have some common points (i.e. use eBPF in an LSM, stacked 
> >> filters like seccomp) but I'm focused on a kind of unprivileged LSM (i.e. no 
> >> CAP_SYS_ADMIN), to make standalone sandboxes, which brings more constraints 
> >> (e.g. no use of unsafe functions like bpf_probe_read(), take care of privacy, 
> >> SUID exec, stable ABI…). However, I don't want to handle resource limits, 
> >> which should be the job of cgroups.
> >>
> > Kind of. Sometimes describing these resource limits is difficult. For example, I 
> > have a customer who is trying to restrict containers from burning up all the 
> > ephemeral ports on the machine. In this, they have an incredibly elaborate chain 
> > of wiring to prevent a given container from connecting to the same (proto, 
> > destip, destport) more than 1000 times.
> > 
> > I'm unsure of how you'd model that in a cgroup. 
> 
> This looks like a Netfilter rule. Have you tried applying this limitation with the connlimit module?
> 
> 
I could do this by adding a new Netfilter match, but with the existing matches, 
the only ones that "select" by cgroup2 don't have the ability to connlimit by 
cgroup. Potentially, I could wire up something with the cgroup2 match, but this 
comes with a lot of overhead. If you know of a low-overhead way of doing this, 
I'd love to hear.

Have you ever user Kubernetes? (http://kubernetes.io/docs/whatisk8s/)? You 
usually have a bunch of independent systems running together under what's called 
a "Pod". You can think of this as an old style "lxc" container, or a VM, and in 
each of these pods there is nesting where you want to not only limit the pod's 
resources, but you also want to limit the resources of each application. Doing 
this without some layer of programmability in resource management layer can be 
difficult.

> > 
> >> For now, I'm focusing on file-system access control which is one of the more 
> >> complex system to properly filter. I also plan to support basic network access 
> >> control.
> >>
> >> What you are trying to accomplish seems more related to a Netfilter extension 
> >> (something like ipset but with eBPF maybe?).
> >>
> > I don't only want to do network access control, I also want to write to the 
> > value once it's copied into kernel space. There are lot of benefits of doing 
> > this at the syscall level, but the two primary ones are performance, and 
> > capability. 
> > 
> > One of the biggest complaints with our current approach to filtering & load 
> > balancing (iptables) is that it hides information. When people connect through 
> > the load balancer, they want to find out who they connected to, and without some 
> > high application-level mechanism, this isn't possible. On the other hand, if we 
> > just rewrite the destination address in the connect hook, we can pretty easily
> > allow them to do getpeername.
> 
> What exactly is not doable with Netfilter (e.g. REDIRECT or TPROXY)?
> 
> 
Is there a way to "load balance" or "proxy" a connection where getpeername() 
tells you the real IP of the node you're connected to?

> > 
> > I'm curious about your filesystem access limiter. Do you have a way to make it so
> > that a given container can only write, say, 100mb of data to disk? 
> 
> It's a filesystem access control. It doesn't deal with quota and is not focused on container but process hierarchies (which is more generic).
> 
> What is not doable with a quota mount option? It may be more appropriate to enhance the VFS (or overlayfs) to apply this kind of limitation, if needed.
> 
Your overlayfs suggesion is on point. Since a lot of my containers look similar 
to Kubernetes though, quota isn't very well aligned with them (within a Pod, 
there are usually a bunch of independent things that need their usage limited). 
I think quota / overlayfs with labeling that comes from an LSM, or some other 
smart classifier would be ideal.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-08-15 17:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-04  7:11 [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM Sargun Dhillon
2016-08-04  8:41 ` Richard Weinberger
2016-08-04  9:24   ` Sargun Dhillon
2016-08-04  9:45 ` Daniel Borkmann
2016-08-04 10:12   ` Sargun Dhillon
2016-08-08 23:44 ` Kees Cook
2016-08-09  0:00   ` Sargun Dhillon
2016-08-09  0:22     ` Kees Cook
2016-08-14 22:57       ` Mickaël Salaün
2016-08-15  3:09         ` Sargun Dhillon
2016-08-15 10:59           ` Mickaël Salaün
2016-08-15 17:03             ` Sargun Dhillon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.