All of lore.kernel.org
 help / color / mirror / Atom feed
* [net-next 0/2] BPF, kprobes: Add current_in_cgroup helper
@ 2016-08-07  4:06 Sargun Dhillon
  2016-08-07  4:32 ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Sargun Dhillon @ 2016-08-07  4:06 UTC (permalink / raw)
  To: netdev; +Cc: alexei.starovoitov, daniel

This patchset includes a helper and an example to determine whether the kprobe 
is currently executing in the context of a specific cgroup based on a cgroup
bpf map / array. 

Sargun Dhillon (2):
  bpf: Add bpf_current_in_cgroup helper
  samples/bpf: Add example using current_in_cgroup

 include/linux/bpf.h                        | 24 +++++++++++
 include/uapi/linux/bpf.h                   | 11 +++++
 kernel/bpf/arraymap.c                      |  2 +-
 kernel/bpf/verifier.c                      |  4 +-
 kernel/trace/bpf_trace.c                   | 34 +++++++++++++++
 net/core/filter.c                          | 11 ++---
 samples/bpf/Makefile                       |  4 ++
 samples/bpf/bpf_helpers.h                  |  2 +
 samples/bpf/trace_current_in_cgroup_kern.c | 44 ++++++++++++++++++++
 samples/bpf/trace_current_in_cgroup_user.c | 66 ++++++++++++++++++++++++++++++
 10 files changed, 193 insertions(+), 9 deletions(-)
 create mode 100644 samples/bpf/trace_current_in_cgroup_kern.c
 create mode 100644 samples/bpf/trace_current_in_cgroup_user.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [net-next 0/2] BPF, kprobes: Add current_in_cgroup helper
  2016-08-07  4:06 [net-next 0/2] BPF, kprobes: Add current_in_cgroup helper Sargun Dhillon
@ 2016-08-07  4:32 ` Alexei Starovoitov
  2016-08-07  4:56   ` Sargun Dhillon
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2016-08-07  4:32 UTC (permalink / raw)
  To: Sargun Dhillon; +Cc: netdev, daniel

On Sat, Aug 06, 2016 at 09:06:53PM -0700, Sargun Dhillon wrote:
> This patchset includes a helper and an example to determine whether the kprobe 
> is currently executing in the context of a specific cgroup based on a cgroup
> bpf map / array. 

description is too short to understand how this new helper is going to be used.
depending on kprobe current is not always valid.
what are you trying to achieve?
This looks like an alternative to lsm patches submitted earlier?
btw net-next is closed and no new features accepted at the moment.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [net-next 0/2] BPF, kprobes: Add current_in_cgroup helper
  2016-08-07  4:32 ` Alexei Starovoitov
@ 2016-08-07  4:56   ` Sargun Dhillon
  2016-08-08  0:52     ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Sargun Dhillon @ 2016-08-07  4:56 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel

On Sat, Aug 06, 2016 at 09:32:05PM -0700, Alexei Starovoitov wrote:
> On Sat, Aug 06, 2016 at 09:06:53PM -0700, Sargun Dhillon wrote:
> > This patchset includes a helper and an example to determine whether the kprobe 
> > is currently executing in the context of a specific cgroup based on a cgroup
> > bpf map / array. 
> 
> description is too short to understand how this new helper is going to be used.
> depending on kprobe current is not always valid.
Anything not in in_interrupt() should have a current, right?

> what are you trying to achieve?
This is primarily to help troubleshoot containers (Docker, and now systemd). A 
lot of the time we want to determine what's going on in a given container 
(opening files, connecting to systems, etc...). There's not really a great way 
to restrict to containers except by manually walking datastructures to check for 
the right cgroup. This seems like a better alternative.

> This looks like an alternative to lsm patches submitted earlier?
No. But I would like to use this helper in the LSM patches I'm working on. For 
now, with those patches, and this helper, I can create a map sized 1, and add 
the cgroup I care about to it. Given I can add as many bpf programs to an LSM
hook I want, I can use this mechanism to "attach BPF programs to cgroups" -- 
I put that in quotes because you're not really attaching it to a cgroup,
but just burning some instructions on checking it. 

In my mind it seems better than making cgroup-attachment a first-class part
of the checmate work since I still want to make globally available hooks
possible.

> btw net-next is closed and no new features accepted at the moment.
Sorry, I didn't realize that. I'd still love to get feedback.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [net-next 0/2] BPF, kprobes: Add current_in_cgroup helper
  2016-08-07  4:56   ` Sargun Dhillon
@ 2016-08-08  0:52     ` Alexei Starovoitov
  2016-08-08  3:08       ` Sargun Dhillon
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2016-08-08  0:52 UTC (permalink / raw)
  To: Sargun Dhillon; +Cc: netdev, daniel, daniel

On Sat, Aug 06, 2016 at 09:56:06PM -0700, Sargun Dhillon wrote:
> On Sat, Aug 06, 2016 at 09:32:05PM -0700, Alexei Starovoitov wrote:
> > On Sat, Aug 06, 2016 at 09:06:53PM -0700, Sargun Dhillon wrote:
> > > This patchset includes a helper and an example to determine whether the kprobe 
> > > is currently executing in the context of a specific cgroup based on a cgroup
> > > bpf map / array. 
> > 
> > description is too short to understand how this new helper is going to be used.
> > depending on kprobe current is not always valid.
> Anything not in in_interrupt() should have a current, right?
> 
> > what are you trying to achieve?
> This is primarily to help troubleshoot containers (Docker, and now systemd). A 
> lot of the time we want to determine what's going on in a given container 
> (opening files, connecting to systems, etc...). There's not really a great way 
> to restrict to containers except by manually walking datastructures to check for 
> the right cgroup. This seems like a better alternative.

so it's about restricting or determining?
In other words if it's analytics/tracing that's one thing, but
enforcement/restriction is quite different.
For analytics one can walk task_css_set(current)->dfl_cgrp and remember
that pointer in a map or something for stats collections and similar.
If it's restricting apps in containers then kprobe approach
is not usable. I don't think you'd want to built an enforcement system
on an unstable api then can vary kernel-to-kernel.

> > This looks like an alternative to lsm patches submitted earlier?
> No. But I would like to use this helper in the LSM patches I'm working on. For 
> now, with those patches, and this helper, I can create a map sized 1, and add 
> the cgroup I care about to it. Given I can add as many bpf programs to an LSM
> hook I want, I can use this mechanism to "attach BPF programs to cgroups" -- 
> I put that in quotes because you're not really attaching it to a cgroup,
> but just burning some instructions on checking it. 

how many cgroups will you need to check? The current bpf_skb_in_cgroup()
suffers similar scaling issues.
I think the proper restriction/enforcement could be done via attaching bpf
program to a cgroup. These patches are being worked on Daniel Mack cc-ed.
Then bpf program will be able to enforce networking behavior of applications
in cgroups.
For global container analytics I think we need something that converts
current to cgroup_id or cgroup_handle. I don't think descendant check
can scale for such use case.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [net-next 0/2] BPF, kprobes: Add current_in_cgroup helper
  2016-08-08  0:52     ` Alexei Starovoitov
@ 2016-08-08  3:08       ` Sargun Dhillon
  2016-08-08  3:52         ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Sargun Dhillon @ 2016-08-08  3:08 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, daniel

Thanks for your feedback Alexei,
I really appreciate it.

On Sun, Aug 07, 2016 at 05:52:36PM -0700, Alexei Starovoitov wrote:
> On Sat, Aug 06, 2016 at 09:56:06PM -0700, Sargun Dhillon wrote:
> > On Sat, Aug 06, 2016 at 09:32:05PM -0700, Alexei Starovoitov wrote:
> > > On Sat, Aug 06, 2016 at 09:06:53PM -0700, Sargun Dhillon wrote:
> > > > This patchset includes a helper and an example to determine whether the kprobe 
> > > > is currently executing in the context of a specific cgroup based on a cgroup
> > > > bpf map / array. 
> > > 
> > > description is too short to understand how this new helper is going to be used.
> > > depending on kprobe current is not always valid.
> > Anything not in in_interrupt() should have a current, right?
> > 
> > > what are you trying to achieve?
> > This is primarily to help troubleshoot containers (Docker, and now systemd). A 
> > lot of the time we want to determine what's going on in a given container 
> > (opening files, connecting to systems, etc...). There's not really a great way 
> > to restrict to containers except by manually walking datastructures to check for 
> > the right cgroup. This seems like a better alternative.
> 
> so it's about restricting or determining?
> In other words if it's analytics/tracing that's one thing, but
> enforcement/restriction is quite different.
> For analytics one can walk task_css_set(current)->dfl_cgrp and remember
> that pointer in a map or something for stats collections and similar.
> If it's restricting apps in containers then kprobe approach
> is not usable. I don't think you'd want to built an enforcement system
> on an unstable api then can vary kernel-to-kernel.
> 
The first real-world use case are to implement something like Sysdig. Often the 
team running the team running the containers don't always know what's inside of 
them, so they want to be able to view network, I/O, and other activity by 
container. Right now, the lowest common denominator between all of the 
containerization techniques is cgroups. We've seen examples of where a admin is 
unsure of the workload, and would love to use opensnoop, but there are too many 
workloads on the machine.

Unfortunately, I don't think that it's possible just to check 
task_css_set(current)->dfl_cgrp in a bpf program. The container, especially 
containers with sidecars (what Kubernetes calls Pods, I believe?) tend to have 
multiple nested cgroups inside of them. If you had a way to convert cgroup array 
entries to pointers, I imagine you could write an unrolled loop to check for 
ownership within a limited range.

I'm still looking for comments from the LSM folks on Checmate[1]. It appears 
that there has been very little churn in the LSM hooks API that's API-breaking. 
For many of syscall hooks, they're closely tied to the syscall API, so they 
can't really change too much. I think that a toolkit like iovisor, or another 
userland translation layer, these hooks could be very powerful. I would love to 
hear feedback from the LSM folks.

My plan with those patches is to reimplement Yama, and Hardchroot in BPF 
programs to show off the potential capabilities of Checmate. I'd also like to 
create some example programs blocking CVEs that have popped up. I think of the 
idea like nftables for kernel syscalls, storage, and the network stack.

The other example I want to show is implementing Docker-bridge style network 
isolation with Checmate. Most folks use it to map ports, and to restrict binding 
to specific ports, and not the dedicated network namespace, or loopback 
interface. It turns out for some applications this comes at a pretty significant 
hit[2][3], as well as awkward upper bounds based on conntrack.

> > > This looks like an alternative to lsm patches submitted earlier?
> > No. But I would like to use this helper in the LSM patches I'm working on. For 
> > now, with those patches, and this helper, I can create a map sized 1, and add 
> > the cgroup I care about to it. Given I can add as many bpf programs to an LSM
> > hook I want, I can use this mechanism to "attach BPF programs to cgroups" -- 
> > I put that in quotes because you're not really attaching it to a cgroup,
> > but just burning some instructions on checking it. 
> 
> how many cgroups will you need to check? The current bpf_skb_in_cgroup()
> suffers similar scaling issues.
> I think the proper restriction/enforcement could be done via attaching bpf
> program to a cgroup. These patches are being worked on Daniel Mack cc-ed.
> Then bpf program will be able to enforce networking behavior of applications
> in cgroups.
> For global container analytics I think we need something that converts
> current to cgroup_id or cgroup_handle. I don't think descendant check
> can scale for such use case.
> 
Usually there's a top level cgroup for a container, and then cgroup for each 
subprocess, and maybe a third level if that fans out to multiple workers (See: 
unicorn). I see your point though about scalability, or performance issues. I 
still think a current_is_cgroup (vs in_cgroup) call would be really nice. 
Though, if we have a current_cgroup_id helper, it introduces the problem that if 
there is churn in cgroups, the ID may be reassigned. There still needs to be a 
way to keep the reference, and perhaps we just make a helper to convert cgroup 
map entires into IDs.

The approach I took in the Checmate patches allows for "attachment" to a uts 
namespace, which are perhaps the lightest, and simplest namespaces. Maybe that's 
the right direction to go, but I'm looking forward to seeing Daniel's patches.

-Thanks,
Sargun

[1] https://lkml.org/lkml/2016/8/4/58
[2] https://www.percona.com/blog/2016/02/11/measuring-docker-io-overhead/ 
[3] http://blog.pierreroudier.net/wp-content/uploads/2015/08/rc25482.pdf (warning: PDF)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [net-next 0/2] BPF, kprobes: Add current_in_cgroup helper
  2016-08-08  3:08       ` Sargun Dhillon
@ 2016-08-08  3:52         ` Alexei Starovoitov
  2016-08-08  9:27           ` Daniel Borkmann
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2016-08-08  3:52 UTC (permalink / raw)
  To: Sargun Dhillon; +Cc: netdev, daniel, daniel, Thomas Graf

On Sun, Aug 07, 2016 at 08:08:19PM -0700, Sargun Dhillon wrote:
> Thanks for your feedback Alexei,
> I really appreciate it.
> 
> On Sun, Aug 07, 2016 at 05:52:36PM -0700, Alexei Starovoitov wrote:
> > On Sat, Aug 06, 2016 at 09:56:06PM -0700, Sargun Dhillon wrote:
> > > On Sat, Aug 06, 2016 at 09:32:05PM -0700, Alexei Starovoitov wrote:
> > > > On Sat, Aug 06, 2016 at 09:06:53PM -0700, Sargun Dhillon wrote:
> > > > > This patchset includes a helper and an example to determine whether the kprobe 
> > > > > is currently executing in the context of a specific cgroup based on a cgroup
> > > > > bpf map / array. 
> > > > 
> > > > description is too short to understand how this new helper is going to be used.
> > > > depending on kprobe current is not always valid.
> > > Anything not in in_interrupt() should have a current, right?
> > > 
> > > > what are you trying to achieve?
> > > This is primarily to help troubleshoot containers (Docker, and now systemd). A 
> > > lot of the time we want to determine what's going on in a given container 
> > > (opening files, connecting to systems, etc...). There's not really a great way 
> > > to restrict to containers except by manually walking datastructures to check for 
> > > the right cgroup. This seems like a better alternative.
> > 
> > so it's about restricting or determining?
> > In other words if it's analytics/tracing that's one thing, but
> > enforcement/restriction is quite different.
> > For analytics one can walk task_css_set(current)->dfl_cgrp and remember
> > that pointer in a map or something for stats collections and similar.
> > If it's restricting apps in containers then kprobe approach
> > is not usable. I don't think you'd want to built an enforcement system
> > on an unstable api then can vary kernel-to-kernel.
> > 
> The first real-world use case are to implement something like Sysdig. Often the 
> team running the team running the containers don't always know what's inside of 
> them, so they want to be able to view network, I/O, and other activity by 
> container. Right now, the lowest common denominator between all of the 
> containerization techniques is cgroups. We've seen examples of where a admin is 
> unsure of the workload, and would love to use opensnoop, but there are too many 
> workloads on the machine.

Indeed it would be a useful feature to teach opensnoop to filter by a cgroup
and all descentants of it. If you can prepare a patch for it that would be
a strong use case for this bpf_current_in_cgroup helper and solid justification
to accept it in the kernel.
Something like cgroupv2 string path as an argument ?

> Unfortunately, I don't think that it's possible just to check 
> task_css_set(current)->dfl_cgrp in a bpf program. The container, especially 
> containers with sidecars (what Kubernetes calls Pods, I believe?) tend to have 
> multiple nested cgroups inside of them. If you had a way to convert cgroup array 
> entries to pointers, I imagine you could write an unrolled loop to check for 
> ownership within a limited range.
> 
> I'm still looking for comments from the LSM folks on Checmate[1]. It appears 
> that there has been very little churn in the LSM hooks API that's API-breaking. 
> For many of syscall hooks, they're closely tied to the syscall API, so they 
> can't really change too much. I think that a toolkit like iovisor, or another 
> userland translation layer, these hooks could be very powerful. I would love to 
> hear feedback from the LSM folks.
> 
> My plan with those patches is to reimplement Yama, and Hardchroot in BPF 
> programs to show off the potential capabilities of Checmate. I'd also like to 
> create some example programs blocking CVEs that have popped up. I think of the 
> idea like nftables for kernel syscalls, storage, and the network stack.

looking forward to more details on checmate, so far I'm convinced we need it.

> The other example I want to show is implementing Docker-bridge style network 
> isolation with Checmate. Most folks use it to map ports, and to restrict binding 
> to specific ports, and not the dedicated network namespace, or loopback 
> interface. It turns out for some applications this comes at a pretty significant 
> hit[2][3], as well as awkward upper bounds based on conntrack.

the default nat setup of docker is obviously slow, but that doesn't mean
kernel needs anything more than it already has.
If you're at linuxcon this year, Thomas's talk [4] shouldn't be missed.

> > > > This looks like an alternative to lsm patches submitted earlier?
> > > No. But I would like to use this helper in the LSM patches I'm working on. For 
> > > now, with those patches, and this helper, I can create a map sized 1, and add 
> > > the cgroup I care about to it. Given I can add as many bpf programs to an LSM
> > > hook I want, I can use this mechanism to "attach BPF programs to cgroups" -- 
> > > I put that in quotes because you're not really attaching it to a cgroup,
> > > but just burning some instructions on checking it. 
> > 
> > how many cgroups will you need to check? The current bpf_skb_in_cgroup()
> > suffers similar scaling issues.
> > I think the proper restriction/enforcement could be done via attaching bpf
> > program to a cgroup. These patches are being worked on Daniel Mack cc-ed.
> > Then bpf program will be able to enforce networking behavior of applications
> > in cgroups.
> > For global container analytics I think we need something that converts
> > current to cgroup_id or cgroup_handle. I don't think descendant check
> > can scale for such use case.
> > 
> Usually there's a top level cgroup for a container, and then cgroup for each 
> subprocess, and maybe a third level if that fans out to multiple workers (See: 
> unicorn). I see your point though about scalability, or performance issues. I 
> still think a current_is_cgroup (vs in_cgroup) call would be really nice. 
> Though, if we have a current_cgroup_id helper, it introduces the problem that if 
> there is churn in cgroups, the ID may be reassigned. There still needs to be a 
> way to keep the reference, and perhaps we just make a helper to convert cgroup 
> map entires into IDs.

agree. good points.
Looking forward for opensnoop+bpf_current_in_cgroup patch.
Naming-wise may be bpf_current_task_in_cgroup is a better name?

> The approach I took in the Checmate patches allows for "attachment" to a uts 
> namespace, which are perhaps the lightest, and simplest namespaces. Maybe that's 
> the right direction to go, but I'm looking forward to seeing Daniel's patches.
> 
> -Thanks,
> Sargun
> 
> [1] https://lkml.org/lkml/2016/8/4/58
> [2] https://www.percona.com/blog/2016/02/11/measuring-docker-io-overhead/ 
> [3] http://blog.pierreroudier.net/wp-content/uploads/2015/08/rc25482.pdf (warning: PDF)

[4] https://lcccna2016.sched.org/event/7JUl/fast-ipv6-only-networking-for-containers-based-on-bpf-and-xdp-thomas-graf-cisco

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [net-next 0/2] BPF, kprobes: Add current_in_cgroup helper
  2016-08-08  3:52         ` Alexei Starovoitov
@ 2016-08-08  9:27           ` Daniel Borkmann
  2016-08-08 18:09             ` Sargun Dhillon
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Borkmann @ 2016-08-08  9:27 UTC (permalink / raw)
  To: Alexei Starovoitov, Sargun Dhillon; +Cc: netdev, daniel, Thomas Graf, aravinda

On 08/08/2016 05:52 AM, Alexei Starovoitov wrote:
> On Sun, Aug 07, 2016 at 08:08:19PM -0700, Sargun Dhillon wrote:
>> Thanks for your feedback Alexei,
>> I really appreciate it.
>>
>> On Sun, Aug 07, 2016 at 05:52:36PM -0700, Alexei Starovoitov wrote:
>>> On Sat, Aug 06, 2016 at 09:56:06PM -0700, Sargun Dhillon wrote:
>>>> On Sat, Aug 06, 2016 at 09:32:05PM -0700, Alexei Starovoitov wrote:
>>>>> On Sat, Aug 06, 2016 at 09:06:53PM -0700, Sargun Dhillon wrote:
>>>>>> This patchset includes a helper and an example to determine whether the kprobe
>>>>>> is currently executing in the context of a specific cgroup based on a cgroup
>>>>>> bpf map / array.
>>>>>
>>>>> description is too short to understand how this new helper is going to be used.
>>>>> depending on kprobe current is not always valid.
>>>> Anything not in in_interrupt() should have a current, right?
>>>>
>>>>> what are you trying to achieve?
>>>> This is primarily to help troubleshoot containers (Docker, and now systemd). A
>>>> lot of the time we want to determine what's going on in a given container
>>>> (opening files, connecting to systems, etc...). There's not really a great way
>>>> to restrict to containers except by manually walking datastructures to check for
>>>> the right cgroup. This seems like a better alternative.
>>>
>>> so it's about restricting or determining?
>>> In other words if it's analytics/tracing that's one thing, but
>>> enforcement/restriction is quite different.
>>> For analytics one can walk task_css_set(current)->dfl_cgrp and remember
>>> that pointer in a map or something for stats collections and similar.
>>> If it's restricting apps in containers then kprobe approach
>>> is not usable. I don't think you'd want to built an enforcement system
>>> on an unstable api then can vary kernel-to-kernel.
>>>
>> The first real-world use case are to implement something like Sysdig. Often the
>> team running the team running the containers don't always know what's inside of
>> them, so they want to be able to view network, I/O, and other activity by
>> container. Right now, the lowest common denominator between all of the
>> containerization techniques is cgroups. We've seen examples of where a admin is
>> unsure of the workload, and would love to use opensnoop, but there are too many
>> workloads on the machine.
>
> Indeed it would be a useful feature to teach opensnoop to filter by a cgroup
> and all descentants of it. If you can prepare a patch for it that would be
> a strong use case for this bpf_current_in_cgroup helper and solid justification
> to accept it in the kernel.
> Something like cgroupv2 string path as an argument ?

How does this integrate with cgroup namespaces? Your current helper would only
look at the cgroup in your current namespace, no? Or would the program populating
the map temporarily switch into other namespaces?

What about cases where cgroup could be shared among other (net, ..) namespaces,
BPF program would still not be namespace aware to sort these things out?

You'll also have the issue, for example, that bpf_perf_event_read() counters
are global, combining them with cgroups helper in a program would lead to false
expectations (in the sense that they might also be assumed for that cgroup), or
do you have a way to tackle that as well (at least SW events, since HW should not
be possible)?

Btw, there's slightly related work from IBM folks (but to run it from within a
container; there was a v2 recently I recall):

   https://lkml.org/lkml/2016/6/14/547

>> Unfortunately, I don't think that it's possible just to check
>> task_css_set(current)->dfl_cgrp in a bpf program. The container, especially
>> containers with sidecars (what Kubernetes calls Pods, I believe?) tend to have
>> multiple nested cgroups inside of them. If you had a way to convert cgroup array
>> entries to pointers, I imagine you could write an unrolled loop to check for
>> ownership within a limited range.
>>
>> I'm still looking for comments from the LSM folks on Checmate[1]. It appears
>> that there has been very little churn in the LSM hooks API that's API-breaking.
>> For many of syscall hooks, they're closely tied to the syscall API, so they
>> can't really change too much. I think that a toolkit like iovisor, or another
>> userland translation layer, these hooks could be very powerful. I would love to
>> hear feedback from the LSM folks.
>>
>> My plan with those patches is to reimplement Yama, and Hardchroot in BPF
>> programs to show off the potential capabilities of Checmate. I'd also like to
>> create some example programs blocking CVEs that have popped up. I think of the
>> idea like nftables for kernel syscalls, storage, and the network stack.
>
> looking forward to more details on checmate, so far I'm convinced we need it.
>
>> The other example I want to show is implementing Docker-bridge style network
>> isolation with Checmate. Most folks use it to map ports, and to restrict binding
>> to specific ports, and not the dedicated network namespace, or loopback
>> interface. It turns out for some applications this comes at a pretty significant
>> hit[2][3], as well as awkward upper bounds based on conntrack.
>
> the default nat setup of docker is obviously slow, but that doesn't mean
> kernel needs anything more than it already has.
> If you're at linuxcon this year, Thomas's talk [4] shouldn't be missed.
>
>>>>> This looks like an alternative to lsm patches submitted earlier?
>>>> No. But I would like to use this helper in the LSM patches I'm working on. For
>>>> now, with those patches, and this helper, I can create a map sized 1, and add
>>>> the cgroup I care about to it. Given I can add as many bpf programs to an LSM
>>>> hook I want, I can use this mechanism to "attach BPF programs to cgroups" --
>>>> I put that in quotes because you're not really attaching it to a cgroup,
>>>> but just burning some instructions on checking it.
>>>
>>> how many cgroups will you need to check? The current bpf_skb_in_cgroup()
>>> suffers similar scaling issues.
>>> I think the proper restriction/enforcement could be done via attaching bpf
>>> program to a cgroup. These patches are being worked on Daniel Mack cc-ed.
>>> Then bpf program will be able to enforce networking behavior of applications
>>> in cgroups.
>>> For global container analytics I think we need something that converts
>>> current to cgroup_id or cgroup_handle. I don't think descendant check
>>> can scale for such use case.
>>>
>> Usually there's a top level cgroup for a container, and then cgroup for each
>> subprocess, and maybe a third level if that fans out to multiple workers (See:
>> unicorn). I see your point though about scalability, or performance issues. I
>> still think a current_is_cgroup (vs in_cgroup) call would be really nice.
>> Though, if we have a current_cgroup_id helper, it introduces the problem that if
>> there is churn in cgroups, the ID may be reassigned. There still needs to be a
>> way to keep the reference, and perhaps we just make a helper to convert cgroup
>> map entires into IDs.
>
> agree. good points.
> Looking forward for opensnoop+bpf_current_in_cgroup patch.
> Naming-wise may be bpf_current_task_in_cgroup is a better name?
>
>> The approach I took in the Checmate patches allows for "attachment" to a uts
>> namespace, which are perhaps the lightest, and simplest namespaces. Maybe that's
>> the right direction to go, but I'm looking forward to seeing Daniel's patches.
>>
>> -Thanks,
>> Sargun
>>
>> [1] https://lkml.org/lkml/2016/8/4/58
>> [2] https://www.percona.com/blog/2016/02/11/measuring-docker-io-overhead/
>> [3] http://blog.pierreroudier.net/wp-content/uploads/2015/08/rc25482.pdf (warning: PDF)
>
> [4] https://lcccna2016.sched.org/event/7JUl/fast-ipv6-only-networking-for-containers-based-on-bpf-and-xdp-thomas-graf-cisco
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [net-next 0/2] BPF, kprobes: Add current_in_cgroup helper
  2016-08-08  9:27           ` Daniel Borkmann
@ 2016-08-08 18:09             ` Sargun Dhillon
  0 siblings, 0 replies; 8+ messages in thread
From: Sargun Dhillon @ 2016-08-08 18:09 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, netdev, daniel, Thomas Graf, aravinda

On Mon, Aug 08, 2016 at 11:27:32AM +0200, Daniel Borkmann wrote:
> On 08/08/2016 05:52 AM, Alexei Starovoitov wrote:
> >On Sun, Aug 07, 2016 at 08:08:19PM -0700, Sargun Dhillon wrote:
> >>Thanks for your feedback Alexei,
> >>I really appreciate it.
> >>
> >>On Sun, Aug 07, 2016 at 05:52:36PM -0700, Alexei Starovoitov wrote:
> >>>On Sat, Aug 06, 2016 at 09:56:06PM -0700, Sargun Dhillon wrote:
> >>>>On Sat, Aug 06, 2016 at 09:32:05PM -0700, Alexei Starovoitov wrote:
> >>>>>On Sat, Aug 06, 2016 at 09:06:53PM -0700, Sargun Dhillon wrote:
> >>>>>>This patchset includes a helper and an example to determine whether the kprobe
> >>>>>>is currently executing in the context of a specific cgroup based on a cgroup
> >>>>>>bpf map / array.
> >>>>>
> >>>>>description is too short to understand how this new helper is going to be used.
> >>>>>depending on kprobe current is not always valid.
> >>>>Anything not in in_interrupt() should have a current, right?
> >>>>
> >>>>>what are you trying to achieve?
> >>>>This is primarily to help troubleshoot containers (Docker, and now systemd). A
> >>>>lot of the time we want to determine what's going on in a given container
> >>>>(opening files, connecting to systems, etc...). There's not really a great way
> >>>>to restrict to containers except by manually walking datastructures to check for
> >>>>the right cgroup. This seems like a better alternative.
> >>>
> >>>so it's about restricting or determining?
> >>>In other words if it's analytics/tracing that's one thing, but
> >>>enforcement/restriction is quite different.
> >>>For analytics one can walk task_css_set(current)->dfl_cgrp and remember
> >>>that pointer in a map or something for stats collections and similar.
> >>>If it's restricting apps in containers then kprobe approach
> >>>is not usable. I don't think you'd want to built an enforcement system
> >>>on an unstable api then can vary kernel-to-kernel.
> >>>
> >>The first real-world use case are to implement something like Sysdig. Often the
> >>team running the team running the containers don't always know what's inside of
> >>them, so they want to be able to view network, I/O, and other activity by
> >>container. Right now, the lowest common denominator between all of the
> >>containerization techniques is cgroups. We've seen examples of where a admin is
> >>unsure of the workload, and would love to use opensnoop, but there are too many
> >>workloads on the machine.
> >
> >Indeed it would be a useful feature to teach opensnoop to filter by a cgroup
> >and all descentants of it. If you can prepare a patch for it that would be
> >a strong use case for this bpf_current_in_cgroup helper and solid justification
> >to accept it in the kernel.
> >Something like cgroupv2 string path as an argument ?
> 
> How does this integrate with cgroup namespaces? Your current helper would only
> look at the cgroup in your current namespace, no? Or would the program populating
> the map temporarily switch into other namespaces?
> 
The BPF program is namespace oblivious. If you had multiple cgroups namepaces, 
you'd have to open an fd for the other namespace's cgroup to populate the map. I 
see this as more of a userspace problem.

> What about cases where cgroup could be shared among other (net, ..) namespaces,
> BPF program would still not be namespace aware to sort these things out?
> 
I'm not sure what you're getting at. It sounds like being "namespace aware" 
either means that during probe installation you restrict the probe to a given 
namespace, or you have another helper that allows you to check the namespace 
you're in. Would a second helper, and arraymap type address this? If so, I'd 
rather that be separate work.

> You'll also have the issue, for example, that bpf_perf_event_read() counters
> are global, combining them with cgroups helper in a program would lead to false
> expectations (in the sense that they might also be assumed for that cgroup), or
> do you have a way to tackle that as well (at least SW events, since HW should not
> be possible)?
> 
> Btw, there's slightly related work from IBM folks (but to run it from within a
> container; there was a v2 recently I recall):
> 
>   https://lkml.org/lkml/2016/6/14/547
> 
I'm not sure how to avoid the aformentioned problem, but I'm not really sure 
it's a problem. Perhaps perf namespaces are the right way to go, but do you have 
a suggestion for the opensnoop-style problem?

> >>Unfortunately, I don't think that it's possible just to check
> >>task_css_set(current)->dfl_cgrp in a bpf program. The container, especially
> >>containers with sidecars (what Kubernetes calls Pods, I believe?) tend to have
> >>multiple nested cgroups inside of them. If you had a way to convert cgroup array
> >>entries to pointers, I imagine you could write an unrolled loop to check for
> >>ownership within a limited range.
> >>
> >>I'm still looking for comments from the LSM folks on Checmate[1]. It appears
> >>that there has been very little churn in the LSM hooks API that's API-breaking.
> >>For many of syscall hooks, they're closely tied to the syscall API, so they
> >>can't really change too much. I think that a toolkit like iovisor, or another
> >>userland translation layer, these hooks could be very powerful. I would love to
> >>hear feedback from the LSM folks.
> >>
> >>My plan with those patches is to reimplement Yama, and Hardchroot in BPF
> >>programs to show off the potential capabilities of Checmate. I'd also like to
> >>create some example programs blocking CVEs that have popped up. I think of the
> >>idea like nftables for kernel syscalls, storage, and the network stack.
> >
> >looking forward to more details on checmate, so far I'm convinced we need it.
> >
> >>The other example I want to show is implementing Docker-bridge style network
> >>isolation with Checmate. Most folks use it to map ports, and to restrict binding
> >>to specific ports, and not the dedicated network namespace, or loopback
> >>interface. It turns out for some applications this comes at a pretty significant
> >>hit[2][3], as well as awkward upper bounds based on conntrack.
> >
> >the default nat setup of docker is obviously slow, but that doesn't mean
> >kernel needs anything more than it already has.
> >If you're at linuxcon this year, Thomas's talk [4] shouldn't be missed.
> >
> >>>>>This looks like an alternative to lsm patches submitted earlier?
> >>>>No. But I would like to use this helper in the LSM patches I'm working on. For
> >>>>now, with those patches, and this helper, I can create a map sized 1, and add
> >>>>the cgroup I care about to it. Given I can add as many bpf programs to an LSM
> >>>>hook I want, I can use this mechanism to "attach BPF programs to cgroups" --
> >>>>I put that in quotes because you're not really attaching it to a cgroup,
> >>>>but just burning some instructions on checking it.
> >>>
> >>>how many cgroups will you need to check? The current bpf_skb_in_cgroup()
> >>>suffers similar scaling issues.
> >>>I think the proper restriction/enforcement could be done via attaching bpf
> >>>program to a cgroup. These patches are being worked on Daniel Mack cc-ed.
> >>>Then bpf program will be able to enforce networking behavior of applications
> >>>in cgroups.
> >>>For global container analytics I think we need something that converts
> >>>current to cgroup_id or cgroup_handle. I don't think descendant check
> >>>can scale for such use case.
> >>>
> >>Usually there's a top level cgroup for a container, and then cgroup for each
> >>subprocess, and maybe a third level if that fans out to multiple workers (See:
> >>unicorn). I see your point though about scalability, or performance issues. I
> >>still think a current_is_cgroup (vs in_cgroup) call would be really nice.
> >>Though, if we have a current_cgroup_id helper, it introduces the problem that if
> >>there is churn in cgroups, the ID may be reassigned. There still needs to be a
> >>way to keep the reference, and perhaps we just make a helper to convert cgroup
> >>map entires into IDs.
> >
> >agree. good points.
> >Looking forward for opensnoop+bpf_current_in_cgroup patch.
> >Naming-wise may be bpf_current_task_in_cgroup is a better name?
> >
> >>The approach I took in the Checmate patches allows for "attachment" to a uts
> >>namespace, which are perhaps the lightest, and simplest namespaces. Maybe that's
> >>the right direction to go, but I'm looking forward to seeing Daniel's patches.
> >>
> >>-Thanks,
> >>Sargun
> >>
> >>[1] https://lkml.org/lkml/2016/8/4/58
> >>[2] https://www.percona.com/blog/2016/02/11/measuring-docker-io-overhead/
> >>[3] http://blog.pierreroudier.net/wp-content/uploads/2015/08/rc25482.pdf (warning: PDF)
> >
> >[4] https://lcccna2016.sched.org/event/7JUl/fast-ipv6-only-networking-for-containers-based-on-bpf-and-xdp-thomas-graf-cisco
> >
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-08-08 18:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-07  4:06 [net-next 0/2] BPF, kprobes: Add current_in_cgroup helper Sargun Dhillon
2016-08-07  4:32 ` Alexei Starovoitov
2016-08-07  4:56   ` Sargun Dhillon
2016-08-08  0:52     ` Alexei Starovoitov
2016-08-08  3:08       ` Sargun Dhillon
2016-08-08  3:52         ` Alexei Starovoitov
2016-08-08  9:27           ` Daniel Borkmann
2016-08-08 18:09             ` Sargun Dhillon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.