linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* (no subject)
@ 2017-01-17  5:18 Andy Lutomirski
  2017-01-18 22:41 ` Potential issues (security and otherwise) with the current cgroup-bpf API Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2017-01-17  5:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

;; This buffer is for text that is not saved, and for Lisp evaluation.
;; To create a file, visit it with C-x C-f and enter text in its buffer.



On Sun, Jan 15, 2017 at 5:19 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> Sorry about the delay.  Some fire fighthing followed the holidays.
>
> On Tue, Jan 03, 2017 at 11:25:59AM +0100, Michal Hocko wrote:
>> > So from what I understand the proposed cgroup is not in fact
>> > hierarchical at all.
>> >
>> > @TJ, I thought you were enforcing all new cgroups to be properly
>> > hierarchical, that would very much include this one.
>>
>> I would be interested in that as well. We have made that mistake in
>> memcg v1 where hierarchy could be disabled for performance reasons and
>> that turned out to be major PITA in the end. Why do we want to repeat
>> the same mistake here?
>
> Across the different threads on this subject, there have been multiple
> explanations but I'll try to sum it up more clearly.
>
> The big issue here is whether this is a cgroup thing or a bpf thing.
> I don't think there's anything inherently wrong with one approach or
> the other.  Forget about the proposed cgroup bpf extentions but thinkg
> about how iptables does cgroups.  Whether it's the netcls/netprio in
> v1 or direct membership matching in v2, it is the network side testing
> for cgroup membership one way or the other.  The only part where
> cgroup is involved in is answering that test.
>

[...]

>
> None of the issues that people have been raising here is actually an
> issue if one thinks of it as a part of bpf.  Its security model is
> exactly the same as any other bpf programs.  Recursive behavior is
> exactly the same as how other external cgroup descendant membership
> testing work.  There is no issue here whatsoever.

After sleeping on this, here are my thoughts:

First, there are three ways to think about this, not just two.  It
could be a BPF feature, a net feature, or a cgroup feature.

I think that calling it a BPF feature is a cop-out.  BPF is an
assembly language and a mechanism for importing little programs into
the kernel.  BPF maps are a BPF feature.  These new hooks are a
feature that actively changes the behavior of other parts of the
kernel.  I don't see how calling this new feature a "BPF" feature
excuses it from playing by the expected rules of the subsystems it
affects or from generally working well with the rest of the kernel.

Perhaps this is a net feature, though, not a cgroup feature.  This
would IMO make a certain amount of sense.  Iptables cgroup matches,
for example, logically are an iptables (i.e., net) feature.  The
problem here is that network configuration (and this explicitly
includes iptables) is done on a per-netns level, whereas these new
hooks entirely ignore network namespaces.  I've pointed out that
trying to enable them in a namespaced world is problematic (e.g.
switching to ns_capable() will cause serious problems), and Alexei
seems to think that this will never happen.  So I don't think we can
really call this a net feature that works the way that other net
features work.

(Suppose that this actually tried to be netns-enabled.  Then you'd
have to map from (netns, cgroup) -> hook, and this would probably be
slow and messy.)

So I think that leaves it being a cgroup feature.  And it definitely
does *not* play by the rules of cgroups right now.

> I'm sure we'll
> eventually get there but from what I hear it isn't something we can
> pull off in a restricted timeframe.

To me, this sounds like "the API isn't that great, we know how to do
better, but we really really want this feature ASAP so we want to ship
it with a sub-par API."  I think that's a bad idea.

> This also holds true for the perf controller.  While it is implemented
> as a controller, it isn't visible to cgroup users in any way and the
> only function it serves is providing the membership test to perf
> subsystem.  perf is the one which decides whether and how it is to be
> used.  cgroup providing membership test to other subsystems is
> completely acceptable and established.

Unless I'm mistaken, "perf_event" is an actual cgroup controller, and
it explicitly respects the cgroup hierarchy.  It shows up in
/proc/cgroups, and I had no problem mounting a cgroupfs instance with
perf_event enabled.  So I'm not sure what you mean.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-01-17  5:18 Andy Lutomirski
@ 2017-01-18 22:41 ` Tejun Heo
  2017-01-19  0:18   ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2017-01-18 22:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

Helloo, Andy.

On Mon, Jan 16, 2017 at 09:18:36PM -0800, Andy Lutomirski wrote:
> Perhaps this is a net feature, though, not a cgroup feature.  This
> would IMO make a certain amount of sense.  Iptables cgroup matches,
> for example, logically are an iptables (i.e., net) feature.  The

Yeap.

> problem here is that network configuration (and this explicitly
> includes iptables) is done on a per-netns level, whereas these new
> hooks entirely ignore network namespaces.  I've pointed out that
> trying to enable them in a namespaced world is problematic (e.g.
> switching to ns_capable() will cause serious problems), and Alexei
> seems to think that this will never happen.  So I don't think we can
> really call this a net feature that works the way that other net
> features work.
> 
> (Suppose that this actually tried to be netns-enabled.  Then you'd
> have to map from (netns, cgroup) -> hook, and this would probably be
> slow and messy.)

I'm not sure how important netns support for these bpf programs
actually is but I have no objection to making it behave in an
equivalent manner to iptables where appropriate.  This is kinda
trivial to do too.  For now, we can simply not run the programs if the
associated socket belongs to non-root netns.  Later, if necessary, we
can add per-ns bpf programs.  And I can't think of a reason why
(netns, cgroup) -> function mapping would be slow and messy.  Why
would that be?

> So I think that leaves it being a cgroup feature.  And it definitely
> does *not* play by the rules of cgroups right now.

Because this in no way is a cgroup feature.  As you wrote above, it is
something similar to iptables with lacking netns considerations.
Let's address that and make it a proper network thing.

> > I'm sure we'll
> > eventually get there but from what I hear it isn't something we can
> > pull off in a restricted timeframe.
> 
> To me, this sounds like "the API isn't that great, we know how to do
> better, but we really really want this feature ASAP so we want to ship
> it with a sub-par API."  I think that's a bad idea.

I see your point but this isn't something which is black and white.
There are arguments for both directions.  I'm leaning the other
direction because

* My impression with bpf is that while delegation is something
  theoretically possible it is not something which is gonna take place
  in any immediate time frame.  If I'm wrong on this, please feel free
  to correct me.

* There are a lot of use cases which can immediately take advantage of
  cgroup-aware bpf programs operating as proposed.  The model is
  semantically equivalent to iptables (let's address the netns part if
  that's an issue) which net people are familiar with.

* It isn't exclusive with adding cgroup bpf controller down the road
  if necessary and practical.  Sure, this isn't the perfect situation
  but it seems like an acceptable trade-off to me.  What ever is
  perfect?

> > This also holds true for the perf controller.  While it is implemented
> > as a controller, it isn't visible to cgroup users in any way and the
> > only function it serves is providing the membership test to perf
> > subsystem.  perf is the one which decides whether and how it is to be
> > used.  cgroup providing membership test to other subsystems is
> > completely acceptable and established.
> 
> Unless I'm mistaken, "perf_event" is an actual cgroup controller, and

Yeah, it is implemented as a controller but in essence what it does is
just tagging the hierarchy to tell perf to use that hierarchy for
membership test purposes.

> it explicitly respects the cgroup hierarchy.  It shows up in
> /proc/cgroups, and I had no problem mounting a cgroupfs instance with
> perf_event enabled.  So I'm not sure what you mean.

That all it's doing is providing membership information.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-01-18 22:41 ` Potential issues (security and otherwise) with the current cgroup-bpf API Tejun Heo
@ 2017-01-19  0:18   ` Andy Lutomirski
  2017-01-19  0:59     ` Tejun Heo
  2017-01-19  1:01     ` Mickaël Salaün
  0 siblings, 2 replies; 13+ messages in thread
From: Andy Lutomirski @ 2017-01-19  0:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

On Wed, Jan 18, 2017 at 2:41 PM, Tejun Heo <tj@kernel.org> wrote:
> Helloo, Andy.
>
> On Mon, Jan 16, 2017 at 09:18:36PM -0800, Andy Lutomirski wrote:
>> Perhaps this is a net feature, though, not a cgroup feature.  This
>> would IMO make a certain amount of sense.  Iptables cgroup matches,
>> for example, logically are an iptables (i.e., net) feature.  The
>
> Yeap.
>
>> problem here is that network configuration (and this explicitly
>> includes iptables) is done on a per-netns level, whereas these new
>> hooks entirely ignore network namespaces.  I've pointed out that
>> trying to enable them in a namespaced world is problematic (e.g.
>> switching to ns_capable() will cause serious problems), and Alexei
>> seems to think that this will never happen.  So I don't think we can
>> really call this a net feature that works the way that other net
>> features work.
>>
>> (Suppose that this actually tried to be netns-enabled.  Then you'd
>> have to map from (netns, cgroup) -> hook, and this would probably be
>> slow and messy.)
>
> I'm not sure how important netns support for these bpf programs
> actually is but I have no objection to making it behave in an
> equivalent manner to iptables where appropriate.  This is kinda
> trivial to do too.  For now, we can simply not run the programs if the
> associated socket belongs to non-root netns.  Later, if necessary, we
> can add per-ns bpf programs.  And I can't think of a reason why
> (netns, cgroup) -> function mapping would be slow and messy.  Why
> would that be?

To map cgroup -> hook, a simple array in each cgroup structure works.
To map (cgroup, netns) -> hook function, the natural approach would be
to have some kind of hash, and that would be slower.  This code is
intended to be *fast*.

I suppose you could have each cgroup structure contain an array where
each element is (netns, hook function) and just skip the ones that are
the wrong netns.  Perhaps it would be rare to have many of those.

>
>> So I think that leaves it being a cgroup feature.  And it definitely
>> does *not* play by the rules of cgroups right now.
>
> Because this in no way is a cgroup feature.  As you wrote above, it is
> something similar to iptables with lacking netns considerations.
> Let's address that and make it a proper network thing.

I think it's currently a broken cgroup feature.  I think it could be
made into a properly working cgroup feature (which I favor) or a
properly working net feature.  Can you articulate why you prefer
having it be a net feature instead of a cgroup feature?  I tend to
favor making it be a working cgroup feature (by giving it a file or
files in cgroupfs and maybe even a controller) because the result
would have very obvious semantics.

But maybe this is just a false dichotomy.  Could this feature be a
per-netns configuration where you can give instructions like "run this
hook if you're in such-and-such netns and in a given cgroup or one of
its descendents"?  This would prevent it from being a direct analogue
to systemd's RestrictAddressFamilies= option, but that may be okay.
This would side-step issues in the current code where a hook can't
rely on ifindex meaning what the hook thinks it means.

Actually, I think I like that approach.  What if it we had a "socket"
controller and files like "socket.create_socket", "socket.ingress" and
"socket.egress"?  You could use the bpf() syscall to install a bpf
filter into "socket.egress" and that filter would filter egress for
the target cgroup and its descendents on the current netns.  As a
first pass, the netns issue could be sidestepped by making it only
work in the root netns (i.e. the bpf() call would fail if you're not
in the root netns and the hooks wouldn't run if you're not in the root
netns.)  It *might* even be possible to retrofit in the socket
controller by saying that the default hierarchy is used if the socket
controller isn't mounted.

What *isn't* possible to cleanly fix after the fact is the current
semantics that cgroup hooks override the hooks in their ancestors.
IMO that is simply broken.  The example you gave (perf_event) is very
careful not to have this problem.

>
>> > I'm sure we'll
>> > eventually get there but from what I hear it isn't something we can
>> > pull off in a restricted timeframe.
>>
>> To me, this sounds like "the API isn't that great, we know how to do
>> better, but we really really want this feature ASAP so we want to ship
>> it with a sub-par API."  I think that's a bad idea.
>
> I see your point but this isn't something which is black and white.
> There are arguments for both directions.  I'm leaning the other
> direction because
>
> * My impression with bpf is that while delegation is something
>   theoretically possible it is not something which is gonna take place
>   in any immediate time frame.  If I'm wrong on this, please feel free
>   to correct me.

But the issue isn't *BPF* delegation.  It's cgroup delegation or netns
creation, both of which exist today.

>
> * There are a lot of use cases which can immediately take advantage of
>   cgroup-aware bpf programs operating as proposed.  The model is
>   semantically equivalent to iptables (let's address the netns part if
>   that's an issue) which net people are familiar with.

That sounds like a great argument for those users to patch their
kernels to gain experience with this feature.

>
> * It isn't exclusive with adding cgroup bpf controller down the road
>   if necessary and practical.  Sure, this isn't the perfect situation
>   but it seems like an acceptable trade-off to me.  What ever is
>   perfect?

I think it more or less is exclusive.  I can imagine davem accepting a
patch to make the sock_cgroup_data think point at a new "socket"
cgroup type.  I can't imagine him accepting a patch to have sockets
have more than one pointer to a cgroup, with good reason.

I don't see what's added by having a "bpf" cgroup controller, though
-- it's just too broad.  If the Landlock stuff goes in, that should
presumably be either tied to the default hierarchy or use an "lsm"
controller.  A "bpf" controller for that makes no sense to me.

I can see *huge* value in having some combination of BPF and the
perf_event controller deciding whether to log perf events, for
example.  But this would be the perf_event controller, not a
hypothetical "bpf" controller.

>
>> > This also holds true for the perf controller.  While it is implemented
>> > as a controller, it isn't visible to cgroup users in any way and the
>> > only function it serves is providing the membership test to perf
>> > subsystem.  perf is the one which decides whether and how it is to be
>> > used.  cgroup providing membership test to other subsystems is
>> > completely acceptable and established.
>>
>> Unless I'm mistaken, "perf_event" is an actual cgroup controller, and
>
> Yeah, it is implemented as a controller but in essence what it does is
> just tagging the hierarchy to tell perf to use that hierarchy for
> membership test purposes.

It's also a rather innocuous controller in that it doesn't change the
behavior of the controlled tasks.  The bpf hooks definitely do change
behavior.

>
>> it explicitly respects the cgroup hierarchy.  It shows up in
>> /proc/cgroups, and I had no problem mounting a cgroupfs instance with
>> perf_event enabled.  So I'm not sure what you mean.
>
> That all it's doing is providing membership information.

But it's doing it wrong!  Even perf_event tests for membership in a
given cgroup *or one of its descendents*.  This code does not.

I think the moral of the story here is that there are lots of open
questions and design work to be done and that this feature really
isn't ready to be stable.  For Landlock, I believe that it really
needs to be done right and I will put my foot down and NAK any effort
to have Landlock available in a released kernel without resolving
these types of issues first.  Does anyone really want Landlock to work
differently than the net hooks simply because the net hooks were in a
rush?

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-01-19  0:18   ` Andy Lutomirski
@ 2017-01-19  0:59     ` Tejun Heo
  2017-01-19  2:29       ` Andy Lutomirski
  2017-01-19  1:01     ` Mickaël Salaün
  1 sibling, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2017-01-19  0:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

Hello, Andy.

On Wed, Jan 18, 2017 at 04:18:04PM -0800, Andy Lutomirski wrote:
> To map cgroup -> hook, a simple array in each cgroup structure works.
> To map (cgroup, netns) -> hook function, the natural approach would be
> to have some kind of hash, and that would be slower.  This code is
> intended to be *fast*.

Updating these programs isn't a frequent operation.  We can easily
calculate a perfect (or acceptable) hash per-cgroup and rcu swap the
new hashtable.

> I suppose you could have each cgroup structure contain an array where
> each element is (netns, hook function) and just skip the ones that are
> the wrong netns.  Perhaps it would be rare to have many of those.

Yeah, or maybe even that.  It just isn't a fundamentally difficult
problem.

> I think it's currently a broken cgroup feature.  I think it could be
> made into a properly working cgroup feature (which I favor) or a
> properly working net feature.  Can you articulate why you prefer
> having it be a net feature instead of a cgroup feature?  I tend to

I thought that's what I was doing in the previous message.

> favor making it be a working cgroup feature (by giving it a file or
> files in cgroupfs and maybe even a controller) because the result
> would have very obvious semantics.

I'm fine with both directions but one seems far out.

> But maybe this is just a false dichotomy.  Could this feature be a
> per-netns configuration where you can give instructions like "run this
> hook if you're in such-and-such netns and in a given cgroup or one of
> its descendents"?  This would prevent it from being a direct analogue
> to systemd's RestrictAddressFamilies= option, but that may be okay.
> This would side-step issues in the current code where a hook can't
> rely on ifindex meaning what the hook thinks it means.

How is this different from making the current code netns aware?

> Actually, I think I like that approach.  What if it we had a "socket"
> controller and files like "socket.create_socket", "socket.ingress" and
> "socket.egress"?  You could use the bpf() syscall to install a bpf
> filter into "socket.egress" and that filter would filter egress for
> the target cgroup and its descendents on the current netns.  As a
> first pass, the netns issue could be sidestepped by making it only
> work in the root netns (i.e. the bpf() call would fail if you're not
> in the root netns and the hooks wouldn't run if you're not in the root
> netns.)  It *might* even be possible to retrofit in the socket
> controller by saying that the default hierarchy is used if the socket
> controller isn't mounted.

I don't know.  You're throwing out too many ideas too fast and it's
difficult to keep up with what the differences are between those
ideas.  But if we're doing cgroup controllers, shouldn't cgroup ns
support be sufficient?  We can consider the product of cgroup and net
namespaces but that doesn't seem necessary given that people usually
set up these namespaces in conjunction.

> What *isn't* possible to cleanly fix after the fact is the current
> semantics that cgroup hooks override the hooks in their ancestors.
> IMO that is simply broken.  The example you gave (perf_event) is very
> careful not to have this problem.

That's like saying installing an iptables rule for a more specific
target is broken.  As a cgroup controller, it is not an acceptable
behavior given how delegation works.  As something similar to
iptables, it is completely fine.

> > * My impression with bpf is that while delegation is something
> >   theoretically possible it is not something which is gonna take place
> >   in any immediate time frame.  If I'm wrong on this, please feel free
> >   to correct me.
> 
> But the issue isn't *BPF* delegation.  It's cgroup delegation or netns
> creation, both of which exist today.

No, the issue is bpf delegation.  If bpf were fully delegatable in
practical sense, we could just do straight forward cgroup bpf
controller.  Well, we'll have to think about how to chain the programs
which would differ on program type but that shouldn't be too hard.

> > * There are a lot of use cases which can immediately take advantage of
> >   cgroup-aware bpf programs operating as proposed.  The model is
> >   semantically equivalent to iptables (let's address the netns part if
> >   that's an issue) which net people are familiar with.
> 
> That sounds like a great argument for those users to patch their
> kernels to gain experience with this feature.

I don't get why this would particularly point to out-of-tree usage.
Why is that?

> > * It isn't exclusive with adding cgroup bpf controller down the road
> >   if necessary and practical.  Sure, this isn't the perfect situation
> >   but it seems like an acceptable trade-off to me.  What ever is
> >   perfect?
> 
> I think it more or less is exclusive.  I can imagine davem accepting a
> patch to make the sock_cgroup_data think point at a new "socket"
> cgroup type.  I can't imagine him accepting a patch to have sockets
> have more than one pointer to a cgroup, with good reason.

Why would it ever need multiple pointers?  Socket is associated with
one cgroup whether it's this or controller thing.  Membership part
doesn't change.  It can share everything.

> I don't see what's added by having a "bpf" cgroup controller, though
> -- it's just too broad.  If the Landlock stuff goes in, that should
> presumably be either tied to the default hierarchy or use an "lsm"
> controller.  A "bpf" controller for that makes no sense to me.
>
> I can see *huge* value in having some combination of BPF and the
> perf_event controller deciding whether to log perf events, for
> example.  But this would be the perf_event controller, not a
> hypothetical "bpf" controller.

Let's not spiral out.  This isn't relevant to the discussion.  Call it
whatever you want.

> But it's doing it wrong!  Even perf_event tests for membership in a
> given cgroup *or one of its descendents*.  This code does not.
> 
> I think the moral of the story here is that there are lots of open
> questions and design work to be done and that this feature really
> isn't ready to be stable.  For Landlock, I believe that it really
> needs to be done right and I will put my foot down and NAK any effort
> to have Landlock available in a released kernel without resolving
> these types of issues first.  Does anyone really want Landlock to work
> differently than the net hooks simply because the net hooks were in a
> rush?

No idea about landlock but as for the cgroup aware network bpf
programs, let's please try to narrow down the arguments.  Here's one
question.

* If we make the current thing netns aware, how is it different from
  iptables?  Note that at least future-proofing for netns is trivial.
  We can simply skip running the bpf programs for non-root ns for now.

If the answer is "they aren't that different", is the reason that
you're against the current code because you think that it being a part
of cgroup would be more useful?  I'm not necessarily against that
point, I just think that this is a reasonable trade-off given the
circumstances especially given that this doens't block having the
cgroup things in the future.

I'd really appreciate some inputs from bpf folks.  I'm basing my
judgement primarily on the impression that it is very difficult to
make bpf programs practically delegatable.  If that's not the case, we
can easily go the cgroup controller route.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-01-19  0:18   ` Andy Lutomirski
  2017-01-19  0:59     ` Tejun Heo
@ 2017-01-19  1:01     ` Mickaël Salaün
  1 sibling, 0 replies; 13+ messages in thread
From: Mickaël Salaün @ 2017-01-19  1:01 UTC (permalink / raw)
  To: Andy Lutomirski, Tejun Heo
  Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov,
	Andy Lutomirski, Daniel Mack, Kees Cook, Jann Horn,
	David S. Miller, Thomas Graf, Michael Kerrisk, Linux API,
	linux-kernel, Network Development


[-- Attachment #1.1: Type: text/plain, Size: 2265 bytes --]


On 19/01/2017 01:18, Andy Lutomirski wrote:
>>> it explicitly respects the cgroup hierarchy.  It shows up in
>>> /proc/cgroups, and I had no problem mounting a cgroupfs instance with
>>> perf_event enabled.  So I'm not sure what you mean.
>>
>> That all it's doing is providing membership information.
> 
> But it's doing it wrong!  Even perf_event tests for membership in a
> given cgroup *or one of its descendents*.  This code does not.
> 
> I think the moral of the story here is that there are lots of open
> questions and design work to be done and that this feature really
> isn't ready to be stable.  For Landlock, I believe that it really
> needs to be done right and I will put my foot down and NAK any effort
> to have Landlock available in a released kernel without resolving
> these types of issues first.  Does anyone really want Landlock to work
> differently than the net hooks simply because the net hooks were in a
> rush?

About Landlock, there is two use cases:

The first is to allow unprivileged users to tie eBPF programs (rules) to
processes. This is the (final) goal. In this case, a (cgroup) hierarchy
is mandatory, otherwise it would be trivial to defeat any access rule.
This is the same issue with namespaces.

The second use case is to only allow privileged users to tie eBPF
programs to processes. As discussed, this will be the next series,
preceding the unprivileged series. In this privilege case, only root
(global CAP_SYS_ADMIN, no namespaces) may be able to use Landlock. Not
having a hierarchy is not a security issue (only a practical one).

The first/next Landlock series (in February) will focus on process
hierarchies (without cgroup), a la seccomp-bpf. It would be too
confusing to not use an inheritable hierarchy like seccomp does, even in
a privileged-only first approach. The inherited rules should then behave
similarly to the seccomp-bpf filters.

However, the following series focusing on cgroup could keep the current
cgroup-bpf behavior, without hierarchy. I don't like the non-hierarchy
approach very much because it add another (less flexible and more
confusing) way to do things (for Landlock at least), but I'm willing to
do it if needed.

Regards,
 Mickaël


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-01-19  0:59     ` Tejun Heo
@ 2017-01-19  2:29       ` Andy Lutomirski
  2017-01-20  2:39         ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2017-01-19  2:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

On Wed, Jan 18, 2017 at 4:59 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Andy.
>
> On Wed, Jan 18, 2017 at 04:18:04PM -0800, Andy Lutomirski wrote:
>> To map cgroup -> hook, a simple array in each cgroup structure works.
>> To map (cgroup, netns) -> hook function, the natural approach would be
>> to have some kind of hash, and that would be slower.  This code is
>> intended to be *fast*.
>
> Updating these programs isn't a frequent operation.  We can easily
> calculate a perfect (or acceptable) hash per-cgroup and rcu swap the
> new hashtable.
>

Fair enough.

>
>> I think it's currently a broken cgroup feature.  I think it could be
>> made into a properly working cgroup feature (which I favor) or a
>> properly working net feature.  Can you articulate why you prefer
>> having it be a net feature instead of a cgroup feature?  I tend to
>
> I thought that's what I was doing in the previous message.
>
>> favor making it be a working cgroup feature (by giving it a file or
>> files in cgroupfs and maybe even a controller) because the result
>> would have very obvious semantics.
>
> I'm fine with both directions but one seems far out.

I thought you were saying why you thought it wasn't a cgroup feature,
but I'm not sure I understand why you thought it shouldn't be a cgroup
feature.

When I chatted with Alexei, I had the impression that you and he had
wanted it to be a cgroup controller but thought it wouldn't work well.
I think it could work by making a single socket cgroup controller that
handles all cgroup things that are bound to a socket.  Using
individual files would play nicer with cgroup delegation within a
single netns.

>
>> But maybe this is just a false dichotomy.  Could this feature be a
>> per-netns configuration where you can give instructions like "run this
>> hook if you're in such-and-such netns and in a given cgroup or one of
>> its descendents"?  This would prevent it from being a direct analogue
>> to systemd's RestrictAddressFamilies= option, but that may be okay.
>> This would side-step issues in the current code where a hook can't
>> rely on ifindex meaning what the hook thinks it means.
>
> How is this different from making the current code netns aware?

The descendents part is important.

>
>> Actually, I think I like that approach.  What if it we had a "socket"
>> controller and files like "socket.create_socket", "socket.ingress" and
>> "socket.egress"?  You could use the bpf() syscall to install a bpf
>> filter into "socket.egress" and that filter would filter egress for
>> the target cgroup and its descendents on the current netns.  As a
>> first pass, the netns issue could be sidestepped by making it only
>> work in the root netns (i.e. the bpf() call would fail if you're not
>> in the root netns and the hooks wouldn't run if you're not in the root
>> netns.)  It *might* even be possible to retrofit in the socket
>> controller by saying that the default hierarchy is used if the socket
>> controller isn't mounted.
>
> I don't know.  You're throwing out too many ideas too fast and it's
> difficult to keep up with what the differences are between those
> ideas.  But if we're doing cgroup controllers, shouldn't cgroup ns
> support be sufficient?  We can consider the product of cgroup and net
> namespaces but that doesn't seem necessary given that people usually
> set up these namespaces in conjunction.

Fair enough.  See way below.

>
>> What *isn't* possible to cleanly fix after the fact is the current
>> semantics that cgroup hooks override the hooks in their ancestors.
>> IMO that is simply broken.  The example you gave (perf_event) is very
>> careful not to have this problem.
>
> That's like saying installing an iptables rule for a more specific
> target is broken.  As a cgroup controller, it is not an acceptable
> behavior given how delegation works.  As something similar to
> iptables, it is completely fine.

Even the xt_cgroup code checks cgroup_is_descendent().

>
>> > * My impression with bpf is that while delegation is something
>> >   theoretically possible it is not something which is gonna take place
>> >   in any immediate time frame.  If I'm wrong on this, please feel free
>> >   to correct me.
>>
>> But the issue isn't *BPF* delegation.  It's cgroup delegation or netns
>> creation, both of which exist today.
>
> No, the issue is bpf delegation.  If bpf were fully delegatable in
> practical sense, we could just do straight forward cgroup bpf
> controller.  Well, we'll have to think about how to chain the programs
> which would differ on program type but that shouldn't be too hard.

What do you mean by "if bpf were fully delegatable"?  I don't know
what it would even mean to delegate bpf.  I do know what it means to
allow programs to create new network namespaces and for cgroup
sub-hierarchies to be delegated.

>
>> > * There are a lot of use cases which can immediately take advantage of
>> >   cgroup-aware bpf programs operating as proposed.  The model is
>> >   semantically equivalent to iptables (let's address the netns part if
>> >   that's an issue) which net people are familiar with.
>>
>> That sounds like a great argument for those users to patch their
>> kernels to gain experience with this feature.
>
> I don't get why this would particularly point to out-of-tree usage.
> Why is that?

Because I think that the API currently has some issues that will be
difficult to fix without either breaking the API or supporting
multiple APIs going forward.

>
>> > * It isn't exclusive with adding cgroup bpf controller down the road
>> >   if necessary and practical.  Sure, this isn't the perfect situation
>> >   but it seems like an acceptable trade-off to me.  What ever is
>> >   perfect?
>>
>> I think it more or less is exclusive.  I can imagine davem accepting a
>> patch to make the sock_cgroup_data think point at a new "socket"
>> cgroup type.  I can't imagine him accepting a patch to have sockets
>> have more than one pointer to a cgroup, with good reason.
>
> Why would it ever need multiple pointers?  Socket is associated with
> one cgroup whether it's this or controller thing.  Membership part
> doesn't change.  It can share everything.

Exactly.  If it starts out using the default hierarchy and then
switches to being a socket controller, there will still be old
deployed code that tries to attach bpf programs to the default
hierarchy.  That will require defining some semantics for it.

> No idea about landlock but as for the cgroup aware network bpf
> programs, let's please try to narrow down the arguments.  Here's one
> question.
>
> * If we make the current thing netns aware, how is it different from
>   iptables?  Note that at least future-proofing for netns is trivial.
>   We can simply skip running the bpf programs for non-root ns for now.
>
> If the answer is "they aren't that different", is the reason that
> you're against the current code because you think that it being a part
> of cgroup would be more useful?  I'm not necessarily against that
> point, I just think that this is a reasonable trade-off given the
> circumstances especially given that this doens't block having the
> cgroup things in the future.

Having thought about this some more, I think that making it would
alleviate a bunch of my concerns, as it would make the semantics if
the capable() check were relaxed to ns_capable() be sane.  Here's what
I currently should happen before bpf+cgroup is enabled in a release:

1. Make it netns-aware.  This could be as simple as making it only
work in the root netns because then real netns awareness can be added
later without breaking anything.  The current situation is bad in that
network namespaces are just ignored and it's plausible that people
will start writing user code that depends on having network namespaces
be ignored.

2. Make it inherit properly.  Inner cgroups should not override outer
hooks.  As in (1), this could be simplified by preventing the same
hook from being configured in both an ancestor and a descendent
cgroup.  Then inheritance could be added for real later on.

3. Give cgroup delegation support some serious thought.  In
particular, if delegation would be straightforward but the current API
wouldn't work well with delegation, then at least consider whether the
API should change before it becomes stable so that two APIs don't need
to be supported going forward.

>From the bpf side, I think that the only thing needed to get
delegation working is to make inheritance work right -- if the same
hook is set up multiple places in the hierarchy, call all of the in an
appropriate order.  The rest is probably all on the cgroup side --
setting up a way to enable delegation (subtree_control?), making sure
that filesystem permissions are checked when configuring hooks, and
perhaps dealing with setuid issues.

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-01-19  2:29       ` Andy Lutomirski
@ 2017-01-20  2:39         ` Alexei Starovoitov
  2017-01-20  4:04           ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2017-01-20  2:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
> I think it could work by making a single socket cgroup controller that
> handles all cgroup things that are bound to a socket.  Using

Such 'socket cgroup controller' would limit usability of the feature
to sockets and force all other use cases like landlock to invent
their own wheel, which is undesirable. Everyone will be
inventing new 'foo cgroup controller', while all of them
are really bpf features. They are different bpf program
types that attach to different hooks and use cgroup for scoping.

> Having thought about this some more, I think that making it would
> alleviate a bunch of my concerns, as it would make the semantics if
> the capable() check were relaxed to ns_capable() be sane.  Here's what

here we're on the same page. For any meaningful discussion about
'bpf cgroup controller' to happen bpf itself needs to become
delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP*
program types need to become available for unprivileged users.
The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER.
To make it secure we severely limited its functionality.
All bpf advances since then (like new map types and verifier extensions)
were done for root only. If early on the priv vs unpriv bpf features
were 80/20. Now it's close to 95/5. No work has been done to
make socket filter type more powerful. It still has to use
slow-ish ld_abs skb access while tc/xdp have direct packet access.
Things like register value tracking is root only as well and so on
and so forth.
We cannot just flip the switch and allow type_cgroup* to unpriv
and I don't see any volunteers willing to do this work.
Until that happens there is no point coming up with designs
for 'cgroup bpf controller'... whatever that means.

> I currently should happen before bpf+cgroup is enabled in a release:
> 
> 1. Make it netns-aware.  This could be as simple as making it only
> work in the root netns because then real netns awareness can be added
> later without breaking anything.  The current situation is bad in that
> network namespaces are just ignored and it's plausible that people
> will start writing user code that depends on having network namespaces
> be ignored.

nothing in bpf today is netns-aware and frankly I don't see
how cgroup+bpf has anything to do with netns.
For regular sockets+bpf we don't check netns.
When tcpdump opens raw socket and attaches bpf there are no netns
checks, since socket itself gives a scope for the program to run.
Same thing applies to cgroup+bpf. cgroup gives a scope for the program.
But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_*
hooks. Then if the hooks are used for security, the process
only needs to do setns() to escape security sandbox. Obviously
broken semantics.

> 2. Make it inherit properly.  Inner cgroups should not override outer
> hooks.  As in (1), this could be simplified by preventing the same
> hook from being configured in both an ancestor and a descendent
> cgroup.  Then inheritance could be added for real later on.

In general it sounds fine, but it seems the reasoning to add
such restriction now (instead of later), so that program chain can
be added without breaking abi, since if we don't restrict it now
there will be no way to add it without breaking abi?!
That is incorrect assumption. We can add chaining and can add
'do not override' logic without breaking existing semantics.
For example, we can add 'priority' field to
struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */}
which would indicate relative position of multiple chained programs
applied to the same cgroup+hook pair. Multiple programs with
the same priority will be executed in the order they were added.
Programs with different priorities will execute in the priority order.
Such scheme will be more generic and flexible than earlier proposals.
Similarly we can add another flag that will say 'dissallow override
of bpf program in descendent cgroup'. It's all trivial to do,
since bpf syscall was designed for extensibility.

Also until bpf_type_cgroup* becomes unprivileged there is no reason
to add this 'priority/prog chaining' feature, since if it's
used for security the root can always override it no matter cgroup
hierarchy.

> 3. Give cgroup delegation support some serious thought.  In
> particular, if delegation would be straightforward but the current API
> wouldn't work well with delegation, then at least consider whether the
> API should change before it becomes stable so that two APIs don't need
> to be supported going forward.

please see example above. Since we went with bpf syscall (instead of
inextensible ioctl) we can add any new cgroup+bpf logic without
breaking current abi.
No matter how you twist it the cgroup+bpf is bpf specific feature.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-01-20  2:39         ` Alexei Starovoitov
@ 2017-01-20  4:04           ` Andy Lutomirski
  2017-01-23  4:31             ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2017-01-20  4:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
>> I think it could work by making a single socket cgroup controller that
>> handles all cgroup things that are bound to a socket.  Using
>
> Such 'socket cgroup controller' would limit usability of the feature
> to sockets and force all other use cases like landlock to invent
> their own wheel, which is undesirable. Everyone will be
> inventing new 'foo cgroup controller', while all of them
> are really bpf features. They are different bpf program
> types that attach to different hooks and use cgroup for scoping.

Can you elaborate on why that would be a problem?  In a cgroup v1
world, users who want different hierarchies for different types of
control could easily want one hierarchy for socket hooks and a
different hierarchy for lsm hooks.  In a cgroup v2 delegation world, I
could easily imagine the decision to delegate socket hooks being
different from the decision to delegate lsm hooks.  Almost all of the
code would be shared between different bpf-using cgroup controllers.

>
>> Having thought about this some more, I think that making it would
>> alleviate a bunch of my concerns, as it would make the semantics if
>> the capable() check were relaxed to ns_capable() be sane.  Here's what
>
> here we're on the same page. For any meaningful discussion about
> 'bpf cgroup controller' to happen bpf itself needs to become
> delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP*
> program types need to become available for unprivileged users.
> The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER.
> To make it secure we severely limited its functionality.
> All bpf advances since then (like new map types and verifier extensions)
> were done for root only. If early on the priv vs unpriv bpf features
> were 80/20. Now it's close to 95/5. No work has been done to
> make socket filter type more powerful. It still has to use
> slow-ish ld_abs skb access while tc/xdp have direct packet access.
> Things like register value tracking is root only as well and so on
> and so forth.
> We cannot just flip the switch and allow type_cgroup* to unpriv
> and I don't see any volunteers willing to do this work.
> Until that happens there is no point coming up with designs
> for 'cgroup bpf controller'... whatever that means.

Sure there is.  If delegation can be turned on without changing the
API, then the result will be easier to work with and have fewer
compatibility issues.

>
>> I currently should happen before bpf+cgroup is enabled in a release:
>>
>> 1. Make it netns-aware.  This could be as simple as making it only
>> work in the root netns because then real netns awareness can be added
>> later without breaking anything.  The current situation is bad in that
>> network namespaces are just ignored and it's plausible that people
>> will start writing user code that depends on having network namespaces
>> be ignored.
>
> nothing in bpf today is netns-aware and frankly I don't see
> how cgroup+bpf has anything to do with netns.
> For regular sockets+bpf we don't check netns.
> When tcpdump opens raw socket and attaches bpf there are no netns
> checks, since socket itself gives a scope for the program to run.
> Same thing applies to cgroup+bpf. cgroup gives a scope for the program.
> But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_*
> hooks.


Here I completely disagree with you.  tcpdump sees packets in its
network namespace.  Regular sockets apply bpf filters to the packets
seen by that socket, and the socket itself is scoped to a netns.

Meanwhile, cgroup+bpf actually appears to be buggy in this regard even
regardless of what semantics you think are better.  sk_bound_dev_if is
exposed as a u32 value, but sk_bound_dev_if only has meaning within a
given netns.  The "ip vrf" stuff will straight-up malfunction if a
process affected by its hook runs in a different netns from the netns
that "ip vrf" was run in.

IOW, the current code is buggy.

> Then if the hooks are used for security, the process
> only needs to do setns() to escape security sandbox. Obviously
> broken semantics.

This could go both ways.  If the goal is to filter packets, then it's
not really important to have the filter keep working if the sandboxed
task unshares netns -- in the new netns, there isn't any access to the
network at all.  If the goal is to reduce attack surface by
restricting the types of sockets that can be created, then you do want
the filter to work across namespaces, but seccomp can do that too and
the current code doesn't handle netns correctly.

>
>> 2. Make it inherit properly.  Inner cgroups should not override outer
>> hooks.  As in (1), this could be simplified by preventing the same
>> hook from being configured in both an ancestor and a descendent
>> cgroup.  Then inheritance could be added for real later on.
>
> In general it sounds fine, but it seems the reasoning to add
> such restriction now (instead of later), so that program chain can
> be added without breaking abi, since if we don't restrict it now
> there will be no way to add it without breaking abi?!

Adding it the straightforward way (by simply making all the hooks in
scope run) would break ABI.  Of course there are more complicated ways
to do it, but those are more complicated and messier.  Do actually
have a use case in which overriding of hooks is a good thing?  As far
as I can tell, the original version of the patch set didn't have hooks
get overridden by descendents, and I don't know why this changed in
the first place.

>
> Also until bpf_type_cgroup* becomes unprivileged there is no reason
> to add this 'priority/prog chaining' feature, since if it's
> used for security the root can always override it no matter cgroup
> hierarchy.
>
>> 3. Give cgroup delegation support some serious thought.  In
>> particular, if delegation would be straightforward but the current API
>> wouldn't work well with delegation, then at least consider whether the
>> API should change before it becomes stable so that two APIs don't need
>> to be supported going forward.
>
> please see example above. Since we went with bpf syscall (instead of
> inextensible ioctl) we can add any new cgroup+bpf logic without
> breaking current abi.

But then you end up with two ABIs.  Ideally delegation would just work
-- then programs written now could run unmodified in unprivileged
containers in future kernels.

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-01-20  4:04           ` Andy Lutomirski
@ 2017-01-23  4:31             ` Alexei Starovoitov
  2017-01-23 20:20               ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2017-01-23  4:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote:
> On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
> >> I think it could work by making a single socket cgroup controller that
> >> handles all cgroup things that are bound to a socket.  Using
> >
> > Such 'socket cgroup controller' would limit usability of the feature
> > to sockets and force all other use cases like landlock to invent
> > their own wheel, which is undesirable. Everyone will be
> > inventing new 'foo cgroup controller', while all of them
> > are really bpf features. They are different bpf program
> > types that attach to different hooks and use cgroup for scoping.
> 
> Can you elaborate on why that would be a problem?  In a cgroup v1
> world, users who want different hierarchies for different types of
> control could easily want one hierarchy for socket hooks and a
> different hierarchy for lsm hooks.  In a cgroup v2 delegation world, I
> could easily imagine the decision to delegate socket hooks being
> different from the decision to delegate lsm hooks.  Almost all of the
> code would be shared between different bpf-using cgroup controllers.

how do you think it can be enforced when directory is chowned?

> >> Having thought about this some more, I think that making it would
> >> alleviate a bunch of my concerns, as it would make the semantics if
> >> the capable() check were relaxed to ns_capable() be sane.  Here's what
> >
> > here we're on the same page. For any meaningful discussion about
> > 'bpf cgroup controller' to happen bpf itself needs to become
> > delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP*
> > program types need to become available for unprivileged users.
> > The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER.
> > To make it secure we severely limited its functionality.
> > All bpf advances since then (like new map types and verifier extensions)
> > were done for root only. If early on the priv vs unpriv bpf features
> > were 80/20. Now it's close to 95/5. No work has been done to
> > make socket filter type more powerful. It still has to use
> > slow-ish ld_abs skb access while tc/xdp have direct packet access.
> > Things like register value tracking is root only as well and so on
> > and so forth.
> > We cannot just flip the switch and allow type_cgroup* to unpriv
> > and I don't see any volunteers willing to do this work.
> > Until that happens there is no point coming up with designs
> > for 'cgroup bpf controller'... whatever that means.
> 
> Sure there is.  If delegation can be turned on without changing the
> API, then the result will be easier to work with and have fewer
> compatibility issues.

... and open() of the directory done by the current api will preserve
cgroup delegation when and only when bpf_prog_type_cgroup_*
becomes unprivileged.
I'm not proposing creating new api here.

> >
> >> I currently should happen before bpf+cgroup is enabled in a release:
> >>
> >> 1. Make it netns-aware.  This could be as simple as making it only
> >> work in the root netns because then real netns awareness can be added
> >> later without breaking anything.  The current situation is bad in that
> >> network namespaces are just ignored and it's plausible that people
> >> will start writing user code that depends on having network namespaces
> >> be ignored.
> >
> > nothing in bpf today is netns-aware and frankly I don't see
> > how cgroup+bpf has anything to do with netns.
> > For regular sockets+bpf we don't check netns.
> > When tcpdump opens raw socket and attaches bpf there are no netns
> > checks, since socket itself gives a scope for the program to run.
> > Same thing applies to cgroup+bpf. cgroup gives a scope for the program.
> > But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_*
> > hooks.
> 
> 
> Here I completely disagree with you.  tcpdump sees packets in its
> network namespace.  Regular sockets apply bpf filters to the packets
> seen by that socket, and the socket itself is scoped to a netns.
> 
> Meanwhile, cgroup+bpf actually appears to be buggy in this regard even
> regardless of what semantics you think are better.  sk_bound_dev_if is
> exposed as a u32 value, but sk_bound_dev_if only has meaning within a
> given netns.  The "ip vrf" stuff will straight-up malfunction if a
> process affected by its hook runs in a different netns from the netns
> that "ip vrf" was run in.

how is that any different from normal 'ip netns exec'?
that is expected user behavior.

> IOW, the current code is buggy.
> 
> > Then if the hooks are used for security, the process
> > only needs to do setns() to escape security sandbox. Obviously
> > broken semantics.
> 
> This could go both ways.  If the goal is to filter packets, then it's
> not really important to have the filter keep working if the sandboxed
> task unshares netns -- in the new netns, there isn't any access to the
> network at all.  If the goal is to reduce attack surface by

in container networking based on netns all netns-es are connected
at l2 or l3, whereas socket is l4+ abstraction. Therefore doing
'if (!root_ns) allow' at socket layer is simply broken from security
point of view.

> restricting the types of sockets that can be created, then you do want
> the filter to work across namespaces, but seccomp can do that too and
> the current code doesn't handle netns correctly.

are you saying that seccomp supports netns filtering? please show the proof.

> >> 2. Make it inherit properly.  Inner cgroups should not override outer
> >> hooks.  As in (1), this could be simplified by preventing the same
> >> hook from being configured in both an ancestor and a descendent
> >> cgroup.  Then inheritance could be added for real later on.
> >
> > In general it sounds fine, but it seems the reasoning to add
> > such restriction now (instead of later), so that program chain can
> > be added without breaking abi, since if we don't restrict it now
> > there will be no way to add it without breaking abi?!
> 
> Adding it the straightforward way (by simply making all the hooks in
> scope run) would break ABI.  Of course there are more complicated ways
> to do it, but those are more complicated and messier.  Do actually

adding 'prog override is not allowed' flag will obviously _not_ break abi.
If program is attached to a cgroup in this mode, the descendants won't
be able to attach. Trivial boolean flag check at attach time.
Just like 'priority' mode will not break it.

> have a use case in which overriding of hooks is a good thing?  As far
> as I can tell, the original version of the patch set didn't have hooks
> get overridden by descendents, and I don't know why this changed in
> the first place.

what's been missing since the beginning of the thread that cgroup+bpf is
the bpf feature and on bpf side we can handle program chaining with priority
in flexible way. bpf prog needs to be able to attach to descendent
and make sure that the attached program is the only one that makes the
decision. Our main use case is application network analytics and
enforcement when apps are not malicious. In this case the container
management framework will attach one program to a particular
part of the hierarchy and will create whatever necessary combinations
of the programs for descendents, since it's one common framework
for them all and running single program is faster than walking link lists
and doing parsing and IP lookups several times. It's often the case that
prep work like skb parsing and map lookups are common across programs, but
filtering/monitoring logic is different, so dumb link list and call
all programs one by one is inferior. Whereas multiple program combining on
the user space side can be done efficiently.

To summarize, I think the 'prog override is not allowed' flag would be
ok feature to have and I wouldn't mind making it the default when no 'flag'
field is passed to bpf syscall, but it's not acceptable to remove current
'prog override is allowed' behavior.
So if you insist on changing the default, please implement the flag asap.
Though from security point of view and ABI point of view there is absolutely
no difference whether this flag is added today or a year later while
the default is kept as-is.

> > Also until bpf_type_cgroup* becomes unprivileged there is no reason
> > to add this 'priority/prog chaining' feature, since if it's
> > used for security the root can always override it no matter cgroup
> > hierarchy.
> >
> >> 3. Give cgroup delegation support some serious thought.  In
> >> particular, if delegation would be straightforward but the current API
> >> wouldn't work well with delegation, then at least consider whether the
> >> API should change before it becomes stable so that two APIs don't need
> >> to be supported going forward.
> >
> > please see example above. Since we went with bpf syscall (instead of
> > inextensible ioctl) we can add any new cgroup+bpf logic without
> > breaking current abi.
> 
> But then you end up with two ABIs.  Ideally delegation would just work
> -- then programs written now could run unmodified in unprivileged
> containers in future kernels.

At this point I don't see a reason to have two APIs.
When bpf_type_cgroup* becomes available to unprivileged users
the same api will work as-is.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-01-23  4:31             ` Alexei Starovoitov
@ 2017-01-23 20:20               ` Andy Lutomirski
  2017-02-03 21:07                 ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2017-01-23 20:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

On Sun, Jan 22, 2017 at 8:31 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote:
>> On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
>> >> I think it could work by making a single socket cgroup controller that
>> >> handles all cgroup things that are bound to a socket.  Using
>> >
>> > Such 'socket cgroup controller' would limit usability of the feature
>> > to sockets and force all other use cases like landlock to invent
>> > their own wheel, which is undesirable. Everyone will be
>> > inventing new 'foo cgroup controller', while all of them
>> > are really bpf features. They are different bpf program
>> > types that attach to different hooks and use cgroup for scoping.
>>
>> Can you elaborate on why that would be a problem?  In a cgroup v1
>> world, users who want different hierarchies for different types of
>> control could easily want one hierarchy for socket hooks and a
>> different hierarchy for lsm hooks.  In a cgroup v2 delegation world, I
>> could easily imagine the decision to delegate socket hooks being
>> different from the decision to delegate lsm hooks.  Almost all of the
>> code would be shared between different bpf-using cgroup controllers.
>
> how do you think it can be enforced when directory is chowned?

A combination of delegation policy (a la subtree_control in the
parent) and MAC policy.  I'm quite confident that apparmor can apply
policy to cgroupfs and I'm reasonably confident (although slightly
less so) that selinux can as well.  The docs suck :(

But if there's only one file in there to apply MAC policy to, the
ability to differentiate between subsystems is quite limited.   In the
current API, there are no files to apply policy to, so it won't work
at all.

>
> ... and open() of the directory done by the current api will preserve
> cgroup delegation when and only when bpf_prog_type_cgroup_*
> becomes unprivileged.
> I'm not proposing creating new api here.

I don't know what you mean.  open() of the directory can't check
anything useful because it has to work for programs that just want to
read the directory.

>> Meanwhile, cgroup+bpf actually appears to be buggy in this regard even
>> regardless of what semantics you think are better.  sk_bound_dev_if is
>> exposed as a u32 value, but sk_bound_dev_if only has meaning within a
>> given netns.  The "ip vrf" stuff will straight-up malfunction if a
>> process affected by its hook runs in a different netns from the netns
>> that "ip vrf" was run in.
>
> how is that any different from normal 'ip netns exec'?
> that is expected user behavior.

# ./ip link add dev vrf0 type vrf table 10
# ./ip vrf exec vrf0 ./show_bind
Default binding is "vrf0"
# ./ip vrf exec vrf0 unshare -n ./show_bind
show_bind: getsockopt: No such device

The expected behavior to me is that ip netns exec (or equivalently
unshare -n) gives a clean slate.  Actually malfunctioning (which this
example using the latest iproute2 and linux just did) is not expected.

I'm done with this part of this thread and I'm sending a patch.

>
>> restricting the types of sockets that can be created, then you do want
>> the filter to work across namespaces, but seccomp can do that too and
>> the current code doesn't handle netns correctly.
>
> are you saying that seccomp supports netns filtering? please show the proof.

It can trivially restruct the types of sockets that are created by
filtering on socket(2) syscall parameters, at least on sane
architectures that don't use socketcall().

> To summarize, I think the 'prog override is not allowed' flag would be
> ok feature to have and I wouldn't mind making it the default when no 'flag'
> field is passed to bpf syscall, but it's not acceptable to remove current
> 'prog override is allowed' behavior.
> So if you insist on changing the default, please implement the flag asap.
> Though from security point of view and ABI point of view there is absolutely
> no difference whether this flag is added today or a year later while
> the default is kept as-is.

It's too late and I have too little time.  I'll try to write a patch
to change the default to just deny attempts to override.  Better
behavior can be added later.

IMO your suggestions about priorities are overcomplicated.  For your
instrumentation needs, I can imagine that a simple "make this hook not
run if a descendent has a hook" would do it.  For everything else, run
them all in tree order (depending on filter type) and let the eBPF
code do whatever it wants to do.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-01-23 20:20               ` Andy Lutomirski
@ 2017-02-03 21:07                 ` Andy Lutomirski
  2017-02-03 23:21                   ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2017-02-03 21:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

On Mon, Jan 23, 2017 at 12:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Sun, Jan 22, 2017 at 8:31 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote:
>>> restricting the types of sockets that can be created, then you do want
>>> the filter to work across namespaces, but seccomp can do that too and
>>> the current code doesn't handle netns correctly.
>>
>> are you saying that seccomp supports netns filtering? please show the proof.
>
> It can trivially restruct the types of sockets that are created by
> filtering on socket(2) syscall parameters, at least on sane
> architectures that don't use socketcall().

I think this is actually wrong -- the socket creation filter appears
to be called only on inet sockets.  Is there a good reason for that?

>
>> To summarize, I think the 'prog override is not allowed' flag would be
>> ok feature to have and I wouldn't mind making it the default when no 'flag'
>> field is passed to bpf syscall, but it's not acceptable to remove current
>> 'prog override is allowed' behavior.
>> So if you insist on changing the default, please implement the flag asap.
>> Though from security point of view and ABI point of view there is absolutely
>> no difference whether this flag is added today or a year later while
>> the default is kept as-is.
>
> It's too late and I have too little time.  I'll try to write a patch
> to change the default to just deny attempts to override.  Better
> behavior can be added later.
>
> IMO your suggestions about priorities are overcomplicated.  For your
> instrumentation needs, I can imagine that a simple "make this hook not
> run if a descendent has a hook" would do it.  For everything else, run
> them all in tree order (depending on filter type) and let the eBPF
> code do whatever it wants to do.

Is there any plan to address this?  If not, I'll try to write that
patch this weekend.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-02-03 21:07                 ` Andy Lutomirski
@ 2017-02-03 23:21                   ` Alexei Starovoitov
  2017-02-04 17:10                     ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2017-02-03 23:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

On Fri, Feb 03, 2017 at 01:07:39PM -0800, Andy Lutomirski wrote:
> 
> Is there any plan to address this?  If not, I'll try to write that
> patch this weekend.

yes. I'm working on 'disallow program override' flag.
It got stalled, because netns discussion got stalled.
Later today will send a patch for dev_id+inode and
will continue on the flag patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
  2017-02-03 23:21                   ` Alexei Starovoitov
@ 2017-02-04 17:10                     ` Andy Lutomirski
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2017-02-04 17:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern,
	Andy Lutomirski, Daniel Mack, Mickaël Salaün,
	Kees Cook, Jann Horn, David S. Miller, Thomas Graf,
	Michael Kerrisk, Linux API, linux-kernel, Network Development

On Fri, Feb 3, 2017 at 3:21 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, Feb 03, 2017 at 01:07:39PM -0800, Andy Lutomirski wrote:
>>
>> Is there any plan to address this?  If not, I'll try to write that
>> patch this weekend.
>
> yes. I'm working on 'disallow program override' flag.
> It got stalled, because netns discussion got stalled.
> Later today will send a patch for dev_id+inode and
> will continue on the flag patch.
>

Would it make sense to try to document what your proposal does before
writing the code?  I don't yet see how to get semantics that are both
simple and sensible with a "disallow override" flag.

I *do* see how to get simple, sensible semantics with an approach
where all the programs in scope for the cgroup in question get called.
If needed, I can imagine a special "overridable" program that would
not be run if the socket in question is bound to a descendent cgroup
that also has an "overridable" program but would still let all the
normal hierarchical programs in scope get called.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-02-04 17:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-17  5:18 Andy Lutomirski
2017-01-18 22:41 ` Potential issues (security and otherwise) with the current cgroup-bpf API Tejun Heo
2017-01-19  0:18   ` Andy Lutomirski
2017-01-19  0:59     ` Tejun Heo
2017-01-19  2:29       ` Andy Lutomirski
2017-01-20  2:39         ` Alexei Starovoitov
2017-01-20  4:04           ` Andy Lutomirski
2017-01-23  4:31             ` Alexei Starovoitov
2017-01-23 20:20               ` Andy Lutomirski
2017-02-03 21:07                 ` Andy Lutomirski
2017-02-03 23:21                   ` Alexei Starovoitov
2017-02-04 17:10                     ` Andy Lutomirski
2017-01-19  1:01     ` Mickaël Salaün

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).