All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: New namespace design and clone(2) flags exhaustion
       [not found] ` <CAN101LiTFwmiMMmLK93QMtNcczqm1mmK7EmPDpDYgtLtzkc8JA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-10 19:32   ` Eric W. Biederman
       [not found]     ` <87shwk7scl.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Eric W. Biederman @ 2016-06-10 19:32 UTC (permalink / raw)
  To: Albert Lee
  Cc: Andrew Morton, Pat Norton, Linux Containers, Nahum Shalman,
	Josh Lohrman, Pavel Emelianov


Adding the containers list as this is essentially a public question
and I figure having conversations as much as possible in public helps at
least in principle to reduce repeating oneself.

Albert Lee <trisk-EuoJsN+J0o7QT0dZR+AlfA@public.gmane.org> writes:

> Hello!
> We are building a platform that uses namespaces and cgroups for
> process group isolation and resource control and ZFS (a pooled
> storage, CoW, filesystem) for storage. [1]
> We wish to delegate administration for subsets of ZFS datasets to
> groups of processes on Linux, based on existing support in OpenZFS for
> illumos zones. Our initial approach introduces a new namespace, which
> allows arbitrary modules to be notified about new instances of this
> namespace. [2]

ZFS being licensed under the CDDL which is GPL incompatible isn't my
favorite subject to talk about.  But I think we are talking a general
question.

Last I looked Solaris/Illumos zones are a rather different concept from
namespaces.   Being a top down big switch rather than a bottom up a
component at a kind concept.

I don't think cgroups are at all interesting here, from what little I
can understand of what you are doing cgroups are not a particularly
good fit.

I actually don't think you need a new namespace either.

This sounds like a job for mount options.  I know btrfs can mount
different subvolumes based on different mount options, and that sounds
like what you are doing here.

But I could easily be missing something.  What is it you are actually
trying to do?  Even the idea of your previous work a delegation
namespace is meaningless to me.  It sounds like you just wanted a giant
hook in the kernel so you could implement a hack.  Random hooks for out
of tree hacks are neither maintainable nor supportable so I do not
encourage that approach.

Meanwhile there is a fair amount of work going on to allow unprivileged
fuse mounts which may dove tail with what you are trying to accomplish.

Eric


> During the initial investigation we noticed clone(2) is has almost no
> available bits in its flags parameter to specify additional
> namespaces. We were re-using the former CLONE_STOPPED value, as
> proposed namespaces have also done. [3] This appears to stem from the
> mount namespace's design not having consideration for future
> namespaces, making it more work than necessary implement any
> additional namespaces.
>
> Given introducing any new namespace in the existing model would
> exacerbate the problem, we're open to different options:
> * Not relying on namespaces but perhaps using cgroups instead. I'm not
> convinced the cgroup semantics make more sense for our use case.
> * Trying to upstream some form of our initial implementation by making
> it useful for other consumers. We've tried to make make this
> "delegation namespace"  as generic as possible.
> * Attempt to address the root issue by making namespaces "pluggable",
> in theory allowing them to be implemented in modules. This obviously
> requires a system call interface change as well as alterations to the
> structure attached to proc.
>
> The options are discussed in a lot more detail here:
> https://github.com/cerana/cerana/issues/143
>
> As you are some of the key people involved in the current
> implementations of namespaces, we would love to hear any comments you
> have, especially any opinions on the best course of action.
>
> Thanks in advance,
> -Albert
>
>  [1] https://cerana.org/
>  [2] https://github.com/cerana/linux-stable/tree/delegns
>  [3] https://lkml.org/lkml/2016/1/29/116

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: New namespace design and clone(2) flags exhaustion
       [not found]     ` <87shwk7scl.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-06-10 21:04       ` Albert Lee
       [not found]         ` <CAN101Lj5tenJRyRS1GipoP8G1KtRtNJMvFenNNZ8NoUCYj0dWA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Albert Lee @ 2016-06-10 21:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Pat Norton, Linux Containers, Nahum Shalman,
	Albert Lee, Josh Lohrman, Pavel Emelianov

On Fri, Jun 10, 2016 at 2:32 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> Adding the containers list as this is essentially a public question
> and I figure having conversations as much as possible in public helps at
> least in principle to reduce repeating oneself.
>
> Albert Lee <trisk-EuoJsN+J0o7QT0dZR+AlfA@public.gmane.org> writes:
>
>> Hello!
>> We are building a platform that uses namespaces and cgroups for
>> process group isolation and resource control and ZFS (a pooled
>> storage, CoW, filesystem) for storage. [1]
>> We wish to delegate administration for subsets of ZFS datasets to
>> groups of processes on Linux, based on existing support in OpenZFS for
>> illumos zones. Our initial approach introduces a new namespace, which
>> allows arbitrary modules to be notified about new instances of this
>> namespace. [2]
>
> ZFS being licensed under the CDDL which is GPL incompatible isn't my
> favorite subject to talk about.  But I think we are talking a general
> question.
>
> Last I looked Solaris/Illumos zones are a rather different concept from
> namespaces.   Being a top down big switch rather than a bottom up a
> component at a kind concept.
>

Right, zones exists as first-class objects that all subsystems can
associate with resources. The motivation for introducing (yet another)
new namespace is that we don't want to conflate the resources that
we're isolating with those associated with an existing namespace.

> I don't think cgroups are at all interesting here, from what little I
> can understand of what you are doing cgroups are not a particularly
> good fit.
>
> I actually don't think you need a new namespace either.
>
> This sounds like a job for mount options.  I know btrfs can mount
> different subvolumes based on different mount options, and that sounds
> like what you are doing here.
>
> But I could easily be missing something.  What is it you are actually
> trying to do?  Even the idea of your previous work a delegation
> namespace is meaningless to me.  It sounds like you just wanted a giant
> hook in the kernel so you could implement a hack.  Random hooks for out
> of tree hacks are neither maintainable nor supportable so I do not
> encourage that approach.
>
> Meanwhile there is a fair amount of work going on to allow unprivileged
> fuse mounts which may dove tail with what you are trying to accomplish.
>

Some background on the immediate problem we were trying to solve,
which is largely orthogonal to mounts: Storage pools in ZFS are a tree
of datasets, roughly analogous to btrfs subvolumes. Datasets can
expose either POSIX filesystem or block device semantics.
Administrative operations on datasets include creating and destroying
children or clones/snapshots, sending and receiving snapshots, and
setting properties.

In the zones model, these privileges can be delegated to a specific
zone, such that processes in those zones only see a subset of the
available datasets. Those datasets are still subject to quotas and
other resource limits in their parents. Processes have full access to
dataset operations if sufficiently privileged, as interpreted by the
zone. (Further down, delegation to unprivileged processes running as
specific users and groups within a zone is also possible, though
that's outside the immediate scope) This allows a multitenant system
to provide storage management to each tenant.

We want to provide this functionality to groups of processes in Linux.
Initially the target is simple logical containers, but ideally it
should not restrict full namespace flexibility and extend to even
nested or disjoint mount (and possibly user) namespaces. Hence, we
don't want to rely on the mount namespace as the reference object for
granting delegation.

Our initial attempt was chosen for simplicity for a proof-of-concept
and while we tried to make it less specific to our consumer I'm not
particularly happy with the design. (Our consumer in the Solaris
Porting Layer actually manages zone objects that are then made visible
to ZFS). If we have to introduce any changes upstream, it's only
feasible do it in a way that is useful to other consumers.

Running out of clone(2) flags and the namespace implementations
generally not being very extensible present obstacles for us, but
suggests that it might be possible to address this in a ways that
could both improve things in general and solve our own problem. (The
third proposal along those lines in
https://github.com/cerana/cerana/issues/143 is a way for modules to
implement new namespaces). I haven't seen previous mentions of these
things as problems, though, and I'm not convinced I'm not totally
crazy either. :)

Thanks,
-Albert

> Eric
>
>
>> During the initial investigation we noticed clone(2) is has almost no
>> available bits in its flags parameter to specify additional
>> namespaces. We were re-using the former CLONE_STOPPED value, as
>> proposed namespaces have also done. [3] This appears to stem from the
>> mount namespace's design not having consideration for future
>> namespaces, making it more work than necessary implement any
>> additional namespaces.
>>
>> Given introducing any new namespace in the existing model would
>> exacerbate the problem, we're open to different options:
>> * Not relying on namespaces but perhaps using cgroups instead. I'm not
>> convinced the cgroup semantics make more sense for our use case.
>> * Trying to upstream some form of our initial implementation by making
>> it useful for other consumers. We've tried to make make this
>> "delegation namespace"  as generic as possible.
>> * Attempt to address the root issue by making namespaces "pluggable",
>> in theory allowing them to be implemented in modules. This obviously
>> requires a system call interface change as well as alterations to the
>> structure attached to proc.
>>
>> The options are discussed in a lot more detail here:
>> https://github.com/cerana/cerana/issues/143
>>
>> As you are some of the key people involved in the current
>> implementations of namespaces, we would love to hear any comments you
>> have, especially any opinions on the best course of action.
>>
>> Thanks in advance,
>> -Albert
>>
>>  [1] https://cerana.org/
>>  [2] https://github.com/cerana/linux-stable/tree/delegns
>>  [3] https://lkml.org/lkml/2016/1/29/116

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: New namespace design and clone(2) flags exhaustion
       [not found]         ` <CAN101Lj5tenJRyRS1GipoP8G1KtRtNJMvFenNNZ8NoUCYj0dWA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-10 21:28           ` Eric W. Biederman
       [not found]             ` <878tyc3fa0.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Eric W. Biederman @ 2016-06-10 21:28 UTC (permalink / raw)
  To: Albert Lee
  Cc: Andrew Morton, Pat Norton, Linux Containers, Nahum Shalman,
	Josh Lohrman, Pavel Emelianov

Albert Lee <trisk-EuoJsN+J0o7QT0dZR+AlfA@public.gmane.org> writes:

> On Fri, Jun 10, 2016 at 2:32 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> Adding the containers list as this is essentially a public question
>> and I figure having conversations as much as possible in public helps at
>> least in principle to reduce repeating oneself.
>>
>> Albert Lee <trisk-EuoJsN+J0o7QT0dZR+AlfA@public.gmane.org> writes:
>>
>>> Hello!
>>> We are building a platform that uses namespaces and cgroups for
>>> process group isolation and resource control and ZFS (a pooled
>>> storage, CoW, filesystem) for storage. [1]
>>> We wish to delegate administration for subsets of ZFS datasets to
>>> groups of processes on Linux, based on existing support in OpenZFS for
>>> illumos zones. Our initial approach introduces a new namespace, which
>>> allows arbitrary modules to be notified about new instances of this
>>> namespace. [2]
>>
>> ZFS being licensed under the CDDL which is GPL incompatible isn't my
>> favorite subject to talk about.  But I think we are talking a general
>> question.
>>
>> Last I looked Solaris/Illumos zones are a rather different concept from
>> namespaces.   Being a top down big switch rather than a bottom up a
>> component at a kind concept.
>>
>
> Right, zones exists as first-class objects that all subsystems can
> associate with resources. The motivation for introducing (yet another)
> new namespace is that we don't want to conflate the resources that
> we're isolating with those associated with an existing namespace.
>
>> I don't think cgroups are at all interesting here, from what little I
>> can understand of what you are doing cgroups are not a particularly
>> good fit.
>>
>> I actually don't think you need a new namespace either.
>>
>> This sounds like a job for mount options.  I know btrfs can mount
>> different subvolumes based on different mount options, and that sounds
>> like what you are doing here.
>>
>> But I could easily be missing something.  What is it you are actually
>> trying to do?  Even the idea of your previous work a delegation
>> namespace is meaningless to me.  It sounds like you just wanted a giant
>> hook in the kernel so you could implement a hack.  Random hooks for out
>> of tree hacks are neither maintainable nor supportable so I do not
>> encourage that approach.
>>
>> Meanwhile there is a fair amount of work going on to allow unprivileged
>> fuse mounts which may dove tail with what you are trying to accomplish.
>>
>
> Some background on the immediate problem we were trying to solve,
> which is largely orthogonal to mounts: Storage pools in ZFS are a tree
> of datasets, roughly analogous to btrfs subvolumes. Datasets can
> expose either POSIX filesystem or block device semantics.
> Administrative operations on datasets include creating and destroying
> children or clones/snapshots, sending and receiving snapshots, and
> setting properties.
>
> In the zones model, these privileges can be delegated to a specific
> zone, such that processes in those zones only see a subset of the
> available datasets. Those datasets are still subject to quotas and
> other resource limits in their parents. Processes have full access to
> dataset operations if sufficiently privileged, as interpreted by the
> zone. (Further down, delegation to unprivileged processes running as
> specific users and groups within a zone is also possible, though
> that's outside the immediate scope) This allows a multitenant system
> to provide storage management to each tenant.
>
> We want to provide this functionality to groups of processes in Linux.
> Initially the target is simple logical containers, but ideally it
> should not restrict full namespace flexibility and extend to even
> nested or disjoint mount (and possibly user) namespaces. Hence, we
> don't want to rely on the mount namespace as the reference object for
> granting delegation.
>
> Our initial attempt was chosen for simplicity for a proof-of-concept
> and while we tried to make it less specific to our consumer I'm not
> particularly happy with the design. (Our consumer in the Solaris
> Porting Layer actually manages zone objects that are then made visible
> to ZFS). If we have to introduce any changes upstream, it's only
> feasible do it in a way that is useful to other consumers.
>
> Running out of clone(2) flags and the namespace implementations
> generally not being very extensible present obstacles for us, but
> suggests that it might be possible to address this in a ways that
> could both improve things in general and solve our own problem. (The
> third proposal along those lines in
> https://github.com/cerana/cerana/issues/143 is a way for modules to
> implement new namespaces). I haven't seen previous mentions of these
> things as problems, though, and I'm not convinced I'm not totally
> crazy either. :)

So as I understand it the issue is one of permissions.  Permissions by
and large are the domain of the user namespace.

At a very rough level what you want to do is to delegate permissions to
a user namespace possibly including the ability to further delegate
permission.

Possibly this should happen in a persistent fashion.

In various cases today various in kernel data structures (especially
other namespacess) have been given a user namespace owner.  Currently
there is work under way to give filesystems a user namespace owner to
clean up the semantics, and allow for things such as unprivileged mounts
of the fuse filesystem.

Perhaps I am a man with a hammer seeing every problem as a nail but it
sounds to me like your ZFS work fits fairly nicely into this model.  I
can't comment on the dangers of extending the ability to manipulate
datasets to less privileged users.  Things like that always seems to
extend the kernel attack surface and have to be done carefully, but it
always seems to be doable.

Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: New namespace design and clone(2) flags exhaustion
       [not found]             ` <878tyc3fa0.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-06-10 22:17               ` James Bottomley
  2016-06-15 15:40               ` Albert Lee
  1 sibling, 0 replies; 5+ messages in thread
From: James Bottomley @ 2016-06-10 22:17 UTC (permalink / raw)
  To: Eric W. Biederman, Albert Lee
  Cc: Andrew Morton, Pat Norton, Miriam Zohar, Linux Containers,
	Nahum Shalman, Josh Lohrman, Pavel Emelianov

On Fri, 2016-06-10 at 16:28 -0500, Eric W. Biederman wrote:
> Albert Lee <trisk-EuoJsN+J0o7QT0dZR+AlfA@public.gmane.org> writes:
> 
> > On Fri, Jun 10, 2016 at 2:32 PM, Eric W. Biederman
> > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> > > 
> > > Adding the containers list as this is essentially a public question
> > > and I figure having conversations as much as possible in public helps at
> > > least in principle to reduce repeating oneself.
> > > 
> > > Albert Lee <trisk-EuoJsN+J0o7QT0dZR+AlfA@public.gmane.org> writes:
> > > 
> > > > Hello!
> > > > We are building a platform that uses namespaces and cgroups for
> > > > process group isolation and resource control and ZFS (a pooled
> > > > storage, CoW, filesystem) for storage. [1]
> > > > We wish to delegate administration for subsets of ZFS datasets to
> > > > groups of processes on Linux, based on existing support in OpenZFS for
> > > > illumos zones. Our initial approach introduces a new namespace, which
> > > > allows arbitrary modules to be notified about new instances of this
> > > > namespace. [2]
> > > 
> > > ZFS being licensed under the CDDL which is GPL incompatible isn't my
> > > favorite subject to talk about.  But I think we are talking a general
> > > question.
> > > 
> > > Last I looked Solaris/Illumos zones are a rather different concept from
> > > namespaces.   Being a top down big switch rather than a bottom up a
> > > component at a kind concept.
> > > 
> > 
> > Right, zones exists as first-class objects that all subsystems can
> > associate with resources. The motivation for introducing (yet another)
> > new namespace is that we don't want to conflate the resources that
> > we're isolating with those associated with an existing namespace.
> > 
> > > I don't think cgroups are at all interesting here, from what little I
> > > can understand of what you are doing cgroups are not a particularly
> > > good fit.
> > > 
> > > I actually don't think you need a new namespace either.
> > > 
> > > This sounds like a job for mount options.  I know btrfs can mount
> > > different subvolumes based on different mount options, and that sounds
> > > like what you are doing here.
> > > 
> > > But I could easily be missing something.  What is it you are actually
> > > trying to do?  Even the idea of your previous work a delegation
> > > namespace is meaningless to me.  It sounds like you just wanted a giant
> > > hook in the kernel so you could implement a hack.  Random hooks for out
> > > of tree hacks are neither maintainable nor supportable so I do not
> > > encourage that approach.
> > > 
> > > Meanwhile there is a fair amount of work going on to allow unprivileged
> > > fuse mounts which may dove tail with what you are trying to accomplish.
> > > 
> > 
> > Some background on the immediate problem we were trying to solve,
> > which is largely orthogonal to mounts: Storage pools in ZFS are a tree
> > of datasets, roughly analogous to btrfs subvolumes. Datasets can
> > expose either POSIX filesystem or block device semantics.
> > Administrative operations on datasets include creating and destroying
> > children or clones/snapshots, sending and receiving snapshots, and
> > setting properties.
> > 
> > In the zones model, these privileges can be delegated to a specific
> > zone, such that processes in those zones only see a subset of the
> > available datasets. Those datasets are still subject to quotas and
> > other resource limits in their parents. Processes have full access to
> > dataset operations if sufficiently privileged, as interpreted by the
> > zone. (Further down, delegation to unprivileged processes running as
> > specific users and groups within a zone is also possible, though
> > that's outside the immediate scope) This allows a multitenant system
> > to provide storage management to each tenant.
> > 
> > We want to provide this functionality to groups of processes in Linux.
> > Initially the target is simple logical containers, but ideally it
> > should not restrict full namespace flexibility and extend to even
> > nested or disjoint mount (and possibly user) namespaces. Hence, we
> > don't want to rely on the mount namespace as the reference object for
> > granting delegation.
> > 
> > Our initial attempt was chosen for simplicity for a proof-of-concept
> > and while we tried to make it less specific to our consumer I'm not
> > particularly happy with the design. (Our consumer in the Solaris
> > Porting Layer actually manages zone objects that are then made visible
> > to ZFS). If we have to introduce any changes upstream, it's only
> > feasible do it in a way that is useful to other consumers.
> > 
> > Running out of clone(2) flags and the namespace implementations
> > generally not being very extensible present obstacles for us, but
> > suggests that it might be possible to address this in a ways that
> > could both improve things in general and solve our own problem. (The
> > third proposal along those lines in
> > https://github.com/cerana/cerana/issues/143 is a way for modules to
> > implement new namespaces). I haven't seen previous mentions of these
> > things as problems, though, and I'm not convinced I'm not totally
> > crazy either. :)
> 
> So as I understand it the issue is one of permissions.  Permissions by
> and large are the domain of the user namespace.
> 
> At a very rough level what you want to do is to delegate permissions to
> a user namespace possibly including the ability to further delegate
> permission.
> 
> Possibly this should happen in a persistent fashion.
> 
> In various cases today various in kernel data structures (especially
> other namespacess) have been given a user namespace owner.  Currently
> there is work under way to give filesystems a user namespace owner to
> clean up the semantics, and allow for things such as unprivileged mounts
> of the fuse filesystem.
> 
> Perhaps I am a man with a hammer seeing every problem as a nail but it
> sounds to me like your ZFS work fits fairly nicely into this model.  I
> can't comment on the dangers of extending the ability to manipulate
> datasets to less privileged users.  Things like that always seems to
> extend the kernel attack surface and have to be done carefully, but it
> always seems to be doable.

So just on the I've seen this request pattern before: IMA wants a way
of virtualizing the kernel keyrings so they can be different for
different containers.  Again, this is a property delegation pattern. If
we come up with an extensible mechanism for adding delegations to the
user namespace, I think that would work for IMA as well.

However, let me play devil's advocate for a bit.

The reason we might want a separate and extensible delegate namespace
(which would do IMA, ZFS and any other delegation to container issues
that came up) is the enforced sharing use case.  That's where the
administrator of the orchestration system wants to force sharing of the
delegated system amongst multiple containers.  For the IMA use case
this might be a set of containers which share the same keyrings.  If we
have a delegate namespace separate from the user namespace, we can do
this whereas if we chain the delegation to the userns, we're forced to
have a new delegate for every userns and the sharing would be broken
(they can all have identical copies of whatever is being delegated, but
if one container changes its copy, the others won't see the change).

So I think my fundamental reason to have a new delegate namespace
separate from the userns is this sharing, but another one is that if
it's a separate NS, it behaves like a regular namespace for creation,
pinning, etc.  If it's a property of the userns, then it doesn't and
thus it will have subtly different semantics, which may also end up
biting us in the long run.

James

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: New namespace design and clone(2) flags exhaustion
       [not found]             ` <878tyc3fa0.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2016-06-10 22:17               ` James Bottomley
@ 2016-06-15 15:40               ` Albert Lee
  1 sibling, 0 replies; 5+ messages in thread
From: Albert Lee @ 2016-06-15 15:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Pat Norton, Linux Containers, Nahum Shalman,
	Josh Lohrman, Pavel Emelianov

On Fri, Jun 10, 2016 at 5:28 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Albert Lee <trisk-EuoJsN+J0o7QT0dZR+AlfA@public.gmane.org> writes:
>
>> On Fri, Jun 10, 2016 at 2:32 PM, Eric W. Biederman
>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>
>>> Adding the containers list as this is essentially a public question
>>> and I figure having conversations as much as possible in public helps at
>>> least in principle to reduce repeating oneself.
>>>
>>> Albert Lee <trisk-EuoJsN+J0o7QT0dZR+AlfA@public.gmane.org> writes:
>>>
>>>> Hello!
>>>> We are building a platform that uses namespaces and cgroups for
>>>> process group isolation and resource control and ZFS (a pooled
>>>> storage, CoW, filesystem) for storage. [1]
>>>> We wish to delegate administration for subsets of ZFS datasets to
>>>> groups of processes on Linux, based on existing support in OpenZFS for
>>>> illumos zones. Our initial approach introduces a new namespace, which
>>>> allows arbitrary modules to be notified about new instances of this
>>>> namespace. [2]
>>>
>>> ZFS being licensed under the CDDL which is GPL incompatible isn't my
>>> favorite subject to talk about.  But I think we are talking a general
>>> question.
>>>
>>> Last I looked Solaris/Illumos zones are a rather different concept from
>>> namespaces.   Being a top down big switch rather than a bottom up a
>>> component at a kind concept.
>>>
>>
>> Right, zones exists as first-class objects that all subsystems can
>> associate with resources. The motivation for introducing (yet another)
>> new namespace is that we don't want to conflate the resources that
>> we're isolating with those associated with an existing namespace.
>>
>>> I don't think cgroups are at all interesting here, from what little I
>>> can understand of what you are doing cgroups are not a particularly
>>> good fit.
>>>
>>> I actually don't think you need a new namespace either.
>>>
>>> This sounds like a job for mount options.  I know btrfs can mount
>>> different subvolumes based on different mount options, and that sounds
>>> like what you are doing here.
>>>
>>> But I could easily be missing something.  What is it you are actually
>>> trying to do?  Even the idea of your previous work a delegation
>>> namespace is meaningless to me.  It sounds like you just wanted a giant
>>> hook in the kernel so you could implement a hack.  Random hooks for out
>>> of tree hacks are neither maintainable nor supportable so I do not
>>> encourage that approach.
>>>
>>> Meanwhile there is a fair amount of work going on to allow unprivileged
>>> fuse mounts which may dove tail with what you are trying to accomplish.
>>>
>>
>> Some background on the immediate problem we were trying to solve,
>> which is largely orthogonal to mounts: Storage pools in ZFS are a tree
>> of datasets, roughly analogous to btrfs subvolumes. Datasets can
>> expose either POSIX filesystem or block device semantics.
>> Administrative operations on datasets include creating and destroying
>> children or clones/snapshots, sending and receiving snapshots, and
>> setting properties.
>>
>> In the zones model, these privileges can be delegated to a specific
>> zone, such that processes in those zones only see a subset of the
>> available datasets. Those datasets are still subject to quotas and
>> other resource limits in their parents. Processes have full access to
>> dataset operations if sufficiently privileged, as interpreted by the
>> zone. (Further down, delegation to unprivileged processes running as
>> specific users and groups within a zone is also possible, though
>> that's outside the immediate scope) This allows a multitenant system
>> to provide storage management to each tenant.
>>
>> We want to provide this functionality to groups of processes in Linux.
>> Initially the target is simple logical containers, but ideally it
>> should not restrict full namespace flexibility and extend to even
>> nested or disjoint mount (and possibly user) namespaces. Hence, we
>> don't want to rely on the mount namespace as the reference object for
>> granting delegation.
>>
>> Our initial attempt was chosen for simplicity for a proof-of-concept
>> and while we tried to make it less specific to our consumer I'm not
>> particularly happy with the design. (Our consumer in the Solaris
>> Porting Layer actually manages zone objects that are then made visible
>> to ZFS). If we have to introduce any changes upstream, it's only
>> feasible do it in a way that is useful to other consumers.
>>
>> Running out of clone(2) flags and the namespace implementations
>> generally not being very extensible present obstacles for us, but
>> suggests that it might be possible to address this in a ways that
>> could both improve things in general and solve our own problem. (The
>> third proposal along those lines in
>> https://github.com/cerana/cerana/issues/143 is a way for modules to
>> implement new namespaces). I haven't seen previous mentions of these
>> things as problems, though, and I'm not convinced I'm not totally
>> crazy either. :)
>
> So as I understand it the issue is one of permissions.  Permissions by
> and large are the domain of the user namespace.
>
> At a very rough level what you want to do is to delegate permissions to
> a user namespace possibly including the ability to further delegate
> permission.
>
> Possibly this should happen in a persistent fashion.

Well, it's about creating a new administrative domain for a specific
group of resources from the set of available ones, of which there may
be an unlimited number. Ideally we want the flexibility to share and
nest (subdivide) these groupings:

Example:
* Parent process can access:
  pool1/dataset1 pool2
* Child process can access:
  pool1/dataset1/child1 pool2/dataset2

We are interested in persistence for the groupings. Since the
namespaces themselves are ephemeral, our current plan is to have
upper-level orchestration software manage the associations. The
implementation sets a property (key/value pair) on the related
datasets which can be stored persistently, although we recreate it
from external configuration.

>
> In various cases today various in kernel data structures (especially
> other namespacess) have been given a user namespace owner.  Currently
> there is work under way to give filesystems a user namespace owner to
> clean up the semantics, and allow for things such as unprivileged mounts
> of the fuse filesystem.
>

Something attached to the VFS probably won't be a good fit, as there's
no requirement that datasets be mounted and visible to the VFS, and
datasets with block device semantics (zvols) aren't treated like
filesystems.

> Perhaps I am a man with a hammer seeing every problem as a nail but it
> sounds to me like your ZFS work fits fairly nicely into this model.  I
> can't comment on the dangers of extending the ability to manipulate
> datasets to less privileged users.  Things like that always seems to
> extend the kernel attack surface and have to be done carefully, but it
> always seems to be doable.
>

I think tying a a specific user namespace to the admin domain for our
resources is at least the simpler model, but may have undesirable side
effects on other resources we may manage since it constrains the
boundaries for sharing and nesting (as James' reply also mentions).

-Albert

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-06-15 15:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAN101LiTFwmiMMmLK93QMtNcczqm1mmK7EmPDpDYgtLtzkc8JA@mail.gmail.com>
     [not found] ` <CAN101LiTFwmiMMmLK93QMtNcczqm1mmK7EmPDpDYgtLtzkc8JA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-10 19:32   ` New namespace design and clone(2) flags exhaustion Eric W. Biederman
     [not found]     ` <87shwk7scl.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-06-10 21:04       ` Albert Lee
     [not found]         ` <CAN101Lj5tenJRyRS1GipoP8G1KtRtNJMvFenNNZ8NoUCYj0dWA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-10 21:28           ` Eric W. Biederman
     [not found]             ` <878tyc3fa0.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-06-10 22:17               ` James Bottomley
2016-06-15 15:40               ` Albert Lee

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.