containers.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* device namespaces
@ 2021-06-08  9:38 Enrico Weigelt, metux IT consult
  2021-06-08 12:30 ` Christian Brauner
  0 siblings, 1 reply; 48+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-08  9:38 UTC (permalink / raw)
  To: containers, linux-kernel

Hello folks,


I'm going to implement device namespaces, where containers can get an
entirely different view of the devices in the machine (usually just a
specific subset, but possibly additional virtual devices).

For start I'd like to add a simple mapping of dev maj/min (leaving aside
sysfs, udev, etc). An important requirement for me is that the parent ns
can choose to delegate devices from those it full access too (child
namespaces can do the same to their childs), and the assignment can
change (for simplicity ignoring the case of removing devices that are
already opened by some process - haven't decided yet whether they should
be forcefully closed or whether keeping them open is a valid use case).

The big question for me now is how exactly to do the table maintenance
from userland. We already have entries in /proc/<pid>/ns/*. I'm thinking
about using them as command channel, like this:

* new child namespaces are created with empty mapping
* mapping manipulation is done by just writing commands to the ns file
* access is only granted if the writing process itself is in the
  parent's device ns and has CAP_SYS_ADMIN (or maybe their could be some
  admin user for the ns ? or the 'root' of the corresponding user_ns ?)
* if the caller has some restrictions on some particular device, these
  are automatically added (eg. if you're restricted to readonly, you
  can't give rw to the child ns).

Is this a good way to go ? Or what would be a better one ?


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-08  9:38 device namespaces Enrico Weigelt, metux IT consult
@ 2021-06-08 12:30 ` Christian Brauner
  2021-06-08 12:41   ` Greg Kroah-Hartman
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2021-06-08 12:30 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult, Greg Kroah-Hartman
  Cc: containers, linux-kernel

On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt, metux IT consult wrote:
> Hello folks,
> 
> 
> I'm going to implement device namespaces, where containers can get an
> entirely different view of the devices in the machine (usually just a
> specific subset, but possibly additional virtual devices).
> 
> For start I'd like to add a simple mapping of dev maj/min (leaving aside
> sysfs, udev, etc). An important requirement for me is that the parent ns
> can choose to delegate devices from those it full access too (child
> namespaces can do the same to their childs), and the assignment can
> change (for simplicity ignoring the case of removing devices that are
> already opened by some process - haven't decided yet whether they should
> be forcefully closed or whether keeping them open is a valid use case).
> 
> The big question for me now is how exactly to do the table maintenance
> from userland. We already have entries in /proc/<pid>/ns/*. I'm thinking
> about using them as command channel, like this:
> 
> * new child namespaces are created with empty mapping
> * mapping manipulation is done by just writing commands to the ns file
> * access is only granted if the writing process itself is in the
>  parent's device ns and has CAP_SYS_ADMIN (or maybe their could be some
>  admin user for the ns ? or the 'root' of the corresponding user_ns ?)
> * if the caller has some restrictions on some particular device, these
>  are automatically added (eg. if you're restricted to readonly, you
>  can't give rw to the child ns).
> 
> Is this a good way to go ? Or what would be a better one ?

Ccing Greg. Without adressing specific problems, I should warn you that
this idea is not new and the plan is unlikely to go anywhere. Especially
not without support from Greg.

Also note that I have done work to make it possible to do sufficient
device management in containers. There's a longer series associated with
this but the gist is 692ec06d7c92 ("netns: send uevent messages") where
you can forward uevents to containers. I spoke about this at Plumbers in
2018 or so too. For example, LXD makes use of this. When you hotplug a
device into a container LXD will forward the generated uevents to the
container making it possible for the container to manage those devices.
That's fully under control of userspace and means we don't need to
burden the kernel with this.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-08 12:30 ` Christian Brauner
@ 2021-06-08 12:41   ` Greg Kroah-Hartman
  2021-06-08 14:10     ` Hannes Reinecke
  0 siblings, 1 reply; 48+ messages in thread
From: Greg Kroah-Hartman @ 2021-06-08 12:41 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Christian Brauner, containers, linux-kernel

On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt, metux IT consult wrote:
> > Hello folks,
> > 
> > 
> > I'm going to implement device namespaces, where containers can get an
> > entirely different view of the devices in the machine (usually just a
> > specific subset, but possibly additional virtual devices).
> > 
> > For start I'd like to add a simple mapping of dev maj/min (leaving aside
> > sysfs, udev, etc). An important requirement for me is that the parent ns
> > can choose to delegate devices from those it full access too (child
> > namespaces can do the same to their childs), and the assignment can
> > change (for simplicity ignoring the case of removing devices that are
> > already opened by some process - haven't decided yet whether they should
> > be forcefully closed or whether keeping them open is a valid use case).
> > 
> > The big question for me now is how exactly to do the table maintenance
> > from userland. We already have entries in /proc/<pid>/ns/*. I'm thinking
> > about using them as command channel, like this:
> > 
> > * new child namespaces are created with empty mapping
> > * mapping manipulation is done by just writing commands to the ns file
> > * access is only granted if the writing process itself is in the
> >  parent's device ns and has CAP_SYS_ADMIN (or maybe their could be some
> >  admin user for the ns ? or the 'root' of the corresponding user_ns ?)
> > * if the caller has some restrictions on some particular device, these
> >  are automatically added (eg. if you're restricted to readonly, you
> >  can't give rw to the child ns).
> > 
> > Is this a good way to go ? Or what would be a better one ?
> 
> Ccing Greg. Without adressing specific problems, I should warn you that
> this idea is not new and the plan is unlikely to go anywhere. Especially
> not without support from Greg.

Hah, yeah, this is a non-starter.

Enrico, what real problem are you trying to solve by doing this?  And
have you tried anything with this yet?  We almost never talk about
"proposals" without seeing real code as it's pointless to discuss things
when you haven't even proven that it can work.

So let's see code before even talking about this...

And as Christian points out, you can do this today without any kernel
changes, so to think you need to modify the kernel means that you
haven't even tried this at all?

greg k-h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-08 12:41   ` Greg Kroah-Hartman
@ 2021-06-08 14:10     ` Hannes Reinecke
  2021-06-08 14:29       ` Christian Brauner
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2021-06-08 14:10 UTC (permalink / raw)
  To: gregkh; +Cc: christian.brauner, containers, linux-kernel, lkml

On Tue, Jun 08, 2021 Greg-KH wrote:
> On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
>> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt,
>> metux IT consult wrote:
>>> Hello folks,
>>>
>>>
>>> I'm going to implement device namespaces, where containers can get
>>> an entirely different view of the devices in the machine (usually
>>> just a specific subset, but possibly additional virtual devices).
>>>
[ .. ]
>>> Is this a good way to go ? Or what would be a better one ?
>>
>> Ccing Greg. Without adressing specific problems, I should warn you
>> that this idea is not new and the plan is unlikely to go anywhere.
>> Especially not without support from Greg.
>
> Hah, yeah, this is a non-starter.
>
> Enrico, what real problem are you trying to solve by doing this?  And
> have you tried anything with this yet?  We almost never talk about
> "proposals" without seeing real code as it's pointless to discuss
> things when you haven't even proven that it can work.
>
> So let's see code before even talking about this...
>
> And as Christian points out, you can do this today without any kernel
> changes, so to think you need to modify the kernel means that you
> haven't even tried this at all?
>
Curious, I had been looking into this, too.
And I have to side with Greg and Christian that your proposal should
already be possible today (cf device groups, which curiously has a
near-identical interface to what you proposed).
Also, I think that a generic 'device namespace' is too broad a scope;
some subsystems like net already inherited namespace support, and it
turns out to be not exactly trivial to implement.

What I'm looking at, though, is to implement 'block' namespaces, to
restrict access to _new_ block devices to any give namespace.
Case in point: if a container creates a ramdisk it's questionable
whether other containers should even see it. iSCSI devices are a similar
case; when starting iSCSI devices from containers their use should be
restricted to that container.
And that's not only the device node in /dev, but would also entail sysfs
access, which from my understanding is not modified with the current code.

uevent redirection would help here, but from what I've seen it's only
for net devices; feels a bit awkward to have a network namespace to get
uevents for block devices, but then I'll have to test.
And, of course, that also doesn't change the sysfs layout.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-08 14:10     ` Hannes Reinecke
@ 2021-06-08 14:29       ` Christian Brauner
  2021-06-08 15:54         ` Hannes Reinecke
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2021-06-08 14:29 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: gregkh, containers, linux-kernel, lkml

On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
> On Tue, Jun 08, 2021 Greg-KH wrote:
> > On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
> >> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt,
> >> metux IT consult wrote:
> >>> Hello folks,
> >>>
> >>>
> >>> I'm going to implement device namespaces, where containers can get
> >>> an entirely different view of the devices in the machine (usually
> >>> just a specific subset, but possibly additional virtual devices).
> >>>
> [ .. ]
> >>> Is this a good way to go ? Or what would be a better one ?
> >>
> >> Ccing Greg. Without adressing specific problems, I should warn you
> >> that this idea is not new and the plan is unlikely to go anywhere.
> >> Especially not without support from Greg.
> >
> > Hah, yeah, this is a non-starter.
> >
> > Enrico, what real problem are you trying to solve by doing this?  And
> > have you tried anything with this yet?  We almost never talk about
> > "proposals" without seeing real code as it's pointless to discuss
> > things when you haven't even proven that it can work.
> >
> > So let's see code before even talking about this...
> >
> > And as Christian points out, you can do this today without any kernel
> > changes, so to think you need to modify the kernel means that you
> > haven't even tried this at all?
> >
> Curious, I had been looking into this, too.
> And I have to side with Greg and Christian that your proposal should
> already be possible today (cf device groups, which curiously has a
> near-identical interface to what you proposed).
> Also, I think that a generic 'device namespace' is too broad a scope;
> some subsystems like net already inherited namespace support, and it
> turns out to be not exactly trivial to implement.
> 
> What I'm looking at, though, is to implement 'block' namespaces, to
> restrict access to _new_ block devices to any give namespace.
> Case in point: if a container creates a ramdisk it's questionable
> whether other containers should even see it. iSCSI devices are a similar
> case; when starting iSCSI devices from containers their use should be
> restricted to that container.
> And that's not only the device node in /dev, but would also entail sysfs
> access, which from my understanding is not modified with the current code.

Hey Hannes. :)

It isn't and we likely shouldn't. You'd likely need to get into the
business of namespacing devtmpfs one way or the other which Seth Forshee
and I once did. But that's really not needed anymore imho. Device
management, i.e. creating device nodes should be the job of a container
manager. We already do that for example (Hotplugging devices ranging
from net devices, to disks, to GPUs.) and it works great.

To make this really clean you will likely have to significanly rework
sysfs too and I don't think that churn is worth it and introduces a
layer of complexity I find outright nakable. And ignoring sysfs or
hacking around it is also not an option I find tasteful.

> 
> uevent redirection would help here, but from what I've seen it's only
> for net devices; feels a bit awkward to have a network namespace to get
> uevents for block devices, but then I'll have to test.

Just to move everyone on the same page. This is not specific to network
devices actually.

You are right though that network devices are correctly namespaced.
Specifically you only get uevents in the network namespace that network
device is moved into. The sysfs permissions for network devices were
correct if you created that network device in the network namespace but
they were wrong when you moved a network device between network
namespaces (with different owning user namespaces). That lead to all
kinds of weird issues. I fixed that a while back.

Uevent messages (and therefore injection of uevents) are not tied to
network devices. They are tied to network namespaces simply because the
transport layer is Netlink but that's about it.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-08 14:29       ` Christian Brauner
@ 2021-06-08 15:54         ` Hannes Reinecke
  2021-06-08 17:16           ` Eric W. Biederman
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2021-06-08 15:54 UTC (permalink / raw)
  To: Christian Brauner; +Cc: gregkh, containers, linux-kernel, lkml

On 6/8/21 4:29 PM, Christian Brauner wrote:
> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
>> On Tue, Jun 08, 2021 Greg-KH wrote:
>>> On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
>>>> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt,
>>>> metux IT consult wrote:
>>>>> Hello folks,
>>>>>
>>>>>
>>>>> I'm going to implement device namespaces, where containers can get
>>>>> an entirely different view of the devices in the machine (usually
>>>>> just a specific subset, but possibly additional virtual devices).
>>>>>
>> [ .. ]
>>>>> Is this a good way to go ? Or what would be a better one ?
>>>>
>>>> Ccing Greg. Without adressing specific problems, I should warn you
>>>> that this idea is not new and the plan is unlikely to go anywhere.
>>>> Especially not without support from Greg.
>>>
>>> Hah, yeah, this is a non-starter.
>>>
>>> Enrico, what real problem are you trying to solve by doing this?  And
>>> have you tried anything with this yet?  We almost never talk about
>>> "proposals" without seeing real code as it's pointless to discuss
>>> things when you haven't even proven that it can work.
>>>
>>> So let's see code before even talking about this...
>>>
>>> And as Christian points out, you can do this today without any kernel
>>> changes, so to think you need to modify the kernel means that you
>>> haven't even tried this at all?
>>>
>> Curious, I had been looking into this, too.
>> And I have to side with Greg and Christian that your proposal should
>> already be possible today (cf device groups, which curiously has a
>> near-identical interface to what you proposed).
>> Also, I think that a generic 'device namespace' is too broad a scope;
>> some subsystems like net already inherited namespace support, and it
>> turns out to be not exactly trivial to implement.
>>
>> What I'm looking at, though, is to implement 'block' namespaces, to
>> restrict access to _new_ block devices to any give namespace.
>> Case in point: if a container creates a ramdisk it's questionable
>> whether other containers should even see it. iSCSI devices are a similar
>> case; when starting iSCSI devices from containers their use should be
>> restricted to that container.
>> And that's not only the device node in /dev, but would also entail sysfs
>> access, which from my understanding is not modified with the current code.
> 
> Hey Hannes. :)
> 
> It isn't and we likely shouldn't. You'd likely need to get into the
> business of namespacing devtmpfs one way or the other which Seth Forshee
> and I once did. But that's really not needed anymore imho. Device
> management, i.e. creating device nodes should be the job of a container
> manager. We already do that for example (Hotplugging devices ranging
> from net devices, to disks, to GPUs.) and it works great.
> 
Right; clearly you can do that within the container.
But my main grudge here is not the container but rather the system
_hosting_ the container.
That is typically using devtmpfs and hence will see _all_ devices, even
those belonging to the container.
This is causing grief to no end if eg the host system starts activating
LVM on devices which are passed to the qemu instance running within a
container ...

> To make this really clean you will likely have to significantly rework
> sysfs too and I don't think that churn is worth it and introduces a
> layer of complexity I find outright nakable. And ignoring sysfs or
> hacking around it is also not an option I find tasteful.
> 
Network namespaces already have the bits and pieces to modify sysfs, so
we should be able to leverage that for block, too.
And I think by restricting it to 'block' devices we should be to keep
the required sysfs modifications in a manageable level.

>>
>> uevent redirection would help here, but from what I've seen it's only
>> for net devices; feels a bit awkward to have a network namespace to get
>> uevents for block devices, but then I'll have to test.
> 
> Just to move everyone on the same page. This is not specific to network
> devices actually.
> 
> You are right though that network devices are correctly namespaced.
> Specifically you only get uevents in the network namespace that network
> device is moved into. The sysfs permissions for network devices were
> correct if you created that network device in the network namespace but
> they were wrong when you moved a network device between network
> namespaces (with different owning user namespaces). That lead to all
> kinds of weird issues. I fixed that a while back.
> 
Granted, modifying sysfs layout is not something for the faint-hearted,
and one really has to look closely to ensure you end up with a
consistent layout afterwards.

But let's see how things go; might well be that it turns out to be too
complex to consider. Can't tell yet.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-08 15:54         ` Hannes Reinecke
@ 2021-06-08 17:16           ` Eric W. Biederman
  2021-06-09  6:38             ` Christian Brauner
  0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2021-06-08 17:16 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christian Brauner, gregkh, containers, linux-kernel, lkml

Hannes Reinecke <hare@suse.de> writes:

> On 6/8/21 4:29 PM, Christian Brauner wrote:
>> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
>>> On Tue, Jun 08, 2021 Greg-KH wrote:
>>>> On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
>>>>> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt,
>>>>> metux IT consult wrote:
>>>>>> Hello folks,
>>>>>>
>>>>>>
>>>>>> I'm going to implement device namespaces, where containers can get
>>>>>> an entirely different view of the devices in the machine (usually
>>>>>> just a specific subset, but possibly additional virtual devices).
>>>>>>
>>> [ .. ]
>>>>>> Is this a good way to go ? Or what would be a better one ?
>>>>>
>>>>> Ccing Greg. Without adressing specific problems, I should warn you
>>>>> that this idea is not new and the plan is unlikely to go anywhere.
>>>>> Especially not without support from Greg.
>>>>
>>>> Hah, yeah, this is a non-starter.
>>>>
>>>> Enrico, what real problem are you trying to solve by doing this?  And
>>>> have you tried anything with this yet?  We almost never talk about
>>>> "proposals" without seeing real code as it's pointless to discuss
>>>> things when you haven't even proven that it can work.
>>>>
>>>> So let's see code before even talking about this...
>>>>
>>>> And as Christian points out, you can do this today without any kernel
>>>> changes, so to think you need to modify the kernel means that you
>>>> haven't even tried this at all?
>>>>
>>> Curious, I had been looking into this, too.
>>> And I have to side with Greg and Christian that your proposal should
>>> already be possible today (cf device groups, which curiously has a
>>> near-identical interface to what you proposed).
>>> Also, I think that a generic 'device namespace' is too broad a scope;
>>> some subsystems like net already inherited namespace support, and it
>>> turns out to be not exactly trivial to implement.
>>>
>>> What I'm looking at, though, is to implement 'block' namespaces, to
>>> restrict access to _new_ block devices to any give namespace.
>>> Case in point: if a container creates a ramdisk it's questionable
>>> whether other containers should even see it. iSCSI devices are a similar
>>> case; when starting iSCSI devices from containers their use should be
>>> restricted to that container.
>>> And that's not only the device node in /dev, but would also entail sysfs
>>> access, which from my understanding is not modified with the current code.
>> 
>> Hey Hannes. :)
>> 
>> It isn't and we likely shouldn't. You'd likely need to get into the
>> business of namespacing devtmpfs one way or the other which Seth Forshee
>> and I once did. But that's really not needed anymore imho. Device
>> management, i.e. creating device nodes should be the job of a container
>> manager. We already do that for example (Hotplugging devices ranging
>> from net devices, to disks, to GPUs.) and it works great.
>> 
> Right; clearly you can do that within the container.
> But my main grudge here is not the container but rather the system
> _hosting_ the container.
> That is typically using devtmpfs and hence will see _all_ devices, even
> those belonging to the container.
> This is causing grief to no end if eg the host system starts activating
> LVM on devices which are passed to the qemu instance running within a
> container ...
>
>> To make this really clean you will likely have to significantly rework
>> sysfs too and I don't think that churn is worth it and introduces a
>> layer of complexity I find outright nakable. And ignoring sysfs or
>> hacking around it is also not an option I find tasteful.
>> 
> Network namespaces already have the bits and pieces to modify sysfs, so
> we should be able to leverage that for block, too.
> And I think by restricting it to 'block' devices we should be to keep
> the required sysfs modifications in a manageable level.
>
>>>
>>> uevent redirection would help here, but from what I've seen it's only
>>> for net devices; feels a bit awkward to have a network namespace to get
>>> uevents for block devices, but then I'll have to test.
>> 
>> Just to move everyone on the same page. This is not specific to network
>> devices actually.
>> 
>> You are right though that network devices are correctly namespaced.
>> Specifically you only get uevents in the network namespace that network
>> device is moved into. The sysfs permissions for network devices were
>> correct if you created that network device in the network namespace but
>> they were wrong when you moved a network device between network
>> namespaces (with different owning user namespaces). That lead to all
>> kinds of weird issues. I fixed that a while back.
>> 
> Granted, modifying sysfs layout is not something for the faint-hearted,
> and one really has to look closely to ensure you end up with a
> consistent layout afterwards.
>
> But let's see how things go; might well be that it turns out to be too
> complex to consider. Can't tell yet.

I would suggest aiming for something like devptsfs without the
complication of /dev/ptmx.

That is a pseudo filesystem that has a control node and virtual block
devices that were created using that control node.

That is the cleanest solution I know and is not strictly limited to use
with containers so it can also gain greater traction.  The interaction
with devtmpfs should be simply having devtmpfs create a mount point for
that filesystem.

This could be a new cleaner api for things like loopback devices.

However the limitation for block devices that I am aware of is that we
don't currently have any filesystems in the kernel that are written
robustly enough that we can be expected to be secure when mounted on top
of an evil block device.  Some of the network filesystems are built
to withstand evil network packets, and possibly evil servers.  So with
care we can probably allow for unprivileged mounts there.


Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-08 17:16           ` Eric W. Biederman
@ 2021-06-09  6:38             ` Christian Brauner
  2021-06-09  7:02               ` Hannes Reinecke
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2021-06-09  6:38 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Hannes Reinecke, gregkh, containers, linux-kernel, lkml

On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
> Hannes Reinecke <hare@suse.de> writes:
> 
> > On 6/8/21 4:29 PM, Christian Brauner wrote:
> >> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
> >>> On Tue, Jun 08, 2021 Greg-KH wrote:
> >>>> On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
> >>>>> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt,
> >>>>> metux IT consult wrote:
> >>>>>> Hello folks,
> >>>>>>
> >>>>>>
> >>>>>> I'm going to implement device namespaces, where containers can get
> >>>>>> an entirely different view of the devices in the machine (usually
> >>>>>> just a specific subset, but possibly additional virtual devices).
> >>>>>>
> >>> [ .. ]
> >>>>>> Is this a good way to go ? Or what would be a better one ?
> >>>>>
> >>>>> Ccing Greg. Without adressing specific problems, I should warn you
> >>>>> that this idea is not new and the plan is unlikely to go anywhere.
> >>>>> Especially not without support from Greg.
> >>>>
> >>>> Hah, yeah, this is a non-starter.
> >>>>
> >>>> Enrico, what real problem are you trying to solve by doing this?  And
> >>>> have you tried anything with this yet?  We almost never talk about
> >>>> "proposals" without seeing real code as it's pointless to discuss
> >>>> things when you haven't even proven that it can work.
> >>>>
> >>>> So let's see code before even talking about this...
> >>>>
> >>>> And as Christian points out, you can do this today without any kernel
> >>>> changes, so to think you need to modify the kernel means that you
> >>>> haven't even tried this at all?
> >>>>
> >>> Curious, I had been looking into this, too.
> >>> And I have to side with Greg and Christian that your proposal should
> >>> already be possible today (cf device groups, which curiously has a
> >>> near-identical interface to what you proposed).
> >>> Also, I think that a generic 'device namespace' is too broad a scope;
> >>> some subsystems like net already inherited namespace support, and it
> >>> turns out to be not exactly trivial to implement.
> >>>
> >>> What I'm looking at, though, is to implement 'block' namespaces, to
> >>> restrict access to _new_ block devices to any give namespace.
> >>> Case in point: if a container creates a ramdisk it's questionable
> >>> whether other containers should even see it. iSCSI devices are a similar
> >>> case; when starting iSCSI devices from containers their use should be
> >>> restricted to that container.
> >>> And that's not only the device node in /dev, but would also entail sysfs
> >>> access, which from my understanding is not modified with the current code.
> >> 
> >> Hey Hannes. :)
> >> 
> >> It isn't and we likely shouldn't. You'd likely need to get into the
> >> business of namespacing devtmpfs one way or the other which Seth Forshee
> >> and I once did. But that's really not needed anymore imho. Device
> >> management, i.e. creating device nodes should be the job of a container
> >> manager. We already do that for example (Hotplugging devices ranging
> >> from net devices, to disks, to GPUs.) and it works great.
> >> 
> > Right; clearly you can do that within the container.
> > But my main grudge here is not the container but rather the system
> > _hosting_ the container.
> > That is typically using devtmpfs and hence will see _all_ devices, even
> > those belonging to the container.
> > This is causing grief to no end if eg the host system starts activating
> > LVM on devices which are passed to the qemu instance running within a
> > container ...
> >
> >> To make this really clean you will likely have to significantly rework
> >> sysfs too and I don't think that churn is worth it and introduces a
> >> layer of complexity I find outright nakable. And ignoring sysfs or
> >> hacking around it is also not an option I find tasteful.
> >> 
> > Network namespaces already have the bits and pieces to modify sysfs, so
> > we should be able to leverage that for block, too.
> > And I think by restricting it to 'block' devices we should be to keep
> > the required sysfs modifications in a manageable level.
> >
> >>>
> >>> uevent redirection would help here, but from what I've seen it's only
> >>> for net devices; feels a bit awkward to have a network namespace to get
> >>> uevents for block devices, but then I'll have to test.
> >> 
> >> Just to move everyone on the same page. This is not specific to network
> >> devices actually.
> >> 
> >> You are right though that network devices are correctly namespaced.
> >> Specifically you only get uevents in the network namespace that network
> >> device is moved into. The sysfs permissions for network devices were
> >> correct if you created that network device in the network namespace but
> >> they were wrong when you moved a network device between network
> >> namespaces (with different owning user namespaces). That lead to all
> >> kinds of weird issues. I fixed that a while back.
> >> 
> > Granted, modifying sysfs layout is not something for the faint-hearted,
> > and one really has to look closely to ensure you end up with a
> > consistent layout afterwards.
> >
> > But let's see how things go; might well be that it turns out to be too
> > complex to consider. Can't tell yet.
> 
> I would suggest aiming for something like devptsfs without the
> complication of /dev/ptmx.
> 
> That is a pseudo filesystem that has a control node and virtual block
> devices that were created using that control node.

Also see android/binder/binderfs.c

> 
> That is the cleanest solution I know and is not strictly limited to use
> with containers so it can also gain greater traction.  The interaction
> with devtmpfs should be simply having devtmpfs create a mount point for
> that filesystem.
> 
> This could be a new cleaner api for things like loopback devices.

I sent a patchset that implemented this last year.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-09  6:38             ` Christian Brauner
@ 2021-06-09  7:02               ` Hannes Reinecke
  2021-06-09  7:21                 ` Christian Brauner
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2021-06-09  7:02 UTC (permalink / raw)
  To: Christian Brauner, Eric W. Biederman
  Cc: gregkh, containers, linux-kernel, lkml

On 6/9/21 8:38 AM, Christian Brauner wrote:
> On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
>> Hannes Reinecke <hare@suse.de> writes:
>>
>>> On 6/8/21 4:29 PM, Christian Brauner wrote:
>>>> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
[ .. ]
>>> Granted, modifying sysfs layout is not something for the faint-hearted,
>>> and one really has to look closely to ensure you end up with a
>>> consistent layout afterwards.
>>>
>>> But let's see how things go; might well be that it turns out to be too
>>> complex to consider. Can't tell yet.
>>
>> I would suggest aiming for something like devptsfs without the
>> complication of /dev/ptmx.
>>
>> That is a pseudo filesystem that has a control node and virtual block
>> devices that were created using that control node.
> 
> Also see android/binder/binderfs.c
> 
Ah. Will have a look.

>>
>> That is the cleanest solution I know and is not strictly limited to use
>> with containers so it can also gain greater traction.  The interaction
>> with devtmpfs should be simply having devtmpfs create a mount point for
>> that filesystem.
>>
>> This could be a new cleaner api for things like loopback devices.
> 
> I sent a patchset that implemented this last year.
> 
Do you have a pointer/commit hash for this?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-09  7:02               ` Hannes Reinecke
@ 2021-06-09  7:21                 ` Christian Brauner
  2021-06-09  7:54                   ` Hannes Reinecke
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2021-06-09  7:21 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Eric W. Biederman, gregkh, containers, linux-kernel, lkml

On Wed, Jun 09, 2021 at 09:02:36AM +0200, Hannes Reinecke wrote:
> On 6/9/21 8:38 AM, Christian Brauner wrote:
> > On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
> > > Hannes Reinecke <hare@suse.de> writes:
> > > 
> > > > On 6/8/21 4:29 PM, Christian Brauner wrote:
> > > > > On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
> [ .. ]
> > > > Granted, modifying sysfs layout is not something for the faint-hearted,
> > > > and one really has to look closely to ensure you end up with a
> > > > consistent layout afterwards.
> > > > 
> > > > But let's see how things go; might well be that it turns out to be too
> > > > complex to consider. Can't tell yet.
> > > 
> > > I would suggest aiming for something like devptsfs without the
> > > complication of /dev/ptmx.
> > > 
> > > That is a pseudo filesystem that has a control node and virtual block
> > > devices that were created using that control node.
> > 
> > Also see android/binder/binderfs.c
> > 
> Ah. Will have a look.

I implemented this a few years back and I think it should've made it
onto Android by default now. So that approach does indeed work well, it
seems:
https://chromium.googlesource.com/aosp/platform/system/core/+/master/rootdir/init.rc#257

This should be easier to follow than the devpts case because you don't
need to wade through the {t,p}ty layer.

> 
> > > 
> > > That is the cleanest solution I know and is not strictly limited to use
> > > with containers so it can also gain greater traction.  The interaction
> > > with devtmpfs should be simply having devtmpfs create a mount point for
> > > that filesystem.
> > > 
> > > This could be a new cleaner api for things like loopback devices.
> > 
> > I sent a patchset that implemented this last year.
> > 
> Do you have a pointer/commit hash for this?

Yes, sure:
https://lore.kernel.org/linux-block/20200424162052.441452-1-christian.brauner@ubuntu.com/

You can also just pull my branch. I think it's still based on v5.7 or sm:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs

I'm happy to collaborate on this too.

Christian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-09  7:21                 ` Christian Brauner
@ 2021-06-09  7:54                   ` Hannes Reinecke
  2021-06-09  8:09                     ` Christian Brauner
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2021-06-09  7:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric W. Biederman, gregkh, containers, linux-kernel, lkml

On 6/9/21 9:21 AM, Christian Brauner wrote:
> On Wed, Jun 09, 2021 at 09:02:36AM +0200, Hannes Reinecke wrote:
>> On 6/9/21 8:38 AM, Christian Brauner wrote:
>>> On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
>>>> Hannes Reinecke <hare@suse.de> writes:
>>>>
>>>>> On 6/8/21 4:29 PM, Christian Brauner wrote:
>>>>>> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
>> [ .. ]
>>>>> Granted, modifying sysfs layout is not something for the faint-hearted,
>>>>> and one really has to look closely to ensure you end up with a
>>>>> consistent layout afterwards.
>>>>>
>>>>> But let's see how things go; might well be that it turns out to be too
>>>>> complex to consider. Can't tell yet.
>>>>
>>>> I would suggest aiming for something like devptsfs without the
>>>> complication of /dev/ptmx.
>>>>
>>>> That is a pseudo filesystem that has a control node and virtual block
>>>> devices that were created using that control node.
>>>
>>> Also see android/binder/binderfs.c
>>>
>> Ah. Will have a look.
> 
> I implemented this a few years back and I think it should've made it
> onto Android by default now. So that approach does indeed work well, it
> seems:
> https://chromium.googlesource.com/aosp/platform/system/core/+/master/rootdir/init.rc#257
> 
> This should be easier to follow than the devpts case because you don't
> need to wade through the {t,p}ty layer.
> 
>>
>>>>
>>>> That is the cleanest solution I know and is not strictly limited to use
>>>> with containers so it can also gain greater traction.  The interaction
>>>> with devtmpfs should be simply having devtmpfs create a mount point for
>>>> that filesystem.
>>>>
>>>> This could be a new cleaner api for things like loopback devices.
>>>
>>> I sent a patchset that implemented this last year.
>>>
>> Do you have a pointer/commit hash for this?
> 
> Yes, sure:
> https://lore.kernel.org/linux-block/20200424162052.441452-1-christian.brauner@ubuntu.com/
> 
> You can also just pull my branch. I think it's still based on v5.7 or sm:
> https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs
> 
> I'm happy to collaborate on this too.
>
How _very_ curious. 'kernfs: handle multiple namespace tags' and 'loop:
preserve sysfs backwards compability' are essentially the same patches I
did for my block namespaces prototyp; I named it 'KOBJ_NS_TYPE_BLK', not
'KOBJ_NS_TYPE_USER', though :-)

Guess we really should cooperate.

Speaking of which: why did you name it 'user' namespace?
There already is a generic 'user_namespace' in
include/linux/user_namespace.h, serving as a container for all
namespaces; as such it probably should include this 'user' namespace,
leading to quite some confusion.

Or did I misunderstood something here?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-09  7:54                   ` Hannes Reinecke
@ 2021-06-09  8:09                     ` Christian Brauner
  2021-06-11 18:14                       ` Eric W. Biederman
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2021-06-09  8:09 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Eric W. Biederman, gregkh, containers, linux-kernel, lkml

On Wed, Jun 09, 2021 at 09:54:05AM +0200, Hannes Reinecke wrote:
> On 6/9/21 9:21 AM, Christian Brauner wrote:
> > On Wed, Jun 09, 2021 at 09:02:36AM +0200, Hannes Reinecke wrote:
> >> On 6/9/21 8:38 AM, Christian Brauner wrote:
> >>> On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
> >>>> Hannes Reinecke <hare@suse.de> writes:
> >>>>
> >>>>> On 6/8/21 4:29 PM, Christian Brauner wrote:
> >>>>>> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
> >> [ .. ]
> >>>>> Granted, modifying sysfs layout is not something for the faint-hearted,
> >>>>> and one really has to look closely to ensure you end up with a
> >>>>> consistent layout afterwards.
> >>>>>
> >>>>> But let's see how things go; might well be that it turns out to be too
> >>>>> complex to consider. Can't tell yet.
> >>>>
> >>>> I would suggest aiming for something like devptsfs without the
> >>>> complication of /dev/ptmx.
> >>>>
> >>>> That is a pseudo filesystem that has a control node and virtual block
> >>>> devices that were created using that control node.
> >>>
> >>> Also see android/binder/binderfs.c
> >>>
> >> Ah. Will have a look.
> > 
> > I implemented this a few years back and I think it should've made it
> > onto Android by default now. So that approach does indeed work well, it
> > seems:
> > https://chromium.googlesource.com/aosp/platform/system/core/+/master/rootdir/init.rc#257
> > 
> > This should be easier to follow than the devpts case because you don't
> > need to wade through the {t,p}ty layer.
> > 
> >>
> >>>>
> >>>> That is the cleanest solution I know and is not strictly limited to use
> >>>> with containers so it can also gain greater traction.  The interaction
> >>>> with devtmpfs should be simply having devtmpfs create a mount point for
> >>>> that filesystem.
> >>>>
> >>>> This could be a new cleaner api for things like loopback devices.
> >>>
> >>> I sent a patchset that implemented this last year.
> >>>
> >> Do you have a pointer/commit hash for this?
> > 
> > Yes, sure:
> > https://lore.kernel.org/linux-block/20200424162052.441452-1-christian.brauner@ubuntu.com/
> > 
> > You can also just pull my branch. I think it's still based on v5.7 or sm:
> > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs
> > 
> > I'm happy to collaborate on this too.
> >
> How _very_ curious. 'kernfs: handle multiple namespace tags' and 'loop:
> preserve sysfs backwards compability' are essentially the same patches I
> did for my block namespaces prototyp; I named it 'KOBJ_NS_TYPE_BLK', not
> 'KOBJ_NS_TYPE_USER', though :-)
> 
> Guess we really should cooperate.
> 
> Speaking of which: why did you name it 'user' namespace?
> There already is a generic 'user_namespace' in
> include/linux/user_namespace.h, serving as a container for all
> namespaces; as such it probably should include this 'user' namespace,
> leading to quite some confusion.
> 
> Or did I misunderstood something here?

Ah yes, you misunderstand. The KOBJ_NS_TYPE_* tags are namespace tags.
So KOBJ_NS_TYPE_NET is a network namespace tag. So KOBJ_NS_TYPE_USER is
a user namespace tag not a completely new namespace. The idea very
roughly being that devices such as loop devices are ultimately filtered
by user namespace which is taken from the s_user_ns the loopfs instance
is mounted in. We should compare notes.

Christian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-09  8:09                     ` Christian Brauner
@ 2021-06-11 18:14                       ` Eric W. Biederman
  2021-06-14  7:49                         ` Enrico Weigelt, metux IT consult
  0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2021-06-11 18:14 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Hannes Reinecke, gregkh, containers, linux-kernel, lkml

Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Wed, Jun 09, 2021 at 09:54:05AM +0200, Hannes Reinecke wrote:
>> On 6/9/21 9:21 AM, Christian Brauner wrote:
>> > On Wed, Jun 09, 2021 at 09:02:36AM +0200, Hannes Reinecke wrote:
>> >> On 6/9/21 8:38 AM, Christian Brauner wrote:
>> >>> On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
>> >>>> Hannes Reinecke <hare@suse.de> writes:
>> >>>>
>> >>>>> On 6/8/21 4:29 PM, Christian Brauner wrote:
>> >>>>>> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
>> >> [ .. ]
>> >>>>> Granted, modifying sysfs layout is not something for the faint-hearted,
>> >>>>> and one really has to look closely to ensure you end up with a
>> >>>>> consistent layout afterwards.
>> >>>>>
>> >>>>> But let's see how things go; might well be that it turns out to be too
>> >>>>> complex to consider. Can't tell yet.
>> >>>>
>> >>>> I would suggest aiming for something like devptsfs without the
>> >>>> complication of /dev/ptmx.
>> >>>>
>> >>>> That is a pseudo filesystem that has a control node and virtual block
>> >>>> devices that were created using that control node.
>> >>>
>> >>> Also see android/binder/binderfs.c
>> >>>
>> >> Ah. Will have a look.
>> > 
>> > I implemented this a few years back and I think it should've made it
>> > onto Android by default now. So that approach does indeed work well, it
>> > seems:
>> > https://chromium.googlesource.com/aosp/platform/system/core/+/master/rootdir/init.rc#257
>> > 
>> > This should be easier to follow than the devpts case because you don't
>> > need to wade through the {t,p}ty layer.
>> > 
>> >>
>> >>>>
>> >>>> That is the cleanest solution I know and is not strictly limited to use
>> >>>> with containers so it can also gain greater traction.  The interaction
>> >>>> with devtmpfs should be simply having devtmpfs create a mount point for
>> >>>> that filesystem.
>> >>>>
>> >>>> This could be a new cleaner api for things like loopback devices.
>> >>>
>> >>> I sent a patchset that implemented this last year.
>> >>>
>> >> Do you have a pointer/commit hash for this?
>> > 
>> > Yes, sure:
>> > https://lore.kernel.org/linux-block/20200424162052.441452-1-christian.brauner@ubuntu.com/
>> > 
>> > You can also just pull my branch. I think it's still based on v5.7 or sm:
>> > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs
>> > 
>> > I'm happy to collaborate on this too.
>> >
>> How _very_ curious. 'kernfs: handle multiple namespace tags' and 'loop:
>> preserve sysfs backwards compability' are essentially the same patches I
>> did for my block namespaces prototyp; I named it 'KOBJ_NS_TYPE_BLK', not
>> 'KOBJ_NS_TYPE_USER', though :-)
>> 
>> Guess we really should cooperate.
>> 
>> Speaking of which: why did you name it 'user' namespace?
>> There already is a generic 'user_namespace' in
>> include/linux/user_namespace.h, serving as a container for all
>> namespaces; as such it probably should include this 'user' namespace,
>> leading to quite some confusion.
>> 
>> Or did I misunderstood something here?
>
> Ah yes, you misunderstand. The KOBJ_NS_TYPE_* tags are namespace tags.
> So KOBJ_NS_TYPE_NET is a network namespace tag. So KOBJ_NS_TYPE_USER is
> a user namespace tag not a completely new namespace. The idea very
> roughly being that devices such as loop devices are ultimately filtered
> by user namespace which is taken from the s_user_ns the loopfs instance
> is mounted in. We should compare notes.

There are two easy possibilities.

- All of the devices on the filesystem show up in sysfs with unique
  major minor numbers.
- None of the devices on the filesystem show up in sysfs.
  (Which I believe is what devpts does).

I favor none of the virtual devices showing up in sysfs.  Maybe existing
userspace needs the devices in sysfs, but if the solution is simply to
skip sysfs for virtual devices that is much simpler.

Eric


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-11 18:14                       ` Eric W. Biederman
@ 2021-06-14  7:49                         ` Enrico Weigelt, metux IT consult
  2021-06-14  8:22                           ` Greg KH
  2021-06-14 17:36                           ` Eric W. Biederman
  0 siblings, 2 replies; 48+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-14  7:49 UTC (permalink / raw)
  To: Eric W. Biederman, Christian Brauner
  Cc: Hannes Reinecke, gregkh, containers, linux-kernel

On 11.06.21 20:14, Eric W. Biederman wrote:

Hi,

> I favor none of the virtual devices showing up in sysfs.  Maybe existing
> userspace needs the devices in sysfs, but if the solution is simply to
> skip sysfs for virtual devices that is much simpler.

Sorry for being a little bit confused, but by virtual devices you mean
things like pty's or all the other stuff we already see under
/sys/device/virtual ?

I'm yet unsure what the better way is. If we're just talking about pty's
specifically, I maybe could live with threating them like "special sort
of pipes", but I guess that would require some extra magic.

If I'm not mistaken, the whole sysfs stuff is automatically handled
device classes and bus'es - seems that tty's are also class devs.

How would you skip the virtual devices from sysfs ? Adding some filter
into sysfs that looks at the device class (or some flag within it) ?

--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-14  7:49                         ` Enrico Weigelt, metux IT consult
@ 2021-06-14  8:22                           ` Greg KH
  2021-06-14 17:36                           ` Eric W. Biederman
  1 sibling, 0 replies; 48+ messages in thread
From: Greg KH @ 2021-06-14  8:22 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Eric W. Biederman, Christian Brauner, Hannes Reinecke,
	containers, linux-kernel

On Mon, Jun 14, 2021 at 09:49:22AM +0200, Enrico Weigelt, metux IT consult wrote:
> On 11.06.21 20:14, Eric W. Biederman wrote:
> 
> Hi,
> 
> > I favor none of the virtual devices showing up in sysfs.  Maybe existing
> > userspace needs the devices in sysfs, but if the solution is simply to
> > skip sysfs for virtual devices that is much simpler.
> 
> Sorry for being a little bit confused, but by virtual devices you mean
> things like pty's or all the other stuff we already see under
> /sys/device/virtual ?
> 
> I'm yet unsure what the better way is. If we're just talking about pty's
> specifically, I maybe could live with threating them like "special sort
> of pipes", but I guess that would require some extra magic.
> 
> If I'm not mistaken, the whole sysfs stuff is automatically handled
> device classes and bus'es - seems that tty's are also class devs.
> 
> How would you skip the virtual devices from sysfs ? Adding some filter
> into sysfs that looks at the device class (or some flag within it) ?

Wait, step back.  What _EXACTLY_ are you wanting to do here?  If you
have not looked at how sysfs handles devices today, that leads me to
believe that you do not have a real model in place.

Again, spend some time and write some code please before continuing this
thread.  We don't like to talk about vague things when you do not even
have an idea of what you want.

good luck!

greg k-h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-14  7:49                         ` Enrico Weigelt, metux IT consult
  2021-06-14  8:22                           ` Greg KH
@ 2021-06-14 17:36                           ` Eric W. Biederman
  2021-06-15 11:24                             ` Enrico Weigelt, metux IT consult
  1 sibling, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2021-06-14 17:36 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Christian Brauner, Hannes Reinecke, gregkh, containers, linux-kernel

"Enrico Weigelt, metux IT consult" <lkml@metux.net> writes:

> On 11.06.21 20:14, Eric W. Biederman wrote:
>
> Hi,
>
>> I favor none of the virtual devices showing up in sysfs.  Maybe existing
>> userspace needs the devices in sysfs, but if the solution is simply to
>> skip sysfs for virtual devices that is much simpler.
>
> Sorry for being a little bit confused, but by virtual devices you mean
> things like pty's or all the other stuff we already see under
> /sys/device/virtual ?

By virtual devices I mean all devices that are not physical pieces
of hardware.  For block devices I mean devices such as loopback
devices that are created on demand.  Ramdisks that start this
conversation could also be considered virtual devices.

> How would you skip the virtual devices from sysfs ? Adding some filter
> into sysfs that looks at the device class (or some flag within it) ?

I would just not run the code to create sysfs entries when the virtual
devices are created.

If you have virtual devices showing up in their own filesystem they
don't even need major or minor numbers.  You can just have files
that accept ioctls like device nodes.  In principle it is
possible to skip a lot of the historical infrastructure.  If the
infrastructure is not needed it is worth skipping.

I haven't dug into the block layer recently enough to say what is needed
or not.  I think there are some thing such as stat on a mounted
filesystem that need a major and minor numbers.  Which probably means
you have to use major and minor numbers.  By virtue of using common
infrastructure that implies showing up in sysfs and devtmpfs.  Things
would be limited just by not mounting devtmpfs in a container.

It is worth checking how much of the common infrastructure you need when
you start creating virtual devices.

The only reason the network devices need changes to sysfs is to allow
different network devices with the same name to show up in different
network namespaces.

If you can fundamentally avoid the problem of devices with the same
name needing to show up in sysfs and devtmpfs by using filesystems
then sysfs and devtmpfs needs no changes.

Hotplug is sufficiently widespread now that it should be possible
to avoid the hard problem of having duplicate names for block devices,
one way or another.  Thus talking of changing sysfs seems completely
unnecessary.

Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-14 17:36                           ` Eric W. Biederman
@ 2021-06-15 11:24                             ` Enrico Weigelt, metux IT consult
  2021-06-15 11:33                               ` Greg KH
  0 siblings, 1 reply; 48+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-15 11:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Hannes Reinecke, gregkh, containers, linux-kernel

On 14.06.21 19:36, Eric W. Biederman wrote:

> By virtual devices I mean all devices that are not physical pieces
> of hardware.  For block devices I mean devices such as loopback
> devices that are created on demand.  Ramdisks that start this
> conversation could also be considered virtual devices.

Ok. Do you also count partitions in here ?

IMHO we've got another category to look up: devices that (can) create
more (sub)devices. Examples coming into my head are loopdev, ptmx,
partitions, etc.

The big problem here: fist we'd need to be clear on the actual
semantics in namespaced context, for example:

* what happens when you talk to /dev/loop0 and create a new loopdev
   inside a container - shall it be ever visible on the host ?

* what if you want to create an loopdev on some file thats only visible
   to the host, but that loopdev shall appear inside a container ?
   ("virtual disk" scenario)

>> How would you skip the virtual devices from sysfs ? Adding some filter
>> into sysfs that looks at the device class (or some flag within it) ?
> 
> I would just not run the code to create sysfs entries when the virtual
> devices are created.

Oh, that would most likely make userland unhappy.

Besides, that won't be so trivial due to the way sysfs works. Because
sysfs more or less just presents kobj's. Each kobj may have attributes,
a parent, and a list of childs. A device is n kobj, and it needs to
be registered into the device hierarchy to work at all. Sysfs itself
doesn't really know whether something is a virtual device (or a device
at all) - it just calls some functions from kobject_type for things like
reading/writing attributes, etc. But I don't see anything where
kobject_type's can implement their own iterators.

As things are right now, not registering a device in sysfs means not
registering it at all.

By the way: i'm just wondering whether it would make sense to give
kobject_type it's own iteration and lookup functions. Unless I'm fully
mistaken, that could help solving several other problems, e.g. device
renaming (currently *very* tricky and only works to some extend for
network devices).

IMHO, we could then eg. fetch the device names (/sys/devices/...)
directly from the struct device instead of the kset (perhaps a simple
list instead of kset would also do here), and also create the symlinks
(e.g. /sys/class/.../) on the fly. Once that's done, renaming a device
should become rather simple.

At that point, adding multiple views or certain parts of sysfs (e.g. the
devices hierarchy) could perhaps be done by implementing special
iterators take take the view criteria into account.

@Greg: what's your take on that iterator idea ?

> If you have virtual devices showing up in their own filesystem they
> don't even need major or minor numbers.  You can just have files
> that accept ioctls like device nodes.  In principle it is
> possible to skip a lot of the historical infrastructure.  If the
> infrastructure is not needed it is worth skipping.

Ah, I see where you're going. You wanna completely drop these virtual 
devices and replace them by a synthentic fs that *looks* like it
contains devices ? Well, theoretically it should be possible, since fs'
may handle opening device nodes completely own, instead of calling 
generic code (is there any that actually does ?).

BUT: in that case we have to really make sure that processes inside the
container cannot ever open any device node outside that special fs.

> I haven't dug into the block layer recently enough to say what is needed
> or not.  I think there are some thing such as stat on a mounted
> filesystem that need a major and minor numbers.  Which probably means
> you have to use major and minor numbers.  By virtue of using common
> infrastructure that implies showing up in sysfs and devtmpfs.  Things
> would be limited just by not mounting devtmpfs in a container.

Note that this approach also needs to support things like dynamically
creating new device nodes (inside the container), udev, ... otherwise
you'd need very special handling in userland again (lxc folks would
become very unhappy ;-))

> It is worth checking how much of the common infrastructure you need when
> you start creating virtual devices.

s/virtual devices/synthetic filesystems/;

You approach goes much into the Plan9 direction (which in generally I'd
love to see). But whatever we gonna do here needs to remain compatible
with what existing userland expects - we've got a lot of Unix tradition
to keep here.

OR: we had to declare that (once inside the devns) we throw it all alway
and it create something entirely new that's more like an Plan9 subsystem
than an Linux container. Also interesting, but not what i've started
this discussion for.

> The only reason the network devices need changes to sysfs is to allow
> different network devices with the same name to show up in different
> network namespaces.
> 
> If you can fundamentally avoid the problem of devices with the same
> name needing to show up in sysfs and devtmpfs by using filesystems
> then sysfs and devtmpfs needs no changes.

Well, that's only for the sysfs part. Network devices still need to
be namespaced in other places (socket, etc) - what's already done by
netns.

But yes, it sounds nice if we had entirely different namespaces for
network device names (e.g. any of the hosts network devices could
appear simply as "eth0" inside a container, if you want to)


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device namespaces
  2021-06-15 11:24                             ` Enrico Weigelt, metux IT consult
@ 2021-06-15 11:33                               ` Greg KH
  0 siblings, 0 replies; 48+ messages in thread
From: Greg KH @ 2021-06-15 11:33 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Eric W. Biederman, Christian Brauner, Hannes Reinecke,
	containers, linux-kernel

On Tue, Jun 15, 2021 at 01:24:24PM +0200, Enrico Weigelt, metux IT consult wrote:
> @Greg: what's your take on that iterator idea ?

I want you to stop talking about ideas, and try to implement them before
this conversation wastes anyone else's time and energy.

There is a good reason we do not do this type of "let's discuss things!"
in the kernel community, and that is because almost none of it matters
without working code.

So please, let's see some patches that implement your ideas and then we
can discuss them.

Until then, consider this thread ignored from me.

greg k-h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                   ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-09-26  5:33                     ` Greg Kroah-Hartman
@ 2013-10-28 23:31                     ` Andrey Wagin
  1 sibling, 0 replies; 48+ messages in thread
From: Andrey Wagin @ 2013-10-28 23:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers,
	Andy Lutomirski, devel-GEFAQzZX7r8dnm+yROfE0A, lxc-devel,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber

2013/9/26 Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>
>
> From conversations at Linux Plumbers Converence it became fairly clear
> that one if not the roughest edge on containers today is dealing with
> devices.
>
> - Hotplug does not work.
> - There seems to be no implementation that does a much beyond creating
>   setting up a static set of /dev entries today.
> - Containers do not see the appropriate uevents for their container.
>
> One of the more compelling cases I heard was of someone who was running
> the a Linux Desktop in container and wanted to just let that container
> see the devices needed for his desktop, and not everything else.

I had experience of implementing this functionality in OpenVZ kernel.
I had requirements to not modify user-space tools, so that
implementations looks as dirty hack, but even hotplug of devices are
workin there.

....

>
> So the big issues for a device namespace to solve are filtering which
> devices a container has access to and being able to dynamically change
> which devices those are at run time (aka hotplug).
>
> After having thought about this for a bit I don't know if a pure
> userspace solution is sufficient or actually a good idea.

I would prefer to think a bit more about userspace solution. We can
try to expand udev functionality.

>
> - We can manually manage a tmpfs with device nodes in userspace.
>   (But that is deprecated functionality in the mainstream kernel).
> - We can manually export a subset of sysfs with bind mounts.
>   (But that feels hacky, and is essentially incompatible with hotplug).
> - We can relay a call of /sbin/hotplug from outside of a container
>   to inside of a container based on policy.
>   (But no one uses /sbin/hotplug anymore).
> - There is no way to fake netlink uevents for a container to see them.
>   (The best we could do is replace udev everywhere with something that
>    listens on a unix domain socket).

or we can teach udev to listens on a unix domain socket.

The host udev listens netlink. When it gets an event about a new
device, it decides for which containers it must be avaliable, does all
required actions and sends events in containers. Probably the protocol
of notifications must be unified for all udev-like services.

>
> - It would be nice to replace the device cgroup with a comprehensive
>   solution that really works. (Among other things the device cgroup
>   does not work in terms of struct device the underlying kernel
>   abstraction for devices).
>
> We must manage sysfs entries as well device nodes because:
> - Seeing more than we should has the real potential to confuse
>   userspace, especially a userspace that replays uevents.
> - Some device control must happens through writing to sysfs files and
>   if we don't remove all root privileges from a container only by
>   exporting a subset of sysfs to that container can we limit which
>   sysfs nodes can be written to.

Sorry if a following idea will sound crazy. Can we use fuse
filesystems for filtering sysfs and devtmpfs? When a CT mounts sysfs,
it will mount fuse-sysfs, which is implemented by userspace program on
host system.

* This way allows to emulate the behavior of uevent files in
containers, if we will use unix sockets between udev services.
* Probably a userspace daemon will be more flexible and customizable
than something in kernel

Do we have a use case when a perfomance of sysfs is critical?

Thanks,
Andrey

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]         ` <CAA2m6vc3OFmS9VwiTavRzPqhn+qoe6vDCO2sitXpEQ8a1JVyfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-10-03  9:17           ` Eric W. Biederman
  0 siblings, 0 replies; 48+ messages in thread
From: Eric W. Biederman @ 2013-10-03  9:17 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers,
	Andy Lutomirski, devel-GEFAQzZX7r8dnm+yROfE0A, lxc-devel,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber

Amir Goldstein <amir@cellrox.com> writes:

> Excellent! let's focus the discussion on a new device driver we want
> to write
> which is namespace aware. let's call this device driver valarm-dev.
> Similarly to Android's alarm-dev, valarm-dev can be used to request
> RTC wakeup calls
> from user space and get/set RTC values, but with valarm-dev, every
> container
> may use different values for current time.
>
> As you can see in our patch set, we already have a version of
> alarm-dev that maintains
> its state inside a context, instead of in global variable, so it is
> capable of providing
> different context per namespace.
>
> And now for the 1M$ question: per *which* namespace do we attribute
> the current realtime clock time?

To none of them.  Just use a different minor per instance, then you
don't have a hard question to answer.

> To UTS namespace (because T historically stands for Time)? To device
> namespace?
> Even if device namespace would exist, we do not want to tie the policy
> decision of "separate time"
> to a very wide definition of "separate devices".
>
> So what we want to create, is an API for device driver writers, that
> will enable to write a namespace
> aware device and allow userspace to configure when the namespace aware
> device context is unshared.


> We would like to share with you our very initial thoughts about how
> this will be implemented:
> - Extend register_pernet_subsys/device(ops) API
> to register_perns_subsys/device(nstype, ops) API
> - Extend pernet_operations to perns_operations that include optional
> migrate() and/or unshare() ops
> - Let valarm-dev register_peruser_subsys/device(&alarm_userns_ops)

For the network subsystem that makes sense.  But it doesn't make sense
for devices.  It is just an unneeded extra complication.

> - Implement a new syscall (or netlink command if it makes more sense)
> setdevns(int dev_fd, int ns_fd, int nstype, int flags)

ioctl?  master device? How do people communicate with raw devices these
days?

> - Unlike the netlink set netns case, this API is not used solely to
> *move* a device to a different namespace,
>   but also to *unshare* a device context between namespaces, for those
> devices that resigtered unshare() ops.

I really think this all makes most sense a driver a virtual driver at a
time.

> This is our missing piece of the puzzle.
> After that, whether we make changes to existing drivers (e.g. evdev)
> or write new virtualized drivers (e.g. vevdev)
> is a technicality. We care not which way to go, whichever way seems
> more maintainable.
>
> What do you think of this master plan?

I think by making your devices behavior depend on which namespace they
are in you are making the drivers unnecesarily fragile, and
unnecessarily unusable.

I think the code will be simpler/cleaner/better if you don't need to
have context outside of your drivers.

> P.S. Please try to refrain from addressing the validity of the use
> case of alarm-dev in particular,
> as we do not wish to get engage "Android sucks" wars. 
> We simply want to present the case for improving the namespace
> infrastructure to cater the needs
> of device driver writers that wish to tailor their drivers for
> containers based products. 

I think this is a driver interface problem, not a namespace problem.
None of the similar drivers that exist in the network namespace
change their behavior depending on which namespace they are in.

The two practical choices I see are.
1) Use a bunch of minors for your driver.
2) Act roughly like /dev/pts and use different mounts of the filesystem
   to create new instances.

I think different minors is probably easier, but we have two successfull
models I am aware of so I have mentioned both.

Eric

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]     ` <87a9iri3ot.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-10-03  0:59       ` Eric W. Biederman
@ 2013-10-03  8:58       ` Amir Goldstein
       [not found]         ` <CAA2m6vc3OFmS9VwiTavRzPqhn+qoe6vDCO2sitXpEQ8a1JVyfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 48+ messages in thread
From: Amir Goldstein @ 2013-10-03  8:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers,
	Andy Lutomirski, devel-GEFAQzZX7r8dnm+yROfE0A, lxc-devel,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber

On Thu, Oct 3, 2013 at 3:44 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>wrote:

> Amir Goldstein <amir-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> writes:
>
> > What we really like to see is a setns() style API that can be used to
> > add a device in the context of a namespace in either a "shared" or
> > "private" mode.
>
> I think you mean an "ip link set dev FOO netns XXX" style API.
>

correct.


>
> Right now one of the best suggestions on the table is:
>
> mkdir -p /dev/container/X
> ln /dev/zero /dev/container/X/zero
> ln /dev/null /dev/container/X/null
> ...
>
> With /dev/container/X mounted on /dev for container X.
>
> Which seems to cover putting a device in a namespace, while allowing
> things to still be reasonably managed.
>
> There are a few other variations on that scheme but nothing that says we
> must have kernel support or to create any kind of kernel context beyond
> which directory the device nodes live in.
>
> > This kind of API is a required building block for us to write device
> > drivers that are namespace aware in a way that userspace will have
> > enough flexibility for dynamic configuration.
> >
> > We are trying to come up with a proposal for that sort of API.  When
> > we have something decent, we shall post it.
>
> I really think what you need to write are special drivers that
> facilitate your use case.
>
> For the networking stack we wound up adding veth pairs, and macvlan
> devices, to handle the common sharing modes.
>
> Outside of your sharing situation I am not seeing any need or any
> advantage of creating devices that are modified to be sharable and I am
> seeing a lot of disadvantages to implementing things that way.  The
> biggest is that you seem to working independent of the subsystem
> maintainers of those devices which is generally a poor idea.
>
> Unprivileged creation of device nodes we can handle if it can be shown
> that it is safe to create device nodes.
>
> As I understand your problem you are trying to multiplex a device by
> building a device with a built in stop light.  Where one opener can
> write and the other openers are stopped/dropped.  That sounds very
> similar to macvlan, or ethernet bridging.   From the patches you have
> floated I suspect it would be very simple to build and just need a
> little bit of glue.
>

Excellent! let's focus the discussion on a new device driver we want to
write
which is namespace aware. let's call this device driver valarm-dev.
Similarly to Android's alarm-dev, valarm-dev can be used to request RTC
wakeup calls
from user space and get/set RTC values, but with valarm-dev, every container
may use different values for current time.

As you can see in our patch set, we already have a version of alarm-dev
that maintains
its state inside a context, instead of in global variable, so it is capable
of providing
different context per namespace.

And now for the 1M$ question: per *which* namespace do we attribute the
current realtime clock time?
To UTS namespace (because T historically stands for Time)? To device
namespace?
Even if device namespace would exist, we do not want to tie the policy
decision of "separate time"
to a very wide definition of "separate devices".

So what we want to create, is an API for device driver writers, that will
enable to write a namespace
aware device and allow userspace to configure when the namespace aware
device context is unshared.

We would like to share with you our very initial thoughts about how this
will be implemented:
- Extend register_pernet_subsys/device(ops) API
to register_perns_subsys/device(nstype, ops) API
- Extend pernet_operations to perns_operations that include optional
migrate() and/or unshare() ops
- Let valarm-dev register_peruser_subsys/device(&alarm_userns_ops)
- Implement a new syscall (or netlink command if it makes more sense)
setdevns(int dev_fd, int ns_fd, int nstype, int flags)
- Unlike the netlink set netns case, this API is not used solely to *move*
a device to a different namespace,
  but also to *unshare* a device context between namespaces, for those
devices that resigtered unshare() ops.

This is our missing piece of the puzzle.
After that, whether we make changes to existing drivers (e.g. evdev) or
write new virtualized drivers (e.g. vevdev)
is a technicality. We care not which way to go, whichever way seems more
maintainable.

What do you think of this master plan?

P.S. Please try to refrain from addressing the validity of the use case of
alarm-dev in particular,
as we do not wish to get engage "Android sucks" wars.
We simply want to present the case for improving the namespace
infrastructure to cater the needs
of device driver writers that wish to tailor their drivers for containers
based products.

Cheers,
Amir.



>
> Eric
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]     ` <87a9iri3ot.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-10-03  0:59       ` Eric W. Biederman
  2013-10-03  8:58       ` Amir Goldstein
  1 sibling, 0 replies; 48+ messages in thread
From: Eric W. Biederman @ 2013-10-03  0:59 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers,
	Andy Lutomirski, devel-GEFAQzZX7r8dnm+yROfE0A, lxc-devel,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

>> This kind of API is a required building block for us to write device
>> drivers that are namespace aware in a way that userspace will have
>> enough flexibility for dynamic configuration.
>>
>> We are trying to come up with a proposal for that sort of API.  When
>> we have something decent, we shall post it.
>
> I really think what you need to write are special drivers that
> facilitate your use case.

Even more practically if you can write special drivers it removes a
level of policy from the kernel, and allows those special drivers to
use at other times for other occassions.

Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found] ` <CAA2m6veny-7_ONMA973Wu36U4kz4gAuw0dpodkb8+GZDv6VNBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-09-29 20:06   ` Greg Kroah-Hartman
@ 2013-10-03  0:44   ` Eric W. Biederman
       [not found]     ` <87a9iri3ot.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2013-10-03  0:44 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers,
	Andy Lutomirski, devel-GEFAQzZX7r8dnm+yROfE0A, lxc-devel,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber

Amir Goldstein <amir-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> writes:

> What we really like to see is a setns() style API that can be used to
> add a device in the context of a namespace in either a "shared" or
> "private" mode.

I think you mean an "ip link set dev FOO netns XXX" style API.

Right now one of the best suggestions on the table is:

mkdir -p /dev/container/X
ln /dev/zero /dev/container/X/zero
ln /dev/null /dev/container/X/null
...

With /dev/container/X mounted on /dev for container X.

Which seems to cover putting a device in a namespace, while allowing
things to still be reasonably managed.

There are a few other variations on that scheme but nothing that says we
must have kernel support or to create any kind of kernel context beyond
which directory the device nodes live in.

> This kind of API is a required building block for us to write device
> drivers that are namespace aware in a way that userspace will have
> enough flexibility for dynamic configuration.
>
> We are trying to come up with a proposal for that sort of API.  When
> we have something decent, we shall post it.

I really think what you need to write are special drivers that
facilitate your use case.

For the networking stack we wound up adding veth pairs, and macvlan
devices, to handle the common sharing modes.

Outside of your sharing situation I am not seeing any need or any
advantage of creating devices that are modified to be sharable and I am
seeing a lot of disadvantages to implementing things that way.  The
biggest is that you seem to working independent of the subsystem
maintainers of those devices which is generally a poor idea.

Unprivileged creation of device nodes we can handle if it can be shown
that it is safe to create device nodes.

As I understand your problem you are trying to multiplex a device by
building a device with a built in stop light.  Where one opener can
write and the other openers are stopped/dropped.  That sounds very
similar to macvlan, or ethernet bridging.   From the patches you have
floated I suspect it would be very simple to build and just need a
little bit of glue.

Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
  2013-10-01 20:46                                         ` Serge Hallyn
@ 2013-10-02 22:55                                           ` Eric W. Biederman
  0 siblings, 0 replies; 48+ messages in thread
From: Eric W. Biederman @ 2013-10-02 22:55 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers,
	Stephane Graber, Andy Lutomirski, lxc-devel, mhw, devel

Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:

>> Glossing over the details.  The general problem is some policy exists
>> outside of the container that deciedes if an when a container gets a
>> serial port and stuffs it in.
>> 
>> The expectation is that system containers will then run the udev
>> rules and send the libuevent event.  
>
> I thought the suggestion was that udev on the host would be given
> container-specific rules, saying "plop this device into /dev/container1/"
> (with /dev/container1 being bind-mounted to $container1_rootfs/dev).

That is what I was trying to describe.  We still need something that
lets the software in the container know it needs to do something.

I may be blind but right now short of replacing the internal udev, or
modifying the kernel I don't see a solution for letting software in a
container know there is a new device it can use.

Once we get the notification issue sorted out I think we have enough to
bring up a full desktop environment in a container and be able to say we
don't need anything else from devices unless someone discovers that
checkpoint/restart actually needs minor numbers to be preserved.

Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                                           ` <20131001205718.GA17036-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2013-10-02 22:45                                             ` Eric W. Biederman
  0 siblings, 0 replies; 48+ messages in thread
From: Eric W. Biederman @ 2013-10-02 22:45 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel, lxc-devel,
	mhw, Stephane Graber


I think libudev is a solution to a completely different problem.  It is
possible I am blind but I just don't see how libudev even attempts to
solve the problem.

The desire is to plop a distro install into a subdirectory.  Fire up a
container around it, and let the distro's userspace do it's thing to
manage hotplug events.

devtmpfs can be faked fairly easily.
I don't know about sysfs.

Sending events that say you have hotplugged is the largest practical
problem.

On the minimal side I think the patch below is enough to let us fake up
uevents for the container and make things work.  I have heard the words
faking uevents and is a bad thing.  But I have not heard a reason or seen
any attempt at explanation.  My guess is that we are simply talking
about different problems.

I would like to see someone wire up all of the userspace bits and see
how well hotplug can be made to work before I walk down the path
represented by this patch but it seems reasonable.  But I do have
anecdotal reports from someone who walked a similar path that this is
enough to bring up a full desktop system in a container.

Eric


diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 7a6c396a263b..46d05783da82 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -38,6 +38,7 @@ extern void netlink_table_ungrab(void);
 
 #define NL_CFG_F_NONROOT_RECV	(1 << 0)
 #define NL_CFG_F_NONROOT_SEND	(1 << 1)
+#define NL_CFG_F_IMPERSONATE_KERN (1 << 2)
 
 /* optional Netlink kernel configuration parameters */
 struct netlink_kernel_cfg {
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 52e5abbc41db..f75e34397df8 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -375,9 +375,12 @@ static int uevent_net_init(struct net *net)
 	struct uevent_sock *ue_sk;
 	struct netlink_kernel_cfg cfg = {
 		.groups	= 1,
-		.flags	= NL_CFG_F_NONROOT_RECV,
+		.flags	= NL_CFG_F_NONROOT_RECV | NL_CFG_F_IMPERSONATE_KERN,
 	};
 
+	if (net->user_ns != &init_user_ns)
+		return 0;
+
 	ue_sk = kzalloc(sizeof(*ue_sk), GFP_KERNEL);
 	if (!ue_sk)
 		return -ENOMEM;
@@ -399,6 +402,9 @@ static void uevent_net_exit(struct net *net)
 {
 	struct uevent_sock *ue_sk;
 
+	if (net->user_ns != &init_user_ns)
+		return;
+
 	mutex_lock(&uevent_sock_mutex);
 	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
 		if (sock_net(ue_sk->sk) == net)
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 0c61b59175dc..71863cc465eb 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1252,7 +1252,7 @@ static int netlink_release(struct socket *sock)
 
 	skb_queue_purge(&sk->sk_write_queue);
 
-	if (nlk->portid) {
+	if (sk_hashed(sk)) {
 		struct netlink_notify n = {
 						.net = sock_net(sk),
 						.protocol = sk->sk_protocol,
@@ -1409,11 +1409,21 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 			return err;
 	}
 
-	if (nlk->portid) {
+	if (sk_hashed(sk)) {
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
 	} else {
-		err = nladdr->nl_pid ?
+		bool autobind = nladdr->nl_pid == 0;
+		if (nladdr->nl_pid == 0 && (nladdr->nl_pad == 0xffff)) {
+			if (!(nl_table[sk->sk_protocol].flags & NL_CFG_F_IMPERSONATE_KERN))
+				return -EPERM;
+			if (net->user_ns == &init_user_ns)
+				return -EPERM;
+			if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+				return -EPERM;
+			autobind = false;
+		}
+		err = !autobind ?
 			netlink_insert(sk, net, nladdr->nl_pid) :
 			netlink_autobind(sock);
 		if (err)
@@ -1467,7 +1477,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
 	if (nladdr->nl_groups && !netlink_capable(sock, NL_CFG_F_NONROOT_SEND))
 		return -EPERM;
 
-	if (!nlk->portid)
+	if (!sk_hashed(sk))
 		err = netlink_autobind(sock);
 
 	if (err == 0) {
@@ -2228,7 +2238,7 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->portid) {
+	if (!sk_hashed(sk)) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                                       ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-10-01 20:46                                         ` Serge Hallyn
  2013-10-01 20:57                                         ` Greg Kroah-Hartman
@ 2013-10-01 22:19                                         ` Michael H. Warfield
  2 siblings, 0 replies; 48+ messages in thread
From: Michael H. Warfield @ 2013-10-01 22:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kay Sievers, Linux Containers, lxc-devel, Andy Lutomirski,
	Greg Kroah-Hartman, Stephane Graber, devel


[-- Attachment #1.1: Type: text/plain, Size: 5040 bytes --]

On Tue, 2013-10-01 at 12:51 -0700, Eric W. Biederman wrote: 
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> >> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen@gmail.com> wrote:
> >> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> >> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> >> >
> >> >>> - We can relay a call of /sbin/hotplug from outside of a container
> >> >>>   to inside of a container based on policy.
> >> >>>   (But no one uses /sbin/hotplug anymore).
> >> >>
> >> >> That's right, they should be listening to libudev events, so why can't
> >> >> your daemon shuffle them off to the proper container, all in userspace?
> >> >
> >> > Which reminds me, one potential reason being..
> >> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
> >> >
> >> 
> >> Can't the daemon live outside the container and shuffle stuff in?
> >
> > That's exactly what Michael Warfield is suggesting, fwiw.

> Michael Warfields example of dynamically assigning serial ports to
> containers is a pretty good test case.  Serial ports are extremely well
> known kernel objects who evolution effectively stopped long ago.  When
> we need it we have ptys to virtual serial ports when we need it, but in
> general unprivileged users are safe to directly use a serial port
> device.

> Glossing over the details.  The general problem is some policy exists
> outside of the container that deciedes if an when a container gets a
> serial port and stuffs it in.

Actually, I don't necessarily see that as a problem as much as a
necessity.  If a container can decide when it gets a serial port or
other device, I would think that would constitute a security issue and
container isolation violation.  Restricting what container can have
access to what has to be determined in the host and, once you've drunk
that koolaid, you might as well stuff it in somewhere.  Policy has to be
in the host or you will never get the security corner cases right.

Ultimately, it is the host which is in charge of the hardware and is
managing the containers (it can start them up, shut them down, or manage
them) so, at its base level, is is the responsibility of the host to
manage those devices between the containers.

That being said, there is the additional issue of, what does the
container do when we hand it a device and how do we let it know.  That's
now classically the issue of udev and formerly hotplug and their
predecessors...

> The expectation is that system containers will then run the udev
> rules and send the libuevent event.  

Which makes sense.  Something along the line of a socket into the
container to send selected events from the user space daemon in the host
would make some sense there.

> To make that all work without kernel modifications requires placing
> a faux-udev in the container, that listens for a device assignment from
> outside the container and then does exactly what udev would have done.

> The problems with this that I see are:

> - udev is a moving target making it hard to build a faux-udev that will
>   work everywhere.

Well, it is an it isn't.  Yeah the rules have been changing (I'm getting
tired of the "deprecated" rule warnings) but I've seen worse, much
worse.

> - On distro's running systemd and udev integration is sufficiently tight
>   that I am not certain a faux-udev is possible or will continue to be
>   possible.

Actually, I think that's a non-issue.  IIRC, systemd (now) discontinues
its udev operation when it detects it's in a container.  That was at the
heart of the entire Fedora 15/16 in a container meltdown with the broken
versions of systemd trying to run udev in the container.  What do we do
in place of it?  I don't know.

> - There are two other widely deployed solutions for managing hotplug
>   devices besides udev.

> So given these difficulties I do not believe that the evolution of linux
> device management is done, and that patches to udev, the kernel or both
> will be needed.  While it would be good for testing and understanding
> the problem I don't think a faux-udev will be a long term maintainable
> solution.

> I also understand the point that we aren't talking patches yet and just
> discussing ideas.  Right now it is my hope that if we talk this out we
> can figure out a general direction that has a hope of working.

> From where I am standing faking uevents instead of replacing
> udev/mdev/whatever looks simpler and more maintainable.

> Eric

Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw-BetbSzk+GohWk0Htik3J/w@public.gmane.org
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 482 bytes --]

[-- Attachment #2: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                                       ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-10-01 20:46                                         ` Serge Hallyn
@ 2013-10-01 20:57                                         ` Greg Kroah-Hartman
       [not found]                                           ` <20131001205718.GA17036-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  2013-10-01 22:19                                         ` Michael H. Warfield
  2 siblings, 1 reply; 48+ messages in thread
From: Greg Kroah-Hartman @ 2013-10-01 20:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel, lxc-devel,
	mhw, Stephane Graber

On Tue, Oct 01, 2013 at 12:51:36PM -0700, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> >> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> >> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> >> >
> >> >>> - We can relay a call of /sbin/hotplug from outside of a container
> >> >>>   to inside of a container based on policy.
> >> >>>   (But no one uses /sbin/hotplug anymore).
> >> >>
> >> >> That's right, they should be listening to libudev events, so why can't
> >> >> your daemon shuffle them off to the proper container, all in userspace?
> >> >
> >> > Which reminds me, one potential reason being..
> >> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
> >> >
> >> 
> >> Can't the daemon live outside the container and shuffle stuff in?
> >
> > That's exactly what Michael Warfield is suggesting, fwiw.
> 
> Michael Warfields example of dynamically assigning serial ports to
> containers is a pretty good test case.  Serial ports are extremely well
> known kernel objects who evolution effectively stopped long ago.  When
> we need it we have ptys to virtual serial ports when we need it, but in
> general unprivileged users are safe to directly use a serial port
> device.
> 
> Glossing over the details.  The general problem is some policy exists
> outside of the container that deciedes if an when a container gets a
> serial port and stuffs it in.
> 
> The expectation is that system containers will then run the udev
> rules and send the libuevent event.  
> 
> To make that all work without kernel modifications requires placing
> a faux-udev in the container, that listens for a device assignment from
> outside the container and then does exactly what udev would have done.
> 
> The problems with this that I see are:
> 
> - udev is a moving target making it hard to build a faux-udev that will
>   work everywhere.

How is udev a moving target?  Use libudev and all should be fine, that's
an ABI you can rely on, right?

Or, if you don't like/want udev, use mdev in your container.  Or
something else, what does this have to do with the kernel?

> - On distro's running systemd and udev integration is sufficiently tight
>   that I am not certain a faux-udev is possible or will continue to be
>   possible.

That's not a kernel issue, that's a "ouch, this is hard, let's give up"
issue.

Or perhaps it is a "maybe I shouldn't even be trying to do this" type
issue... :)

> - There are two other widely deployed solutions for managing hotplug
>   devices besides udev.

I know of mdev, what's the other one?  The hacked-up shell script that
Android uses?  Or something else?

> So given these difficulties I do not believe that the evolution of linux
> device management is done, and that patches to udev, the kernel or both
> will be needed.  While it would be good for testing and understanding
> the problem I don't think a faux-udev will be a long term maintainable
> solution.

You are saying that for some reason you feel helpless with the way
userspace is going, so we have to change the kernel.  That's horrible,
and is not going to be a reason I accept to change the kernel, sorry.

> I also understand the point that we aren't talking patches yet and just
> discussing ideas.  Right now it is my hope that if we talk this out we
> can figure out a general direction that has a hope of working.
> 
> From where I am standing faking uevents instead of replacing
> udev/mdev/whatever looks simpler and more maintainable.

Have you really looked into this?  Numerous people, who understand this
code path and userspace issues, have said it is not a good idea at all.

But hey, what do I know...

I still have yet to see a reason why you can't use libudev today for
something like this.

Anyway, I'm done discussing this as it's pointless this early, I'm going
to refrain for any more pithy comments until someone posts some code, as
this is just wasting people's time at the moment.

greg k-h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                                       ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-10-01 20:46                                         ` Serge Hallyn
  2013-10-02 22:55                                           ` Eric W. Biederman
  2013-10-01 20:57                                         ` Greg Kroah-Hartman
  2013-10-01 22:19                                         ` Michael H. Warfield
  2 siblings, 1 reply; 48+ messages in thread
From: Serge Hallyn @ 2013-10-01 20:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers,
	Stephane Graber, Andy Lutomirski, lxc-devel, mhw, devel

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> >> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> >> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> >> >
> >> >>> - We can relay a call of /sbin/hotplug from outside of a container
> >> >>>   to inside of a container based on policy.
> >> >>>   (But no one uses /sbin/hotplug anymore).
> >> >>
> >> >> That's right, they should be listening to libudev events, so why can't
> >> >> your daemon shuffle them off to the proper container, all in userspace?
> >> >
> >> > Which reminds me, one potential reason being..
> >> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
> >> >
> >> 
> >> Can't the daemon live outside the container and shuffle stuff in?
> >
> > That's exactly what Michael Warfield is suggesting, fwiw.
> 
> Michael Warfields example of dynamically assigning serial ports to
> containers is a pretty good test case.  Serial ports are extremely well
> known kernel objects who evolution effectively stopped long ago.  When
> we need it we have ptys to virtual serial ports when we need it, but in
> general unprivileged users are safe to directly use a serial port
> device.
> 
> Glossing over the details.  The general problem is some policy exists
> outside of the container that deciedes if an when a container gets a
> serial port and stuffs it in.
> 
> The expectation is that system containers will then run the udev
> rules and send the libuevent event.  

I thought the suggestion was that udev on the host would be given
container-specific rules, saying "plop this device into /dev/container1/"
(with /dev/container1 being bind-mounted to $container1_rootfs/dev).

-serge

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                                   ` <20131001175345.GA4145-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-10-01 19:51                                     ` Eric W. Biederman
       [not found]                                       ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2013-10-01 19:51 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Kay Sievers, Linux Containers, lxc-devel, Andy Lutomirski, devel,
	Greg Kroah-Hartman, mhw, Stephane Graber

"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
>> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
>> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
>> >
>> >>> - We can relay a call of /sbin/hotplug from outside of a container
>> >>>   to inside of a container based on policy.
>> >>>   (But no one uses /sbin/hotplug anymore).
>> >>
>> >> That's right, they should be listening to libudev events, so why can't
>> >> your daemon shuffle them off to the proper container, all in userspace?
>> >
>> > Which reminds me, one potential reason being..
>> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
>> >
>> 
>> Can't the daemon live outside the container and shuffle stuff in?
>
> That's exactly what Michael Warfield is suggesting, fwiw.

Michael Warfields example of dynamically assigning serial ports to
containers is a pretty good test case.  Serial ports are extremely well
known kernel objects who evolution effectively stopped long ago.  When
we need it we have ptys to virtual serial ports when we need it, but in
general unprivileged users are safe to directly use a serial port
device.

Glossing over the details.  The general problem is some policy exists
outside of the container that deciedes if an when a container gets a
serial port and stuffs it in.

The expectation is that system containers will then run the udev
rules and send the libuevent event.  

To make that all work without kernel modifications requires placing
a faux-udev in the container, that listens for a device assignment from
outside the container and then does exactly what udev would have done.

The problems with this that I see are:

- udev is a moving target making it hard to build a faux-udev that will
  work everywhere.

- On distro's running systemd and udev integration is sufficiently tight
  that I am not certain a faux-udev is possible or will continue to be
  possible.

- There are two other widely deployed solutions for managing hotplug
  devices besides udev.

So given these difficulties I do not believe that the evolution of linux
device management is done, and that patches to udev, the kernel or both
will be needed.  While it would be good for testing and understanding
the problem I don't think a faux-udev will be a long term maintainable
solution.

I also understand the point that we aren't talking patches yet and just
discussing ideas.  Right now it is my hope that if we talk this out we
can figure out a general direction that has a hope of working.

From where I am standing faking uevents instead of replacing
udev/mdev/whatever looks simpler and more maintainable.

Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                               ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-10-01 17:53                                 ` Serge E. Hallyn
@ 2013-10-01 18:36                                 ` Janne Karhunen
  1 sibling, 0 replies; 48+ messages in thread
From: Janne Karhunen @ 2013-10-01 18:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers,
	Stephane Graber, Eric W. Biederman, lxc-devel, mhw, devel

On Tue, Oct 1, 2013 at 8:27 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:

>> http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
>
> Can't the daemon live outside the container and shuffle stuff in?
> IOW, there seems to be little point in containerizing things if you're
> just going to punch a privilege hole in the namespace.

Yeah. I will try to experiment just how much can be 'stuffed
in' without effective caps. It certainly would be better this way.


> FWIW, I think that the capability evolution rules are crap, but
> changing them is a can of worms, and enough people seem to thing the
> status quo is acceptable that this is unlikely to ever get fixed.

I have noted (Casey almost tried to strangle me during the
last security summit for even daring to talk about it).


-- 
Janne

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                               ` <20131001173342.GA19267-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2013-10-01 18:23                                 ` Janne Karhunen
  0 siblings, 0 replies; 48+ messages in thread
From: Janne Karhunen @ 2013-10-01 18:23 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel,
	Eric W. Biederman, lxc-devel, mhw, Stephane Graber

On Tue, Oct 1, 2013 at 8:33 PM, Greg Kroah-Hartman
<gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:

>> > That's right, they should be listening to libudev events, so why can't
>> > your daemon shuffle them off to the proper container, all in userspace?
>>
>> Which reminds me, one potential reason being..
>> http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
>
> I really wish I had never seen that patch, and I am glad it was
> rejected.

Thanks, I agree. Just wanted to point out the reason and
bring up the discussion.


-- 
Janne

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                               ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-10-01 17:53                                 ` Serge E. Hallyn
       [not found]                                   ` <20131001175345.GA4145-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2013-10-01 18:36                                 ` Janne Karhunen
  1 sibling, 1 reply; 48+ messages in thread
From: Serge E. Hallyn @ 2013-10-01 17:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kay Sievers, Linux Containers, lxc-devel, Stephane Graber,
	Eric W. Biederman, Greg Kroah-Hartman, mhw, devel

Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> >
> >>> - We can relay a call of /sbin/hotplug from outside of a container
> >>>   to inside of a container based on policy.
> >>>   (But no one uses /sbin/hotplug anymore).
> >>
> >> That's right, they should be listening to libudev events, so why can't
> >> your daemon shuffle them off to the proper container, all in userspace?
> >
> > Which reminds me, one potential reason being..
> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
> >
> 
> Can't the daemon live outside the container and shuffle stuff in?

That's exactly what Michael Warfield is suggesting, fwiw.

-serge

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                           ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-10-01 17:27                             ` Andy Lutomirski
@ 2013-10-01 17:33                             ` Greg Kroah-Hartman
       [not found]                               ` <20131001173342.GA19267-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 48+ messages in thread
From: Greg Kroah-Hartman @ 2013-10-01 17:33 UTC (permalink / raw)
  To: Janne Karhunen
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel,
	Eric W. Biederman, lxc-devel, mhw, Stephane Graber

On Tue, Oct 01, 2013 at 09:19:58AM +0300, Janne Karhunen wrote:
> On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> 
> >> - We can relay a call of /sbin/hotplug from outside of a container
> >>   to inside of a container based on policy.
> >>   (But no one uses /sbin/hotplug anymore).
> >
> > That's right, they should be listening to libudev events, so why can't
> > your daemon shuffle them off to the proper container, all in userspace?
> 
> Which reminds me, one potential reason being..
> http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

I really wish I had never seen that patch, and I am glad it was
rejected.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                           ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-10-01 17:27                             ` Andy Lutomirski
       [not found]                               ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-10-01 17:33                             ` Greg Kroah-Hartman
  1 sibling, 1 reply; 48+ messages in thread
From: Andy Lutomirski @ 2013-10-01 17:27 UTC (permalink / raw)
  To: Janne Karhunen
  Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers,
	Stephane Graber, Eric W. Biederman, lxc-devel, mhw, devel

On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
>
>>> - We can relay a call of /sbin/hotplug from outside of a container
>>>   to inside of a container based on policy.
>>>   (But no one uses /sbin/hotplug anymore).
>>
>> That's right, they should be listening to libudev events, so why can't
>> your daemon shuffle them off to the proper container, all in userspace?
>
> Which reminds me, one potential reason being..
> http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
>

Can't the daemon live outside the container and shuffle stuff in?
IOW, there seems to be little point in containerizing things if you're
just going to punch a privilege hole in the namespace.

FWIW, I think that the capability evolution rules are crap, but
changing them is a can of worms, and enough people seem to thing the
status quo is acceptable that this is unlikely to ever get fixed.

--Andy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                       ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  2013-09-26  8:25                         ` Janne Karhunen
@ 2013-10-01  6:19                         ` Janne Karhunen
       [not found]                           ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 48+ messages in thread
From: Janne Karhunen @ 2013-10-01  6:19 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel,
	Eric W. Biederman, lxc-devel, mhw, Stephane Graber

On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
<gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:

>> - We can relay a call of /sbin/hotplug from outside of a container
>>   to inside of a container based on policy.
>>   (But no one uses /sbin/hotplug anymore).
>
> That's right, they should be listening to libudev events, so why can't
> your daemon shuffle them off to the proper container, all in userspace?

Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html


-- 
Janne

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                                               ` <20130930161117.GA26459-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2013-09-30 16:33                                                 ` James Bottomley
  0 siblings, 0 replies; 48+ messages in thread
From: James Bottomley @ 2013-09-30 16:33 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, Kay Sievers, Stephane Graber, Andy Lutomirski,
	Eric W. Biederman, lxc-devel, mhw, devel

On Mon, 2013-09-30 at 09:11 -0700, Greg Kroah-Hartman wrote:
> On Mon, Sep 30, 2013 at 08:37:19AM -0700, James Bottomley wrote:
> > On Thu, 2013-09-26 at 10:07 -0700, Greg Kroah-Hartman wrote:
> > > On Thu, Sep 26, 2013 at 08:01:31PM +0300, Janne Karhunen wrote:
> > > > That being said, our wish would be to support any combination of
> > > > OS's and frankly, I'd be slightly annoyed to tell the customer that
> > > > they can't do two Androids or we magically run out of bits.
> > > 
> > > If you want to support "any" combination of operating systems, then use
> > > a hypervisor, that's what they are there for :)
> > 
> > No that's not quite the right way to think about it: The correct
> > statement is only use a hypervisor if you need different kernels.  With
> > Windows, it happens to be true that you need a different kernel for each
> > different OS version.  However; with Linux, thanks to strong ABI
> > backwards compatibility, you mostly don't.  The way OpenVZ works today
> > is that it installs a modified kernel which can then bring up every
> > Linux OS in a separate container.  Our use case is the hosters that give
> > you root login to a virtual private server and allow you to upgrade it
> > on your own.  The reason for using a container rather than a hypervisor
> > is the old density and elasticity one:  3x the density (i.e. 1/3 the
> > overhead cost to the hoster) and the boot only needs to start at init,
> > not bring up of virtual hardware and booting a second kernel.
> 
> I understand that some people really like the idea of using OpenVZ for
> various things like this, but to claim that because of it we need to
> hack up the driver core in the kernel into unimaginable pieces is not
> necessarily something that I'll agree with.

I don't believe I claimed that.  In fact, from 3.9 we can already bring
up an OpenVZ containerised system running different versions of Linux
that you can give someone root access to with no kernel modifications
whatsoever.  The user space solution currently works for us because
we're handing out server VPSs, so the device configuration is fixed as
we init the container.  However, we do have use cases for dynamic
instead of static device configurations, which is why we're
participating in the debate.

> But all of this is just words, I have yet to see any patches for any of
> this, so I'll just wait until that happens before worrying about it...

Well, that's because we're still debating what the best approach is.  If
you want a historical parallel: the comments you make above (hack up
the ... kernel into unimaginable pieces) is an almost exact mirror of
the comments that were made rejecting the in-kernel Checkpoint/Restore
patches at the 2010 Kernel Summit ... yet we have it fully functional
today in a form that proved acceptable.

James

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                                           ` <1380555439.2161.5.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
@ 2013-09-30 16:11                                             ` Greg Kroah-Hartman
       [not found]                                               ` <20130930161117.GA26459-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Greg Kroah-Hartman @ 2013-09-30 16:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linux Containers, Kay Sievers, Stephane Graber, Andy Lutomirski,
	Eric W. Biederman, lxc-devel, mhw, devel

On Mon, Sep 30, 2013 at 08:37:19AM -0700, James Bottomley wrote:
> On Thu, 2013-09-26 at 10:07 -0700, Greg Kroah-Hartman wrote:
> > On Thu, Sep 26, 2013 at 08:01:31PM +0300, Janne Karhunen wrote:
> > > That being said, our wish would be to support any combination of
> > > OS's and frankly, I'd be slightly annoyed to tell the customer that
> > > they can't do two Androids or we magically run out of bits.
> > 
> > If you want to support "any" combination of operating systems, then use
> > a hypervisor, that's what they are there for :)
> 
> No that's not quite the right way to think about it: The correct
> statement is only use a hypervisor if you need different kernels.  With
> Windows, it happens to be true that you need a different kernel for each
> different OS version.  However; with Linux, thanks to strong ABI
> backwards compatibility, you mostly don't.  The way OpenVZ works today
> is that it installs a modified kernel which can then bring up every
> Linux OS in a separate container.  Our use case is the hosters that give
> you root login to a virtual private server and allow you to upgrade it
> on your own.  The reason for using a container rather than a hypervisor
> is the old density and elasticity one:  3x the density (i.e. 1/3 the
> overhead cost to the hoster) and the boot only needs to start at init,
> not bring up of virtual hardware and booting a second kernel.

I understand that some people really like the idea of using OpenVZ for
various things like this, but to claim that because of it we need to
hack up the driver core in the kernel into unimaginable pieces is not
necessarily something that I'll agree with.

But all of this is just words, I have yet to see any patches for any of
this, so I'll just wait until that happens before worrying about it...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                                       ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  2013-09-26 17:56                                         ` Janne Karhunen
@ 2013-09-30 15:37                                         ` James Bottomley
       [not found]                                           ` <1380555439.2161.5.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 48+ messages in thread
From: James Bottomley @ 2013-09-30 15:37 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, Kay Sievers, Stephane Graber, Andy Lutomirski,
	Eric W. Biederman, lxc-devel, mhw, devel

On Thu, 2013-09-26 at 10:07 -0700, Greg Kroah-Hartman wrote:
> On Thu, Sep 26, 2013 at 08:01:31PM +0300, Janne Karhunen wrote:
> > That being said, our wish would be to support any combination of
> > OS's and frankly, I'd be slightly annoyed to tell the customer that
> > they can't do two Androids or we magically run out of bits.
> 
> If you want to support "any" combination of operating systems, then use
> a hypervisor, that's what they are there for :)

No that's not quite the right way to think about it: The correct
statement is only use a hypervisor if you need different kernels.  With
Windows, it happens to be true that you need a different kernel for each
different OS version.  However; with Linux, thanks to strong ABI
backwards compatibility, you mostly don't.  The way OpenVZ works today
is that it installs a modified kernel which can then bring up every
Linux OS in a separate container.  Our use case is the hosters that give
you root login to a virtual private server and allow you to upgrade it
on your own.  The reason for using a container rather than a hypervisor
is the old density and elasticity one:  3x the density (i.e. 1/3 the
overhead cost to the hoster) and the boot only needs to start at init,
not bring up of virtual hardware and booting a second kernel.

James

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]     ` <20130929200620.GA31304-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2013-09-30 15:36       ` Michael H. Warfield
  0 siblings, 0 replies; 48+ messages in thread
From: Michael H. Warfield @ 2013-09-30 15:36 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski,
	Eric W. Biederman, lxc-devel, Stephane Graber,
	devel-GEFAQzZX7r8dnm+yROfE0A


[-- Attachment #1.1: Type: text/plain, Size: 10741 bytes --]

On Sun, 2013-09-29 at 13:06 -0700, Greg Kroah-Hartman wrote: 
> On Sun, Sep 29, 2013 at 10:28:55PM +0300, Amir Goldstein wrote:
> > 
> > 
> > 
> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <gregkh@linuxfoundation.org
> > > wrote:
> > 
> >     On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote:
> >     > So the big issues for a device namespace to solve are filtering which
> >     > devices a container has access to and being able to dynamically change
> >     > which devices those are at run time (aka hotplug).
> > 
> >     As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG
> >     anymore, because it was redundant), I think you need to really think
> >     this through better (pci, memory, cpus, etc.) before you do anything in
> >     the kernel.
> > 
> >     > After having thought about this for a bit I don't know if a pure
> >     > userspace solution is sufficient or actually a good idea.
> >     >
> >     > - We can manually manage a tmpfs with device nodes in userspace.
> >     >   (But that is deprecated functionality in the mainstream kernel).
> > 
> >     Yes, but I'm not going to namespace devtmpfs, as that is going to be an
> >     impossible task, right?
> > 
> > 
> > That sounds like a challenge ;-)
> > Seriously, as Serge correctly noted, it would not be that different from devpts
> > if you start from an empty devtmpfs and populate it with devices that are
> > "added in the context of that namespace".  The semantics in which
> > devices are "added in the context of a namespace" is the missing piece
> > of the puzzle.

> And the fact that these devices are almost all created before userspace
> starts up, is a non-trivial "piece of the puzzle" :)

That's putting it mildly.  As I said in the Containers session at Linux
Plumbers, I agree with you (wrt device namespaces), but we do have (a)
problem(s) to solve.  The more I've thought on this, the more I agree
with you and that there's got to be a better way.

I'm not going to address the Android use case issues here which Janne
raised (which are very valid), since I've got other fish to fry and I
haven't even begun to look at the complexities of Android in an LXC
container on a non-android host, much less Android on Android or other
on Android.  This may have some applicability to the Android case, I
just haven't thought it through yet.  Anything on a common kernel should
work and standard distributions seem to be no problem now, but Android
is a rather unique beast, to say the least.

I will disagree with you on one point, though, from that session.  When
I mentioned both persistent and dynamic devices, you said they were
mutually exclusive.  It may be a difference in semantics or terminology
but I would beg to differ there, so I'll explain that too...

In my "worst case, real world, right now" scenario of the USB sharing
device and multiple USB serial adapters for serial consoles, I have
several different issues that are illustrative of several problems I'm
trying to overcome.

With this sharing device, you get a "/dev/usbshare" HID device on all
the connected hosts which do NOT have the USB bus that's being shared.
The device that has control of the bus does NOT see the /dev/usbshare
device but does see all the USB devices (the serial port adapters
- /dev/ttyUSB* - in this case) which are connected to it.

So, when you switch the sharing from system A to system B, all the
shared serial devices disappear from A and the /dev/usbshare device
appears, while the usbshare device disappears from system B and all the
usb serial devices appear together.  Either system may (and do) have
other static usb serial devices attached so the numbering and order
of /dev/ttyUSB* may vary and can even change depending if a host had
been booted with the usb bus shared to it or not.

Ok...  That's the "dynamic" devices I was referring to.  They come and
go and may have differing names under differing circumstances.  Very
real world dynamic.

Now...  For consistency, I have udev rules that map those serial devices
to other names, based on their device USB serial numbers.  That naming
convention remains persistent on that system as the devices come and go
and remains consistent between the systems with those rules.

So that's my "dynamic" with "persistent" devices.  I have persistent
names on dynamic devices.  Perhaps I could have chosen my terminology
better but, that's what I was arguing for in that Plumbers session when
I used those terms.

Now, for the complications...  If I wish to (and I most certainly do)
divvy up these serial devices between containers, I have several things
which need to be managed.

The /dev/usbshare device needs to be mapped to ALL containers which may
wish to request the shared bus (plus the host).  It's generally only a
very momentary device access and collisions would be extremely rare and
non-harmful in any case (two containers both wanting the bus on the same
host - shrug...).  It's actually far less confusing and difficult than
merely the collisions and contention between systems, and that's been
easily managable, given the rarity of cross serial console access (the
real world use case).

The /dev/ttyUSB* devices need to be mapped to their specific containers
with or without removing them from the host and possibly allowing for
multiple containers.  Device access is easily managed by the device
driver for multiple access (EBUSY) and not a problem.  This could be
more complicated if, for example, we were talking about USB drives, loop
devices, or other devices which multiple access, but that's another
layer of complication.

The "persistent" udev symlinks also need to be mapped to the containers.
I think I can do this equally well in the host as the real devices...

> Good luck,

I'm scratching on an idea that started forming just after that session.
I told Serge that "I think I can do it and it will (should) suck less."
Basically, it exploits some of the properties of devtmpfs to accomplish
some of our goals.

You're right about the user space problem.  Something needs to manage
the devices in a coherent manner as devices come and go and as
containers come and go in asynchronous manner.  In my mind, the only
place for that is in the host.  "Non trivial" is a jaw dropping
understatement and I can see where you feel it would be impossible to
manage in applying namespaces to devtmpfs.  That leaves the user space
in the host.  I can see where it would be intractable in the kernel.

I may get beat mercilessly for suggesting this but, just as with
cgroups, if we create a subdirectory in devtmpfs for subsystem (LXC) and
container, we can then bind mount that subtree off of devtmpfs to the
container and then the host can map and manipulate the device subtree
into the container (even if the container is denied mknod capability).
That leaves the host to manage all the devices, which actually makes a
LOT of sense (to me) since it should be responsible for the devices and
the overall kernel operations.  That would be no different than needing
to configure device passthroughs for KVM / VirtualBox / VMware
hypervisors.

Example...  In the host I would have something like this...

/dev/lxc/
romulus
remus
gemini
janus

And then bind mount each of those subdirectories
to /var/lib/lxc/${Container}/rootfs/dev directory.  Then map the devices
from the host /dev to the container /dev with mknod in the host and
relative symlinks.

That also (I think) helps me deal with some of the (mis)behavior of
systemd where it contains unconfigurable behavior (mounting devtmpfs)
controlled by "magic cookies" (/dev mounted on another major/minor
from / to disable it mounting devtmpfs).  I initially recoiled in horror
of the thought of overloading the devtmpfs subtree with container based
subdirectories, devices, and symlinks but the idea grew on me that this
might be better than what we're dealing with now of mounting tmpfs on
the /dev mount point in all theses containers and then having to
populate them just to prevent systemd from creating collisions with
devtmpfs and the resulting violation of the container isolation.

It DOES still leave the problem of dealing with udev rules in the
container and subsidiary device syslinks in the container which may not
correspond to the rules in the host.  That's still problem in my mind
(but already present and miniscule to what we would be solving).  I
could pattern match everything coming out of udev in a trigger and map
devices and symlinks into the new subtree in the host but I have no way
to manage propagating the rules in the container down into the processor
in the host or a way to trigger those udev rules in the containers.
Suggestions there might be nice (as well as the cat calls).  I'm not
sure I have it clear in my head yet how I would deal with bringing up a
container and then mapping all the required existing devices over to it.
That's your user space problem in a nutshell.  That's easy to handle
with udev as things come and go but, when the user space comes after and
udev isn't processing triggers, how do I handle the mappings.  That's
also non-trivial in my mind.

Device creation would seem to be pretty trivial.  Device removal, not so
much.  If I create another node on devtmpfs and that major/minor gets
removed, will it also get removed?  I also have to remove the symlinks.
The removal process just feels more complicated in my mind.

Greg, I think you are absolutely right, this needs to be managed in user
space and not in kernel space and we do have the tools to do it.  I
think I can do some of it in a way that will suck less compared to how
we're (LXC is) doing it now.  I'm just not so sure how comprehensive the
solution will be or how well it will work.

I've still got several other takeaways from that session to put a bow on
before really testing this idea further.  I really have not fully
fleshed this idea out and it's going to take me some time.  There may
also me some other corner cases I haven't considered.  And then there's
Android.  Sigh...

And maybe I'm just totally off base and crazy.  Wouldn't be the first
time, won't be the last time.

> greg k-h

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw-BetbSzk+GohWk0Htik3J/w@public.gmane.org
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 482 bytes --]

[-- Attachment #2: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found] ` <CAA2m6veny-7_ONMA973Wu36U4kz4gAuw0dpodkb8+GZDv6VNBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-09-29 20:06   ` Greg Kroah-Hartman
       [not found]     ` <20130929200620.GA31304-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  2013-10-03  0:44   ` Eric W. Biederman
  1 sibling, 1 reply; 48+ messages in thread
From: Greg Kroah-Hartman @ 2013-09-29 20:06 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski,
	devel-GEFAQzZX7r8dnm+yROfE0A, Eric W. Biederman, lxc-devel,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber

On Sun, Sep 29, 2013 at 10:28:55PM +0300, Amir Goldstein wrote:
> 
> 
> 
> On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <gregkh@linuxfoundation.org
> > wrote:
> 
>     On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote:
>     > So the big issues for a device namespace to solve are filtering which
>     > devices a container has access to and being able to dynamically change
>     > which devices those are at run time (aka hotplug).
> 
>     As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG
>     anymore, because it was redundant), I think you need to really think
>     this through better (pci, memory, cpus, etc.) before you do anything in
>     the kernel.
> 
>     > After having thought about this for a bit I don't know if a pure
>     > userspace solution is sufficient or actually a good idea.
>     >
>     > - We can manually manage a tmpfs with device nodes in userspace.
>     >   (But that is deprecated functionality in the mainstream kernel).
> 
>     Yes, but I'm not going to namespace devtmpfs, as that is going to be an
>     impossible task, right?
> 
> 
> That sounds like a challenge ;-)
> Seriously, as Serge correctly noted, it would not be that different from devpts
> if you start from an empty devtmpfs and populate it with devices that are
> "added in the context of that namespace".  The semantics in which
> devices are "added in the context of a namespace" is the missing piece
> of the puzzle.

And the fact that these devices are almost all created before userspace
starts up, is a non-trivial "piece of the puzzle" :)

Good luck,

greg k-h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
@ 2013-09-29 19:28 Amir Goldstein
       [not found] ` <CAA2m6veny-7_ONMA973Wu36U4kz4gAuw0dpodkb8+GZDv6VNBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Amir Goldstein @ 2013-09-29 19:28 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski,
	devel-GEFAQzZX7r8dnm+yROfE0A, Eric W. Biederman, lxc-devel,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber

On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:

> On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote:
> > So the big issues for a device namespace to solve are filtering which
> > devices a container has access to and being able to dynamically change
> > which devices those are at run time (aka hotplug).
>
> As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG
> anymore, because it was redundant), I think you need to really think
> this through better (pci, memory, cpus, etc.) before you do anything in
> the kernel.
>
> > After having thought about this for a bit I don't know if a pure
> > userspace solution is sufficient or actually a good idea.
> >
> > - We can manually manage a tmpfs with device nodes in userspace.
> >   (But that is deprecated functionality in the mainstream kernel).
>
> Yes, but I'm not going to namespace devtmpfs, as that is going to be an
> impossible task, right?
>

That sounds like a challenge ;-)
Seriously, as Serge correctly noted, it would not be that different from
devpts
if you start from an empty devtmpfs and populate it with devices that are
"added
in the context of that namespace".
The semantics in which devices are "added in the context of a namespace"
is the missing piece of the puzzle.

What we really like to see is a setns() style API that can be used to
add a device in the context of a namespace in either a "shared" or "private"
mode.
This kind of API is a required building block for us to write device drivers
that are namespace aware in a way that userspace will have enough
flexibility
for dynamic configuration.

We are trying to come up with a proposal for that sort of API.
When we have something decent, we shall post it.


> And remember, udev doesn't create device nodes anymore...
>
> > - We can manually export a subset of sysfs with bind mounts.
> >   (But that feels hacky, and is essentially incompatible with hotplug).
>
> True.
>
> > - We can relay a call of /sbin/hotplug from outside of a container
> >   to inside of a container based on policy.
> >   (But no one uses /sbin/hotplug anymore).
>
> That's right, they should be listening to libudev events, so why can't
> your daemon shuffle them off to the proper container, all in userspace?
>
> > - There is no way to fake netlink uevents for a container to see them.
> >   (The best we could do is replace udev everywhere with something that
> >    listens on a unix domain socket).
>
> You shouldn't need to do this.
>
> > - It would be nice to replace the device cgroup with a comprehensive
> >   solution that really works. (Among other things the device cgroup
> >   does not work in terms of struct device the underlying kernel
> >   abstraction for devices).
>
> I didn't even know there was a device cgroup.
>
> Which means that if there is one, odds are it's useless.
>
> > We must manage sysfs entries as well device nodes because:
> > - Seeing more than we should has the real potential to confuse
> >   userspace, especially a userspace that replays uevents.
>
> You should never replay uevents.  If you don't do that, why can't you
> see all of sysfs?
>
> > - Some device control must happens through writing to sysfs files and
> >   if we don't remove all root privileges from a container only by
> >   exporting a subset of sysfs to that container can we limit which
> >   sysfs nodes can be written to.
>
> But you have the issue of controlling devices in a "shared" way, which
> isn't going to be usable for almost all devices.
>
> > The current kernel tagged sysfs entry support does not look like a good
> > match for the impelementing device filtering.   The common case will
> > be allowing devices like /dev/zero, and /dev/null that live in
> > /sys/devices/virtual and are the devices we are most likely to care
> > about.  Those devices need to live in multiple device namespaces so
> > everyone can use them.  Perhaps exclusive assignment will be the more
> > common paradigm for device namespaces like it is for network devices in
> > the network namespace but from what little I can of this problem right
> now I
> > don't think so.
> >
> > I definitely think we should hold off on a kernel level implementation
> > until we really understand the issues and are ready to implement device
> > namespaces correctly.
>
> I agree, especially as I don't think this will ever work.
>
> > A userspace implementation looks like it can only do about 95% of what
> > is really needed, but at the same time looks like an easy way to
> > experiment until the problem is sufficiently well understood.
>
> 95% is probably way better than what you have today, and will fit the
> needs of almost everyone today, so why not do it?
>
> I'd argue that those last 5% either are custom solutions that never get
> merged, or candidates for true virtulization.
>
> > In summary the situation with device hoptlug and containers sucks today,
> > and we need to do something.  Running a linux desktop in a container is
> > a reasonably good example use case.
>
> No it isn't.  I'd argue that this is a horrible use case, one that you
> shouldn't do.  Why not just use multi-head machines like people do who
> really want to do this, relying on user separation?  That's a workable
> solution that is quite common and works very well today.
>
> > Having one standard common maintainable implementation would be very
> > useful and the most logical place for that would be in the kernel.
> > For now we should focus on simple device filtering and hotplug.
>
> Just listen for libudev stuff, don't try to filter them, or ever
> "replay" them, that way lies madness, and lots of nasty race conditions
> that is guaranteed to break things.
>
> good luck,
>
> greg k-h
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                                       ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2013-09-26 17:56                                         ` Janne Karhunen
  2013-09-30 15:37                                         ` James Bottomley
  1 sibling, 0 replies; 48+ messages in thread
From: Janne Karhunen @ 2013-09-26 17:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel,
	Eric W. Biederman, lxc-devel, mhw, Stephane Graber

On Thu, Sep 26, 2013 at 8:07 PM, Greg Kroah-Hartman
<gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:

>> That being said, our wish would be to support any combination of
>> OS's and frankly, I'd be slightly annoyed to tell the customer that
>> they can't do two Androids or we magically run out of bits.
>
> If you want to support "any" combination of operating systems, then use
> a hypervisor, that's what they are there for :)

Only relevant mobile OS's are of interest ;)


-- 
Janne

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                                   ` <CAE=NcrY3xC1AF_GV2b1KsF7AwYZTuGBuKLS5yBUWoWcmKU4YBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-09-26 17:07                                     ` Greg Kroah-Hartman
       [not found]                                       ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Greg Kroah-Hartman @ 2013-09-26 17:07 UTC (permalink / raw)
  To: Janne Karhunen
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel,
	Eric W. Biederman, lxc-devel, mhw, Stephane Graber

On Thu, Sep 26, 2013 at 08:01:31PM +0300, Janne Karhunen wrote:
> That being said, our wish would be to support any combination of
> OS's and frankly, I'd be slightly annoyed to tell the customer that
> they can't do two Androids or we magically run out of bits.

If you want to support "any" combination of operating systems, then use
a hypervisor, that's what they are there for :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                               ` <20130926135604.GA16624-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2013-09-26 17:01                                 ` Janne Karhunen
       [not found]                                   ` <CAE=NcrY3xC1AF_GV2b1KsF7AwYZTuGBuKLS5yBUWoWcmKU4YBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Janne Karhunen @ 2013-09-26 17:01 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel,
	Eric W. Biederman, lxc-devel, mhw, Stephane Graber

On Thu, Sep 26, 2013 at 4:56 PM, Greg Kroah-Hartman
<gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:

>> I suppose so, but now you take the assumption that there is no
>> need for running multiple Linux variants on the same host (say
>> Ubuntu and Android side by side). Is this something you would
>> not like to see done?
>
> You can do that today without any need for device namespaces, so why is
> this an issue here?

I think you misunderstood me. I wasn't so much advocating on the
device namespace part, just the issue at hand (device access
filtering based on which ns happens to be 'active'). We are already
trying to do this in userspace, let's see how that goes.

That being said, our wish would be to support any combination of
OS's and frankly, I'd be slightly annoyed to tell the customer that
they can't do two Androids or we magically run out of bits.


-- 
Janne

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                           ` <CAE=NcrbPXGWU8FUgwchXyL5HjXf+4AKbgUWGe1ZO=Xcq=iV-Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-09-26 13:56                             ` Greg Kroah-Hartman
       [not found]                               ` <20130926135604.GA16624-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Greg Kroah-Hartman @ 2013-09-26 13:56 UTC (permalink / raw)
  To: Janne Karhunen
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski,
	devel-GEFAQzZX7r8dnm+yROfE0A, Eric W. Biederman, lxc-devel,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber

On Thu, Sep 26, 2013 at 11:25:56AM +0300, Janne Karhunen wrote:
> On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> 
> >> In summary the situation with device hoptlug and containers sucks today,
> >> and we need to do something.  Running a linux desktop in a container is
> >> a reasonably good example use case.
> >
> > No it isn't.  I'd argue that this is a horrible use case, one that you
> > shouldn't do.  Why not just use multi-head machines like people do who
> > really want to do this, relying on user separation?  That's a workable
> > solution that is quite common and works very well today.
> 
> I suppose so, but now you take the assumption that there is no
> need for running multiple Linux variants on the same host (say
> Ubuntu and Android side by side). Is this something you would
> not like to see done?

You can do that today without any need for device namespaces, so why is
this an issue here?

greg k-h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                       ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2013-09-26  8:25                         ` Janne Karhunen
       [not found]                           ` <CAE=NcrbPXGWU8FUgwchXyL5HjXf+4AKbgUWGe1ZO=Xcq=iV-Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-10-01  6:19                         ` Janne Karhunen
  1 sibling, 1 reply; 48+ messages in thread
From: Janne Karhunen @ 2013-09-26  8:25 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski,
	devel-GEFAQzZX7r8dnm+yROfE0A, Eric W. Biederman, lxc-devel,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber

On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
<gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:

>> In summary the situation with device hoptlug and containers sucks today,
>> and we need to do something.  Running a linux desktop in a container is
>> a reasonably good example use case.
>
> No it isn't.  I'd argue that this is a horrible use case, one that you
> shouldn't do.  Why not just use multi-head machines like people do who
> really want to do this, relying on user separation?  That's a workable
> solution that is quite common and works very well today.

I suppose so, but now you take the assumption that there is no
need for running multiple Linux variants on the same host (say
Ubuntu and Android side by side). Is this something you would
not like to see done?


-- 
Janne

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Device Namespaces
       [not found]                   ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-09-26  5:33                     ` Greg Kroah-Hartman
       [not found]                       ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  2013-10-28 23:31                     ` Andrey Wagin
  1 sibling, 1 reply; 48+ messages in thread
From: Greg Kroah-Hartman @ 2013-09-26  5:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski,
	devel-GEFAQzZX7r8dnm+yROfE0A, lxc-devel,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber

On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote:
> So the big issues for a device namespace to solve are filtering which
> devices a container has access to and being able to dynamically change
> which devices those are at run time (aka hotplug).

As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG
anymore, because it was redundant), I think you need to really think
this through better (pci, memory, cpus, etc.) before you do anything in
the kernel.

> After having thought about this for a bit I don't know if a pure
> userspace solution is sufficient or actually a good idea.
> 
> - We can manually manage a tmpfs with device nodes in userspace.
>   (But that is deprecated functionality in the mainstream kernel).

Yes, but I'm not going to namespace devtmpfs, as that is going to be an
impossible task, right?

And remember, udev doesn't create device nodes anymore...

> - We can manually export a subset of sysfs with bind mounts.
>   (But that feels hacky, and is essentially incompatible with hotplug).

True.

> - We can relay a call of /sbin/hotplug from outside of a container
>   to inside of a container based on policy.
>   (But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

> - There is no way to fake netlink uevents for a container to see them.
>   (The best we could do is replace udev everywhere with something that
>    listens on a unix domain socket).

You shouldn't need to do this.

> - It would be nice to replace the device cgroup with a comprehensive
>   solution that really works. (Among other things the device cgroup
>   does not work in terms of struct device the underlying kernel
>   abstraction for devices).

I didn't even know there was a device cgroup.

Which means that if there is one, odds are it's useless.

> We must manage sysfs entries as well device nodes because:
> - Seeing more than we should has the real potential to confuse
>   userspace, especially a userspace that replays uevents.

You should never replay uevents.  If you don't do that, why can't you
see all of sysfs?

> - Some device control must happens through writing to sysfs files and
>   if we don't remove all root privileges from a container only by
>   exporting a subset of sysfs to that container can we limit which
>   sysfs nodes can be written to.

But you have the issue of controlling devices in a "shared" way, which
isn't going to be usable for almost all devices.

> The current kernel tagged sysfs entry support does not look like a good
> match for the impelementing device filtering.   The common case will
> be allowing devices like /dev/zero, and /dev/null that live in
> /sys/devices/virtual and are the devices we are most likely to care
> about.  Those devices need to live in multiple device namespaces so
> everyone can use them.  Perhaps exclusive assignment will be the more
> common paradigm for device namespaces like it is for network devices in
> the network namespace but from what little I can of this problem right now I
> don't think so.
> 
> I definitely think we should hold off on a kernel level implementation
> until we really understand the issues and are ready to implement device
> namespaces correctly.

I agree, especially as I don't think this will ever work.

> A userspace implementation looks like it can only do about 95% of what
> is really needed, but at the same time looks like an easy way to
> experiment until the problem is sufficiently well understood.

95% is probably way better than what you have today, and will fit the
needs of almost everyone today, so why not do it?

I'd argue that those last 5% either are custom solutions that never get
merged, or candidates for true virtulization.

> In summary the situation with device hoptlug and containers sucks today,
> and we need to do something.  Running a linux desktop in a container is
> a reasonably good example use case.

No it isn't.  I'd argue that this is a horrible use case, one that you
shouldn't do.  Why not just use multi-head machines like people do who
really want to do this, relying on user separation?  That's a workable
solution that is quite common and works very well today.

> Having one standard common maintainable implementation would be very
> useful and the most logical place for that would be in the kernel.
> For now we should focus on simple device filtering and hotplug.

Just listen for libudev stuff, don't try to filter them, or ever
"replay" them, that way lies madness, and lots of nasty race conditions
that is guaranteed to break things.

good luck,

greg k-h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Device Namespaces
       [not found]               ` <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-09-25 21:34                 ` Eric W. Biederman
       [not found]                   ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2013-09-25 21:34 UTC (permalink / raw)
  To: Linux Containers
  Cc: Greg Kroah-Hartman, mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Kay Sievers, Andy Lutomirski, lxc-devel, Stephane Graber,
	devel-GEFAQzZX7r8dnm+yROfE0A


From conversations at Linux Plumbers Converence it became fairly clear
that one if not the roughest edge on containers today is dealing with
devices.

- Hotplug does not work.
- There seems to be no implementation that does a much beyond creating
  setting up a static set of /dev entries today.
- Containers do not see the appropriate uevents for their container.

One of the more compelling cases I heard was of someone who was running
the a Linux Desktop in container and wanted to just let that container
see the devices needed for his desktop, and not everything else.

Talking with the OpenVZ folks it appears that preserving device numbers
across checkpoint/restart is not currently an issue.  However they reuse
the same loopback minor number when they can which would hide this
issue.   So while it is clear we don't need to worry about migrating
an application that cares about major/minor numbers of filesystems right
now as the set of application that are migrated increases that situation
may change.  As the case with the network device ifindex has shown it is
possible to implement filtering now and later when there is a usecase it
is possible to expand filtering to actual namespace local identifiers.

Thinking about it for the case of container migration the simplest
solution for the rare application that needs something more may be to
figure out how to send a kernel hotplug event.  Something to think about
when we encounter them.

So the big issues for a device namespace to solve are filtering which
devices a container has access to and being able to dynamically change
which devices those are at run time (aka hotplug).

After having thought about this for a bit I don't know if a pure
userspace solution is sufficient or actually a good idea.

- We can manually manage a tmpfs with device nodes in userspace.
  (But that is deprecated functionality in the mainstream kernel).
- We can manually export a subset of sysfs with bind mounts.
  (But that feels hacky, and is essentially incompatible with hotplug).
- We can relay a call of /sbin/hotplug from outside of a container
  to inside of a container based on policy.
  (But no one uses /sbin/hotplug anymore).
- There is no way to fake netlink uevents for a container to see them.
  (The best we could do is replace udev everywhere with something that
   listens on a unix domain socket).
- It would be nice to replace the device cgroup with a comprehensive
  solution that really works. (Among other things the device cgroup
  does not work in terms of struct device the underlying kernel
  abstraction for devices).

We must manage sysfs entries as well device nodes because:
- Seeing more than we should has the real potential to confuse
  userspace, especially a userspace that replays uevents.
- Some device control must happens through writing to sysfs files and
  if we don't remove all root privileges from a container only by
  exporting a subset of sysfs to that container can we limit which
  sysfs nodes can be written to.

The current kernel tagged sysfs entry support does not look like a good
match for the impelementing device filtering.   The common case will
be allowing devices like /dev/zero, and /dev/null that live in
/sys/devices/virtual and are the devices we are most likely to care
about.  Those devices need to live in multiple device namespaces so
everyone can use them.  Perhaps exclusive assignment will be the more
common paradigm for device namespaces like it is for network devices in
the network namespace but from what little I can of this problem right now I
don't think so.

I definitely think we should hold off on a kernel level implementation
until we really understand the issues and are ready to implement device
namespaces correctly.

A userspace implementation looks like it can only do about 95% of what
is really needed, but at the same time looks like an easy way to
experiment until the problem is sufficiently well understood.

At the end of the day we need to filter the devices a set of userspace
processes can use and be able to change that set of devices dynamically.
All of the rest of the infrastructure for that lives in the kernel, and
keeping all of the infrastructure in one place where it can be
maintained together is likely to be most maintainable.  It looks like
the code is just complicated enough and the use cases just boring enough
that spreading the code to perform container device hotplug and
container device filtering between a dozen userspace tools, and a hadful
of userspace device managers will not be particularly managable at the
end of the day.

In summary the situation with device hoptlug and containers sucks today,
and we need to do something.  Running a linux desktop in a container is
a reasonably good example use case.  Having one standard common
maintainable implementation would be very useful and the most logical
place for that would be in the kernel.  For now we should focus on
simple device filtering and hotplug.

Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-06-15 11:33 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-08  9:38 device namespaces Enrico Weigelt, metux IT consult
2021-06-08 12:30 ` Christian Brauner
2021-06-08 12:41   ` Greg Kroah-Hartman
2021-06-08 14:10     ` Hannes Reinecke
2021-06-08 14:29       ` Christian Brauner
2021-06-08 15:54         ` Hannes Reinecke
2021-06-08 17:16           ` Eric W. Biederman
2021-06-09  6:38             ` Christian Brauner
2021-06-09  7:02               ` Hannes Reinecke
2021-06-09  7:21                 ` Christian Brauner
2021-06-09  7:54                   ` Hannes Reinecke
2021-06-09  8:09                     ` Christian Brauner
2021-06-11 18:14                       ` Eric W. Biederman
2021-06-14  7:49                         ` Enrico Weigelt, metux IT consult
2021-06-14  8:22                           ` Greg KH
2021-06-14 17:36                           ` Eric W. Biederman
2021-06-15 11:24                             ` Enrico Weigelt, metux IT consult
2021-06-15 11:33                               ` Greg KH
  -- strict thread matches above, loose matches on Subject: below --
2013-09-29 19:28 Device Namespaces Amir Goldstein
     [not found] ` <CAA2m6veny-7_ONMA973Wu36U4kz4gAuw0dpodkb8+GZDv6VNBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-29 20:06   ` Greg Kroah-Hartman
     [not found]     ` <20130929200620.GA31304-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-30 15:36       ` Michael H. Warfield
2013-10-03  0:44   ` Eric W. Biederman
     [not found]     ` <87a9iri3ot.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-10-03  0:59       ` Eric W. Biederman
2013-10-03  8:58       ` Amir Goldstein
     [not found]         ` <CAA2m6vc3OFmS9VwiTavRzPqhn+qoe6vDCO2sitXpEQ8a1JVyfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-03  9:17           ` Eric W. Biederman
2013-08-22 17:43 RFC: " Oren Laadan
2013-08-22 18:21 ` Serge Hallyn
2013-08-26 10:11   ` Oren Laadan
2013-09-06 17:50     ` Eric W. Biederman
2013-09-08 12:28       ` Amir Goldstein
2013-09-09  0:51         ` Eric W. Biederman
2013-09-10  7:09           ` Amir Goldstein
2013-09-25 11:05             ` Janne Karhunen
     [not found]               ` <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-25 21:34                 ` Eric W. Biederman
     [not found]                   ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-09-26  5:33                     ` Greg Kroah-Hartman
     [not found]                       ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26  8:25                         ` Janne Karhunen
     [not found]                           ` <CAE=NcrbPXGWU8FUgwchXyL5HjXf+4AKbgUWGe1ZO=Xcq=iV-Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-26 13:56                             ` Greg Kroah-Hartman
     [not found]                               ` <20130926135604.GA16624-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26 17:01                                 ` Janne Karhunen
     [not found]                                   ` <CAE=NcrY3xC1AF_GV2b1KsF7AwYZTuGBuKLS5yBUWoWcmKU4YBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-26 17:07                                     ` Greg Kroah-Hartman
     [not found]                                       ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26 17:56                                         ` Janne Karhunen
2013-09-30 15:37                                         ` James Bottomley
     [not found]                                           ` <1380555439.2161.5.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2013-09-30 16:11                                             ` Greg Kroah-Hartman
     [not found]                                               ` <20130930161117.GA26459-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-30 16:33                                                 ` James Bottomley
2013-10-01  6:19                         ` Janne Karhunen
     [not found]                           ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-01 17:27                             ` Andy Lutomirski
     [not found]                               ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-01 17:53                                 ` Serge E. Hallyn
     [not found]                                   ` <20131001175345.GA4145-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-10-01 19:51                                     ` Eric W. Biederman
     [not found]                                       ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-10-01 20:46                                         ` Serge Hallyn
2013-10-02 22:55                                           ` Eric W. Biederman
2013-10-01 20:57                                         ` Greg Kroah-Hartman
     [not found]                                           ` <20131001205718.GA17036-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-10-02 22:45                                             ` Eric W. Biederman
2013-10-01 22:19                                         ` Michael H. Warfield
2013-10-01 18:36                                 ` Janne Karhunen
2013-10-01 17:33                             ` Greg Kroah-Hartman
     [not found]                               ` <20131001173342.GA19267-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-10-01 18:23                                 ` Janne Karhunen
2013-10-28 23:31                     ` Andrey Wagin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).