* Re: [lxc-devel] device namespaces
[not found] <CALRD3qKmpzJCRszkG_S9Z3XgoTGWVMFd7FqeJh+W-9pZqPVhCg@mail.gmail.com>
@ 2014-09-24 5:04 ` Eric W. Biederman
[not found] ` <CALRD3qKPJHmmY2DSNNfNKzmLihDLm9fgBQprCXNMHVOArV4iuw@mail.gmail.com>
2014-09-24 16:38 ` Serge Hallyn
0 siblings, 2 replies; 11+ messages in thread
From: Eric W. Biederman @ 2014-09-24 5:04 UTC (permalink / raw)
To: LXC development mailing-list
Cc: linux-kernel, Miklos Szeredi, fuse-devel, Tejun Heo,
Seth Forshee, serge.hallyn
riya khanna <riyakhanna1983@gmail.com> writes:
> (Please pardon multiple emails, artifact of merging all separate conversations)
>
> Thanks for your feedback!
>
> Letting the kernel know about what devices a container could access (based on
> device cgroups) and having devtmpfs in the kernel create device nodes for a
> container that map to corresponding CUSE nodes is what I thought of. For
> example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual framebuffer
> (based on real fb0 SCREENINFO properties) for this process provided permissions
> allow this operation. To view the framebuffer, the CUSE based virtual device
> would talk to the actual hardware. Since namespaces would have different view of
> the underlying devices, "sysfs" has to made aware of this as well.
>
> Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.
The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.
Therefore the question becomes what are you trying to support.
If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.
If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.
There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.
Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).
The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.
Eric
^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <CALRD3qKPJHmmY2DSNNfNKzmLihDLm9fgBQprCXNMHVOArV4iuw@mail.gmail.com>]
* Re: [lxc-devel] device namespaces
[not found] ` <CALRD3qKPJHmmY2DSNNfNKzmLihDLm9fgBQprCXNMHVOArV4iuw@mail.gmail.com>
@ 2014-09-24 16:37 ` Serge Hallyn
2014-09-24 17:43 ` Using devices in Containers (was: [lxc-devel] device namespaces) Eric W. Biederman
2014-09-24 19:07 ` [lxc-devel] device namespaces Riya Khanna
0 siblings, 2 replies; 11+ messages in thread
From: Serge Hallyn @ 2014-09-24 16:37 UTC (permalink / raw)
To: riya khanna
Cc: Eric W. Biederman, LXC development mailing-list, Miklos Szeredi,
fuse-devel, Tejun Heo, Seth Forshee, linux-kernel
Isolation is provided by the devices cgroup. You want something more
than isolation.
Quoting riya khanna (riyakhanna1983@gmail.com):
> My use case for having device namespaces is device isolation. Isn't what
> namespaces are there for (as I understand)? Not everything should be
> accessible (or even visible) from a container all the time (we have seen
> people come up with different use cases for this). However, bind-mounting
> takes away this flexibility. I agree that assigning fixed device numbers is
> clearly not a long-term solution. Emulation for safe and flexible
> multiplexing, like you suggested either using CUSE/FUSE or something like
> devpts, is what I'm exploring.
>
> On Wed, Sep 24, 2014 at 12:04 AM, Eric W. Biederman <ebiederm@xmission.com>
> wrote:
>
> > riya khanna <riyakhanna1983@gmail.com> writes:
> >
> > > (Please pardon multiple emails, artifact of merging all separate
> > conversations)
> > >
> > > Thanks for your feedback!
> > >
> > > Letting the kernel know about what devices a container could access
> > (based on
> > > device cgroups) and having devtmpfs in the kernel create device nodes
> > for a
> > > container that map to corresponding CUSE nodes is what I thought of. For
> > > example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
> > framebuffer
> > > (based on real fb0 SCREENINFO properties) for this process provided
> > permissions
> > > allow this operation. To view the framebuffer, the CUSE based virtual
> > device
> > > would talk to the actual hardware. Since namespaces would have different
> > view of
> > > the underlying devices, "sysfs" has to made aware of this as well.
> > >
> > > Please let me know your inputs. Thanks again!
> >
> > The solution hugely depends on what you are trying to do with it.
> >
> > The situation today is that device nodes are slowly fading out. In
> > another 20 years linux may not have any device nodes at all.
> >
> > Therefore the question becomes what are you trying to support.
> >
> > If it is just filtering of existing device nodes. We can do a pretty
> > good approximation with bind mounts.
> >
> > If you want to emulate a device you can use normal fuse (not cuse).
> > As normal fuse file will support arbitrary ioctls.
> >
> > There are a few cases where it is desirable to emulate what devpts
> > does for allowing arbitrary users to creating virtual devices in the
> > kernel. Loop devices in particular.
> >
> > Ultimately given the existence of device hotplug I don't see any call
> > for being able to create device nodes with well known device numbers
> > (fundamentally what a device namespace would be about).
> >
> > The conversation last year was about people wanting to multiplex devices
> > that don't have multiplexer support in the kernel. If that is your
> > desire I think it is entirely reasonable to device type by device type
> > add support for multiplexing that device type to the kernel, or
> > potentially just use fuse or cuse to implement your multiplexer in
> > userspace but that has the potential to be unusably slow.
> >
> > Eric
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
^ permalink raw reply [flat|nested] 11+ messages in thread
* Using devices in Containers (was: [lxc-devel] device namespaces)
2014-09-24 16:37 ` Serge Hallyn
@ 2014-09-24 17:43 ` Eric W. Biederman
2014-09-24 19:30 ` Riya Khanna
2014-09-24 19:07 ` [lxc-devel] device namespaces Riya Khanna
1 sibling, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 2014-09-24 17:43 UTC (permalink / raw)
To: riya khanna
Cc: LXC development mailing-list, Miklos Szeredi, fuse-devel,
Tejun Heo, Seth Forshee, linux-kernel, Serge Hallyn
Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> Isolation is provided by the devices cgroup. You want something more
> than isolation.
>
> Quoting riya khanna (riyakhanna1983@gmail.com):
>> My use case for having device namespaces is device isolation. Isn't what
>> namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).
Unless someone cares about device numbers at a namespace level
the work is done.
The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).
>> Not everything should be
>> accessible (or even visible) from a container all the time (we have seen
>> people come up with different use cases for this). However, bind-mounting
>> takes away this flexibility.
I don't see how. If they are mounts that propogate into the container
and are controlled from outside you can do whatever you want. (I am
imagining device by device bind mounts here). It should be trivial
to have a a directory tree that propogates into a container and works.
>> I agree that assigning fixed device numbers is
>> clearly not a long-term solution. Emulation for safe and flexible
>> multiplexing, like you suggested either using CUSE/FUSE or something like
>> devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?
I think there is quite a bit of room to talk about how to safely
and effectively use devices in containers. So let's make that the
discussion. No one actually wants device number namespaces and talking
about them only muddies the watters.
Eric
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Using devices in Containers (was: [lxc-devel] device namespaces)
2014-09-24 17:43 ` Using devices in Containers (was: [lxc-devel] device namespaces) Eric W. Biederman
@ 2014-09-24 19:30 ` Riya Khanna
2014-09-24 22:38 ` Using devices in Containers Eric W. Biederman
0 siblings, 1 reply; 11+ messages in thread
From: Riya Khanna @ 2014-09-24 19:30 UTC (permalink / raw)
To: Eric W. Biederman
Cc: LXC development mailing-list, Miklos Szeredi, fuse-devel,
Tejun Heo, Seth Forshee, linux-kernel, Serge Hallyn
On Sep 24, 2014, at 12:43 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>
>> Isolation is provided by the devices cgroup. You want something more
>> than isolation.
>>
>> Quoting riya khanna (riyakhanna1983@gmail.com):
>>> My use case for having device namespaces is device isolation. Isn't what
>>> namespaces are there for (as I understand)?
>
> Namespaces fundamentally provide for using the same ``global'' name
> in different contexts. This allows them to be used for isolation
> and process migration (because you can take the same name from
> machine to machine).
>
> Unless someone cares about device numbers at a namespace level
> the work is done.
>
> The mount namespace provides exsits to deal with file names.
> The devices cgroup will limit which devices you can access (although
> I can't ever imagine a case where the mout namespace would be
> insufficient).
>
>>> Not everything should be
>>> accessible (or even visible) from a container all the time (we have seen
>>> people come up with different use cases for this). However, bind-mounting
>>> takes away this flexibility.
>
> I don't see how. If they are mounts that propogate into the container
> and are controlled from outside you can do whatever you want. (I am
> imagining device by device bind mounts here). It should be trivial
> to have a a directory tree that propogates into a container and works.
>
Device-by-device bind mounts can grant/revoke access to real individual devices as and when needed. However, revoking the access to real devices could break the applications if there’s no transparent mechanism to back up the propagated (but now revoked) device bind mounts that could fool the apps into believing that they are working with real devices. Frame buffer is one such example, where safe multiplexing could be applied.
>>> I agree that assigning fixed device numbers is
>>> clearly not a long-term solution. Emulation for safe and flexible
>>> multiplexing, like you suggested either using CUSE/FUSE or something like
>>> devpts, is what I'm exploring.
>
> Is the problem you actually care about multiplexing devices?
>
The problem I care about is access to real devices, such as input, fb, loop, etc. as and when needed, thereby having native I/O performance - either through secure multiplexing or exclusive ownership, whatever makes sense according to the device type.
> I think there is quite a bit of room to talk about how to safely
> and effectively use devices in containers. So let's make that the
> discussion. No one actually wants device number namespaces and talking
> about them only muddies the watters.
>
I cannot agree more. Let’s restrict the discussion to it.
Thanks,
Riya
> Eric
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Using devices in Containers
2014-09-24 19:30 ` Riya Khanna
@ 2014-09-24 22:38 ` Eric W. Biederman
[not found] ` <CALRD3qLYAc+K8e1xYb27ipi4KyGRmTxokPCHN0L_zta=Cy9sCQ@mail.gmail.com>
0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 2014-09-24 22:38 UTC (permalink / raw)
To: Riya Khanna
Cc: LXC development mailing-list, Miklos Szeredi, fuse-devel,
Tejun Heo, Seth Forshee, linux-kernel, Serge Hallyn
Riya Khanna <riyakhanna1983@gmail.com> writes:
> On Sep 24, 2014, at 12:43 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>>
>>> Isolation is provided by the devices cgroup. You want something more
>>> than isolation.
>>>
>>> Quoting riya khanna (riyakhanna1983@gmail.com):
>>>> My use case for having device namespaces is device isolation. Isn't what
>>>> namespaces are there for (as I understand)?
>>
>> Namespaces fundamentally provide for using the same ``global'' name
>> in different contexts. This allows them to be used for isolation
>> and process migration (because you can take the same name from
>> machine to machine).
>>
>> Unless someone cares about device numbers at a namespace level
>> the work is done.
>>
>> The mount namespace provides exsits to deal with file names.
>> The devices cgroup will limit which devices you can access (although
>> I can't ever imagine a case where the mout namespace would be
>> insufficient).
>>
>>>> Not everything should be
>>>> accessible (or even visible) from a container all the time (we have seen
>>>> people come up with different use cases for this). However, bind-mounting
>>>> takes away this flexibility.
>>
>> I don't see how. If they are mounts that propogate into the container
>> and are controlled from outside you can do whatever you want. (I am
>> imagining device by device bind mounts here). It should be trivial
>> to have a a directory tree that propogates into a container and works.
>>
>
> Device-by-device bind mounts can grant/revoke access to real
> individual devices as and when needed. However, revoking the access to
> real devices could break the applications if there’s no transparent
> mechanism to back up the propagated (but now revoked) device bind
> mounts that could fool the apps into believing that they are working
> with real devices. Frame buffer is one such example, where safe
> multiplexing could be applied.
>
>>>> I agree that assigning fixed device numbers is
>>>> clearly not a long-term solution. Emulation for safe and flexible
>>>> multiplexing, like you suggested either using CUSE/FUSE or something like
>>>> devpts, is what I'm exploring.
>>
>> Is the problem you actually care about multiplexing devices?
>>
>
> The problem I care about is access to real devices, such as input, fb,
> loop, etc. as and when needed, thereby having native I/O performance -
> either through secure multiplexing or exclusive ownership, whatever
> makes sense according to the device type.
Riya Khanna <riyakhanna1983@gmail.com> writes:
> I guess policy-based multiplexing (or exclusive ownership) is the
> usage. What kind of devices (loop, fb, etc.) this is needed for
> depends on the usage. If there are multiple FBs, then each container
> could potentially own one. One may want to provide exclusive ownership
> of input devices to one container at a time to avoid information
> leakage. Like we saw at LPC last year, this applies to sensors (gps,
> accelerometer, etc.) on mobile devices as well.
Allowing mutiplexing of those devices seems reasonable.
Where the discussion ran into problems last time was that people did not
want to use any of the existing linux solutions for multiplexing those
kind of thing and wanted to invent something new.
Inventing something new is fine if it the extra code maintenance can be
justified, or if the invention just a better solution for all users and
new code can just start using that in general.
The old solution to your problem of multiplexing devices is by
allocating a virtual terminal nd sending signals to coordinate
cooperatively sharing those resources.
If you want some sort of preemtive multitasking that requires
something a bit more effort, and work in the device abstractions.
You may be able to share concepts and library code but I don't believe
there is something you can just pain on top of devices and make it
happen. Certainly in the bad old days of X terminal switching the
cooperation was necessary so that when a video card was yanked from an
application writing directly to that video card the application would
need to restore the video card to a known state so the next application
would have a chance of making sense of it. Furthermore most devices
are not safe to let unprivileged users to access their control registers
directly.
All of which boils down the simple fact that for each type of device you
would like to share it is necessary to update the subsystem to support
arbitrary numbers of virtual devices that you can talk to.
The macvlan driver in the networking stack is a rough example of what I
expect you would like. Something that takes one real physical device
and turns it into N virtual devices each of which runs at effectively
full speed. Along with some kind of new master interface for
controlling when the multiplexing takes place.
I think we do most of this is software today and arguably for a lot of
devices the overhead is small enough that a software solution is fine.
So perhaps all you need is a fuse interface to the existing software
multiplexers so that weird legacy code can be made to run.
Now I suspect part of doing this right will be getting proper video
drivers on Android. I assume that Android is the platform you care
about.
Eric
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [lxc-devel] device namespaces
2014-09-24 16:37 ` Serge Hallyn
2014-09-24 17:43 ` Using devices in Containers (was: [lxc-devel] device namespaces) Eric W. Biederman
@ 2014-09-24 19:07 ` Riya Khanna
1 sibling, 0 replies; 11+ messages in thread
From: Riya Khanna @ 2014-09-24 19:07 UTC (permalink / raw)
To: Serge Hallyn
Cc: Eric W. Biederman, LXC development mailing-list, Miklos Szeredi,
fuse-devel, Tejun Heo, Seth Forshee, linux-kernel
I guess policy-based multiplexing (or exclusive ownership) is the usage. What kind of devices (loop, fb, etc.) this is needed for depends on the usage. If there are multiple FBs, then each container could potentially own one. One may want to provide exclusive ownership of input devices to one container at a time to avoid information leakage. Like we saw at LPC last year, this applies to sensors (gps, accelerometer, etc.) on mobile devices as well.
On Sep 24, 2014, at 11:37 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> Isolation is provided by the devices cgroup. You want something more
> than isolation.
>
> Quoting riya khanna (riyakhanna1983@gmail.com):
>> My use case for having device namespaces is device isolation. Isn't what
>> namespaces are there for (as I understand)? Not everything should be
>> accessible (or even visible) from a container all the time (we have seen
>> people come up with different use cases for this). However, bind-mounting
>> takes away this flexibility. I agree that assigning fixed device numbers is
>> clearly not a long-term solution. Emulation for safe and flexible
>> multiplexing, like you suggested either using CUSE/FUSE or something like
>> devpts, is what I'm exploring.
>>
>> On Wed, Sep 24, 2014 at 12:04 AM, Eric W. Biederman <ebiederm@xmission.com>
>> wrote:
>>
>>> riya khanna <riyakhanna1983@gmail.com> writes:
>>>
>>>> (Please pardon multiple emails, artifact of merging all separate
>>> conversations)
>>>>
>>>> Thanks for your feedback!
>>>>
>>>> Letting the kernel know about what devices a container could access
>>> (based on
>>>> device cgroups) and having devtmpfs in the kernel create device nodes
>>> for a
>>>> container that map to corresponding CUSE nodes is what I thought of. For
>>>> example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
>>> framebuffer
>>>> (based on real fb0 SCREENINFO properties) for this process provided
>>> permissions
>>>> allow this operation. To view the framebuffer, the CUSE based virtual
>>> device
>>>> would talk to the actual hardware. Since namespaces would have different
>>> view of
>>>> the underlying devices, "sysfs" has to made aware of this as well.
>>>>
>>>> Please let me know your inputs. Thanks again!
>>>
>>> The solution hugely depends on what you are trying to do with it.
>>>
>>> The situation today is that device nodes are slowly fading out. In
>>> another 20 years linux may not have any device nodes at all.
>>>
>>> Therefore the question becomes what are you trying to support.
>>>
>>> If it is just filtering of existing device nodes. We can do a pretty
>>> good approximation with bind mounts.
>>>
>>> If you want to emulate a device you can use normal fuse (not cuse).
>>> As normal fuse file will support arbitrary ioctls.
>>>
>>> There are a few cases where it is desirable to emulate what devpts
>>> does for allowing arbitrary users to creating virtual devices in the
>>> kernel. Loop devices in particular.
>>>
>>> Ultimately given the existence of device hotplug I don't see any call
>>> for being able to create device nodes with well known device numbers
>>> (fundamentally what a device namespace would be about).
>>>
>>> The conversation last year was about people wanting to multiplex devices
>>> that don't have multiplexer support in the kernel. If that is your
>>> desire I think it is entirely reasonable to device type by device type
>>> add support for multiplexing that device type to the kernel, or
>>> potentially just use fuse or cuse to implement your multiplexer in
>>> userspace but that has the potential to be unusably slow.
>>>
>>> Eric
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [lxc-devel] device namespaces
2014-09-24 5:04 ` [lxc-devel] device namespaces Eric W. Biederman
[not found] ` <CALRD3qKPJHmmY2DSNNfNKzmLihDLm9fgBQprCXNMHVOArV4iuw@mail.gmail.com>
@ 2014-09-24 16:38 ` Serge Hallyn
1 sibling, 0 replies; 11+ messages in thread
From: Serge Hallyn @ 2014-09-24 16:38 UTC (permalink / raw)
To: Eric W. Biederman
Cc: LXC development mailing-list, linux-kernel, Miklos Szeredi,
fuse-devel, Tejun Heo, Seth Forshee
Quoting Eric W. Biederman (ebiederm@xmission.com):
> riya khanna <riyakhanna1983@gmail.com> writes:
>
> > (Please pardon multiple emails, artifact of merging all separate conversations)
> >
> > Thanks for your feedback!
> >
> > Letting the kernel know about what devices a container could access (based on
> > device cgroups) and having devtmpfs in the kernel create device nodes for a
> > container that map to corresponding CUSE nodes is what I thought of. For
> > example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual framebuffer
> > (based on real fb0 SCREENINFO properties) for this process provided permissions
> > allow this operation. To view the framebuffer, the CUSE based virtual device
> > would talk to the actual hardware. Since namespaces would have different view of
> > the underlying devices, "sysfs" has to made aware of this as well.
> >
> > Please let me know your inputs. Thanks again!
>
> The solution hugely depends on what you are trying to do with it.
>
> The situation today is that device nodes are slowly fading out. In
> another 20 years linux may not have any device nodes at all.
>
> Therefore the question becomes what are you trying to support.
>
> If it is just filtering of existing device nodes. We can do a pretty
> good approximation with bind mounts.
>
> If you want to emulate a device you can use normal fuse (not cuse).
> As normal fuse file will support arbitrary ioctls.
>
> There are a few cases where it is desirable to emulate what devpts
> does for allowing arbitrary users to creating virtual devices in the
> kernel. Loop devices in particular.
>
> Ultimately given the existence of device hotplug I don't see any call
> for being able to create device nodes with well known device numbers
> (fundamentally what a device namespace would be about).
>
> The conversation last year was about people wanting to multiplex devices
> that don't have multiplexer support in the kernel. If that is your
> desire I think it is entirely reasonable to device type by device type
> add support for multiplexing that device type to the kernel, or
> potentially just use fuse or cuse to implement your multiplexer in
> userspace but that has the potential to be unusably slow.
It would be helpful to have a list of devices that may want that
multiplexing. Is it really just loop and graphics drivers?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RFC: Device Namespaces
@ 2013-09-08 12:28 Amir Goldstein
2013-09-09 0:51 ` Eric W. Biederman
0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2013-09-08 12:28 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Linux Containers, lxc-devel
On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>wrote:
> Oren Laadan <orenl-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> writes:
>
> > Hi Serge,
> >
> >
> > On Thu, Aug 22, 2013 at 2:21 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org
> >wrote:
> >
> >> Quoting Oren Laadan (orenl-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org):
> >> > Hi everyone!
> >> >
> >> > We [1] have been working on bringing lightweight virtualization to
> >> > Linux-based mobile devices like Android (or other Linux-based devices
> >> with
> >> > diverse I/O) and want to share our solution: device namespaces.
> >> >
> >> > Imagine you could run several instances of your favorite mobile OS or
> >> other
> >> > distributions in isolated containers, each under the impression of
> having
> >> > exclusive access to device drivers; Interact and switch between them
> >> within
> >> > a blink, no flashing, no reboot.
> >> >
> >> > Device namespaces are an extension to existing Linux kernel namespaces
> >> that
> >> > brings lightweight virtualization to Linux-based end-user devices,
> >> > primarily mobile devices.
> >> > Device namespaces introduce a private and virtual namespace for device
> >> > drivers to create the illusion for a process group that it interacts
> >> > exclusively with a set of drivers. Device namespaces also introduce
> the
> >> > concepts of an “active” namespace with which a user interacts, vs
> >> > “non-active” namespaces that run in the background, and the ability to
> >> > switch between them.[2]
> >>
> >> Note that unless I'm misunderstanding what you're saying here, this is
> >> also what net_ns does. A netns can exist with no processes so long as
> >> you've bound its /proc/$$/ns/net somewhere. You can then re-enter that
> >> ns using ns_attach. I haven't looked closely enough yet to see whether
> >> you should be (or are) using the same interface.
> >>
> >>
> > To illustrate the need for device namespaces, consider the use case of
> > running two containers of your favorite OS (say, Android), on a single
> > physical phone. As a user, you either work in one container, or in the
> > other, and you will want to be able to switch between them (just like
> with
> > apps on mobile devices: you interact with one application at a time, and
> > switch between them).
> >
> > See here for a demo of how it works: http://vimeo.com/60113683
> >
> > To accomplish this, device namespaces solve two shortcomings of existing
> > namespaces:
> >
> > 1. A namespace for device drivers: each (Android) container needs a
> > private view of all devices. This includes logical drivers, like binder
> (in
> > Android) but also loop device; and physical devices, like the framebuffer
> > and the touch-screen.
> >
> > In other words, device namespaces virtualize the _major/minor_ and the
> > _state_ of device drivers. With the exception of VFS, network, and PTY
> > (note: all three offer/are virtual devices), device drivers are otherwise
> > not isolated between containers.
> >
> > 2. A namespace for interactive scenarios: a namespace can be "active" -
> it
> > has access to the hardware, e.g. display and touch-screen. This will be
> the
> > container with which the user is interacting right now. Otherwise a
> > namespace is "non-active" - it still runs in the background, but can
> > neither alter the display nor receive input from the touch-screen.
> > Switching to another container means a context switch in the relevant
> > drivers, so that they restore the state and now "obey" the other
> namespace.
> >
> > You can also think about the "active" namespace as foreground, and the
> > "non-active" as background, akin to foreground/background processes in a
> > terminal with job-control. Similar to how a terminal delivers input to
> the
> > foreground task only but not to the background tasks - this is enforced
> by
> > the new device namespace.
> >
> > More details on this use-case are in the wiki:
> > https://github.com/Cellrox/devns-patches/wiki/Thinvisor).
>
> I think this is going to take some talking, and looking at code.
>
>
Hi Eric,
If we can get people to take a quick look at the code before LPC
that could make the LPC discussions more effective.
Even looking at one of the subsystem patches can give a basic
idea of the work we have done:
https://github.com/Cellrox/linux/commits/devns-goldfish-3.4
I think you are talking about having wrappers around your devices so you
> can share. Which is not the quite same problem the rest of us have been
> thinking of when talking about a device namespace.
>
We are interested in all problems related to virtualizated view of devices
inside a container, so let our work so far be a starting point to discuss
all of them.
>
> My first impression is that this is better solved with more appropriate
> abstractions in userspace or in the kernel.
>
> But we can talk at LPC and see what we can hash out.
>
Looking forward to that :-)
Amir.
>
> Eric
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RFC: Device Namespaces
@ 2013-09-09 0:51 ` Eric W. Biederman
2013-09-10 7:09 ` Amir Goldstein
0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 2013-09-09 0:51 UTC (permalink / raw)
To: Amir Goldstein; +Cc: Linux Containers, lxc-devel
Amir Goldstein <amir@cellrox.com> writes:
> On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>
> Hi Eric,
>
> If we can get people to take a quick look at the code before LPC
> that could make the LPC discussions more effective.
> Even looking at one of the subsystem patches can give a basic
> idea of the work we have done:
> https://github.com/Cellrox/linux/commits/devns-goldfish-3.4
>
> I think you are talking about having wrappers around your devices
> so you
> can share. Which is not the quite same problem the rest of us
> have been
> thinking of when talking about a device namespace.
>
> We are interested in all problems related to virtualizated view of
> devices
> inside a container, so let our work so far be a starting point to
> discuss all of them.
>
> My first impression is that this is better solved with more
> appropriate
> abstractions in userspace or in the kernel.
As I read your code, you are solving the problem of one opener of a
device among a group of openers being able to access a device at a time.
Which leads to the question why can't the multiplexing happen in
userspace?
I think with your design it would not be possible to play a song in one
device namespace while doing work in the other. As a security model
that isn't wrong but as someone trying to get work done that could be a
real pain.
The more common concern is to have devices we can use all of the time.
There may be a need for a device namespace and multiplexing access to
hardware devices makes that clearer. So far nothing has risen to the
level of we actually need a device namespace to do X. Especially in an
erra of hotplug and dynamic device numbers.
It is arguable that you could do your kind of device multiplexing with a
fuse device in userspace that implements your desired policy.
And policy is where cell situtation seems to fall down because it hard
codes one specific policy into the kernel, and a policy most situations
don't find useful.
Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RFC: Device Namespaces
@ 2013-09-10 7:09 ` Amir Goldstein
2013-09-25 11:05 ` Janne Karhunen
0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2013-09-10 7:09 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Linux Containers, lxc-devel
On Mon, Sep 9, 2013 at 2:51 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>wrote:
> Amir Goldstein <amir-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> writes:
>
> > On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman
> > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> >
> > Hi Eric,
> >
> > If we can get people to take a quick look at the code before LPC
> > that could make the LPC discussions more effective.
> > Even looking at one of the subsystem patches can give a basic
> > idea of the work we have done:
> > https://github.com/Cellrox/linux/commits/devns-goldfish-3.4
> >
> > I think you are talking about having wrappers around your devices
> > so you
> > can share. Which is not the quite same problem the rest of us
> > have been
> > thinking of when talking about a device namespace.
> >
> > We are interested in all problems related to virtualizated view of
> > devices
> > inside a container, so let our work so far be a starting point to
> > discuss all of them.
> >
> > My first impression is that this is better solved with more
> > appropriate
> > abstractions in userspace or in the kernel.
>
> As I read your code, you are solving the problem of one opener of a
> device among a group of openers being able to access a device at a time.
> Which leads to the question why can't the multiplexing happen in
> userspace?
>
> I think with your design it would not be possible to play a song in one
> device namespace while doing work in the other. As a security model
> that isn't wrong but as someone trying to get work done that could be a
> real pain.
>
As a matter of fact, in our multi persona phone, you *can* hear music played
from background persona, but you *cannot* see images drawn from background
persona.
> The more common concern is to have devices we can use all of the time.
>
> There may be a need for a device namespace and multiplexing access to
> hardware devices makes that clearer. So far nothing has risen to the
> level of we actually need a device namespace to do X. Especially in an
> erra of hotplug and dynamic device numbers.
>
> It is arguable that you could do your kind of device multiplexing with a
> fuse device in userspace that implements your desired policy.
>
I agree about it being arguable :-)
We shall present our arguments on LPC.
>
> And policy is where cell situtation seems to fall down because it hard
> codes one specific policy into the kernel, and a policy most situations
> don't find useful.
>
>
It's true that for our product, we have made hardcoded policy decisions in
our kernel
patches, but that was just as a proof of concept for the technique.
We do envision being able to dynamically assign a device to a specific devns
(e.g. block,loop) keep a device shared between multi devns (e.g. audio)
and in addition to that, being able to multiplex a device between multi
devns (e.g. framebuffer)
> Eric
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RFC: Device Namespaces
@ 2013-09-25 11:05 ` Janne Karhunen
2013-09-25 21:34 ` Eric W. Biederman
0 siblings, 1 reply; 11+ messages in thread
From: Janne Karhunen @ 2013-09-25 11:05 UTC (permalink / raw)
To: Amir Goldstein; +Cc: Linux Containers, Eric W. Biederman, lxc-devel
On Tue, Sep 10, 2013 at 10:09 AM, Amir Goldstein <amir-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> wrote:
> On Mon, Sep 9, 2013 at 2:51 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>wrote:
>
>> Amir Goldstein <amir-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> writes:
>>
>> > On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman
>> > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> >
>> > Hi Eric,
>> >
>> > If we can get people to take a quick look at the code before LPC
>> > that could make the LPC discussions more effective.
Hi,
I think we are curious enough to experiment with Erics idea of
implementing basic 'device namespace' in userspace (never miss an
opportunity to throw away kernel code). Can anyone point out any
obvious reason why this would not work if we consider bulk of the work
being plain access filtering?
That being said, is there a valid reason why binder is part of device
namespace here instead of IPC?
--
Janne
^ permalink raw reply [flat|nested] 11+ messages in thread
* Device Namespaces
@ 2013-09-25 21:34 ` Eric W. Biederman
2013-09-26 5:33 ` Greg Kroah-Hartman
0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 2013-09-25 21:34 UTC (permalink / raw)
To: Linux Containers
Cc: Greg Kroah-Hartman, mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
Kay Sievers, Andy Lutomirski, lxc-devel, Stephane Graber,
devel-GEFAQzZX7r8dnm+yROfE0A
From conversations at Linux Plumbers Converence it became fairly clear
that one if not the roughest edge on containers today is dealing with
devices.
- Hotplug does not work.
- There seems to be no implementation that does a much beyond creating
setting up a static set of /dev entries today.
- Containers do not see the appropriate uevents for their container.
One of the more compelling cases I heard was of someone who was running
the a Linux Desktop in container and wanted to just let that container
see the devices needed for his desktop, and not everything else.
Talking with the OpenVZ folks it appears that preserving device numbers
across checkpoint/restart is not currently an issue. However they reuse
the same loopback minor number when they can which would hide this
issue. So while it is clear we don't need to worry about migrating
an application that cares about major/minor numbers of filesystems right
now as the set of application that are migrated increases that situation
may change. As the case with the network device ifindex has shown it is
possible to implement filtering now and later when there is a usecase it
is possible to expand filtering to actual namespace local identifiers.
Thinking about it for the case of container migration the simplest
solution for the rare application that needs something more may be to
figure out how to send a kernel hotplug event. Something to think about
when we encounter them.
So the big issues for a device namespace to solve are filtering which
devices a container has access to and being able to dynamically change
which devices those are at run time (aka hotplug).
After having thought about this for a bit I don't know if a pure
userspace solution is sufficient or actually a good idea.
- We can manually manage a tmpfs with device nodes in userspace.
(But that is deprecated functionality in the mainstream kernel).
- We can manually export a subset of sysfs with bind mounts.
(But that feels hacky, and is essentially incompatible with hotplug).
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).
- There is no way to fake netlink uevents for a container to see them.
(The best we could do is replace udev everywhere with something that
listens on a unix domain socket).
- It would be nice to replace the device cgroup with a comprehensive
solution that really works. (Among other things the device cgroup
does not work in terms of struct device the underlying kernel
abstraction for devices).
We must manage sysfs entries as well device nodes because:
- Seeing more than we should has the real potential to confuse
userspace, especially a userspace that replays uevents.
- Some device control must happens through writing to sysfs files and
if we don't remove all root privileges from a container only by
exporting a subset of sysfs to that container can we limit which
sysfs nodes can be written to.
The current kernel tagged sysfs entry support does not look like a good
match for the impelementing device filtering. The common case will
be allowing devices like /dev/zero, and /dev/null that live in
/sys/devices/virtual and are the devices we are most likely to care
about. Those devices need to live in multiple device namespaces so
everyone can use them. Perhaps exclusive assignment will be the more
common paradigm for device namespaces like it is for network devices in
the network namespace but from what little I can of this problem right now I
don't think so.
I definitely think we should hold off on a kernel level implementation
until we really understand the issues and are ready to implement device
namespaces correctly.
A userspace implementation looks like it can only do about 95% of what
is really needed, but at the same time looks like an easy way to
experiment until the problem is sufficiently well understood.
At the end of the day we need to filter the devices a set of userspace
processes can use and be able to change that set of devices dynamically.
All of the rest of the infrastructure for that lives in the kernel, and
keeping all of the infrastructure in one place where it can be
maintained together is likely to be most maintainable. It looks like
the code is just complicated enough and the use cases just boring enough
that spreading the code to perform container device hotplug and
container device filtering between a dozen userspace tools, and a hadful
of userspace device managers will not be particularly managable at the
end of the day.
In summary the situation with device hoptlug and containers sucks today,
and we need to do something. Running a linux desktop in a container is
a reasonably good example use case. Having one standard common
maintainable implementation would be very useful and the most logical
place for that would be in the kernel. For now we should focus on
simple device filtering and hotplug.
Eric
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Device Namespaces
@ 2013-09-26 5:33 ` Greg Kroah-Hartman
2013-10-01 6:19 ` Janne Karhunen
0 siblings, 1 reply; 11+ messages in thread
From: Greg Kroah-Hartman @ 2013-09-26 5:33 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linux Containers, Kay Sievers, Andy Lutomirski,
devel-GEFAQzZX7r8dnm+yROfE0A, lxc-devel,
mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber
On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote:
> So the big issues for a device namespace to solve are filtering which
> devices a container has access to and being able to dynamically change
> which devices those are at run time (aka hotplug).
As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG
anymore, because it was redundant), I think you need to really think
this through better (pci, memory, cpus, etc.) before you do anything in
the kernel.
> After having thought about this for a bit I don't know if a pure
> userspace solution is sufficient or actually a good idea.
>
> - We can manually manage a tmpfs with device nodes in userspace.
> (But that is deprecated functionality in the mainstream kernel).
Yes, but I'm not going to namespace devtmpfs, as that is going to be an
impossible task, right?
And remember, udev doesn't create device nodes anymore...
> - We can manually export a subset of sysfs with bind mounts.
> (But that feels hacky, and is essentially incompatible with hotplug).
True.
> - We can relay a call of /sbin/hotplug from outside of a container
> to inside of a container based on policy.
> (But no one uses /sbin/hotplug anymore).
That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?
> - There is no way to fake netlink uevents for a container to see them.
> (The best we could do is replace udev everywhere with something that
> listens on a unix domain socket).
You shouldn't need to do this.
> - It would be nice to replace the device cgroup with a comprehensive
> solution that really works. (Among other things the device cgroup
> does not work in terms of struct device the underlying kernel
> abstraction for devices).
I didn't even know there was a device cgroup.
Which means that if there is one, odds are it's useless.
> We must manage sysfs entries as well device nodes because:
> - Seeing more than we should has the real potential to confuse
> userspace, especially a userspace that replays uevents.
You should never replay uevents. If you don't do that, why can't you
see all of sysfs?
> - Some device control must happens through writing to sysfs files and
> if we don't remove all root privileges from a container only by
> exporting a subset of sysfs to that container can we limit which
> sysfs nodes can be written to.
But you have the issue of controlling devices in a "shared" way, which
isn't going to be usable for almost all devices.
> The current kernel tagged sysfs entry support does not look like a good
> match for the impelementing device filtering. The common case will
> be allowing devices like /dev/zero, and /dev/null that live in
> /sys/devices/virtual and are the devices we are most likely to care
> about. Those devices need to live in multiple device namespaces so
> everyone can use them. Perhaps exclusive assignment will be the more
> common paradigm for device namespaces like it is for network devices in
> the network namespace but from what little I can of this problem right now I
> don't think so.
>
> I definitely think we should hold off on a kernel level implementation
> until we really understand the issues and are ready to implement device
> namespaces correctly.
I agree, especially as I don't think this will ever work.
> A userspace implementation looks like it can only do about 95% of what
> is really needed, but at the same time looks like an easy way to
> experiment until the problem is sufficiently well understood.
95% is probably way better than what you have today, and will fit the
needs of almost everyone today, so why not do it?
I'd argue that those last 5% either are custom solutions that never get
merged, or candidates for true virtulization.
> In summary the situation with device hoptlug and containers sucks today,
> and we need to do something. Running a linux desktop in a container is
> a reasonably good example use case.
No it isn't. I'd argue that this is a horrible use case, one that you
shouldn't do. Why not just use multi-head machines like people do who
really want to do this, relying on user separation? That's a workable
solution that is quite common and works very well today.
> Having one standard common maintainable implementation would be very
> useful and the most logical place for that would be in the kernel.
> For now we should focus on simple device filtering and hotplug.
Just listen for libudev stuff, don't try to filter them, or ever
"replay" them, that way lies madness, and lots of nasty race conditions
that is guaranteed to break things.
good luck,
greg k-h
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Device Namespaces
@ 2013-10-01 6:19 ` Janne Karhunen
2013-10-01 17:27 ` Andy Lutomirski
0 siblings, 1 reply; 11+ messages in thread
From: Janne Karhunen @ 2013-10-01 6:19 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel,
Eric W. Biederman, lxc-devel, mhw, Stephane Graber
On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
<gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
>> - We can relay a call of /sbin/hotplug from outside of a container
>> to inside of a container based on policy.
>> (But no one uses /sbin/hotplug anymore).
>
> That's right, they should be listening to libudev events, so why can't
> your daemon shuffle them off to the proper container, all in userspace?
Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
--
Janne
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Device Namespaces
@ 2013-10-01 17:27 ` Andy Lutomirski
2013-10-01 17:53 ` Serge E. Hallyn
0 siblings, 1 reply; 11+ messages in thread
From: Andy Lutomirski @ 2013-10-01 17:27 UTC (permalink / raw)
To: Janne Karhunen
Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers,
Stephane Graber, Eric W. Biederman, lxc-devel, mhw, devel
On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
>
>>> - We can relay a call of /sbin/hotplug from outside of a container
>>> to inside of a container based on policy.
>>> (But no one uses /sbin/hotplug anymore).
>>
>> That's right, they should be listening to libudev events, so why can't
>> your daemon shuffle them off to the proper container, all in userspace?
>
> Which reminds me, one potential reason being..
> http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
>
Can't the daemon live outside the container and shuffle stuff in?
IOW, there seems to be little point in containerizing things if you're
just going to punch a privilege hole in the namespace.
FWIW, I think that the capability evolution rules are crap, but
changing them is a can of worms, and enough people seem to thing the
status quo is acceptable that this is unlikely to ever get fixed.
--Andy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Device Namespaces
@ 2013-10-01 17:53 ` Serge E. Hallyn
2013-10-01 19:51 ` Eric W. Biederman
0 siblings, 1 reply; 11+ messages in thread
From: Serge E. Hallyn @ 2013-10-01 17:53 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Kay Sievers, Linux Containers, lxc-devel, Stephane Graber,
Eric W. Biederman, Greg Kroah-Hartman, mhw, devel
Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> >
> >>> - We can relay a call of /sbin/hotplug from outside of a container
> >>> to inside of a container based on policy.
> >>> (But no one uses /sbin/hotplug anymore).
> >>
> >> That's right, they should be listening to libudev events, so why can't
> >> your daemon shuffle them off to the proper container, all in userspace?
> >
> > Which reminds me, one potential reason being..
> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
> >
>
> Can't the daemon live outside the container and shuffle stuff in?
That's exactly what Michael Warfield is suggesting, fwiw.
-serge
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Device Namespaces
@ 2013-10-01 19:51 ` Eric W. Biederman
2013-10-01 20:46 ` Serge Hallyn
0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 2013-10-01 19:51 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Kay Sievers, Linux Containers, lxc-devel, Andy Lutomirski, devel,
Greg Kroah-Hartman, mhw, Stephane Graber
"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
>> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
>> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
>> >
>> >>> - We can relay a call of /sbin/hotplug from outside of a container
>> >>> to inside of a container based on policy.
>> >>> (But no one uses /sbin/hotplug anymore).
>> >>
>> >> That's right, they should be listening to libudev events, so why can't
>> >> your daemon shuffle them off to the proper container, all in userspace?
>> >
>> > Which reminds me, one potential reason being..
>> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
>> >
>>
>> Can't the daemon live outside the container and shuffle stuff in?
>
> That's exactly what Michael Warfield is suggesting, fwiw.
Michael Warfields example of dynamically assigning serial ports to
containers is a pretty good test case. Serial ports are extremely well
known kernel objects who evolution effectively stopped long ago. When
we need it we have ptys to virtual serial ports when we need it, but in
general unprivileged users are safe to directly use a serial port
device.
Glossing over the details. The general problem is some policy exists
outside of the container that deciedes if an when a container gets a
serial port and stuffs it in.
The expectation is that system containers will then run the udev
rules and send the libuevent event.
To make that all work without kernel modifications requires placing
a faux-udev in the container, that listens for a device assignment from
outside the container and then does exactly what udev would have done.
The problems with this that I see are:
- udev is a moving target making it hard to build a faux-udev that will
work everywhere.
- On distro's running systemd and udev integration is sufficiently tight
that I am not certain a faux-udev is possible or will continue to be
possible.
- There are two other widely deployed solutions for managing hotplug
devices besides udev.
So given these difficulties I do not believe that the evolution of linux
device management is done, and that patches to udev, the kernel or both
will be needed. While it would be good for testing and understanding
the problem I don't think a faux-udev will be a long term maintainable
solution.
I also understand the point that we aren't talking patches yet and just
discussing ideas. Right now it is my hope that if we talk this out we
can figure out a general direction that has a hope of working.
From where I am standing faking uevents instead of replacing
udev/mdev/whatever looks simpler and more maintainable.
Eric
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Device Namespaces
@ 2013-10-01 20:46 ` Serge Hallyn
2013-10-01 22:59 ` [lxc-devel] " Michael H. Warfield
0 siblings, 1 reply; 11+ messages in thread
From: Serge Hallyn @ 2013-10-01 20:46 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers,
Stephane Graber, Andy Lutomirski, lxc-devel, mhw, devel
Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>
> > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> >> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> >> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> >> >
> >> >>> - We can relay a call of /sbin/hotplug from outside of a container
> >> >>> to inside of a container based on policy.
> >> >>> (But no one uses /sbin/hotplug anymore).
> >> >>
> >> >> That's right, they should be listening to libudev events, so why can't
> >> >> your daemon shuffle them off to the proper container, all in userspace?
> >> >
> >> > Which reminds me, one potential reason being..
> >> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
> >> >
> >>
> >> Can't the daemon live outside the container and shuffle stuff in?
> >
> > That's exactly what Michael Warfield is suggesting, fwiw.
>
> Michael Warfields example of dynamically assigning serial ports to
> containers is a pretty good test case. Serial ports are extremely well
> known kernel objects who evolution effectively stopped long ago. When
> we need it we have ptys to virtual serial ports when we need it, but in
> general unprivileged users are safe to directly use a serial port
> device.
>
> Glossing over the details. The general problem is some policy exists
> outside of the container that deciedes if an when a container gets a
> serial port and stuffs it in.
>
> The expectation is that system containers will then run the udev
> rules and send the libuevent event.
I thought the suggestion was that udev on the host would be given
container-specific rules, saying "plop this device into /dev/container1/"
(with /dev/container1 being bind-mounted to $container1_rootfs/dev).
-serge
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [lxc-devel] Device Namespaces
2013-10-01 20:46 ` Serge Hallyn
@ 2013-10-01 22:59 ` Michael H. Warfield
0 siblings, 0 replies; 11+ messages in thread
From: Michael H. Warfield @ 2013-10-01 22:59 UTC (permalink / raw)
To: Serge Hallyn
Cc: Greg Kroah-Hartman, Michael H.Warfield, Kay Sievers,
Andy Lutomirski, Eric W. Biederman, lxc-devel, Linux Containers,
devel
[-- Attachment #1.1: Type: text/plain, Size: 4401 bytes --]
On Tue, 2013-10-01 at 15:46 -0500, Serge Hallyn wrote:
> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> > "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> >
> > > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> > >> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen@gmail.com> wrote:
> > >> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman
> > >> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> > >> >
> > >> >>> - We can relay a call of /sbin/hotplug from outside of a container
> > >> >>> to inside of a container based on policy.
> > >> >>> (But no one uses /sbin/hotplug anymore).
> > >> >>
> > >> >> That's right, they should be listening to libudev events, so why can't
> > >> >> your daemon shuffle them off to the proper container, all in userspace?
> > >> >
> > >> > Which reminds me, one potential reason being..
> > >> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html
> > >> >
> > >>
> > >> Can't the daemon live outside the container and shuffle stuff in?
> > >
> > > That's exactly what Michael Warfield is suggesting, fwiw.
> >
> > Michael Warfields example of dynamically assigning serial ports to
> > containers is a pretty good test case. Serial ports are extremely well
> > known kernel objects who evolution effectively stopped long ago. When
> > we need it we have ptys to virtual serial ports when we need it, but in
> > general unprivileged users are safe to directly use a serial port
> > device.
> >
> > Glossing over the details. The general problem is some policy exists
> > outside of the container that deciedes if an when a container gets a
> > serial port and stuffs it in.
> >
> > The expectation is that system containers will then run the udev
> > rules and send the libuevent event.
> I thought the suggestion was that udev on the host would be given
> container-specific rules, saying "plop this device into /dev/container1/"
> (with /dev/container1 being bind-mounted to $container1_rootfs/dev).
I think that the "given container-specific rules, saying..." thing was
on my chart of options as the one with the big cloudy shaped object in
the lower right corner labeled "and then a miracle occurs".
The basic part is the mapping from /dev into /dev/lxc/container. That
should be doable based on the rules in the host and a basic udev trigger
along with a simple mapping configuration. The "given
container-specific" part becomes a morass if it gets complicated enough.
What I was envisioning was a very simple system of container specific
{match} and {map} objects. If a name or symlink passed to the daemon
from a udev trigger matched a match, then the name and symlinks and
additional maps would be mapped into the appropriate container
subdirectory. That works real well if the container and host udev rules
are congruent.
The tough part is the "container-specific" rules which was the part I
specifically mentioned that I had no clue how to make happen. That's a
non-trivial task if the container is allowed to make arbitrary udev rule
changes based on what they are allowed to receive from the host (and how
do we trigger the changes in the host when a change is made in the
container).
It's easily doable where the container rules are congruent with the host
rules. Where they are orthogonal gets much more complicated. But...
All that being said, I will take the congruent solution as a starting
point (and that will not be an 80% solution - it will be more like a 95%
solution) and we can argue about the corner cases and deltas after that.
Doable, yes, for some value of doable.
I like what Greg was saying about using libudev but I'm totally in the
dark as to how to effectively hook that or if it would even work in the
container. That one is not in my realm.
> -serge
Regards,
Mike
--
Michael H. Warfield (AI4NB) | Desk: (404) 236-2807
Senior Researcher - X-Force | Cell: (678) 463-0932
IBM Security Services | mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org mhw-UGBql2FAF+1Wk0Htik3J/w@public.gmane.org
6303 Barfield Road | http://www.iss.net/
Atlanta, Georgia 30328 | http://www.wittsend.com/mhw/
| PGP Key: 0x674627FF
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 482 bytes --]
[-- Attachment #2: Type: text/plain, Size: 205 bytes --]
_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-09-25 18:22 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CALRD3qKmpzJCRszkG_S9Z3XgoTGWVMFd7FqeJh+W-9pZqPVhCg@mail.gmail.com>
2014-09-24 5:04 ` [lxc-devel] device namespaces Eric W. Biederman
[not found] ` <CALRD3qKPJHmmY2DSNNfNKzmLihDLm9fgBQprCXNMHVOArV4iuw@mail.gmail.com>
2014-09-24 16:37 ` Serge Hallyn
2014-09-24 17:43 ` Using devices in Containers (was: [lxc-devel] device namespaces) Eric W. Biederman
2014-09-24 19:30 ` Riya Khanna
2014-09-24 22:38 ` Using devices in Containers Eric W. Biederman
[not found] ` <CALRD3qLYAc+K8e1xYb27ipi4KyGRmTxokPCHN0L_zta=Cy9sCQ@mail.gmail.com>
2014-09-25 15:40 ` riya khanna
2014-09-25 18:09 ` Eric W. Biederman
2014-09-25 18:21 ` Eric W. Biederman
2014-09-24 19:07 ` [lxc-devel] device namespaces Riya Khanna
2014-09-24 16:38 ` Serge Hallyn
2013-09-08 12:28 RFC: Device Namespaces Amir Goldstein
2013-09-09 0:51 ` Eric W. Biederman
2013-09-10 7:09 ` Amir Goldstein
2013-09-25 11:05 ` Janne Karhunen
2013-09-25 21:34 ` Eric W. Biederman
2013-09-26 5:33 ` Greg Kroah-Hartman
2013-10-01 6:19 ` Janne Karhunen
2013-10-01 17:27 ` Andy Lutomirski
2013-10-01 17:53 ` Serge E. Hallyn
2013-10-01 19:51 ` Eric W. Biederman
2013-10-01 20:46 ` Serge Hallyn
2013-10-01 22:59 ` [lxc-devel] " Michael H. Warfield
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.