* RFC: Device Namespaces @ 2013-08-22 17:43 Oren Laadan [not found] ` <CAA4jN2aw4zEW=UfKCyqaOvXnbiRb_J9srfCn4OXTFzc6vWBM4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Oren Laadan @ 2013-08-22 17:43 UTC (permalink / raw) To: Linux Containers; +Cc: lxc-devel Hi everyone! We [1] have been working on bringing lightweight virtualization to Linux-based mobile devices like Android (or other Linux-based devices with diverse I/O) and want to share our solution: device namespaces. Imagine you could run several instances of your favorite mobile OS or other distributions in isolated containers, each under the impression of having exclusive access to device drivers; Interact and switch between them within a blink, no flashing, no reboot. Device namespaces are an extension to existing Linux kernel namespaces that brings lightweight virtualization to Linux-based end-user devices, primarily mobile devices. Device namespaces introduce a private and virtual namespace for device drivers to create the illusion for a process group that it interacts exclusively with a set of drivers. Device namespaces also introduce the concepts of an “active” namespace with which a user interacts, vs “non-active” namespaces that run in the background, and the ability to switch between them.[2] We are planning to prepare individual patches to be submitted to the relevant maintainers and mailing lists. In the meantime, we already want to share a set of patches on top of the Android goldfish Kernel 3.4 as well as a user-space demo, so you can see where we are heading and get an overview of the approach and see how it works. We are aware that the patches are not ready for submission in their current state, and we'd highly appreciate any feedback or suggestions which may come to your mind once you have a look [3]. Of particular interest is to elaborate a proper userspace API with respect to existing and future use-cases. To illustrate a simple use-case we also provide a simple userspace demo for Android [4]. I will be presenting "The Case for Linux Device Namespace" [5] at LinuxCon North America 2013 [6]. We will also be attending the Containers Track [7] at LPC 2013 to present the current state of the patches and discuss the best course to proceed. We are looking forward to hear from you! Thanks, Oren. 1: http://www.cellrox.com/ 2: https://github.com/Cellrox/devns-patches/wiki/DeviceNamespace 3: https://github.com/Cellrox/devns-patches 4: https://github.com/Cellrox/devns-demo 5: http://sched.co/1asN1v7 6: http://events.linuxfoundation.org/events/linuxcon-north-america 7: http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/tracks/153 -- Oren Laadan Cellrox Ltd. ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <CAA4jN2aw4zEW=UfKCyqaOvXnbiRb_J9srfCn4OXTFzc6vWBM4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: RFC: Device Namespaces [not found] ` <CAA4jN2aw4zEW=UfKCyqaOvXnbiRb_J9srfCn4OXTFzc6vWBM4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-08-22 18:21 ` Serge Hallyn 2013-08-26 10:11 ` Oren Laadan 2013-08-29 19:06 ` RFC: " Andy Lutomirski 1 sibling, 1 reply; 38+ messages in thread From: Serge Hallyn @ 2013-08-22 18:21 UTC (permalink / raw) To: Oren Laadan; +Cc: Linux Containers, lxc-devel Quoting Oren Laadan (orenl@cellrox.com): > Hi everyone! > > We [1] have been working on bringing lightweight virtualization to > Linux-based mobile devices like Android (or other Linux-based devices with > diverse I/O) and want to share our solution: device namespaces. > > Imagine you could run several instances of your favorite mobile OS or other > distributions in isolated containers, each under the impression of having > exclusive access to device drivers; Interact and switch between them within > a blink, no flashing, no reboot. > > Device namespaces are an extension to existing Linux kernel namespaces that > brings lightweight virtualization to Linux-based end-user devices, > primarily mobile devices. > Device namespaces introduce a private and virtual namespace for device > drivers to create the illusion for a process group that it interacts > exclusively with a set of drivers. Device namespaces also introduce the > concepts of an “active” namespace with which a user interacts, vs > “non-active” namespaces that run in the background, and the ability to > switch between them.[2] Note that unless I'm misunderstanding what you're saying here, this is also what net_ns does. A netns can exist with no processes so long as you've bound its /proc/$$/ns/net somewhere. You can then re-enter that ns using ns_attach. I haven't looked closely enough yet to see whether you should be (or are) using the same interface. > We are planning to prepare individual patches to be submitted to the Looking forward to it, and seeing you at the containers track :) > 2: https://github.com/Cellrox/devns-patches/wiki/DeviceNamespace > 3: https://github.com/Cellrox/devns-patches > 4: https://github.com/Cellrox/devns-demo (Have looked over the wiki, will look over the patches as well) -serge _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: RFC: Device Namespaces 2013-08-22 18:21 ` Serge Hallyn @ 2013-08-26 10:11 ` Oren Laadan [not found] ` <CAA4jN2YL7Lfu2+DW-i+MovFxWEhJfT4aBBKREU_vy7JX9TKGHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Oren Laadan @ 2013-08-26 10:11 UTC (permalink / raw) Cc: Linux Containers, lxc-devel Hi Serge, On Thu, Aug 22, 2013 at 2:21 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>wrote: > Quoting Oren Laadan (orenl-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org): > > Hi everyone! > > > > We [1] have been working on bringing lightweight virtualization to > > Linux-based mobile devices like Android (or other Linux-based devices > with > > diverse I/O) and want to share our solution: device namespaces. > > > > Imagine you could run several instances of your favorite mobile OS or > other > > distributions in isolated containers, each under the impression of having > > exclusive access to device drivers; Interact and switch between them > within > > a blink, no flashing, no reboot. > > > > Device namespaces are an extension to existing Linux kernel namespaces > that > > brings lightweight virtualization to Linux-based end-user devices, > > primarily mobile devices. > > Device namespaces introduce a private and virtual namespace for device > > drivers to create the illusion for a process group that it interacts > > exclusively with a set of drivers. Device namespaces also introduce the > > concepts of an “active” namespace with which a user interacts, vs > > “non-active” namespaces that run in the background, and the ability to > > switch between them.[2] > > Note that unless I'm misunderstanding what you're saying here, this is > also what net_ns does. A netns can exist with no processes so long as > you've bound its /proc/$$/ns/net somewhere. You can then re-enter that > ns using ns_attach. I haven't looked closely enough yet to see whether > you should be (or are) using the same interface. > > To illustrate the need for device namespaces, consider the use case of running two containers of your favorite OS (say, Android), on a single physical phone. As a user, you either work in one container, or in the other, and you will want to be able to switch between them (just like with apps on mobile devices: you interact with one application at a time, and switch between them). See here for a demo of how it works: http://vimeo.com/60113683 To accomplish this, device namespaces solve two shortcomings of existing namespaces: 1. A namespace for device drivers: each (Android) container needs a private view of all devices. This includes logical drivers, like binder (in Android) but also loop device; and physical devices, like the framebuffer and the touch-screen. In other words, device namespaces virtualize the _major/minor_ and the _state_ of device drivers. With the exception of VFS, network, and PTY (note: all three offer/are virtual devices), device drivers are otherwise not isolated between containers. 2. A namespace for interactive scenarios: a namespace can be "active" - it has access to the hardware, e.g. display and touch-screen. This will be the container with which the user is interacting right now. Otherwise a namespace is "non-active" - it still runs in the background, but can neither alter the display nor receive input from the touch-screen. Switching to another container means a context switch in the relevant drivers, so that they restore the state and now "obey" the other namespace. You can also think about the "active" namespace as foreground, and the "non-active" as background, akin to foreground/background processes in a terminal with job-control. Similar to how a terminal delivers input to the foreground task only but not to the background tasks - this is enforced by the new device namespace. More details on this use-case are in the wiki: https://github.com/Cellrox/devns-patches/wiki/Thinvisor). > We are planning to prepare individual patches to be submitted to the > > Looking forward to it, and seeing you at the containers track :) > Same here! > > > 2: https://github.com/Cellrox/devns-patches/wiki/DeviceNamespace > > 3: https://github.com/Cellrox/devns-patches > > 4: https://github.com/Cellrox/devns-demo > > (Have looked over the wiki, will look over the patches as well) > > -serge > Thanks, Oren. -- Oren Laadan Cellrox Ltd. ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <CAA4jN2YL7Lfu2+DW-i+MovFxWEhJfT4aBBKREU_vy7JX9TKGHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: RFC: Device Namespaces [not found] ` <CAA4jN2YL7Lfu2+DW-i+MovFxWEhJfT4aBBKREU_vy7JX9TKGHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-06 17:50 ` Eric W. Biederman [not found] ` <8761udlu0d.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Eric W. Biederman @ 2013-09-06 17:50 UTC (permalink / raw) To: Oren Laadan; +Cc: Linux Containers, lxc-devel Oren Laadan <orenl@cellrox.com> writes: > Hi Serge, > > > On Thu, Aug 22, 2013 at 2:21 PM, Serge Hallyn <serge.hallyn@ubuntu.com>wrote: > >> Quoting Oren Laadan (orenl@cellrox.com): >> > Hi everyone! >> > >> > We [1] have been working on bringing lightweight virtualization to >> > Linux-based mobile devices like Android (or other Linux-based devices >> with >> > diverse I/O) and want to share our solution: device namespaces. >> > >> > Imagine you could run several instances of your favorite mobile OS or >> other >> > distributions in isolated containers, each under the impression of having >> > exclusive access to device drivers; Interact and switch between them >> within >> > a blink, no flashing, no reboot. >> > >> > Device namespaces are an extension to existing Linux kernel namespaces >> that >> > brings lightweight virtualization to Linux-based end-user devices, >> > primarily mobile devices. >> > Device namespaces introduce a private and virtual namespace for device >> > drivers to create the illusion for a process group that it interacts >> > exclusively with a set of drivers. Device namespaces also introduce the >> > concepts of an “active” namespace with which a user interacts, vs >> > “non-active” namespaces that run in the background, and the ability to >> > switch between them.[2] >> >> Note that unless I'm misunderstanding what you're saying here, this is >> also what net_ns does. A netns can exist with no processes so long as >> you've bound its /proc/$$/ns/net somewhere. You can then re-enter that >> ns using ns_attach. I haven't looked closely enough yet to see whether >> you should be (or are) using the same interface. >> >> > To illustrate the need for device namespaces, consider the use case of > running two containers of your favorite OS (say, Android), on a single > physical phone. As a user, you either work in one container, or in the > other, and you will want to be able to switch between them (just like with > apps on mobile devices: you interact with one application at a time, and > switch between them). > > See here for a demo of how it works: http://vimeo.com/60113683 > > To accomplish this, device namespaces solve two shortcomings of existing > namespaces: > > 1. A namespace for device drivers: each (Android) container needs a > private view of all devices. This includes logical drivers, like binder (in > Android) but also loop device; and physical devices, like the framebuffer > and the touch-screen. > > In other words, device namespaces virtualize the _major/minor_ and the > _state_ of device drivers. With the exception of VFS, network, and PTY > (note: all three offer/are virtual devices), device drivers are otherwise > not isolated between containers. > > 2. A namespace for interactive scenarios: a namespace can be "active" - it > has access to the hardware, e.g. display and touch-screen. This will be the > container with which the user is interacting right now. Otherwise a > namespace is "non-active" - it still runs in the background, but can > neither alter the display nor receive input from the touch-screen. > Switching to another container means a context switch in the relevant > drivers, so that they restore the state and now "obey" the other namespace. > > You can also think about the "active" namespace as foreground, and the > "non-active" as background, akin to foreground/background processes in a > terminal with job-control. Similar to how a terminal delivers input to the > foreground task only but not to the background tasks - this is enforced by > the new device namespace. > > More details on this use-case are in the wiki: > https://github.com/Cellrox/devns-patches/wiki/Thinvisor). I think this is going to take some talking, and looking at code. I think you are talking about having wrappers around your devices so you can share. Which is not the quite same problem the rest of us have been thinking of when talking about a device namespace. My first impression is that this is better solved with more appropriate abstractions in userspace or in the kernel. But we can talk at LPC and see what we can hash out. Eric _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <8761udlu0d.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: RFC: Device Namespaces [not found] ` <8761udlu0d.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-09-08 12:28 ` Amir Goldstein [not found] ` <CAA2m6vexArJ+6jFbK80Amstk=LK30=XDNHdBHSswP=LgpSP-6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2013-09-08 12:28 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Linux Containers, lxc-devel On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>wrote: > Oren Laadan <orenl-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> writes: > > > Hi Serge, > > > > > > On Thu, Aug 22, 2013 at 2:21 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org > >wrote: > > > >> Quoting Oren Laadan (orenl-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org): > >> > Hi everyone! > >> > > >> > We [1] have been working on bringing lightweight virtualization to > >> > Linux-based mobile devices like Android (or other Linux-based devices > >> with > >> > diverse I/O) and want to share our solution: device namespaces. > >> > > >> > Imagine you could run several instances of your favorite mobile OS or > >> other > >> > distributions in isolated containers, each under the impression of > having > >> > exclusive access to device drivers; Interact and switch between them > >> within > >> > a blink, no flashing, no reboot. > >> > > >> > Device namespaces are an extension to existing Linux kernel namespaces > >> that > >> > brings lightweight virtualization to Linux-based end-user devices, > >> > primarily mobile devices. > >> > Device namespaces introduce a private and virtual namespace for device > >> > drivers to create the illusion for a process group that it interacts > >> > exclusively with a set of drivers. Device namespaces also introduce > the > >> > concepts of an “active” namespace with which a user interacts, vs > >> > “non-active” namespaces that run in the background, and the ability to > >> > switch between them.[2] > >> > >> Note that unless I'm misunderstanding what you're saying here, this is > >> also what net_ns does. A netns can exist with no processes so long as > >> you've bound its /proc/$$/ns/net somewhere. You can then re-enter that > >> ns using ns_attach. I haven't looked closely enough yet to see whether > >> you should be (or are) using the same interface. > >> > >> > > To illustrate the need for device namespaces, consider the use case of > > running two containers of your favorite OS (say, Android), on a single > > physical phone. As a user, you either work in one container, or in the > > other, and you will want to be able to switch between them (just like > with > > apps on mobile devices: you interact with one application at a time, and > > switch between them). > > > > See here for a demo of how it works: http://vimeo.com/60113683 > > > > To accomplish this, device namespaces solve two shortcomings of existing > > namespaces: > > > > 1. A namespace for device drivers: each (Android) container needs a > > private view of all devices. This includes logical drivers, like binder > (in > > Android) but also loop device; and physical devices, like the framebuffer > > and the touch-screen. > > > > In other words, device namespaces virtualize the _major/minor_ and the > > _state_ of device drivers. With the exception of VFS, network, and PTY > > (note: all three offer/are virtual devices), device drivers are otherwise > > not isolated between containers. > > > > 2. A namespace for interactive scenarios: a namespace can be "active" - > it > > has access to the hardware, e.g. display and touch-screen. This will be > the > > container with which the user is interacting right now. Otherwise a > > namespace is "non-active" - it still runs in the background, but can > > neither alter the display nor receive input from the touch-screen. > > Switching to another container means a context switch in the relevant > > drivers, so that they restore the state and now "obey" the other > namespace. > > > > You can also think about the "active" namespace as foreground, and the > > "non-active" as background, akin to foreground/background processes in a > > terminal with job-control. Similar to how a terminal delivers input to > the > > foreground task only but not to the background tasks - this is enforced > by > > the new device namespace. > > > > More details on this use-case are in the wiki: > > https://github.com/Cellrox/devns-patches/wiki/Thinvisor). > > I think this is going to take some talking, and looking at code. > > Hi Eric, If we can get people to take a quick look at the code before LPC that could make the LPC discussions more effective. Even looking at one of the subsystem patches can give a basic idea of the work we have done: https://github.com/Cellrox/linux/commits/devns-goldfish-3.4 I think you are talking about having wrappers around your devices so you > can share. Which is not the quite same problem the rest of us have been > thinking of when talking about a device namespace. > We are interested in all problems related to virtualizated view of devices inside a container, so let our work so far be a starting point to discuss all of them. > > My first impression is that this is better solved with more appropriate > abstractions in userspace or in the kernel. > > But we can talk at LPC and see what we can hash out. > Looking forward to that :-) Amir. > > Eric > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linuxfoundation.org/mailman/listinfo/containers > ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <CAA2m6vexArJ+6jFbK80Amstk=LK30=XDNHdBHSswP=LgpSP-6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: RFC: Device Namespaces [not found] ` <CAA2m6vexArJ+6jFbK80Amstk=LK30=XDNHdBHSswP=LgpSP-6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-09 0:51 ` Eric W. Biederman [not found] ` <871u4yddg4.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Eric W. Biederman @ 2013-09-09 0:51 UTC (permalink / raw) To: Amir Goldstein; +Cc: Linux Containers, lxc-devel Amir Goldstein <amir@cellrox.com> writes: > On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman > <ebiederm@xmission.com> wrote: > > Hi Eric, > > If we can get people to take a quick look at the code before LPC > that could make the LPC discussions more effective. > Even looking at one of the subsystem patches can give a basic > idea of the work we have done: > https://github.com/Cellrox/linux/commits/devns-goldfish-3.4 > > I think you are talking about having wrappers around your devices > so you > can share. Which is not the quite same problem the rest of us > have been > thinking of when talking about a device namespace. > > We are interested in all problems related to virtualizated view of > devices > inside a container, so let our work so far be a starting point to > discuss all of them. > > My first impression is that this is better solved with more > appropriate > abstractions in userspace or in the kernel. As I read your code, you are solving the problem of one opener of a device among a group of openers being able to access a device at a time. Which leads to the question why can't the multiplexing happen in userspace? I think with your design it would not be possible to play a song in one device namespace while doing work in the other. As a security model that isn't wrong but as someone trying to get work done that could be a real pain. The more common concern is to have devices we can use all of the time. There may be a need for a device namespace and multiplexing access to hardware devices makes that clearer. So far nothing has risen to the level of we actually need a device namespace to do X. Especially in an erra of hotplug and dynamic device numbers. It is arguable that you could do your kind of device multiplexing with a fuse device in userspace that implements your desired policy. And policy is where cell situtation seems to fall down because it hard codes one specific policy into the kernel, and a policy most situations don't find useful. Eric _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <871u4yddg4.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: RFC: Device Namespaces [not found] ` <871u4yddg4.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-09-10 7:09 ` Amir Goldstein [not found] ` <CAA2m6vc_kWWGDWcdjk26N3YvTqZySLFxPQRjOD9_ypBOka2+GQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2013-09-10 7:09 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Linux Containers, lxc-devel On Mon, Sep 9, 2013 at 2:51 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>wrote: > Amir Goldstein <amir-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> writes: > > > On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman > > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > > > > Hi Eric, > > > > If we can get people to take a quick look at the code before LPC > > that could make the LPC discussions more effective. > > Even looking at one of the subsystem patches can give a basic > > idea of the work we have done: > > https://github.com/Cellrox/linux/commits/devns-goldfish-3.4 > > > > I think you are talking about having wrappers around your devices > > so you > > can share. Which is not the quite same problem the rest of us > > have been > > thinking of when talking about a device namespace. > > > > We are interested in all problems related to virtualizated view of > > devices > > inside a container, so let our work so far be a starting point to > > discuss all of them. > > > > My first impression is that this is better solved with more > > appropriate > > abstractions in userspace or in the kernel. > > As I read your code, you are solving the problem of one opener of a > device among a group of openers being able to access a device at a time. > Which leads to the question why can't the multiplexing happen in > userspace? > > I think with your design it would not be possible to play a song in one > device namespace while doing work in the other. As a security model > that isn't wrong but as someone trying to get work done that could be a > real pain. > As a matter of fact, in our multi persona phone, you *can* hear music played from background persona, but you *cannot* see images drawn from background persona. > The more common concern is to have devices we can use all of the time. > > There may be a need for a device namespace and multiplexing access to > hardware devices makes that clearer. So far nothing has risen to the > level of we actually need a device namespace to do X. Especially in an > erra of hotplug and dynamic device numbers. > > It is arguable that you could do your kind of device multiplexing with a > fuse device in userspace that implements your desired policy. > I agree about it being arguable :-) We shall present our arguments on LPC. > > And policy is where cell situtation seems to fall down because it hard > codes one specific policy into the kernel, and a policy most situations > don't find useful. > > It's true that for our product, we have made hardcoded policy decisions in our kernel patches, but that was just as a proof of concept for the technique. We do envision being able to dynamically assign a device to a specific devns (e.g. block,loop) keep a device shared between multi devns (e.g. audio) and in addition to that, being able to multiplex a device between multi devns (e.g. framebuffer) > Eric > ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <CAA2m6vc_kWWGDWcdjk26N3YvTqZySLFxPQRjOD9_ypBOka2+GQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: RFC: Device Namespaces [not found] ` <CAA2m6vc_kWWGDWcdjk26N3YvTqZySLFxPQRjOD9_ypBOka2+GQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-25 11:05 ` Janne Karhunen [not found] ` <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Janne Karhunen @ 2013-09-25 11:05 UTC (permalink / raw) To: Amir Goldstein; +Cc: Linux Containers, Eric W. Biederman, lxc-devel On Tue, Sep 10, 2013 at 10:09 AM, Amir Goldstein <amir-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> wrote: > On Mon, Sep 9, 2013 at 2:51 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>wrote: > >> Amir Goldstein <amir-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> writes: >> >> > On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman >> > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >> > >> > Hi Eric, >> > >> > If we can get people to take a quick look at the code before LPC >> > that could make the LPC discussions more effective. Hi, I think we are curious enough to experiment with Erics idea of implementing basic 'device namespace' in userspace (never miss an opportunity to throw away kernel code). Can anyone point out any obvious reason why this would not work if we consider bulk of the work being plain access filtering? That being said, is there a valid reason why binder is part of device namespace here instead of IPC? -- Janne ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: RFC: Device Namespaces [not found] ` <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-25 20:23 ` Eric W. Biederman [not found] ` <87ioxo4pm5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-09-25 21:34 ` Eric W. Biederman 1 sibling, 1 reply; 38+ messages in thread From: Eric W. Biederman @ 2013-09-25 20:23 UTC (permalink / raw) To: Janne Karhunen; +Cc: Linux Containers, lxc-devel Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: > That being said, is there a valid reason why binder is part of device > namespace here instead of IPC? I think the practical issue with binder was simply that binder only allows for a single instance and thus is does not play nicely with containers. Eric ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <87ioxo4pm5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: [lxc-devel] RFC: Device Namespaces [not found] ` <87ioxo4pm5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-09-25 21:17 ` Jeremy Andrus [not found] ` <AD5F7BD2-0166-46BD-AB14-463C0E88BC92-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Jeremy Andrus @ 2013-09-25 21:17 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Linux Containers, lxc-devel On Sep 25, 2013, at 4:23 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: > >> That being said, is there a valid reason why binder is part of device >> namespace here instead of IPC? > > I think the practical issue with binder was simply that binder only > allows for a single instance and thus is does not play nicely with > containers. It's true that there was a singleton paradigm in binder that had to be overcome, but I agree with Janne. It really belongs in the IPC namespace, and I don't see any technical reason not to move it there. -Jeremy ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <AD5F7BD2-0166-46BD-AB14-463C0E88BC92-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [lxc-devel] RFC: Device Namespaces [not found] ` <AD5F7BD2-0166-46BD-AB14-463C0E88BC92-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2013-09-25 21:47 ` Eric W. Biederman [not found] ` <8738osr2ue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Eric W. Biederman @ 2013-09-25 21:47 UTC (permalink / raw) To: Jeremy Andrus; +Cc: Linux Containers, lxc-devel Jeremy Andrus <jeremya-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes: > On Sep 25, 2013, at 4:23 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > >> Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: >> >>> That being said, is there a valid reason why binder is part of device >>> namespace here instead of IPC? >> >> I think the practical issue with binder was simply that binder only >> allows for a single instance and thus is does not play nicely with >> containers. > > It's true that there was a singleton paradigm in binder that had to be > overcome, but I agree with Janne. It really belongs in the IPC namespace, > and I don't see any technical reason not to move it there. *Blink* I missed the IPC namespace suggestion. The IPC namespace sounds reasonable. Of course binder is still in staging because it has implementation and ABI problems. Little things like a 64bit kernel and a 32bit userspace don't work particularly well. So while fixing those problems it might be possible to fix the single container problem as well. It would be a weird direction for cleanup of binder to come from but I don't see why that wouldn't work. Personally until binder is out of staging it seems reasonable to push for an API that sucks less, or for a more general solution that Androdid could use instead of binder. One of the uses of namespaces is to clean up after problematic kernel design decisions. If we still have the option I would rather fix the problems than clean up after them. Eric ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <8738osr2ue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: [lxc-devel] RFC: Device Namespaces [not found] ` <8738osr2ue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-09-29 17:56 ` Amir Goldstein 0 siblings, 0 replies; 38+ messages in thread From: Amir Goldstein @ 2013-09-29 17:56 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Greg Kroah-Hartman, Linux Containers, lxc-devel On Thu, Sep 26, 2013 at 12:47 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>wrote: > Jeremy Andrus <jeremya-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes: > > > On Sep 25, 2013, at 4:23 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > wrote: > > > >> Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: > >> > >>> That being said, is there a valid reason why binder is part of device > >>> namespace here instead of IPC? > >> > >> I think the practical issue with binder was simply that binder only > >> allows for a single instance and thus is does not play nicely with > >> containers. > > > > It's true that there was a singleton paradigm in binder that had to be > > overcome, but I agree with Janne. It really belongs in the IPC namespace, > > and I don't see any technical reason not to move it there. > > *Blink* I missed the IPC namespace suggestion. > > The IPC namespace sounds reasonable. > Binder rewrite for IPC namespace is in the works (by Oren) We discussed this with Greg and adding namespace support to binder (in staging) seemed reasonable to him as well. > Of course binder is still in staging because it has implementation and > ABI problems. Little things like a 64bit kernel and a 32bit userspace > don't work particularly well. So while fixing those problems it might > be possible to fix the single container problem as well. It would be a > weird direction for cleanup of binder to come from but I don't see why > that wouldn't work. > > Personally until binder is out of staging it seems reasonable to push > for an API that sucks less, or for a more general solution that Androdid > could use instead of binder. > > One of the uses of namespaces is to clean up after problematic kernel > design decisions. If we still have the option I would rather fix the > problems than clean up after them. > > Eric > > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linuxfoundation.org/mailman/listinfo/containers > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Device Namespaces [not found] ` <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-25 20:23 ` Eric W. Biederman @ 2013-09-25 21:34 ` Eric W. Biederman [not found] ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 1 sibling, 1 reply; 38+ messages in thread From: Eric W. Biederman @ 2013-09-25 21:34 UTC (permalink / raw) To: Linux Containers Cc: Greg Kroah-Hartman, mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Kay Sievers, Andy Lutomirski, lxc-devel, Stephane Graber, devel-GEFAQzZX7r8dnm+yROfE0A From conversations at Linux Plumbers Converence it became fairly clear that one if not the roughest edge on containers today is dealing with devices. - Hotplug does not work. - There seems to be no implementation that does a much beyond creating setting up a static set of /dev entries today. - Containers do not see the appropriate uevents for their container. One of the more compelling cases I heard was of someone who was running the a Linux Desktop in container and wanted to just let that container see the devices needed for his desktop, and not everything else. Talking with the OpenVZ folks it appears that preserving device numbers across checkpoint/restart is not currently an issue. However they reuse the same loopback minor number when they can which would hide this issue. So while it is clear we don't need to worry about migrating an application that cares about major/minor numbers of filesystems right now as the set of application that are migrated increases that situation may change. As the case with the network device ifindex has shown it is possible to implement filtering now and later when there is a usecase it is possible to expand filtering to actual namespace local identifiers. Thinking about it for the case of container migration the simplest solution for the rare application that needs something more may be to figure out how to send a kernel hotplug event. Something to think about when we encounter them. So the big issues for a device namespace to solve are filtering which devices a container has access to and being able to dynamically change which devices those are at run time (aka hotplug). After having thought about this for a bit I don't know if a pure userspace solution is sufficient or actually a good idea. - We can manually manage a tmpfs with device nodes in userspace. (But that is deprecated functionality in the mainstream kernel). - We can manually export a subset of sysfs with bind mounts. (But that feels hacky, and is essentially incompatible with hotplug). - We can relay a call of /sbin/hotplug from outside of a container to inside of a container based on policy. (But no one uses /sbin/hotplug anymore). - There is no way to fake netlink uevents for a container to see them. (The best we could do is replace udev everywhere with something that listens on a unix domain socket). - It would be nice to replace the device cgroup with a comprehensive solution that really works. (Among other things the device cgroup does not work in terms of struct device the underlying kernel abstraction for devices). We must manage sysfs entries as well device nodes because: - Seeing more than we should has the real potential to confuse userspace, especially a userspace that replays uevents. - Some device control must happens through writing to sysfs files and if we don't remove all root privileges from a container only by exporting a subset of sysfs to that container can we limit which sysfs nodes can be written to. The current kernel tagged sysfs entry support does not look like a good match for the impelementing device filtering. The common case will be allowing devices like /dev/zero, and /dev/null that live in /sys/devices/virtual and are the devices we are most likely to care about. Those devices need to live in multiple device namespaces so everyone can use them. Perhaps exclusive assignment will be the more common paradigm for device namespaces like it is for network devices in the network namespace but from what little I can of this problem right now I don't think so. I definitely think we should hold off on a kernel level implementation until we really understand the issues and are ready to implement device namespaces correctly. A userspace implementation looks like it can only do about 95% of what is really needed, but at the same time looks like an easy way to experiment until the problem is sufficiently well understood. At the end of the day we need to filter the devices a set of userspace processes can use and be able to change that set of devices dynamically. All of the rest of the infrastructure for that lives in the kernel, and keeping all of the infrastructure in one place where it can be maintained together is likely to be most maintainable. It looks like the code is just complicated enough and the use cases just boring enough that spreading the code to perform container device hotplug and container device filtering between a dozen userspace tools, and a hadful of userspace device managers will not be particularly managable at the end of the day. In summary the situation with device hoptlug and containers sucks today, and we need to do something. Running a linux desktop in a container is a reasonably good example use case. Having one standard common maintainable implementation would be very useful and the most logical place for that would be in the kernel. For now we should focus on simple device filtering and hotplug. Eric ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Device Namespaces [not found] ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-09-26 5:33 ` Greg Kroah-Hartman [not found] ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2013-10-28 23:31 ` Andrey Wagin 1 sibling, 1 reply; 38+ messages in thread From: Greg Kroah-Hartman @ 2013-09-26 5:33 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel-GEFAQzZX7r8dnm+yROfE0A, lxc-devel, mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote: > So the big issues for a device namespace to solve are filtering which > devices a container has access to and being able to dynamically change > which devices those are at run time (aka hotplug). As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG anymore, because it was redundant), I think you need to really think this through better (pci, memory, cpus, etc.) before you do anything in the kernel. > After having thought about this for a bit I don't know if a pure > userspace solution is sufficient or actually a good idea. > > - We can manually manage a tmpfs with device nodes in userspace. > (But that is deprecated functionality in the mainstream kernel). Yes, but I'm not going to namespace devtmpfs, as that is going to be an impossible task, right? And remember, udev doesn't create device nodes anymore... > - We can manually export a subset of sysfs with bind mounts. > (But that feels hacky, and is essentially incompatible with hotplug). True. > - We can relay a call of /sbin/hotplug from outside of a container > to inside of a container based on policy. > (But no one uses /sbin/hotplug anymore). That's right, they should be listening to libudev events, so why can't your daemon shuffle them off to the proper container, all in userspace? > - There is no way to fake netlink uevents for a container to see them. > (The best we could do is replace udev everywhere with something that > listens on a unix domain socket). You shouldn't need to do this. > - It would be nice to replace the device cgroup with a comprehensive > solution that really works. (Among other things the device cgroup > does not work in terms of struct device the underlying kernel > abstraction for devices). I didn't even know there was a device cgroup. Which means that if there is one, odds are it's useless. > We must manage sysfs entries as well device nodes because: > - Seeing more than we should has the real potential to confuse > userspace, especially a userspace that replays uevents. You should never replay uevents. If you don't do that, why can't you see all of sysfs? > - Some device control must happens through writing to sysfs files and > if we don't remove all root privileges from a container only by > exporting a subset of sysfs to that container can we limit which > sysfs nodes can be written to. But you have the issue of controlling devices in a "shared" way, which isn't going to be usable for almost all devices. > The current kernel tagged sysfs entry support does not look like a good > match for the impelementing device filtering. The common case will > be allowing devices like /dev/zero, and /dev/null that live in > /sys/devices/virtual and are the devices we are most likely to care > about. Those devices need to live in multiple device namespaces so > everyone can use them. Perhaps exclusive assignment will be the more > common paradigm for device namespaces like it is for network devices in > the network namespace but from what little I can of this problem right now I > don't think so. > > I definitely think we should hold off on a kernel level implementation > until we really understand the issues and are ready to implement device > namespaces correctly. I agree, especially as I don't think this will ever work. > A userspace implementation looks like it can only do about 95% of what > is really needed, but at the same time looks like an easy way to > experiment until the problem is sufficiently well understood. 95% is probably way better than what you have today, and will fit the needs of almost everyone today, so why not do it? I'd argue that those last 5% either are custom solutions that never get merged, or candidates for true virtulization. > In summary the situation with device hoptlug and containers sucks today, > and we need to do something. Running a linux desktop in a container is > a reasonably good example use case. No it isn't. I'd argue that this is a horrible use case, one that you shouldn't do. Why not just use multi-head machines like people do who really want to do this, relying on user separation? That's a workable solution that is quite common and works very well today. > Having one standard common maintainable implementation would be very > useful and the most logical place for that would be in the kernel. > For now we should focus on simple device filtering and hotplug. Just listen for libudev stuff, don't try to filter them, or ever "replay" them, that way lies madness, and lots of nasty race conditions that is guaranteed to break things. good luck, greg k-h ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>]
* Re: Device Namespaces [not found] ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> @ 2013-09-26 8:25 ` Janne Karhunen [not found] ` <CAE=NcrbPXGWU8FUgwchXyL5HjXf+4AKbgUWGe1ZO=Xcq=iV-Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-01 6:19 ` Janne Karhunen 1 sibling, 1 reply; 38+ messages in thread From: Janne Karhunen @ 2013-09-26 8:25 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel-GEFAQzZX7r8dnm+yROfE0A, Eric W. Biederman, lxc-devel, mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: >> In summary the situation with device hoptlug and containers sucks today, >> and we need to do something. Running a linux desktop in a container is >> a reasonably good example use case. > > No it isn't. I'd argue that this is a horrible use case, one that you > shouldn't do. Why not just use multi-head machines like people do who > really want to do this, relying on user separation? That's a workable > solution that is quite common and works very well today. I suppose so, but now you take the assumption that there is no need for running multiple Linux variants on the same host (say Ubuntu and Android side by side). Is this something you would not like to see done? -- Janne ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <CAE=NcrbPXGWU8FUgwchXyL5HjXf+4AKbgUWGe1ZO=Xcq=iV-Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Device Namespaces [not found] ` <CAE=NcrbPXGWU8FUgwchXyL5HjXf+4AKbgUWGe1ZO=Xcq=iV-Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-26 13:56 ` Greg Kroah-Hartman [not found] ` <20130926135604.GA16624-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Greg Kroah-Hartman @ 2013-09-26 13:56 UTC (permalink / raw) To: Janne Karhunen Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel-GEFAQzZX7r8dnm+yROfE0A, Eric W. Biederman, lxc-devel, mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber On Thu, Sep 26, 2013 at 11:25:56AM +0300, Janne Karhunen wrote: > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: > > >> In summary the situation with device hoptlug and containers sucks today, > >> and we need to do something. Running a linux desktop in a container is > >> a reasonably good example use case. > > > > No it isn't. I'd argue that this is a horrible use case, one that you > > shouldn't do. Why not just use multi-head machines like people do who > > really want to do this, relying on user separation? That's a workable > > solution that is quite common and works very well today. > > I suppose so, but now you take the assumption that there is no > need for running multiple Linux variants on the same host (say > Ubuntu and Android side by side). Is this something you would > not like to see done? You can do that today without any need for device namespaces, so why is this an issue here? greg k-h ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <20130926135604.GA16624-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>]
* Re: Device Namespaces [not found] ` <20130926135604.GA16624-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> @ 2013-09-26 17:01 ` Janne Karhunen [not found] ` <CAE=NcrY3xC1AF_GV2b1KsF7AwYZTuGBuKLS5yBUWoWcmKU4YBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Janne Karhunen @ 2013-09-26 17:01 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel, Eric W. Biederman, lxc-devel, mhw, Stephane Graber On Thu, Sep 26, 2013 at 4:56 PM, Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: >> I suppose so, but now you take the assumption that there is no >> need for running multiple Linux variants on the same host (say >> Ubuntu and Android side by side). Is this something you would >> not like to see done? > > You can do that today without any need for device namespaces, so why is > this an issue here? I think you misunderstood me. I wasn't so much advocating on the device namespace part, just the issue at hand (device access filtering based on which ns happens to be 'active'). We are already trying to do this in userspace, let's see how that goes. That being said, our wish would be to support any combination of OS's and frankly, I'd be slightly annoyed to tell the customer that they can't do two Androids or we magically run out of bits. -- Janne ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <CAE=NcrY3xC1AF_GV2b1KsF7AwYZTuGBuKLS5yBUWoWcmKU4YBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Device Namespaces [not found] ` <CAE=NcrY3xC1AF_GV2b1KsF7AwYZTuGBuKLS5yBUWoWcmKU4YBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-26 17:07 ` Greg Kroah-Hartman [not found] ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Greg Kroah-Hartman @ 2013-09-26 17:07 UTC (permalink / raw) To: Janne Karhunen Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel, Eric W. Biederman, lxc-devel, mhw, Stephane Graber On Thu, Sep 26, 2013 at 08:01:31PM +0300, Janne Karhunen wrote: > That being said, our wish would be to support any combination of > OS's and frankly, I'd be slightly annoyed to tell the customer that > they can't do two Androids or we magically run out of bits. If you want to support "any" combination of operating systems, then use a hypervisor, that's what they are there for :) ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>]
* Re: Device Namespaces [not found] ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> @ 2013-09-26 17:56 ` Janne Karhunen 2013-09-30 15:37 ` James Bottomley 1 sibling, 0 replies; 38+ messages in thread From: Janne Karhunen @ 2013-09-26 17:56 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel, Eric W. Biederman, lxc-devel, mhw, Stephane Graber On Thu, Sep 26, 2013 at 8:07 PM, Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: >> That being said, our wish would be to support any combination of >> OS's and frankly, I'd be slightly annoyed to tell the customer that >> they can't do two Androids or we magically run out of bits. > > If you want to support "any" combination of operating systems, then use > a hypervisor, that's what they are there for :) Only relevant mobile OS's are of interest ;) -- Janne ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Device Namespaces [not found] ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2013-09-26 17:56 ` Janne Karhunen @ 2013-09-30 15:37 ` James Bottomley [not found] ` <1380555439.2161.5.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 1 sibling, 1 reply; 38+ messages in thread From: James Bottomley @ 2013-09-30 15:37 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Linux Containers, Kay Sievers, Stephane Graber, Andy Lutomirski, Eric W. Biederman, lxc-devel, mhw, devel On Thu, 2013-09-26 at 10:07 -0700, Greg Kroah-Hartman wrote: > On Thu, Sep 26, 2013 at 08:01:31PM +0300, Janne Karhunen wrote: > > That being said, our wish would be to support any combination of > > OS's and frankly, I'd be slightly annoyed to tell the customer that > > they can't do two Androids or we magically run out of bits. > > If you want to support "any" combination of operating systems, then use > a hypervisor, that's what they are there for :) No that's not quite the right way to think about it: The correct statement is only use a hypervisor if you need different kernels. With Windows, it happens to be true that you need a different kernel for each different OS version. However; with Linux, thanks to strong ABI backwards compatibility, you mostly don't. The way OpenVZ works today is that it installs a modified kernel which can then bring up every Linux OS in a separate container. Our use case is the hosters that give you root login to a virtual private server and allow you to upgrade it on your own. The reason for using a container rather than a hypervisor is the old density and elasticity one: 3x the density (i.e. 1/3 the overhead cost to the hoster) and the boot only needs to start at init, not bring up of virtual hardware and booting a second kernel. James ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <1380555439.2161.5.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>]
* Re: Device Namespaces [not found] ` <1380555439.2161.5.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> @ 2013-09-30 16:11 ` Greg Kroah-Hartman [not found] ` <20130930161117.GA26459-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Greg Kroah-Hartman @ 2013-09-30 16:11 UTC (permalink / raw) To: James Bottomley Cc: Linux Containers, Kay Sievers, Stephane Graber, Andy Lutomirski, Eric W. Biederman, lxc-devel, mhw, devel On Mon, Sep 30, 2013 at 08:37:19AM -0700, James Bottomley wrote: > On Thu, 2013-09-26 at 10:07 -0700, Greg Kroah-Hartman wrote: > > On Thu, Sep 26, 2013 at 08:01:31PM +0300, Janne Karhunen wrote: > > > That being said, our wish would be to support any combination of > > > OS's and frankly, I'd be slightly annoyed to tell the customer that > > > they can't do two Androids or we magically run out of bits. > > > > If you want to support "any" combination of operating systems, then use > > a hypervisor, that's what they are there for :) > > No that's not quite the right way to think about it: The correct > statement is only use a hypervisor if you need different kernels. With > Windows, it happens to be true that you need a different kernel for each > different OS version. However; with Linux, thanks to strong ABI > backwards compatibility, you mostly don't. The way OpenVZ works today > is that it installs a modified kernel which can then bring up every > Linux OS in a separate container. Our use case is the hosters that give > you root login to a virtual private server and allow you to upgrade it > on your own. The reason for using a container rather than a hypervisor > is the old density and elasticity one: 3x the density (i.e. 1/3 the > overhead cost to the hoster) and the boot only needs to start at init, > not bring up of virtual hardware and booting a second kernel. I understand that some people really like the idea of using OpenVZ for various things like this, but to claim that because of it we need to hack up the driver core in the kernel into unimaginable pieces is not necessarily something that I'll agree with. But all of this is just words, I have yet to see any patches for any of this, so I'll just wait until that happens before worrying about it... thanks, greg k-h ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <20130930161117.GA26459-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>]
* Re: Device Namespaces [not found] ` <20130930161117.GA26459-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> @ 2013-09-30 16:33 ` James Bottomley 0 siblings, 0 replies; 38+ messages in thread From: James Bottomley @ 2013-09-30 16:33 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Linux Containers, Kay Sievers, Stephane Graber, Andy Lutomirski, Eric W. Biederman, lxc-devel, mhw, devel On Mon, 2013-09-30 at 09:11 -0700, Greg Kroah-Hartman wrote: > On Mon, Sep 30, 2013 at 08:37:19AM -0700, James Bottomley wrote: > > On Thu, 2013-09-26 at 10:07 -0700, Greg Kroah-Hartman wrote: > > > On Thu, Sep 26, 2013 at 08:01:31PM +0300, Janne Karhunen wrote: > > > > That being said, our wish would be to support any combination of > > > > OS's and frankly, I'd be slightly annoyed to tell the customer that > > > > they can't do two Androids or we magically run out of bits. > > > > > > If you want to support "any" combination of operating systems, then use > > > a hypervisor, that's what they are there for :) > > > > No that's not quite the right way to think about it: The correct > > statement is only use a hypervisor if you need different kernels. With > > Windows, it happens to be true that you need a different kernel for each > > different OS version. However; with Linux, thanks to strong ABI > > backwards compatibility, you mostly don't. The way OpenVZ works today > > is that it installs a modified kernel which can then bring up every > > Linux OS in a separate container. Our use case is the hosters that give > > you root login to a virtual private server and allow you to upgrade it > > on your own. The reason for using a container rather than a hypervisor > > is the old density and elasticity one: 3x the density (i.e. 1/3 the > > overhead cost to the hoster) and the boot only needs to start at init, > > not bring up of virtual hardware and booting a second kernel. > > I understand that some people really like the idea of using OpenVZ for > various things like this, but to claim that because of it we need to > hack up the driver core in the kernel into unimaginable pieces is not > necessarily something that I'll agree with. I don't believe I claimed that. In fact, from 3.9 we can already bring up an OpenVZ containerised system running different versions of Linux that you can give someone root access to with no kernel modifications whatsoever. The user space solution currently works for us because we're handing out server VPSs, so the device configuration is fixed as we init the container. However, we do have use cases for dynamic instead of static device configurations, which is why we're participating in the debate. > But all of this is just words, I have yet to see any patches for any of > this, so I'll just wait until that happens before worrying about it... Well, that's because we're still debating what the best approach is. If you want a historical parallel: the comments you make above (hack up the ... kernel into unimaginable pieces) is an almost exact mirror of the comments that were made rejecting the in-kernel Checkpoint/Restore patches at the 2010 Kernel Summit ... yet we have it fully functional today in a form that proved acceptable. James ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Device Namespaces [not found] ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2013-09-26 8:25 ` Janne Karhunen @ 2013-10-01 6:19 ` Janne Karhunen [not found] ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 38+ messages in thread From: Janne Karhunen @ 2013-10-01 6:19 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel, Eric W. Biederman, lxc-devel, mhw, Stephane Graber On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: >> - We can relay a call of /sbin/hotplug from outside of a container >> to inside of a container based on policy. >> (But no one uses /sbin/hotplug anymore). > > That's right, they should be listening to libudev events, so why can't > your daemon shuffle them off to the proper container, all in userspace? Which reminds me, one potential reason being.. http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html -- Janne ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Device Namespaces [not found] ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-10-01 17:27 ` Andy Lutomirski [not found] ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-01 17:33 ` Greg Kroah-Hartman 1 sibling, 1 reply; 38+ messages in thread From: Andy Lutomirski @ 2013-10-01 17:27 UTC (permalink / raw) To: Janne Karhunen Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers, Stephane Graber, Eric W. Biederman, lxc-devel, mhw, devel On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: > >>> - We can relay a call of /sbin/hotplug from outside of a container >>> to inside of a container based on policy. >>> (But no one uses /sbin/hotplug anymore). >> >> That's right, they should be listening to libudev events, so why can't >> your daemon shuffle them off to the proper container, all in userspace? > > Which reminds me, one potential reason being.. > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html > Can't the daemon live outside the container and shuffle stuff in? IOW, there seems to be little point in containerizing things if you're just going to punch a privilege hole in the namespace. FWIW, I think that the capability evolution rules are crap, but changing them is a can of worms, and enough people seem to thing the status quo is acceptable that this is unlikely to ever get fixed. --Andy ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Device Namespaces [not found] ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-10-01 17:53 ` Serge E. Hallyn [not found] ` <20131001175345.GA4145-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2013-10-01 18:36 ` Janne Karhunen 1 sibling, 1 reply; 38+ messages in thread From: Serge E. Hallyn @ 2013-10-01 17:53 UTC (permalink / raw) To: Andy Lutomirski Cc: Kay Sievers, Linux Containers, lxc-devel, Stephane Graber, Eric W. Biederman, Greg Kroah-Hartman, mhw, devel Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): > On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: > > > >>> - We can relay a call of /sbin/hotplug from outside of a container > >>> to inside of a container based on policy. > >>> (But no one uses /sbin/hotplug anymore). > >> > >> That's right, they should be listening to libudev events, so why can't > >> your daemon shuffle them off to the proper container, all in userspace? > > > > Which reminds me, one potential reason being.. > > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html > > > > Can't the daemon live outside the container and shuffle stuff in? That's exactly what Michael Warfield is suggesting, fwiw. -serge ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <20131001175345.GA4145-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]
* Re: Device Namespaces [not found] ` <20131001175345.GA4145-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> @ 2013-10-01 19:51 ` Eric W. Biederman [not found] ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 38+ messages in thread From: Eric W. Biederman @ 2013-10-01 19:51 UTC (permalink / raw) To: Serge E. Hallyn Cc: Kay Sievers, Linux Containers, lxc-devel, Andy Lutomirski, devel, Greg Kroah-Hartman, mhw, Stephane Graber "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): >> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman >> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: >> > >> >>> - We can relay a call of /sbin/hotplug from outside of a container >> >>> to inside of a container based on policy. >> >>> (But no one uses /sbin/hotplug anymore). >> >> >> >> That's right, they should be listening to libudev events, so why can't >> >> your daemon shuffle them off to the proper container, all in userspace? >> > >> > Which reminds me, one potential reason being.. >> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html >> > >> >> Can't the daemon live outside the container and shuffle stuff in? > > That's exactly what Michael Warfield is suggesting, fwiw. Michael Warfields example of dynamically assigning serial ports to containers is a pretty good test case. Serial ports are extremely well known kernel objects who evolution effectively stopped long ago. When we need it we have ptys to virtual serial ports when we need it, but in general unprivileged users are safe to directly use a serial port device. Glossing over the details. The general problem is some policy exists outside of the container that deciedes if an when a container gets a serial port and stuffs it in. The expectation is that system containers will then run the udev rules and send the libuevent event. To make that all work without kernel modifications requires placing a faux-udev in the container, that listens for a device assignment from outside the container and then does exactly what udev would have done. The problems with this that I see are: - udev is a moving target making it hard to build a faux-udev that will work everywhere. - On distro's running systemd and udev integration is sufficiently tight that I am not certain a faux-udev is possible or will continue to be possible. - There are two other widely deployed solutions for managing hotplug devices besides udev. So given these difficulties I do not believe that the evolution of linux device management is done, and that patches to udev, the kernel or both will be needed. While it would be good for testing and understanding the problem I don't think a faux-udev will be a long term maintainable solution. I also understand the point that we aren't talking patches yet and just discussing ideas. Right now it is my hope that if we talk this out we can figure out a general direction that has a hope of working. From where I am standing faking uevents instead of replacing udev/mdev/whatever looks simpler and more maintainable. Eric ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Device Namespaces [not found] ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-10-01 20:46 ` Serge Hallyn 2013-10-01 22:59 ` [lxc-devel] " Michael H. Warfield 2013-10-02 22:55 ` Eric W. Biederman 2013-10-01 20:57 ` Greg Kroah-Hartman 2013-10-01 22:19 ` Michael H. Warfield 2 siblings, 2 replies; 38+ messages in thread From: Serge Hallyn @ 2013-10-01 20:46 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers, Stephane Graber, Andy Lutomirski, lxc-devel, mhw, devel Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > > > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): > >> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > >> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: > >> > > >> >>> - We can relay a call of /sbin/hotplug from outside of a container > >> >>> to inside of a container based on policy. > >> >>> (But no one uses /sbin/hotplug anymore). > >> >> > >> >> That's right, they should be listening to libudev events, so why can't > >> >> your daemon shuffle them off to the proper container, all in userspace? > >> > > >> > Which reminds me, one potential reason being.. > >> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html > >> > > >> > >> Can't the daemon live outside the container and shuffle stuff in? > > > > That's exactly what Michael Warfield is suggesting, fwiw. > > Michael Warfields example of dynamically assigning serial ports to > containers is a pretty good test case. Serial ports are extremely well > known kernel objects who evolution effectively stopped long ago. When > we need it we have ptys to virtual serial ports when we need it, but in > general unprivileged users are safe to directly use a serial port > device. > > Glossing over the details. The general problem is some policy exists > outside of the container that deciedes if an when a container gets a > serial port and stuffs it in. > > The expectation is that system containers will then run the udev > rules and send the libuevent event. I thought the suggestion was that udev on the host would be given container-specific rules, saying "plop this device into /dev/container1/" (with /dev/container1 being bind-mounted to $container1_rootfs/dev). -serge ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [lxc-devel] Device Namespaces 2013-10-01 20:46 ` Serge Hallyn @ 2013-10-01 22:59 ` Michael H. Warfield 2013-10-02 22:55 ` Eric W. Biederman 1 sibling, 0 replies; 38+ messages in thread From: Michael H. Warfield @ 2013-10-01 22:59 UTC (permalink / raw) To: Serge Hallyn Cc: Greg Kroah-Hartman, Michael H.Warfield, Kay Sievers, Andy Lutomirski, Eric W. Biederman, lxc-devel, Linux Containers, devel [-- Attachment #1.1: Type: text/plain, Size: 4401 bytes --] On Tue, 2013-10-01 at 15:46 -0500, Serge Hallyn wrote: > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > > "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > > > > > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): > > >> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen@gmail.com> wrote: > > >> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > > >> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: > > >> > > > >> >>> - We can relay a call of /sbin/hotplug from outside of a container > > >> >>> to inside of a container based on policy. > > >> >>> (But no one uses /sbin/hotplug anymore). > > >> >> > > >> >> That's right, they should be listening to libudev events, so why can't > > >> >> your daemon shuffle them off to the proper container, all in userspace? > > >> > > > >> > Which reminds me, one potential reason being.. > > >> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html > > >> > > > >> > > >> Can't the daemon live outside the container and shuffle stuff in? > > > > > > That's exactly what Michael Warfield is suggesting, fwiw. > > > > Michael Warfields example of dynamically assigning serial ports to > > containers is a pretty good test case. Serial ports are extremely well > > known kernel objects who evolution effectively stopped long ago. When > > we need it we have ptys to virtual serial ports when we need it, but in > > general unprivileged users are safe to directly use a serial port > > device. > > > > Glossing over the details. The general problem is some policy exists > > outside of the container that deciedes if an when a container gets a > > serial port and stuffs it in. > > > > The expectation is that system containers will then run the udev > > rules and send the libuevent event. > I thought the suggestion was that udev on the host would be given > container-specific rules, saying "plop this device into /dev/container1/" > (with /dev/container1 being bind-mounted to $container1_rootfs/dev). I think that the "given container-specific rules, saying..." thing was on my chart of options as the one with the big cloudy shaped object in the lower right corner labeled "and then a miracle occurs". The basic part is the mapping from /dev into /dev/lxc/container. That should be doable based on the rules in the host and a basic udev trigger along with a simple mapping configuration. The "given container-specific" part becomes a morass if it gets complicated enough. What I was envisioning was a very simple system of container specific {match} and {map} objects. If a name or symlink passed to the daemon from a udev trigger matched a match, then the name and symlinks and additional maps would be mapped into the appropriate container subdirectory. That works real well if the container and host udev rules are congruent. The tough part is the "container-specific" rules which was the part I specifically mentioned that I had no clue how to make happen. That's a non-trivial task if the container is allowed to make arbitrary udev rule changes based on what they are allowed to receive from the host (and how do we trigger the changes in the host when a change is made in the container). It's easily doable where the container rules are congruent with the host rules. Where they are orthogonal gets much more complicated. But... All that being said, I will take the congruent solution as a starting point (and that will not be an 80% solution - it will be more like a 95% solution) and we can argue about the corner cases and deltas after that. Doable, yes, for some value of doable. I like what Greg was saying about using libudev but I'm totally in the dark as to how to effectively hook that or if it would even work in the container. That one is not in my realm. > -serge Regards, Mike -- Michael H. Warfield (AI4NB) | Desk: (404) 236-2807 Senior Researcher - X-Force | Cell: (678) 463-0932 IBM Security Services | mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org mhw-UGBql2FAF+1Wk0Htik3J/w@public.gmane.org 6303 Barfield Road | http://www.iss.net/ Atlanta, Georgia 30328 | http://www.wittsend.com/mhw/ | PGP Key: 0x674627FF [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 482 bytes --] [-- Attachment #2: Type: text/plain, Size: 205 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Device Namespaces 2013-10-01 20:46 ` Serge Hallyn 2013-10-01 22:59 ` [lxc-devel] " Michael H. Warfield @ 2013-10-02 22:55 ` Eric W. Biederman 1 sibling, 0 replies; 38+ messages in thread From: Eric W. Biederman @ 2013-10-02 22:55 UTC (permalink / raw) To: Serge Hallyn Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers, Stephane Graber, Andy Lutomirski, lxc-devel, mhw, devel Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes: >> Glossing over the details. The general problem is some policy exists >> outside of the container that deciedes if an when a container gets a >> serial port and stuffs it in. >> >> The expectation is that system containers will then run the udev >> rules and send the libuevent event. > > I thought the suggestion was that udev on the host would be given > container-specific rules, saying "plop this device into /dev/container1/" > (with /dev/container1 being bind-mounted to $container1_rootfs/dev). That is what I was trying to describe. We still need something that lets the software in the container know it needs to do something. I may be blind but right now short of replacing the internal udev, or modifying the kernel I don't see a solution for letting software in a container know there is a new device it can use. Once we get the notification issue sorted out I think we have enough to bring up a full desktop environment in a container and be able to say we don't need anything else from devices unless someone discovers that checkpoint/restart actually needs minor numbers to be preserved. Eric ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Device Namespaces [not found] ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-10-01 20:46 ` Serge Hallyn @ 2013-10-01 20:57 ` Greg Kroah-Hartman [not found] ` <20131001205718.GA17036-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2013-10-01 22:19 ` Michael H. Warfield 2 siblings, 1 reply; 38+ messages in thread From: Greg Kroah-Hartman @ 2013-10-01 20:57 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel, lxc-devel, mhw, Stephane Graber On Tue, Oct 01, 2013 at 12:51:36PM -0700, Eric W. Biederman wrote: > "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > > > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): > >> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > >> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: > >> > > >> >>> - We can relay a call of /sbin/hotplug from outside of a container > >> >>> to inside of a container based on policy. > >> >>> (But no one uses /sbin/hotplug anymore). > >> >> > >> >> That's right, they should be listening to libudev events, so why can't > >> >> your daemon shuffle them off to the proper container, all in userspace? > >> > > >> > Which reminds me, one potential reason being.. > >> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html > >> > > >> > >> Can't the daemon live outside the container and shuffle stuff in? > > > > That's exactly what Michael Warfield is suggesting, fwiw. > > Michael Warfields example of dynamically assigning serial ports to > containers is a pretty good test case. Serial ports are extremely well > known kernel objects who evolution effectively stopped long ago. When > we need it we have ptys to virtual serial ports when we need it, but in > general unprivileged users are safe to directly use a serial port > device. > > Glossing over the details. The general problem is some policy exists > outside of the container that deciedes if an when a container gets a > serial port and stuffs it in. > > The expectation is that system containers will then run the udev > rules and send the libuevent event. > > To make that all work without kernel modifications requires placing > a faux-udev in the container, that listens for a device assignment from > outside the container and then does exactly what udev would have done. > > The problems with this that I see are: > > - udev is a moving target making it hard to build a faux-udev that will > work everywhere. How is udev a moving target? Use libudev and all should be fine, that's an ABI you can rely on, right? Or, if you don't like/want udev, use mdev in your container. Or something else, what does this have to do with the kernel? > - On distro's running systemd and udev integration is sufficiently tight > that I am not certain a faux-udev is possible or will continue to be > possible. That's not a kernel issue, that's a "ouch, this is hard, let's give up" issue. Or perhaps it is a "maybe I shouldn't even be trying to do this" type issue... :) > - There are two other widely deployed solutions for managing hotplug > devices besides udev. I know of mdev, what's the other one? The hacked-up shell script that Android uses? Or something else? > So given these difficulties I do not believe that the evolution of linux > device management is done, and that patches to udev, the kernel or both > will be needed. While it would be good for testing and understanding > the problem I don't think a faux-udev will be a long term maintainable > solution. You are saying that for some reason you feel helpless with the way userspace is going, so we have to change the kernel. That's horrible, and is not going to be a reason I accept to change the kernel, sorry. > I also understand the point that we aren't talking patches yet and just > discussing ideas. Right now it is my hope that if we talk this out we > can figure out a general direction that has a hope of working. > > From where I am standing faking uevents instead of replacing > udev/mdev/whatever looks simpler and more maintainable. Have you really looked into this? Numerous people, who understand this code path and userspace issues, have said it is not a good idea at all. But hey, what do I know... I still have yet to see a reason why you can't use libudev today for something like this. Anyway, I'm done discussing this as it's pointless this early, I'm going to refrain for any more pithy comments until someone posts some code, as this is just wasting people's time at the moment. greg k-h ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <20131001205718.GA17036-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>]
* Re: Device Namespaces [not found] ` <20131001205718.GA17036-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> @ 2013-10-02 22:45 ` Eric W. Biederman 0 siblings, 0 replies; 38+ messages in thread From: Eric W. Biederman @ 2013-10-02 22:45 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel, lxc-devel, mhw, Stephane Graber I think libudev is a solution to a completely different problem. It is possible I am blind but I just don't see how libudev even attempts to solve the problem. The desire is to plop a distro install into a subdirectory. Fire up a container around it, and let the distro's userspace do it's thing to manage hotplug events. devtmpfs can be faked fairly easily. I don't know about sysfs. Sending events that say you have hotplugged is the largest practical problem. On the minimal side I think the patch below is enough to let us fake up uevents for the container and make things work. I have heard the words faking uevents and is a bad thing. But I have not heard a reason or seen any attempt at explanation. My guess is that we are simply talking about different problems. I would like to see someone wire up all of the userspace bits and see how well hotplug can be made to work before I walk down the path represented by this patch but it seems reasonable. But I do have anecdotal reports from someone who walked a similar path that this is enough to bring up a full desktop system in a container. Eric diff --git a/include/linux/netlink.h b/include/linux/netlink.h index 7a6c396a263b..46d05783da82 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -38,6 +38,7 @@ extern void netlink_table_ungrab(void); #define NL_CFG_F_NONROOT_RECV (1 << 0) #define NL_CFG_F_NONROOT_SEND (1 << 1) +#define NL_CFG_F_IMPERSONATE_KERN (1 << 2) /* optional Netlink kernel configuration parameters */ struct netlink_kernel_cfg { diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c index 52e5abbc41db..f75e34397df8 100644 --- a/lib/kobject_uevent.c +++ b/lib/kobject_uevent.c @@ -375,9 +375,12 @@ static int uevent_net_init(struct net *net) struct uevent_sock *ue_sk; struct netlink_kernel_cfg cfg = { .groups = 1, - .flags = NL_CFG_F_NONROOT_RECV, + .flags = NL_CFG_F_NONROOT_RECV | NL_CFG_F_IMPERSONATE_KERN, }; + if (net->user_ns != &init_user_ns) + return 0; + ue_sk = kzalloc(sizeof(*ue_sk), GFP_KERNEL); if (!ue_sk) return -ENOMEM; @@ -399,6 +402,9 @@ static void uevent_net_exit(struct net *net) { struct uevent_sock *ue_sk; + if (net->user_ns != &init_user_ns) + return; + mutex_lock(&uevent_sock_mutex); list_for_each_entry(ue_sk, &uevent_sock_list, list) { if (sock_net(ue_sk->sk) == net) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 0c61b59175dc..71863cc465eb 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -1252,7 +1252,7 @@ static int netlink_release(struct socket *sock) skb_queue_purge(&sk->sk_write_queue); - if (nlk->portid) { + if (sk_hashed(sk)) { struct netlink_notify n = { .net = sock_net(sk), .protocol = sk->sk_protocol, @@ -1409,11 +1409,21 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr, return err; } - if (nlk->portid) { + if (sk_hashed(sk)) { if (nladdr->nl_pid != nlk->portid) return -EINVAL; } else { - err = nladdr->nl_pid ? + bool autobind = nladdr->nl_pid == 0; + if (nladdr->nl_pid == 0 && (nladdr->nl_pad == 0xffff)) { + if (!(nl_table[sk->sk_protocol].flags & NL_CFG_F_IMPERSONATE_KERN)) + return -EPERM; + if (net->user_ns == &init_user_ns) + return -EPERM; + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) + return -EPERM; + autobind = false; + } + err = !autobind ? netlink_insert(sk, net, nladdr->nl_pid) : netlink_autobind(sock); if (err) @@ -1467,7 +1477,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr, if (nladdr->nl_groups && !netlink_capable(sock, NL_CFG_F_NONROOT_SEND)) return -EPERM; - if (!nlk->portid) + if (!sk_hashed(sk)) err = netlink_autobind(sock); if (err == 0) { @@ -2228,7 +2238,7 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock, dst_group = nlk->dst_group; } - if (!nlk->portid) { + if (!sk_hashed(sk)) { err = netlink_autobind(sock); if (err) goto out; ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: Device Namespaces [not found] ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-10-01 20:46 ` Serge Hallyn 2013-10-01 20:57 ` Greg Kroah-Hartman @ 2013-10-01 22:19 ` Michael H. Warfield 2 siblings, 0 replies; 38+ messages in thread From: Michael H. Warfield @ 2013-10-01 22:19 UTC (permalink / raw) To: Eric W. Biederman Cc: Kay Sievers, Linux Containers, lxc-devel, Andy Lutomirski, Greg Kroah-Hartman, Stephane Graber, devel [-- Attachment #1.1: Type: text/plain, Size: 5040 bytes --] On Tue, 2013-10-01 at 12:51 -0700, Eric W. Biederman wrote: > "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > > > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): > >> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen <janne.karhunen@gmail.com> wrote: > >> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > >> > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: > >> > > >> >>> - We can relay a call of /sbin/hotplug from outside of a container > >> >>> to inside of a container based on policy. > >> >>> (But no one uses /sbin/hotplug anymore). > >> >> > >> >> That's right, they should be listening to libudev events, so why can't > >> >> your daemon shuffle them off to the proper container, all in userspace? > >> > > >> > Which reminds me, one potential reason being.. > >> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html > >> > > >> > >> Can't the daemon live outside the container and shuffle stuff in? > > > > That's exactly what Michael Warfield is suggesting, fwiw. > Michael Warfields example of dynamically assigning serial ports to > containers is a pretty good test case. Serial ports are extremely well > known kernel objects who evolution effectively stopped long ago. When > we need it we have ptys to virtual serial ports when we need it, but in > general unprivileged users are safe to directly use a serial port > device. > Glossing over the details. The general problem is some policy exists > outside of the container that deciedes if an when a container gets a > serial port and stuffs it in. Actually, I don't necessarily see that as a problem as much as a necessity. If a container can decide when it gets a serial port or other device, I would think that would constitute a security issue and container isolation violation. Restricting what container can have access to what has to be determined in the host and, once you've drunk that koolaid, you might as well stuff it in somewhere. Policy has to be in the host or you will never get the security corner cases right. Ultimately, it is the host which is in charge of the hardware and is managing the containers (it can start them up, shut them down, or manage them) so, at its base level, is is the responsibility of the host to manage those devices between the containers. That being said, there is the additional issue of, what does the container do when we hand it a device and how do we let it know. That's now classically the issue of udev and formerly hotplug and their predecessors... > The expectation is that system containers will then run the udev > rules and send the libuevent event. Which makes sense. Something along the line of a socket into the container to send selected events from the user space daemon in the host would make some sense there. > To make that all work without kernel modifications requires placing > a faux-udev in the container, that listens for a device assignment from > outside the container and then does exactly what udev would have done. > The problems with this that I see are: > - udev is a moving target making it hard to build a faux-udev that will > work everywhere. Well, it is an it isn't. Yeah the rules have been changing (I'm getting tired of the "deprecated" rule warnings) but I've seen worse, much worse. > - On distro's running systemd and udev integration is sufficiently tight > that I am not certain a faux-udev is possible or will continue to be > possible. Actually, I think that's a non-issue. IIRC, systemd (now) discontinues its udev operation when it detects it's in a container. That was at the heart of the entire Fedora 15/16 in a container meltdown with the broken versions of systemd trying to run udev in the container. What do we do in place of it? I don't know. > - There are two other widely deployed solutions for managing hotplug > devices besides udev. > So given these difficulties I do not believe that the evolution of linux > device management is done, and that patches to udev, the kernel or both > will be needed. While it would be good for testing and understanding > the problem I don't think a faux-udev will be a long term maintainable > solution. > I also understand the point that we aren't talking patches yet and just > discussing ideas. Right now it is my hope that if we talk this out we > can figure out a general direction that has a hope of working. > From where I am standing faking uevents instead of replacing > udev/mdev/whatever looks simpler and more maintainable. > Eric Mike -- Michael H. Warfield (AI4NB) | (770) 985-6132 | mhw-BetbSzk+GohWk0Htik3J/w@public.gmane.org /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it! [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 482 bytes --] [-- Attachment #2: Type: text/plain, Size: 205 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Device Namespaces [not found] ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-01 17:53 ` Serge E. Hallyn @ 2013-10-01 18:36 ` Janne Karhunen 1 sibling, 0 replies; 38+ messages in thread From: Janne Karhunen @ 2013-10-01 18:36 UTC (permalink / raw) To: Andy Lutomirski Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers, Stephane Graber, Eric W. Biederman, lxc-devel, mhw, devel On Tue, Oct 1, 2013 at 8:27 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote: >> http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html > > Can't the daemon live outside the container and shuffle stuff in? > IOW, there seems to be little point in containerizing things if you're > just going to punch a privilege hole in the namespace. Yeah. I will try to experiment just how much can be 'stuffed in' without effective caps. It certainly would be better this way. > FWIW, I think that the capability evolution rules are crap, but > changing them is a can of worms, and enough people seem to thing the > status quo is acceptable that this is unlikely to ever get fixed. I have noted (Casey almost tried to strangle me during the last security summit for even daring to talk about it). -- Janne ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Device Namespaces [not found] ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-01 17:27 ` Andy Lutomirski @ 2013-10-01 17:33 ` Greg Kroah-Hartman [not found] ` <20131001173342.GA19267-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 1 sibling, 1 reply; 38+ messages in thread From: Greg Kroah-Hartman @ 2013-10-01 17:33 UTC (permalink / raw) To: Janne Karhunen Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel, Eric W. Biederman, lxc-devel, mhw, Stephane Graber On Tue, Oct 01, 2013 at 09:19:58AM +0300, Janne Karhunen wrote: > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: > > >> - We can relay a call of /sbin/hotplug from outside of a container > >> to inside of a container based on policy. > >> (But no one uses /sbin/hotplug anymore). > > > > That's right, they should be listening to libudev events, so why can't > > your daemon shuffle them off to the proper container, all in userspace? > > Which reminds me, one potential reason being.. > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html I really wish I had never seen that patch, and I am glad it was rejected. ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <20131001173342.GA19267-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>]
* Re: Device Namespaces [not found] ` <20131001173342.GA19267-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> @ 2013-10-01 18:23 ` Janne Karhunen 0 siblings, 0 replies; 38+ messages in thread From: Janne Karhunen @ 2013-10-01 18:23 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Linux Containers, Kay Sievers, Andy Lutomirski, devel, Eric W. Biederman, lxc-devel, mhw, Stephane Graber On Tue, Oct 1, 2013 at 8:33 PM, Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote: >> > That's right, they should be listening to libudev events, so why can't >> > your daemon shuffle them off to the proper container, all in userspace? >> >> Which reminds me, one potential reason being.. >> http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html > > I really wish I had never seen that patch, and I am glad it was > rejected. Thanks, I agree. Just wanted to point out the reason and bring up the discussion. -- Janne ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Device Namespaces [not found] ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-09-26 5:33 ` Greg Kroah-Hartman @ 2013-10-28 23:31 ` Andrey Wagin 1 sibling, 0 replies; 38+ messages in thread From: Andrey Wagin @ 2013-10-28 23:31 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg Kroah-Hartman, Linux Containers, Kay Sievers, Andy Lutomirski, devel-GEFAQzZX7r8dnm+yROfE0A, lxc-devel, mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber 2013/9/26 Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > > > From conversations at Linux Plumbers Converence it became fairly clear > that one if not the roughest edge on containers today is dealing with > devices. > > - Hotplug does not work. > - There seems to be no implementation that does a much beyond creating > setting up a static set of /dev entries today. > - Containers do not see the appropriate uevents for their container. > > One of the more compelling cases I heard was of someone who was running > the a Linux Desktop in container and wanted to just let that container > see the devices needed for his desktop, and not everything else. I had experience of implementing this functionality in OpenVZ kernel. I had requirements to not modify user-space tools, so that implementations looks as dirty hack, but even hotplug of devices are workin there. .... > > So the big issues for a device namespace to solve are filtering which > devices a container has access to and being able to dynamically change > which devices those are at run time (aka hotplug). > > After having thought about this for a bit I don't know if a pure > userspace solution is sufficient or actually a good idea. I would prefer to think a bit more about userspace solution. We can try to expand udev functionality. > > - We can manually manage a tmpfs with device nodes in userspace. > (But that is deprecated functionality in the mainstream kernel). > - We can manually export a subset of sysfs with bind mounts. > (But that feels hacky, and is essentially incompatible with hotplug). > - We can relay a call of /sbin/hotplug from outside of a container > to inside of a container based on policy. > (But no one uses /sbin/hotplug anymore). > - There is no way to fake netlink uevents for a container to see them. > (The best we could do is replace udev everywhere with something that > listens on a unix domain socket). or we can teach udev to listens on a unix domain socket. The host udev listens netlink. When it gets an event about a new device, it decides for which containers it must be avaliable, does all required actions and sends events in containers. Probably the protocol of notifications must be unified for all udev-like services. > > - It would be nice to replace the device cgroup with a comprehensive > solution that really works. (Among other things the device cgroup > does not work in terms of struct device the underlying kernel > abstraction for devices). > > We must manage sysfs entries as well device nodes because: > - Seeing more than we should has the real potential to confuse > userspace, especially a userspace that replays uevents. > - Some device control must happens through writing to sysfs files and > if we don't remove all root privileges from a container only by > exporting a subset of sysfs to that container can we limit which > sysfs nodes can be written to. Sorry if a following idea will sound crazy. Can we use fuse filesystems for filtering sysfs and devtmpfs? When a CT mounts sysfs, it will mount fuse-sysfs, which is implemented by userspace program on host system. * This way allows to emulate the behavior of uevent files in containers, if we will use unix sockets between udev services. * Probably a userspace daemon will be more flexible and customizable than something in kernel Do we have a use case when a perfomance of sysfs is critical? Thanks, Andrey ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: RFC: Device Namespaces [not found] ` <CAA4jN2aw4zEW=UfKCyqaOvXnbiRb_J9srfCn4OXTFzc6vWBM4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-08-22 18:21 ` Serge Hallyn @ 2013-08-29 19:06 ` Andy Lutomirski [not found] ` <521F9BBE.2070505-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> 1 sibling, 1 reply; 38+ messages in thread From: Andy Lutomirski @ 2013-08-29 19:06 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, lxc-devel On 08/22/2013 10:43 AM, Oren Laadan wrote: > Hi everyone! > > We [1] have been working on bringing lightweight virtualization to > Linux-based mobile devices like Android (or other Linux-based devices with > diverse I/O) and want to share our solution: device namespaces. Have you looked at systemd-logind? It seems to do something similar. ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <521F9BBE.2070505-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>]
* Re: [lxc-devel] RFC: Device Namespaces [not found] ` <521F9BBE.2070505-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> @ 2013-09-03 19:35 ` Stéphane Graber 0 siblings, 0 replies; 38+ messages in thread From: Stéphane Graber @ 2013-09-03 19:35 UTC (permalink / raw) To: Andy Lutomirski Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, lxc-devel [-- Attachment #1.1: Type: text/plain, Size: 1006 bytes --] On Thu, Aug 29, 2013 at 12:06:38PM -0700, Andy Lutomirski wrote: > On 08/22/2013 10:43 AM, Oren Laadan wrote: > > Hi everyone! > > > > We [1] have been working on bringing lightweight virtualization to > > Linux-based mobile devices like Android (or other Linux-based devices with > > diverse I/O) and want to share our solution: device namespaces. > > Have you looked at systemd-logind? It seems to do something similar. logind can be used to know the list of existing user sessions and which have console access, it also creates and to some extent manages cgroups but it doesn't do anything that the device namespace would at the kernel level. The main benefit from having a device namespace in the kernel would be to only get the uevents and device access for devices that are either owned or shared with the container. Being able to have fake devices replace some of the standard ones would also be nice to have. -- Stéphane Graber Ubuntu developer http://www.ubuntu.com [-- Attachment #1.2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] [-- Attachment #2: Type: text/plain, Size: 205 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2013-10-28 23:31 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-08-22 17:43 RFC: Device Namespaces Oren Laadan [not found] ` <CAA4jN2aw4zEW=UfKCyqaOvXnbiRb_J9srfCn4OXTFzc6vWBM4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-08-22 18:21 ` Serge Hallyn 2013-08-26 10:11 ` Oren Laadan [not found] ` <CAA4jN2YL7Lfu2+DW-i+MovFxWEhJfT4aBBKREU_vy7JX9TKGHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-06 17:50 ` Eric W. Biederman [not found] ` <8761udlu0d.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-09-08 12:28 ` Amir Goldstein [not found] ` <CAA2m6vexArJ+6jFbK80Amstk=LK30=XDNHdBHSswP=LgpSP-6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-09 0:51 ` Eric W. Biederman [not found] ` <871u4yddg4.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-09-10 7:09 ` Amir Goldstein [not found] ` <CAA2m6vc_kWWGDWcdjk26N3YvTqZySLFxPQRjOD9_ypBOka2+GQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-25 11:05 ` Janne Karhunen [not found] ` <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-25 20:23 ` Eric W. Biederman [not found] ` <87ioxo4pm5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-09-25 21:17 ` [lxc-devel] " Jeremy Andrus [not found] ` <AD5F7BD2-0166-46BD-AB14-463C0E88BC92-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2013-09-25 21:47 ` Eric W. Biederman [not found] ` <8738osr2ue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-09-29 17:56 ` Amir Goldstein 2013-09-25 21:34 ` Eric W. Biederman [not found] ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-09-26 5:33 ` Greg Kroah-Hartman [not found] ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2013-09-26 8:25 ` Janne Karhunen [not found] ` <CAE=NcrbPXGWU8FUgwchXyL5HjXf+4AKbgUWGe1ZO=Xcq=iV-Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-26 13:56 ` Greg Kroah-Hartman [not found] ` <20130926135604.GA16624-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2013-09-26 17:01 ` Janne Karhunen [not found] ` <CAE=NcrY3xC1AF_GV2b1KsF7AwYZTuGBuKLS5yBUWoWcmKU4YBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-26 17:07 ` Greg Kroah-Hartman [not found] ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2013-09-26 17:56 ` Janne Karhunen 2013-09-30 15:37 ` James Bottomley [not found] ` <1380555439.2161.5.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 2013-09-30 16:11 ` Greg Kroah-Hartman [not found] ` <20130930161117.GA26459-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2013-09-30 16:33 ` James Bottomley 2013-10-01 6:19 ` Janne Karhunen [not found] ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-01 17:27 ` Andy Lutomirski [not found] ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-01 17:53 ` Serge E. Hallyn [not found] ` <20131001175345.GA4145-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2013-10-01 19:51 ` Eric W. Biederman [not found] ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-10-01 20:46 ` Serge Hallyn 2013-10-01 22:59 ` [lxc-devel] " Michael H. Warfield 2013-10-02 22:55 ` Eric W. Biederman 2013-10-01 20:57 ` Greg Kroah-Hartman [not found] ` <20131001205718.GA17036-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2013-10-02 22:45 ` Eric W. Biederman 2013-10-01 22:19 ` Michael H. Warfield 2013-10-01 18:36 ` Janne Karhunen 2013-10-01 17:33 ` Greg Kroah-Hartman [not found] ` <20131001173342.GA19267-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2013-10-01 18:23 ` Janne Karhunen 2013-10-28 23:31 ` Andrey Wagin 2013-08-29 19:06 ` RFC: " Andy Lutomirski [not found] ` <521F9BBE.2070505-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> 2013-09-03 19:35 ` [lxc-devel] " Stéphane Graber
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).