containers.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* device namespaces
@ 2021-06-08  9:38 Enrico Weigelt, metux IT consult
  2021-06-08 12:30 ` Christian Brauner
  0 siblings, 1 reply; 48+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-08  9:38 UTC (permalink / raw)
  To: containers, linux-kernel

Hello folks,


I'm going to implement device namespaces, where containers can get an
entirely different view of the devices in the machine (usually just a
specific subset, but possibly additional virtual devices).

For start I'd like to add a simple mapping of dev maj/min (leaving aside
sysfs, udev, etc). An important requirement for me is that the parent ns
can choose to delegate devices from those it full access too (child
namespaces can do the same to their childs), and the assignment can
change (for simplicity ignoring the case of removing devices that are
already opened by some process - haven't decided yet whether they should
be forcefully closed or whether keeping them open is a valid use case).

The big question for me now is how exactly to do the table maintenance
from userland. We already have entries in /proc/<pid>/ns/*. I'm thinking
about using them as command channel, like this:

* new child namespaces are created with empty mapping
* mapping manipulation is done by just writing commands to the ns file
* access is only granted if the writing process itself is in the
  parent's device ns and has CAP_SYS_ADMIN (or maybe their could be some
  admin user for the ns ? or the 'root' of the corresponding user_ns ?)
* if the caller has some restrictions on some particular device, these
  are automatically added (eg. if you're restricted to readonly, you
  can't give rw to the child ns).

Is this a good way to go ? Or what would be a better one ?


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: Device Namespaces
@ 2013-09-29 19:28 Amir Goldstein
       [not found] ` <CAA2m6veny-7_ONMA973Wu36U4kz4gAuw0dpodkb8+GZDv6VNBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Amir Goldstein @ 2013-09-29 19:28 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, Kay Sievers, Andy Lutomirski,
	devel-GEFAQzZX7r8dnm+yROfE0A, Eric W. Biederman, lxc-devel,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber

On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:

> On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote:
> > So the big issues for a device namespace to solve are filtering which
> > devices a container has access to and being able to dynamically change
> > which devices those are at run time (aka hotplug).
>
> As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG
> anymore, because it was redundant), I think you need to really think
> this through better (pci, memory, cpus, etc.) before you do anything in
> the kernel.
>
> > After having thought about this for a bit I don't know if a pure
> > userspace solution is sufficient or actually a good idea.
> >
> > - We can manually manage a tmpfs with device nodes in userspace.
> >   (But that is deprecated functionality in the mainstream kernel).
>
> Yes, but I'm not going to namespace devtmpfs, as that is going to be an
> impossible task, right?
>

That sounds like a challenge ;-)
Seriously, as Serge correctly noted, it would not be that different from
devpts
if you start from an empty devtmpfs and populate it with devices that are
"added
in the context of that namespace".
The semantics in which devices are "added in the context of a namespace"
is the missing piece of the puzzle.

What we really like to see is a setns() style API that can be used to
add a device in the context of a namespace in either a "shared" or "private"
mode.
This kind of API is a required building block for us to write device drivers
that are namespace aware in a way that userspace will have enough
flexibility
for dynamic configuration.

We are trying to come up with a proposal for that sort of API.
When we have something decent, we shall post it.


> And remember, udev doesn't create device nodes anymore...
>
> > - We can manually export a subset of sysfs with bind mounts.
> >   (But that feels hacky, and is essentially incompatible with hotplug).
>
> True.
>
> > - We can relay a call of /sbin/hotplug from outside of a container
> >   to inside of a container based on policy.
> >   (But no one uses /sbin/hotplug anymore).
>
> That's right, they should be listening to libudev events, so why can't
> your daemon shuffle them off to the proper container, all in userspace?
>
> > - There is no way to fake netlink uevents for a container to see them.
> >   (The best we could do is replace udev everywhere with something that
> >    listens on a unix domain socket).
>
> You shouldn't need to do this.
>
> > - It would be nice to replace the device cgroup with a comprehensive
> >   solution that really works. (Among other things the device cgroup
> >   does not work in terms of struct device the underlying kernel
> >   abstraction for devices).
>
> I didn't even know there was a device cgroup.
>
> Which means that if there is one, odds are it's useless.
>
> > We must manage sysfs entries as well device nodes because:
> > - Seeing more than we should has the real potential to confuse
> >   userspace, especially a userspace that replays uevents.
>
> You should never replay uevents.  If you don't do that, why can't you
> see all of sysfs?
>
> > - Some device control must happens through writing to sysfs files and
> >   if we don't remove all root privileges from a container only by
> >   exporting a subset of sysfs to that container can we limit which
> >   sysfs nodes can be written to.
>
> But you have the issue of controlling devices in a "shared" way, which
> isn't going to be usable for almost all devices.
>
> > The current kernel tagged sysfs entry support does not look like a good
> > match for the impelementing device filtering.   The common case will
> > be allowing devices like /dev/zero, and /dev/null that live in
> > /sys/devices/virtual and are the devices we are most likely to care
> > about.  Those devices need to live in multiple device namespaces so
> > everyone can use them.  Perhaps exclusive assignment will be the more
> > common paradigm for device namespaces like it is for network devices in
> > the network namespace but from what little I can of this problem right
> now I
> > don't think so.
> >
> > I definitely think we should hold off on a kernel level implementation
> > until we really understand the issues and are ready to implement device
> > namespaces correctly.
>
> I agree, especially as I don't think this will ever work.
>
> > A userspace implementation looks like it can only do about 95% of what
> > is really needed, but at the same time looks like an easy way to
> > experiment until the problem is sufficiently well understood.
>
> 95% is probably way better than what you have today, and will fit the
> needs of almost everyone today, so why not do it?
>
> I'd argue that those last 5% either are custom solutions that never get
> merged, or candidates for true virtulization.
>
> > In summary the situation with device hoptlug and containers sucks today,
> > and we need to do something.  Running a linux desktop in a container is
> > a reasonably good example use case.
>
> No it isn't.  I'd argue that this is a horrible use case, one that you
> shouldn't do.  Why not just use multi-head machines like people do who
> really want to do this, relying on user separation?  That's a workable
> solution that is quite common and works very well today.
>
> > Having one standard common maintainable implementation would be very
> > useful and the most logical place for that would be in the kernel.
> > For now we should focus on simple device filtering and hotplug.
>
> Just listen for libudev stuff, don't try to filter them, or ever
> "replay" them, that way lies madness, and lots of nasty race conditions
> that is guaranteed to break things.
>
> good luck,
>
> greg k-h
>

^ permalink raw reply	[flat|nested] 48+ messages in thread
* RFC: Device Namespaces
@ 2013-08-22 17:43 Oren Laadan
  2013-08-22 18:21 ` Serge Hallyn
  0 siblings, 1 reply; 48+ messages in thread
From: Oren Laadan @ 2013-08-22 17:43 UTC (permalink / raw)
  To: Linux Containers; +Cc: lxc-devel

Hi everyone!

We [1] have been working on bringing lightweight virtualization to
Linux-based mobile devices like Android (or other Linux-based devices with
diverse I/O) and want to share our solution: device namespaces.

Imagine you could run several instances of your favorite mobile OS or other
distributions in isolated containers, each under the impression of having
exclusive access to device drivers; Interact and switch between them within
a blink, no flashing, no reboot.

Device namespaces are an extension to existing Linux kernel namespaces that
brings lightweight virtualization to Linux-based end-user devices,
primarily mobile devices.
Device namespaces introduce a private and virtual namespace for device
drivers to create the illusion for a process group that it interacts
exclusively with a set of drivers. Device namespaces also introduce the
concepts of an “active” namespace with which a user interacts, vs
“non-active” namespaces that run in the background, and the ability to
switch between them.[2]

We are planning to prepare individual patches to be submitted to the
relevant maintainers and mailing lists. In the meantime, we already want to
share a set of patches on top of the Android goldfish Kernel 3.4 as well as
a user-space demo, so you can see where we are heading and get an overview
of the approach and see how it works.

We are aware that the patches are not ready for submission in their current
state, and we'd highly appreciate any feedback or suggestions which may
come to your mind once you have a look [3]. Of particular interest is to
elaborate a proper userspace API with respect to existing and future
use-cases. To illustrate a simple use-case we also provide a simple
userspace demo for Android [4].

I will be presenting "The Case for Linux Device Namespace" [5] at LinuxCon
North America 2013 [6]. We will also be attending the Containers Track [7]
at LPC 2013 to present the current state of the patches and discuss the
best course to proceed.

We are looking forward to hear from you!

Thanks,

Oren.


1: http://www.cellrox.com/
2: https://github.com/Cellrox/devns-patches/wiki/DeviceNamespace
3: https://github.com/Cellrox/devns-patches
4: https://github.com/Cellrox/devns-demo
5: http://sched.co/1asN1v7
6: http://events.linuxfoundation.org/events/linuxcon-north-america
7: http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/tracks/153

-- 
 Oren Laadan
 Cellrox Ltd.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-06-15 11:33 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-08  9:38 device namespaces Enrico Weigelt, metux IT consult
2021-06-08 12:30 ` Christian Brauner
2021-06-08 12:41   ` Greg Kroah-Hartman
2021-06-08 14:10     ` Hannes Reinecke
2021-06-08 14:29       ` Christian Brauner
2021-06-08 15:54         ` Hannes Reinecke
2021-06-08 17:16           ` Eric W. Biederman
2021-06-09  6:38             ` Christian Brauner
2021-06-09  7:02               ` Hannes Reinecke
2021-06-09  7:21                 ` Christian Brauner
2021-06-09  7:54                   ` Hannes Reinecke
2021-06-09  8:09                     ` Christian Brauner
2021-06-11 18:14                       ` Eric W. Biederman
2021-06-14  7:49                         ` Enrico Weigelt, metux IT consult
2021-06-14  8:22                           ` Greg KH
2021-06-14 17:36                           ` Eric W. Biederman
2021-06-15 11:24                             ` Enrico Weigelt, metux IT consult
2021-06-15 11:33                               ` Greg KH
  -- strict thread matches above, loose matches on Subject: below --
2013-09-29 19:28 Device Namespaces Amir Goldstein
     [not found] ` <CAA2m6veny-7_ONMA973Wu36U4kz4gAuw0dpodkb8+GZDv6VNBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-29 20:06   ` Greg Kroah-Hartman
     [not found]     ` <20130929200620.GA31304-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-30 15:36       ` Michael H. Warfield
2013-10-03  0:44   ` Eric W. Biederman
     [not found]     ` <87a9iri3ot.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-10-03  0:59       ` Eric W. Biederman
2013-10-03  8:58       ` Amir Goldstein
     [not found]         ` <CAA2m6vc3OFmS9VwiTavRzPqhn+qoe6vDCO2sitXpEQ8a1JVyfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-03  9:17           ` Eric W. Biederman
2013-08-22 17:43 RFC: " Oren Laadan
2013-08-22 18:21 ` Serge Hallyn
2013-08-26 10:11   ` Oren Laadan
2013-09-06 17:50     ` Eric W. Biederman
2013-09-08 12:28       ` Amir Goldstein
2013-09-09  0:51         ` Eric W. Biederman
2013-09-10  7:09           ` Amir Goldstein
2013-09-25 11:05             ` Janne Karhunen
     [not found]               ` <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-25 21:34                 ` Eric W. Biederman
     [not found]                   ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-09-26  5:33                     ` Greg Kroah-Hartman
     [not found]                       ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26  8:25                         ` Janne Karhunen
     [not found]                           ` <CAE=NcrbPXGWU8FUgwchXyL5HjXf+4AKbgUWGe1ZO=Xcq=iV-Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-26 13:56                             ` Greg Kroah-Hartman
     [not found]                               ` <20130926135604.GA16624-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26 17:01                                 ` Janne Karhunen
     [not found]                                   ` <CAE=NcrY3xC1AF_GV2b1KsF7AwYZTuGBuKLS5yBUWoWcmKU4YBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-26 17:07                                     ` Greg Kroah-Hartman
     [not found]                                       ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26 17:56                                         ` Janne Karhunen
2013-09-30 15:37                                         ` James Bottomley
     [not found]                                           ` <1380555439.2161.5.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2013-09-30 16:11                                             ` Greg Kroah-Hartman
     [not found]                                               ` <20130930161117.GA26459-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-30 16:33                                                 ` James Bottomley
2013-10-01  6:19                         ` Janne Karhunen
     [not found]                           ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-01 17:27                             ` Andy Lutomirski
     [not found]                               ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-01 17:53                                 ` Serge E. Hallyn
     [not found]                                   ` <20131001175345.GA4145-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-10-01 19:51                                     ` Eric W. Biederman
     [not found]                                       ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-10-01 20:46                                         ` Serge Hallyn
2013-10-02 22:55                                           ` Eric W. Biederman
2013-10-01 20:57                                         ` Greg Kroah-Hartman
     [not found]                                           ` <20131001205718.GA17036-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-10-02 22:45                                             ` Eric W. Biederman
2013-10-01 22:19                                         ` Michael H. Warfield
2013-10-01 18:36                                 ` Janne Karhunen
2013-10-01 17:33                             ` Greg Kroah-Hartman
     [not found]                               ` <20131001173342.GA19267-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-10-01 18:23                                 ` Janne Karhunen
2013-10-28 23:31                     ` Andrey Wagin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).