containers.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman)
To: Linux Containers
	<containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Cc: Greg Kroah-Hartman
	<gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>,
	mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org,
	Kay Sievers <kay.sievers-tD+1rO4QERM@public.gmane.org>,
	Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>,
	lxc-devel
	<lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org>,
	Stephane Graber
	<stgraber-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>,
	devel-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org
Subject: Device Namespaces
Date: Wed, 25 Sep 2013 14:34:54 -0700	[thread overview]
Message-ID: <87bo3gshz5.fsf_-_@xmission.com> (raw)
In-Reply-To: <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> (Janne Karhunen's message of "Wed, 25 Sep 2013 14:05:32 +0300")


From conversations at Linux Plumbers Converence it became fairly clear
that one if not the roughest edge on containers today is dealing with
devices.

- Hotplug does not work.
- There seems to be no implementation that does a much beyond creating
  setting up a static set of /dev entries today.
- Containers do not see the appropriate uevents for their container.

One of the more compelling cases I heard was of someone who was running
the a Linux Desktop in container and wanted to just let that container
see the devices needed for his desktop, and not everything else.

Talking with the OpenVZ folks it appears that preserving device numbers
across checkpoint/restart is not currently an issue.  However they reuse
the same loopback minor number when they can which would hide this
issue.   So while it is clear we don't need to worry about migrating
an application that cares about major/minor numbers of filesystems right
now as the set of application that are migrated increases that situation
may change.  As the case with the network device ifindex has shown it is
possible to implement filtering now and later when there is a usecase it
is possible to expand filtering to actual namespace local identifiers.

Thinking about it for the case of container migration the simplest
solution for the rare application that needs something more may be to
figure out how to send a kernel hotplug event.  Something to think about
when we encounter them.

So the big issues for a device namespace to solve are filtering which
devices a container has access to and being able to dynamically change
which devices those are at run time (aka hotplug).

After having thought about this for a bit I don't know if a pure
userspace solution is sufficient or actually a good idea.

- We can manually manage a tmpfs with device nodes in userspace.
  (But that is deprecated functionality in the mainstream kernel).
- We can manually export a subset of sysfs with bind mounts.
  (But that feels hacky, and is essentially incompatible with hotplug).
- We can relay a call of /sbin/hotplug from outside of a container
  to inside of a container based on policy.
  (But no one uses /sbin/hotplug anymore).
- There is no way to fake netlink uevents for a container to see them.
  (The best we could do is replace udev everywhere with something that
   listens on a unix domain socket).
- It would be nice to replace the device cgroup with a comprehensive
  solution that really works. (Among other things the device cgroup
  does not work in terms of struct device the underlying kernel
  abstraction for devices).

We must manage sysfs entries as well device nodes because:
- Seeing more than we should has the real potential to confuse
  userspace, especially a userspace that replays uevents.
- Some device control must happens through writing to sysfs files and
  if we don't remove all root privileges from a container only by
  exporting a subset of sysfs to that container can we limit which
  sysfs nodes can be written to.

The current kernel tagged sysfs entry support does not look like a good
match for the impelementing device filtering.   The common case will
be allowing devices like /dev/zero, and /dev/null that live in
/sys/devices/virtual and are the devices we are most likely to care
about.  Those devices need to live in multiple device namespaces so
everyone can use them.  Perhaps exclusive assignment will be the more
common paradigm for device namespaces like it is for network devices in
the network namespace but from what little I can of this problem right now I
don't think so.

I definitely think we should hold off on a kernel level implementation
until we really understand the issues and are ready to implement device
namespaces correctly.

A userspace implementation looks like it can only do about 95% of what
is really needed, but at the same time looks like an easy way to
experiment until the problem is sufficiently well understood.

At the end of the day we need to filter the devices a set of userspace
processes can use and be able to change that set of devices dynamically.
All of the rest of the infrastructure for that lives in the kernel, and
keeping all of the infrastructure in one place where it can be
maintained together is likely to be most maintainable.  It looks like
the code is just complicated enough and the use cases just boring enough
that spreading the code to perform container device hotplug and
container device filtering between a dozen userspace tools, and a hadful
of userspace device managers will not be particularly managable at the
end of the day.

In summary the situation with device hoptlug and containers sucks today,
and we need to do something.  Running a linux desktop in a container is
a reasonably good example use case.  Having one standard common
maintainable implementation would be very useful and the most logical
place for that would be in the kernel.  For now we should focus on
simple device filtering and hotplug.

Eric

  parent reply	other threads:[~2013-09-25 21:34 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-22 17:43 RFC: Device Namespaces Oren Laadan
     [not found] ` <CAA4jN2aw4zEW=UfKCyqaOvXnbiRb_J9srfCn4OXTFzc6vWBM4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-08-22 18:21   ` Serge Hallyn
2013-08-26 10:11     ` Oren Laadan
     [not found]       ` <CAA4jN2YL7Lfu2+DW-i+MovFxWEhJfT4aBBKREU_vy7JX9TKGHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-06 17:50         ` Eric W. Biederman
     [not found]           ` <8761udlu0d.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-09-08 12:28             ` Amir Goldstein
     [not found]               ` <CAA2m6vexArJ+6jFbK80Amstk=LK30=XDNHdBHSswP=LgpSP-6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-09  0:51                 ` Eric W. Biederman
     [not found]                   ` <871u4yddg4.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-09-10  7:09                     ` Amir Goldstein
     [not found]                       ` <CAA2m6vc_kWWGDWcdjk26N3YvTqZySLFxPQRjOD9_ypBOka2+GQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-25 11:05                         ` Janne Karhunen
     [not found]                           ` <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-25 20:23                             ` Eric W. Biederman
     [not found]                               ` <87ioxo4pm5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-09-25 21:17                                 ` [lxc-devel] " Jeremy Andrus
     [not found]                                   ` <AD5F7BD2-0166-46BD-AB14-463C0E88BC92-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2013-09-25 21:47                                     ` Eric W. Biederman
     [not found]                                       ` <8738osr2ue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-09-29 17:56                                         ` Amir Goldstein
2013-09-25 21:34                             ` Eric W. Biederman [this message]
     [not found]                               ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-09-26  5:33                                 ` Greg Kroah-Hartman
     [not found]                                   ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26  8:25                                     ` Janne Karhunen
     [not found]                                       ` <CAE=NcrbPXGWU8FUgwchXyL5HjXf+4AKbgUWGe1ZO=Xcq=iV-Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-26 13:56                                         ` Greg Kroah-Hartman
     [not found]                                           ` <20130926135604.GA16624-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26 17:01                                             ` Janne Karhunen
     [not found]                                               ` <CAE=NcrY3xC1AF_GV2b1KsF7AwYZTuGBuKLS5yBUWoWcmKU4YBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-26 17:07                                                 ` Greg Kroah-Hartman
     [not found]                                                   ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26 17:56                                                     ` Janne Karhunen
2013-09-30 15:37                                                     ` James Bottomley
     [not found]                                                       ` <1380555439.2161.5.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2013-09-30 16:11                                                         ` Greg Kroah-Hartman
     [not found]                                                           ` <20130930161117.GA26459-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-30 16:33                                                             ` James Bottomley
2013-10-01  6:19                                     ` Janne Karhunen
     [not found]                                       ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-01 17:27                                         ` Andy Lutomirski
     [not found]                                           ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-01 17:53                                             ` Serge E. Hallyn
     [not found]                                               ` <20131001175345.GA4145-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-10-01 19:51                                                 ` Eric W. Biederman
     [not found]                                                   ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-10-01 20:46                                                     ` Serge Hallyn
2013-10-01 22:59                                                       ` [lxc-devel] " Michael H. Warfield
2013-10-02 22:55                                                       ` Eric W. Biederman
2013-10-01 20:57                                                     ` Greg Kroah-Hartman
     [not found]                                                       ` <20131001205718.GA17036-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-10-02 22:45                                                         ` Eric W. Biederman
2013-10-01 22:19                                                     ` Michael H. Warfield
2013-10-01 18:36                                             ` Janne Karhunen
2013-10-01 17:33                                         ` Greg Kroah-Hartman
     [not found]                                           ` <20131001173342.GA19267-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-10-01 18:23                                             ` Janne Karhunen
2013-10-28 23:31                                 ` Andrey Wagin
2013-08-29 19:06   ` RFC: " Andy Lutomirski
     [not found]     ` <521F9BBE.2070505-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
2013-09-03 19:35       ` [lxc-devel] " Stéphane Graber
2013-09-29 19:28 Amir Goldstein
     [not found] ` <CAA2m6veny-7_ONMA973Wu36U4kz4gAuw0dpodkb8+GZDv6VNBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-29 20:06   ` Greg Kroah-Hartman
     [not found]     ` <20130929200620.GA31304-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-30 15:36       ` Michael H. Warfield
2013-10-03  0:44   ` Eric W. Biederman
     [not found]     ` <87a9iri3ot.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-10-03  0:59       ` Eric W. Biederman
2013-10-03  8:58       ` Amir Goldstein
     [not found]         ` <CAA2m6vc3OFmS9VwiTavRzPqhn+qoe6vDCO2sitXpEQ8a1JVyfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-03  9:17           ` Eric W. Biederman
2021-06-08  9:38 device namespaces Enrico Weigelt, metux IT consult
2021-06-08 12:30 ` Christian Brauner
2021-06-08 12:41   ` Greg Kroah-Hartman
2021-06-08 14:10     ` Hannes Reinecke
2021-06-08 14:29       ` Christian Brauner
2021-06-08 15:54         ` Hannes Reinecke
2021-06-08 17:16           ` Eric W. Biederman
2021-06-09  6:38             ` Christian Brauner
2021-06-09  7:02               ` Hannes Reinecke
2021-06-09  7:21                 ` Christian Brauner
2021-06-09  7:54                   ` Hannes Reinecke
2021-06-09  8:09                     ` Christian Brauner
2021-06-11 18:14                       ` Eric W. Biederman
2021-06-14  7:49                         ` Enrico Weigelt, metux IT consult
2021-06-14  8:22                           ` Greg KH
2021-06-14 17:36                           ` Eric W. Biederman
2021-06-15 11:24                             ` Enrico Weigelt, metux IT consult
2021-06-15 11:33                               ` Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bo3gshz5.fsf_-_@xmission.com \
    --to=ebiederm-as9lmozglivwk0htik3j/w@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=devel-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    --cc=gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org \
    --cc=kay.sievers-tD+1rO4QERM@public.gmane.org \
    --cc=luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org \
    --cc=lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org \
    --cc=mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org \
    --cc=stgraber-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).