containers.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: "Enrico Weigelt, metux IT consult" <lkml@metux.net>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>,
	Hannes Reinecke <hare@suse.de>,
	gregkh@linuxfoundation.org, containers@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: device namespaces
Date: Tue, 15 Jun 2021 13:24:24 +0200	[thread overview]
Message-ID: <2bee9206-ca4a-8456-fabf-9557db599545@metux.net> (raw)
In-Reply-To: <874ke0s60c.fsf@disp2133>

On 14.06.21 19:36, Eric W. Biederman wrote:

> By virtual devices I mean all devices that are not physical pieces
> of hardware.  For block devices I mean devices such as loopback
> devices that are created on demand.  Ramdisks that start this
> conversation could also be considered virtual devices.

Ok. Do you also count partitions in here ?

IMHO we've got another category to look up: devices that (can) create
more (sub)devices. Examples coming into my head are loopdev, ptmx,
partitions, etc.

The big problem here: fist we'd need to be clear on the actual
semantics in namespaced context, for example:

* what happens when you talk to /dev/loop0 and create a new loopdev
   inside a container - shall it be ever visible on the host ?

* what if you want to create an loopdev on some file thats only visible
   to the host, but that loopdev shall appear inside a container ?
   ("virtual disk" scenario)

>> How would you skip the virtual devices from sysfs ? Adding some filter
>> into sysfs that looks at the device class (or some flag within it) ?
> 
> I would just not run the code to create sysfs entries when the virtual
> devices are created.

Oh, that would most likely make userland unhappy.

Besides, that won't be so trivial due to the way sysfs works. Because
sysfs more or less just presents kobj's. Each kobj may have attributes,
a parent, and a list of childs. A device is n kobj, and it needs to
be registered into the device hierarchy to work at all. Sysfs itself
doesn't really know whether something is a virtual device (or a device
at all) - it just calls some functions from kobject_type for things like
reading/writing attributes, etc. But I don't see anything where
kobject_type's can implement their own iterators.

As things are right now, not registering a device in sysfs means not
registering it at all.

By the way: i'm just wondering whether it would make sense to give
kobject_type it's own iteration and lookup functions. Unless I'm fully
mistaken, that could help solving several other problems, e.g. device
renaming (currently *very* tricky and only works to some extend for
network devices).

IMHO, we could then eg. fetch the device names (/sys/devices/...)
directly from the struct device instead of the kset (perhaps a simple
list instead of kset would also do here), and also create the symlinks
(e.g. /sys/class/.../) on the fly. Once that's done, renaming a device
should become rather simple.

At that point, adding multiple views or certain parts of sysfs (e.g. the
devices hierarchy) could perhaps be done by implementing special
iterators take take the view criteria into account.

@Greg: what's your take on that iterator idea ?

> If you have virtual devices showing up in their own filesystem they
> don't even need major or minor numbers.  You can just have files
> that accept ioctls like device nodes.  In principle it is
> possible to skip a lot of the historical infrastructure.  If the
> infrastructure is not needed it is worth skipping.

Ah, I see where you're going. You wanna completely drop these virtual 
devices and replace them by a synthentic fs that *looks* like it
contains devices ? Well, theoretically it should be possible, since fs'
may handle opening device nodes completely own, instead of calling 
generic code (is there any that actually does ?).

BUT: in that case we have to really make sure that processes inside the
container cannot ever open any device node outside that special fs.

> I haven't dug into the block layer recently enough to say what is needed
> or not.  I think there are some thing such as stat on a mounted
> filesystem that need a major and minor numbers.  Which probably means
> you have to use major and minor numbers.  By virtue of using common
> infrastructure that implies showing up in sysfs and devtmpfs.  Things
> would be limited just by not mounting devtmpfs in a container.

Note that this approach also needs to support things like dynamically
creating new device nodes (inside the container), udev, ... otherwise
you'd need very special handling in userland again (lxc folks would
become very unhappy ;-))

> It is worth checking how much of the common infrastructure you need when
> you start creating virtual devices.

s/virtual devices/synthetic filesystems/;

You approach goes much into the Plan9 direction (which in generally I'd
love to see). But whatever we gonna do here needs to remain compatible
with what existing userland expects - we've got a lot of Unix tradition
to keep here.

OR: we had to declare that (once inside the devns) we throw it all alway
and it create something entirely new that's more like an Plan9 subsystem
than an Linux container. Also interesting, but not what i've started
this discussion for.

> The only reason the network devices need changes to sysfs is to allow
> different network devices with the same name to show up in different
> network namespaces.
> 
> If you can fundamentally avoid the problem of devices with the same
> name needing to show up in sysfs and devtmpfs by using filesystems
> then sysfs and devtmpfs needs no changes.

Well, that's only for the sysfs part. Network devices still need to
be namespaced in other places (socket, etc) - what's already done by
netns.

But yes, it sounds nice if we had entirely different namespaces for
network device names (e.g. any of the hosts network devices could
appear simply as "eth0" inside a container, if you want to)


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

  reply	other threads:[~2021-06-15 11:24 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-08  9:38 device namespaces Enrico Weigelt, metux IT consult
2021-06-08 12:30 ` Christian Brauner
2021-06-08 12:41   ` Greg Kroah-Hartman
2021-06-08 14:10     ` Hannes Reinecke
2021-06-08 14:29       ` Christian Brauner
2021-06-08 15:54         ` Hannes Reinecke
2021-06-08 17:16           ` Eric W. Biederman
2021-06-09  6:38             ` Christian Brauner
2021-06-09  7:02               ` Hannes Reinecke
2021-06-09  7:21                 ` Christian Brauner
2021-06-09  7:54                   ` Hannes Reinecke
2021-06-09  8:09                     ` Christian Brauner
2021-06-11 18:14                       ` Eric W. Biederman
2021-06-14  7:49                         ` Enrico Weigelt, metux IT consult
2021-06-14  8:22                           ` Greg KH
2021-06-14 17:36                           ` Eric W. Biederman
2021-06-15 11:24                             ` Enrico Weigelt, metux IT consult [this message]
2021-06-15 11:33                               ` Greg KH
  -- strict thread matches above, loose matches on Subject: below --
2013-09-29 19:28 Device Namespaces Amir Goldstein
     [not found] ` <CAA2m6veny-7_ONMA973Wu36U4kz4gAuw0dpodkb8+GZDv6VNBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-29 20:06   ` Greg Kroah-Hartman
     [not found]     ` <20130929200620.GA31304-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-30 15:36       ` Michael H. Warfield
2013-10-03  0:44   ` Eric W. Biederman
     [not found]     ` <87a9iri3ot.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-10-03  0:59       ` Eric W. Biederman
2013-10-03  8:58       ` Amir Goldstein
     [not found]         ` <CAA2m6vc3OFmS9VwiTavRzPqhn+qoe6vDCO2sitXpEQ8a1JVyfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-03  9:17           ` Eric W. Biederman
2013-08-22 17:43 RFC: " Oren Laadan
2013-08-22 18:21 ` Serge Hallyn
2013-08-26 10:11   ` Oren Laadan
2013-09-06 17:50     ` Eric W. Biederman
2013-09-08 12:28       ` Amir Goldstein
2013-09-09  0:51         ` Eric W. Biederman
2013-09-10  7:09           ` Amir Goldstein
2013-09-25 11:05             ` Janne Karhunen
     [not found]               ` <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-25 21:34                 ` Eric W. Biederman
     [not found]                   ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-09-26  5:33                     ` Greg Kroah-Hartman
     [not found]                       ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26  8:25                         ` Janne Karhunen
     [not found]                           ` <CAE=NcrbPXGWU8FUgwchXyL5HjXf+4AKbgUWGe1ZO=Xcq=iV-Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-26 13:56                             ` Greg Kroah-Hartman
     [not found]                               ` <20130926135604.GA16624-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26 17:01                                 ` Janne Karhunen
     [not found]                                   ` <CAE=NcrY3xC1AF_GV2b1KsF7AwYZTuGBuKLS5yBUWoWcmKU4YBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-26 17:07                                     ` Greg Kroah-Hartman
     [not found]                                       ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26 17:56                                         ` Janne Karhunen
2013-09-30 15:37                                         ` James Bottomley
     [not found]                                           ` <1380555439.2161.5.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2013-09-30 16:11                                             ` Greg Kroah-Hartman
     [not found]                                               ` <20130930161117.GA26459-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-30 16:33                                                 ` James Bottomley
2013-10-01  6:19                         ` Janne Karhunen
     [not found]                           ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-01 17:27                             ` Andy Lutomirski
     [not found]                               ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-01 17:53                                 ` Serge E. Hallyn
     [not found]                                   ` <20131001175345.GA4145-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-10-01 19:51                                     ` Eric W. Biederman
     [not found]                                       ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-10-01 20:46                                         ` Serge Hallyn
2013-10-02 22:55                                           ` Eric W. Biederman
2013-10-01 20:57                                         ` Greg Kroah-Hartman
     [not found]                                           ` <20131001205718.GA17036-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-10-02 22:45                                             ` Eric W. Biederman
2013-10-01 22:19                                         ` Michael H. Warfield
2013-10-01 18:36                                 ` Janne Karhunen
2013-10-01 17:33                             ` Greg Kroah-Hartman
     [not found]                               ` <20131001173342.GA19267-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-10-01 18:23                                 ` Janne Karhunen
2013-10-28 23:31                     ` Andrey Wagin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2bee9206-ca4a-8456-fabf-9557db599545@metux.net \
    --to=lkml@metux.net \
    --cc=christian.brauner@ubuntu.com \
    --cc=containers@lists.linux.dev \
    --cc=ebiederm@xmission.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hare@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).