* device namespaces
@ 2021-06-08 9:38 Enrico Weigelt, metux IT consult
2021-06-08 12:30 ` Christian Brauner
0 siblings, 1 reply; 48+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-08 9:38 UTC (permalink / raw)
To: containers, linux-kernel
Hello folks,
I'm going to implement device namespaces, where containers can get an
entirely different view of the devices in the machine (usually just a
specific subset, but possibly additional virtual devices).
For start I'd like to add a simple mapping of dev maj/min (leaving aside
sysfs, udev, etc). An important requirement for me is that the parent ns
can choose to delegate devices from those it full access too (child
namespaces can do the same to their childs), and the assignment can
change (for simplicity ignoring the case of removing devices that are
already opened by some process - haven't decided yet whether they should
be forcefully closed or whether keeping them open is a valid use case).
The big question for me now is how exactly to do the table maintenance
from userland. We already have entries in /proc/<pid>/ns/*. I'm thinking
about using them as command channel, like this:
* new child namespaces are created with empty mapping
* mapping manipulation is done by just writing commands to the ns file
* access is only granted if the writing process itself is in the
parent's device ns and has CAP_SYS_ADMIN (or maybe their could be some
admin user for the ns ? or the 'root' of the corresponding user_ns ?)
* if the caller has some restrictions on some particular device, these
are automatically added (eg. if you're restricted to readonly, you
can't give rw to the child ns).
Is this a good way to go ? Or what would be a better one ?
--mtx
--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-08 9:38 device namespaces Enrico Weigelt, metux IT consult
@ 2021-06-08 12:30 ` Christian Brauner
2021-06-08 12:41 ` Greg Kroah-Hartman
0 siblings, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2021-06-08 12:30 UTC (permalink / raw)
To: Enrico Weigelt, metux IT consult, Greg Kroah-Hartman
Cc: containers, linux-kernel
On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt, metux IT consult wrote:
> Hello folks,
>
>
> I'm going to implement device namespaces, where containers can get an
> entirely different view of the devices in the machine (usually just a
> specific subset, but possibly additional virtual devices).
>
> For start I'd like to add a simple mapping of dev maj/min (leaving aside
> sysfs, udev, etc). An important requirement for me is that the parent ns
> can choose to delegate devices from those it full access too (child
> namespaces can do the same to their childs), and the assignment can
> change (for simplicity ignoring the case of removing devices that are
> already opened by some process - haven't decided yet whether they should
> be forcefully closed or whether keeping them open is a valid use case).
>
> The big question for me now is how exactly to do the table maintenance
> from userland. We already have entries in /proc/<pid>/ns/*. I'm thinking
> about using them as command channel, like this:
>
> * new child namespaces are created with empty mapping
> * mapping manipulation is done by just writing commands to the ns file
> * access is only granted if the writing process itself is in the
> parent's device ns and has CAP_SYS_ADMIN (or maybe their could be some
> admin user for the ns ? or the 'root' of the corresponding user_ns ?)
> * if the caller has some restrictions on some particular device, these
> are automatically added (eg. if you're restricted to readonly, you
> can't give rw to the child ns).
>
> Is this a good way to go ? Or what would be a better one ?
Ccing Greg. Without adressing specific problems, I should warn you that
this idea is not new and the plan is unlikely to go anywhere. Especially
not without support from Greg.
Also note that I have done work to make it possible to do sufficient
device management in containers. There's a longer series associated with
this but the gist is 692ec06d7c92 ("netns: send uevent messages") where
you can forward uevents to containers. I spoke about this at Plumbers in
2018 or so too. For example, LXD makes use of this. When you hotplug a
device into a container LXD will forward the generated uevents to the
container making it possible for the container to manage those devices.
That's fully under control of userspace and means we don't need to
burden the kernel with this.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-08 12:30 ` Christian Brauner
@ 2021-06-08 12:41 ` Greg Kroah-Hartman
2021-06-08 14:10 ` Hannes Reinecke
0 siblings, 1 reply; 48+ messages in thread
From: Greg Kroah-Hartman @ 2021-06-08 12:41 UTC (permalink / raw)
To: Enrico Weigelt, metux IT consult
Cc: Christian Brauner, containers, linux-kernel
On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt, metux IT consult wrote:
> > Hello folks,
> >
> >
> > I'm going to implement device namespaces, where containers can get an
> > entirely different view of the devices in the machine (usually just a
> > specific subset, but possibly additional virtual devices).
> >
> > For start I'd like to add a simple mapping of dev maj/min (leaving aside
> > sysfs, udev, etc). An important requirement for me is that the parent ns
> > can choose to delegate devices from those it full access too (child
> > namespaces can do the same to their childs), and the assignment can
> > change (for simplicity ignoring the case of removing devices that are
> > already opened by some process - haven't decided yet whether they should
> > be forcefully closed or whether keeping them open is a valid use case).
> >
> > The big question for me now is how exactly to do the table maintenance
> > from userland. We already have entries in /proc/<pid>/ns/*. I'm thinking
> > about using them as command channel, like this:
> >
> > * new child namespaces are created with empty mapping
> > * mapping manipulation is done by just writing commands to the ns file
> > * access is only granted if the writing process itself is in the
> > parent's device ns and has CAP_SYS_ADMIN (or maybe their could be some
> > admin user for the ns ? or the 'root' of the corresponding user_ns ?)
> > * if the caller has some restrictions on some particular device, these
> > are automatically added (eg. if you're restricted to readonly, you
> > can't give rw to the child ns).
> >
> > Is this a good way to go ? Or what would be a better one ?
>
> Ccing Greg. Without adressing specific problems, I should warn you that
> this idea is not new and the plan is unlikely to go anywhere. Especially
> not without support from Greg.
Hah, yeah, this is a non-starter.
Enrico, what real problem are you trying to solve by doing this? And
have you tried anything with this yet? We almost never talk about
"proposals" without seeing real code as it's pointless to discuss things
when you haven't even proven that it can work.
So let's see code before even talking about this...
And as Christian points out, you can do this today without any kernel
changes, so to think you need to modify the kernel means that you
haven't even tried this at all?
greg k-h
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-08 12:41 ` Greg Kroah-Hartman
@ 2021-06-08 14:10 ` Hannes Reinecke
2021-06-08 14:29 ` Christian Brauner
0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2021-06-08 14:10 UTC (permalink / raw)
To: gregkh; +Cc: christian.brauner, containers, linux-kernel, lkml
On Tue, Jun 08, 2021 Greg-KH wrote:
> On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
>> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt,
>> metux IT consult wrote:
>>> Hello folks,
>>>
>>>
>>> I'm going to implement device namespaces, where containers can get
>>> an entirely different view of the devices in the machine (usually
>>> just a specific subset, but possibly additional virtual devices).
>>>
[ .. ]
>>> Is this a good way to go ? Or what would be a better one ?
>>
>> Ccing Greg. Without adressing specific problems, I should warn you
>> that this idea is not new and the plan is unlikely to go anywhere.
>> Especially not without support from Greg.
>
> Hah, yeah, this is a non-starter.
>
> Enrico, what real problem are you trying to solve by doing this? And
> have you tried anything with this yet? We almost never talk about
> "proposals" without seeing real code as it's pointless to discuss
> things when you haven't even proven that it can work.
>
> So let's see code before even talking about this...
>
> And as Christian points out, you can do this today without any kernel
> changes, so to think you need to modify the kernel means that you
> haven't even tried this at all?
>
Curious, I had been looking into this, too.
And I have to side with Greg and Christian that your proposal should
already be possible today (cf device groups, which curiously has a
near-identical interface to what you proposed).
Also, I think that a generic 'device namespace' is too broad a scope;
some subsystems like net already inherited namespace support, and it
turns out to be not exactly trivial to implement.
What I'm looking at, though, is to implement 'block' namespaces, to
restrict access to _new_ block devices to any give namespace.
Case in point: if a container creates a ramdisk it's questionable
whether other containers should even see it. iSCSI devices are a similar
case; when starting iSCSI devices from containers their use should be
restricted to that container.
And that's not only the device node in /dev, but would also entail sysfs
access, which from my understanding is not modified with the current code.
uevent redirection would help here, but from what I've seen it's only
for net devices; feels a bit awkward to have a network namespace to get
uevents for block devices, but then I'll have to test.
And, of course, that also doesn't change the sysfs layout.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-08 14:10 ` Hannes Reinecke
@ 2021-06-08 14:29 ` Christian Brauner
2021-06-08 15:54 ` Hannes Reinecke
0 siblings, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2021-06-08 14:29 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: gregkh, containers, linux-kernel, lkml
On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
> On Tue, Jun 08, 2021 Greg-KH wrote:
> > On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
> >> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt,
> >> metux IT consult wrote:
> >>> Hello folks,
> >>>
> >>>
> >>> I'm going to implement device namespaces, where containers can get
> >>> an entirely different view of the devices in the machine (usually
> >>> just a specific subset, but possibly additional virtual devices).
> >>>
> [ .. ]
> >>> Is this a good way to go ? Or what would be a better one ?
> >>
> >> Ccing Greg. Without adressing specific problems, I should warn you
> >> that this idea is not new and the plan is unlikely to go anywhere.
> >> Especially not without support from Greg.
> >
> > Hah, yeah, this is a non-starter.
> >
> > Enrico, what real problem are you trying to solve by doing this? And
> > have you tried anything with this yet? We almost never talk about
> > "proposals" without seeing real code as it's pointless to discuss
> > things when you haven't even proven that it can work.
> >
> > So let's see code before even talking about this...
> >
> > And as Christian points out, you can do this today without any kernel
> > changes, so to think you need to modify the kernel means that you
> > haven't even tried this at all?
> >
> Curious, I had been looking into this, too.
> And I have to side with Greg and Christian that your proposal should
> already be possible today (cf device groups, which curiously has a
> near-identical interface to what you proposed).
> Also, I think that a generic 'device namespace' is too broad a scope;
> some subsystems like net already inherited namespace support, and it
> turns out to be not exactly trivial to implement.
>
> What I'm looking at, though, is to implement 'block' namespaces, to
> restrict access to _new_ block devices to any give namespace.
> Case in point: if a container creates a ramdisk it's questionable
> whether other containers should even see it. iSCSI devices are a similar
> case; when starting iSCSI devices from containers their use should be
> restricted to that container.
> And that's not only the device node in /dev, but would also entail sysfs
> access, which from my understanding is not modified with the current code.
Hey Hannes. :)
It isn't and we likely shouldn't. You'd likely need to get into the
business of namespacing devtmpfs one way or the other which Seth Forshee
and I once did. But that's really not needed anymore imho. Device
management, i.e. creating device nodes should be the job of a container
manager. We already do that for example (Hotplugging devices ranging
from net devices, to disks, to GPUs.) and it works great.
To make this really clean you will likely have to significanly rework
sysfs too and I don't think that churn is worth it and introduces a
layer of complexity I find outright nakable. And ignoring sysfs or
hacking around it is also not an option I find tasteful.
>
> uevent redirection would help here, but from what I've seen it's only
> for net devices; feels a bit awkward to have a network namespace to get
> uevents for block devices, but then I'll have to test.
Just to move everyone on the same page. This is not specific to network
devices actually.
You are right though that network devices are correctly namespaced.
Specifically you only get uevents in the network namespace that network
device is moved into. The sysfs permissions for network devices were
correct if you created that network device in the network namespace but
they were wrong when you moved a network device between network
namespaces (with different owning user namespaces). That lead to all
kinds of weird issues. I fixed that a while back.
Uevent messages (and therefore injection of uevents) are not tied to
network devices. They are tied to network namespaces simply because the
transport layer is Netlink but that's about it.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-08 14:29 ` Christian Brauner
@ 2021-06-08 15:54 ` Hannes Reinecke
2021-06-08 17:16 ` Eric W. Biederman
0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2021-06-08 15:54 UTC (permalink / raw)
To: Christian Brauner; +Cc: gregkh, containers, linux-kernel, lkml
On 6/8/21 4:29 PM, Christian Brauner wrote:
> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
>> On Tue, Jun 08, 2021 Greg-KH wrote:
>>> On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
>>>> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt,
>>>> metux IT consult wrote:
>>>>> Hello folks,
>>>>>
>>>>>
>>>>> I'm going to implement device namespaces, where containers can get
>>>>> an entirely different view of the devices in the machine (usually
>>>>> just a specific subset, but possibly additional virtual devices).
>>>>>
>> [ .. ]
>>>>> Is this a good way to go ? Or what would be a better one ?
>>>>
>>>> Ccing Greg. Without adressing specific problems, I should warn you
>>>> that this idea is not new and the plan is unlikely to go anywhere.
>>>> Especially not without support from Greg.
>>>
>>> Hah, yeah, this is a non-starter.
>>>
>>> Enrico, what real problem are you trying to solve by doing this? And
>>> have you tried anything with this yet? We almost never talk about
>>> "proposals" without seeing real code as it's pointless to discuss
>>> things when you haven't even proven that it can work.
>>>
>>> So let's see code before even talking about this...
>>>
>>> And as Christian points out, you can do this today without any kernel
>>> changes, so to think you need to modify the kernel means that you
>>> haven't even tried this at all?
>>>
>> Curious, I had been looking into this, too.
>> And I have to side with Greg and Christian that your proposal should
>> already be possible today (cf device groups, which curiously has a
>> near-identical interface to what you proposed).
>> Also, I think that a generic 'device namespace' is too broad a scope;
>> some subsystems like net already inherited namespace support, and it
>> turns out to be not exactly trivial to implement.
>>
>> What I'm looking at, though, is to implement 'block' namespaces, to
>> restrict access to _new_ block devices to any give namespace.
>> Case in point: if a container creates a ramdisk it's questionable
>> whether other containers should even see it. iSCSI devices are a similar
>> case; when starting iSCSI devices from containers their use should be
>> restricted to that container.
>> And that's not only the device node in /dev, but would also entail sysfs
>> access, which from my understanding is not modified with the current code.
>
> Hey Hannes. :)
>
> It isn't and we likely shouldn't. You'd likely need to get into the
> business of namespacing devtmpfs one way or the other which Seth Forshee
> and I once did. But that's really not needed anymore imho. Device
> management, i.e. creating device nodes should be the job of a container
> manager. We already do that for example (Hotplugging devices ranging
> from net devices, to disks, to GPUs.) and it works great.
>
Right; clearly you can do that within the container.
But my main grudge here is not the container but rather the system
_hosting_ the container.
That is typically using devtmpfs and hence will see _all_ devices, even
those belonging to the container.
This is causing grief to no end if eg the host system starts activating
LVM on devices which are passed to the qemu instance running within a
container ...
> To make this really clean you will likely have to significantly rework
> sysfs too and I don't think that churn is worth it and introduces a
> layer of complexity I find outright nakable. And ignoring sysfs or
> hacking around it is also not an option I find tasteful.
>
Network namespaces already have the bits and pieces to modify sysfs, so
we should be able to leverage that for block, too.
And I think by restricting it to 'block' devices we should be to keep
the required sysfs modifications in a manageable level.
>>
>> uevent redirection would help here, but from what I've seen it's only
>> for net devices; feels a bit awkward to have a network namespace to get
>> uevents for block devices, but then I'll have to test.
>
> Just to move everyone on the same page. This is not specific to network
> devices actually.
>
> You are right though that network devices are correctly namespaced.
> Specifically you only get uevents in the network namespace that network
> device is moved into. The sysfs permissions for network devices were
> correct if you created that network device in the network namespace but
> they were wrong when you moved a network device between network
> namespaces (with different owning user namespaces). That lead to all
> kinds of weird issues. I fixed that a while back.
>
Granted, modifying sysfs layout is not something for the faint-hearted,
and one really has to look closely to ensure you end up with a
consistent layout afterwards.
But let's see how things go; might well be that it turns out to be too
complex to consider. Can't tell yet.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-08 15:54 ` Hannes Reinecke
@ 2021-06-08 17:16 ` Eric W. Biederman
2021-06-09 6:38 ` Christian Brauner
0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2021-06-08 17:16 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: Christian Brauner, gregkh, containers, linux-kernel, lkml
Hannes Reinecke <hare@suse.de> writes:
> On 6/8/21 4:29 PM, Christian Brauner wrote:
>> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
>>> On Tue, Jun 08, 2021 Greg-KH wrote:
>>>> On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
>>>>> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt,
>>>>> metux IT consult wrote:
>>>>>> Hello folks,
>>>>>>
>>>>>>
>>>>>> I'm going to implement device namespaces, where containers can get
>>>>>> an entirely different view of the devices in the machine (usually
>>>>>> just a specific subset, but possibly additional virtual devices).
>>>>>>
>>> [ .. ]
>>>>>> Is this a good way to go ? Or what would be a better one ?
>>>>>
>>>>> Ccing Greg. Without adressing specific problems, I should warn you
>>>>> that this idea is not new and the plan is unlikely to go anywhere.
>>>>> Especially not without support from Greg.
>>>>
>>>> Hah, yeah, this is a non-starter.
>>>>
>>>> Enrico, what real problem are you trying to solve by doing this? And
>>>> have you tried anything with this yet? We almost never talk about
>>>> "proposals" without seeing real code as it's pointless to discuss
>>>> things when you haven't even proven that it can work.
>>>>
>>>> So let's see code before even talking about this...
>>>>
>>>> And as Christian points out, you can do this today without any kernel
>>>> changes, so to think you need to modify the kernel means that you
>>>> haven't even tried this at all?
>>>>
>>> Curious, I had been looking into this, too.
>>> And I have to side with Greg and Christian that your proposal should
>>> already be possible today (cf device groups, which curiously has a
>>> near-identical interface to what you proposed).
>>> Also, I think that a generic 'device namespace' is too broad a scope;
>>> some subsystems like net already inherited namespace support, and it
>>> turns out to be not exactly trivial to implement.
>>>
>>> What I'm looking at, though, is to implement 'block' namespaces, to
>>> restrict access to _new_ block devices to any give namespace.
>>> Case in point: if a container creates a ramdisk it's questionable
>>> whether other containers should even see it. iSCSI devices are a similar
>>> case; when starting iSCSI devices from containers their use should be
>>> restricted to that container.
>>> And that's not only the device node in /dev, but would also entail sysfs
>>> access, which from my understanding is not modified with the current code.
>>
>> Hey Hannes. :)
>>
>> It isn't and we likely shouldn't. You'd likely need to get into the
>> business of namespacing devtmpfs one way or the other which Seth Forshee
>> and I once did. But that's really not needed anymore imho. Device
>> management, i.e. creating device nodes should be the job of a container
>> manager. We already do that for example (Hotplugging devices ranging
>> from net devices, to disks, to GPUs.) and it works great.
>>
> Right; clearly you can do that within the container.
> But my main grudge here is not the container but rather the system
> _hosting_ the container.
> That is typically using devtmpfs and hence will see _all_ devices, even
> those belonging to the container.
> This is causing grief to no end if eg the host system starts activating
> LVM on devices which are passed to the qemu instance running within a
> container ...
>
>> To make this really clean you will likely have to significantly rework
>> sysfs too and I don't think that churn is worth it and introduces a
>> layer of complexity I find outright nakable. And ignoring sysfs or
>> hacking around it is also not an option I find tasteful.
>>
> Network namespaces already have the bits and pieces to modify sysfs, so
> we should be able to leverage that for block, too.
> And I think by restricting it to 'block' devices we should be to keep
> the required sysfs modifications in a manageable level.
>
>>>
>>> uevent redirection would help here, but from what I've seen it's only
>>> for net devices; feels a bit awkward to have a network namespace to get
>>> uevents for block devices, but then I'll have to test.
>>
>> Just to move everyone on the same page. This is not specific to network
>> devices actually.
>>
>> You are right though that network devices are correctly namespaced.
>> Specifically you only get uevents in the network namespace that network
>> device is moved into. The sysfs permissions for network devices were
>> correct if you created that network device in the network namespace but
>> they were wrong when you moved a network device between network
>> namespaces (with different owning user namespaces). That lead to all
>> kinds of weird issues. I fixed that a while back.
>>
> Granted, modifying sysfs layout is not something for the faint-hearted,
> and one really has to look closely to ensure you end up with a
> consistent layout afterwards.
>
> But let's see how things go; might well be that it turns out to be too
> complex to consider. Can't tell yet.
I would suggest aiming for something like devptsfs without the
complication of /dev/ptmx.
That is a pseudo filesystem that has a control node and virtual block
devices that were created using that control node.
That is the cleanest solution I know and is not strictly limited to use
with containers so it can also gain greater traction. The interaction
with devtmpfs should be simply having devtmpfs create a mount point for
that filesystem.
This could be a new cleaner api for things like loopback devices.
However the limitation for block devices that I am aware of is that we
don't currently have any filesystems in the kernel that are written
robustly enough that we can be expected to be secure when mounted on top
of an evil block device. Some of the network filesystems are built
to withstand evil network packets, and possibly evil servers. So with
care we can probably allow for unprivileged mounts there.
Eric
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-08 17:16 ` Eric W. Biederman
@ 2021-06-09 6:38 ` Christian Brauner
2021-06-09 7:02 ` Hannes Reinecke
0 siblings, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2021-06-09 6:38 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Hannes Reinecke, gregkh, containers, linux-kernel, lkml
On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
> Hannes Reinecke <hare@suse.de> writes:
>
> > On 6/8/21 4:29 PM, Christian Brauner wrote:
> >> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
> >>> On Tue, Jun 08, 2021 Greg-KH wrote:
> >>>> On Tue, Jun 08, 2021 at 02:30:50PM +0200, Christian Brauner wrote:
> >>>>> On Tue, Jun 08, 2021 at 11:38:16AM +0200, Enrico Weigelt,
> >>>>> metux IT consult wrote:
> >>>>>> Hello folks,
> >>>>>>
> >>>>>>
> >>>>>> I'm going to implement device namespaces, where containers can get
> >>>>>> an entirely different view of the devices in the machine (usually
> >>>>>> just a specific subset, but possibly additional virtual devices).
> >>>>>>
> >>> [ .. ]
> >>>>>> Is this a good way to go ? Or what would be a better one ?
> >>>>>
> >>>>> Ccing Greg. Without adressing specific problems, I should warn you
> >>>>> that this idea is not new and the plan is unlikely to go anywhere.
> >>>>> Especially not without support from Greg.
> >>>>
> >>>> Hah, yeah, this is a non-starter.
> >>>>
> >>>> Enrico, what real problem are you trying to solve by doing this? And
> >>>> have you tried anything with this yet? We almost never talk about
> >>>> "proposals" without seeing real code as it's pointless to discuss
> >>>> things when you haven't even proven that it can work.
> >>>>
> >>>> So let's see code before even talking about this...
> >>>>
> >>>> And as Christian points out, you can do this today without any kernel
> >>>> changes, so to think you need to modify the kernel means that you
> >>>> haven't even tried this at all?
> >>>>
> >>> Curious, I had been looking into this, too.
> >>> And I have to side with Greg and Christian that your proposal should
> >>> already be possible today (cf device groups, which curiously has a
> >>> near-identical interface to what you proposed).
> >>> Also, I think that a generic 'device namespace' is too broad a scope;
> >>> some subsystems like net already inherited namespace support, and it
> >>> turns out to be not exactly trivial to implement.
> >>>
> >>> What I'm looking at, though, is to implement 'block' namespaces, to
> >>> restrict access to _new_ block devices to any give namespace.
> >>> Case in point: if a container creates a ramdisk it's questionable
> >>> whether other containers should even see it. iSCSI devices are a similar
> >>> case; when starting iSCSI devices from containers their use should be
> >>> restricted to that container.
> >>> And that's not only the device node in /dev, but would also entail sysfs
> >>> access, which from my understanding is not modified with the current code.
> >>
> >> Hey Hannes. :)
> >>
> >> It isn't and we likely shouldn't. You'd likely need to get into the
> >> business of namespacing devtmpfs one way or the other which Seth Forshee
> >> and I once did. But that's really not needed anymore imho. Device
> >> management, i.e. creating device nodes should be the job of a container
> >> manager. We already do that for example (Hotplugging devices ranging
> >> from net devices, to disks, to GPUs.) and it works great.
> >>
> > Right; clearly you can do that within the container.
> > But my main grudge here is not the container but rather the system
> > _hosting_ the container.
> > That is typically using devtmpfs and hence will see _all_ devices, even
> > those belonging to the container.
> > This is causing grief to no end if eg the host system starts activating
> > LVM on devices which are passed to the qemu instance running within a
> > container ...
> >
> >> To make this really clean you will likely have to significantly rework
> >> sysfs too and I don't think that churn is worth it and introduces a
> >> layer of complexity I find outright nakable. And ignoring sysfs or
> >> hacking around it is also not an option I find tasteful.
> >>
> > Network namespaces already have the bits and pieces to modify sysfs, so
> > we should be able to leverage that for block, too.
> > And I think by restricting it to 'block' devices we should be to keep
> > the required sysfs modifications in a manageable level.
> >
> >>>
> >>> uevent redirection would help here, but from what I've seen it's only
> >>> for net devices; feels a bit awkward to have a network namespace to get
> >>> uevents for block devices, but then I'll have to test.
> >>
> >> Just to move everyone on the same page. This is not specific to network
> >> devices actually.
> >>
> >> You are right though that network devices are correctly namespaced.
> >> Specifically you only get uevents in the network namespace that network
> >> device is moved into. The sysfs permissions for network devices were
> >> correct if you created that network device in the network namespace but
> >> they were wrong when you moved a network device between network
> >> namespaces (with different owning user namespaces). That lead to all
> >> kinds of weird issues. I fixed that a while back.
> >>
> > Granted, modifying sysfs layout is not something for the faint-hearted,
> > and one really has to look closely to ensure you end up with a
> > consistent layout afterwards.
> >
> > But let's see how things go; might well be that it turns out to be too
> > complex to consider. Can't tell yet.
>
> I would suggest aiming for something like devptsfs without the
> complication of /dev/ptmx.
>
> That is a pseudo filesystem that has a control node and virtual block
> devices that were created using that control node.
Also see android/binder/binderfs.c
>
> That is the cleanest solution I know and is not strictly limited to use
> with containers so it can also gain greater traction. The interaction
> with devtmpfs should be simply having devtmpfs create a mount point for
> that filesystem.
>
> This could be a new cleaner api for things like loopback devices.
I sent a patchset that implemented this last year.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-09 6:38 ` Christian Brauner
@ 2021-06-09 7:02 ` Hannes Reinecke
2021-06-09 7:21 ` Christian Brauner
0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2021-06-09 7:02 UTC (permalink / raw)
To: Christian Brauner, Eric W. Biederman
Cc: gregkh, containers, linux-kernel, lkml
On 6/9/21 8:38 AM, Christian Brauner wrote:
> On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
>> Hannes Reinecke <hare@suse.de> writes:
>>
>>> On 6/8/21 4:29 PM, Christian Brauner wrote:
>>>> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
[ .. ]
>>> Granted, modifying sysfs layout is not something for the faint-hearted,
>>> and one really has to look closely to ensure you end up with a
>>> consistent layout afterwards.
>>>
>>> But let's see how things go; might well be that it turns out to be too
>>> complex to consider. Can't tell yet.
>>
>> I would suggest aiming for something like devptsfs without the
>> complication of /dev/ptmx.
>>
>> That is a pseudo filesystem that has a control node and virtual block
>> devices that were created using that control node.
>
> Also see android/binder/binderfs.c
>
Ah. Will have a look.
>>
>> That is the cleanest solution I know and is not strictly limited to use
>> with containers so it can also gain greater traction. The interaction
>> with devtmpfs should be simply having devtmpfs create a mount point for
>> that filesystem.
>>
>> This could be a new cleaner api for things like loopback devices.
>
> I sent a patchset that implemented this last year.
>
Do you have a pointer/commit hash for this?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-09 7:02 ` Hannes Reinecke
@ 2021-06-09 7:21 ` Christian Brauner
2021-06-09 7:54 ` Hannes Reinecke
0 siblings, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2021-06-09 7:21 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: Eric W. Biederman, gregkh, containers, linux-kernel, lkml
On Wed, Jun 09, 2021 at 09:02:36AM +0200, Hannes Reinecke wrote:
> On 6/9/21 8:38 AM, Christian Brauner wrote:
> > On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
> > > Hannes Reinecke <hare@suse.de> writes:
> > >
> > > > On 6/8/21 4:29 PM, Christian Brauner wrote:
> > > > > On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
> [ .. ]
> > > > Granted, modifying sysfs layout is not something for the faint-hearted,
> > > > and one really has to look closely to ensure you end up with a
> > > > consistent layout afterwards.
> > > >
> > > > But let's see how things go; might well be that it turns out to be too
> > > > complex to consider. Can't tell yet.
> > >
> > > I would suggest aiming for something like devptsfs without the
> > > complication of /dev/ptmx.
> > >
> > > That is a pseudo filesystem that has a control node and virtual block
> > > devices that were created using that control node.
> >
> > Also see android/binder/binderfs.c
> >
> Ah. Will have a look.
I implemented this a few years back and I think it should've made it
onto Android by default now. So that approach does indeed work well, it
seems:
https://chromium.googlesource.com/aosp/platform/system/core/+/master/rootdir/init.rc#257
This should be easier to follow than the devpts case because you don't
need to wade through the {t,p}ty layer.
>
> > >
> > > That is the cleanest solution I know and is not strictly limited to use
> > > with containers so it can also gain greater traction. The interaction
> > > with devtmpfs should be simply having devtmpfs create a mount point for
> > > that filesystem.
> > >
> > > This could be a new cleaner api for things like loopback devices.
> >
> > I sent a patchset that implemented this last year.
> >
> Do you have a pointer/commit hash for this?
Yes, sure:
https://lore.kernel.org/linux-block/20200424162052.441452-1-christian.brauner@ubuntu.com/
You can also just pull my branch. I think it's still based on v5.7 or sm:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs
I'm happy to collaborate on this too.
Christian
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-09 7:21 ` Christian Brauner
@ 2021-06-09 7:54 ` Hannes Reinecke
2021-06-09 8:09 ` Christian Brauner
0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2021-06-09 7:54 UTC (permalink / raw)
To: Christian Brauner
Cc: Eric W. Biederman, gregkh, containers, linux-kernel, lkml
On 6/9/21 9:21 AM, Christian Brauner wrote:
> On Wed, Jun 09, 2021 at 09:02:36AM +0200, Hannes Reinecke wrote:
>> On 6/9/21 8:38 AM, Christian Brauner wrote:
>>> On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
>>>> Hannes Reinecke <hare@suse.de> writes:
>>>>
>>>>> On 6/8/21 4:29 PM, Christian Brauner wrote:
>>>>>> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
>> [ .. ]
>>>>> Granted, modifying sysfs layout is not something for the faint-hearted,
>>>>> and one really has to look closely to ensure you end up with a
>>>>> consistent layout afterwards.
>>>>>
>>>>> But let's see how things go; might well be that it turns out to be too
>>>>> complex to consider. Can't tell yet.
>>>>
>>>> I would suggest aiming for something like devptsfs without the
>>>> complication of /dev/ptmx.
>>>>
>>>> That is a pseudo filesystem that has a control node and virtual block
>>>> devices that were created using that control node.
>>>
>>> Also see android/binder/binderfs.c
>>>
>> Ah. Will have a look.
>
> I implemented this a few years back and I think it should've made it
> onto Android by default now. So that approach does indeed work well, it
> seems:
> https://chromium.googlesource.com/aosp/platform/system/core/+/master/rootdir/init.rc#257
>
> This should be easier to follow than the devpts case because you don't
> need to wade through the {t,p}ty layer.
>
>>
>>>>
>>>> That is the cleanest solution I know and is not strictly limited to use
>>>> with containers so it can also gain greater traction. The interaction
>>>> with devtmpfs should be simply having devtmpfs create a mount point for
>>>> that filesystem.
>>>>
>>>> This could be a new cleaner api for things like loopback devices.
>>>
>>> I sent a patchset that implemented this last year.
>>>
>> Do you have a pointer/commit hash for this?
>
> Yes, sure:
> https://lore.kernel.org/linux-block/20200424162052.441452-1-christian.brauner@ubuntu.com/
>
> You can also just pull my branch. I think it's still based on v5.7 or sm:
> https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs
>
> I'm happy to collaborate on this too.
>
How _very_ curious. 'kernfs: handle multiple namespace tags' and 'loop:
preserve sysfs backwards compability' are essentially the same patches I
did for my block namespaces prototyp; I named it 'KOBJ_NS_TYPE_BLK', not
'KOBJ_NS_TYPE_USER', though :-)
Guess we really should cooperate.
Speaking of which: why did you name it 'user' namespace?
There already is a generic 'user_namespace' in
include/linux/user_namespace.h, serving as a container for all
namespaces; as such it probably should include this 'user' namespace,
leading to quite some confusion.
Or did I misunderstood something here?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-09 7:54 ` Hannes Reinecke
@ 2021-06-09 8:09 ` Christian Brauner
2021-06-11 18:14 ` Eric W. Biederman
0 siblings, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2021-06-09 8:09 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: Eric W. Biederman, gregkh, containers, linux-kernel, lkml
On Wed, Jun 09, 2021 at 09:54:05AM +0200, Hannes Reinecke wrote:
> On 6/9/21 9:21 AM, Christian Brauner wrote:
> > On Wed, Jun 09, 2021 at 09:02:36AM +0200, Hannes Reinecke wrote:
> >> On 6/9/21 8:38 AM, Christian Brauner wrote:
> >>> On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
> >>>> Hannes Reinecke <hare@suse.de> writes:
> >>>>
> >>>>> On 6/8/21 4:29 PM, Christian Brauner wrote:
> >>>>>> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
> >> [ .. ]
> >>>>> Granted, modifying sysfs layout is not something for the faint-hearted,
> >>>>> and one really has to look closely to ensure you end up with a
> >>>>> consistent layout afterwards.
> >>>>>
> >>>>> But let's see how things go; might well be that it turns out to be too
> >>>>> complex to consider. Can't tell yet.
> >>>>
> >>>> I would suggest aiming for something like devptsfs without the
> >>>> complication of /dev/ptmx.
> >>>>
> >>>> That is a pseudo filesystem that has a control node and virtual block
> >>>> devices that were created using that control node.
> >>>
> >>> Also see android/binder/binderfs.c
> >>>
> >> Ah. Will have a look.
> >
> > I implemented this a few years back and I think it should've made it
> > onto Android by default now. So that approach does indeed work well, it
> > seems:
> > https://chromium.googlesource.com/aosp/platform/system/core/+/master/rootdir/init.rc#257
> >
> > This should be easier to follow than the devpts case because you don't
> > need to wade through the {t,p}ty layer.
> >
> >>
> >>>>
> >>>> That is the cleanest solution I know and is not strictly limited to use
> >>>> with containers so it can also gain greater traction. The interaction
> >>>> with devtmpfs should be simply having devtmpfs create a mount point for
> >>>> that filesystem.
> >>>>
> >>>> This could be a new cleaner api for things like loopback devices.
> >>>
> >>> I sent a patchset that implemented this last year.
> >>>
> >> Do you have a pointer/commit hash for this?
> >
> > Yes, sure:
> > https://lore.kernel.org/linux-block/20200424162052.441452-1-christian.brauner@ubuntu.com/
> >
> > You can also just pull my branch. I think it's still based on v5.7 or sm:
> > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs
> >
> > I'm happy to collaborate on this too.
> >
> How _very_ curious. 'kernfs: handle multiple namespace tags' and 'loop:
> preserve sysfs backwards compability' are essentially the same patches I
> did for my block namespaces prototyp; I named it 'KOBJ_NS_TYPE_BLK', not
> 'KOBJ_NS_TYPE_USER', though :-)
>
> Guess we really should cooperate.
>
> Speaking of which: why did you name it 'user' namespace?
> There already is a generic 'user_namespace' in
> include/linux/user_namespace.h, serving as a container for all
> namespaces; as such it probably should include this 'user' namespace,
> leading to quite some confusion.
>
> Or did I misunderstood something here?
Ah yes, you misunderstand. The KOBJ_NS_TYPE_* tags are namespace tags.
So KOBJ_NS_TYPE_NET is a network namespace tag. So KOBJ_NS_TYPE_USER is
a user namespace tag not a completely new namespace. The idea very
roughly being that devices such as loop devices are ultimately filtered
by user namespace which is taken from the s_user_ns the loopfs instance
is mounted in. We should compare notes.
Christian
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-09 8:09 ` Christian Brauner
@ 2021-06-11 18:14 ` Eric W. Biederman
2021-06-14 7:49 ` Enrico Weigelt, metux IT consult
0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2021-06-11 18:14 UTC (permalink / raw)
To: Christian Brauner; +Cc: Hannes Reinecke, gregkh, containers, linux-kernel, lkml
Christian Brauner <christian.brauner@ubuntu.com> writes:
> On Wed, Jun 09, 2021 at 09:54:05AM +0200, Hannes Reinecke wrote:
>> On 6/9/21 9:21 AM, Christian Brauner wrote:
>> > On Wed, Jun 09, 2021 at 09:02:36AM +0200, Hannes Reinecke wrote:
>> >> On 6/9/21 8:38 AM, Christian Brauner wrote:
>> >>> On Tue, Jun 08, 2021 at 12:16:43PM -0500, Eric W. Biederman wrote:
>> >>>> Hannes Reinecke <hare@suse.de> writes:
>> >>>>
>> >>>>> On 6/8/21 4:29 PM, Christian Brauner wrote:
>> >>>>>> On Tue, Jun 08, 2021 at 04:10:08PM +0200, Hannes Reinecke wrote:
>> >> [ .. ]
>> >>>>> Granted, modifying sysfs layout is not something for the faint-hearted,
>> >>>>> and one really has to look closely to ensure you end up with a
>> >>>>> consistent layout afterwards.
>> >>>>>
>> >>>>> But let's see how things go; might well be that it turns out to be too
>> >>>>> complex to consider. Can't tell yet.
>> >>>>
>> >>>> I would suggest aiming for something like devptsfs without the
>> >>>> complication of /dev/ptmx.
>> >>>>
>> >>>> That is a pseudo filesystem that has a control node and virtual block
>> >>>> devices that were created using that control node.
>> >>>
>> >>> Also see android/binder/binderfs.c
>> >>>
>> >> Ah. Will have a look.
>> >
>> > I implemented this a few years back and I think it should've made it
>> > onto Android by default now. So that approach does indeed work well, it
>> > seems:
>> > https://chromium.googlesource.com/aosp/platform/system/core/+/master/rootdir/init.rc#257
>> >
>> > This should be easier to follow than the devpts case because you don't
>> > need to wade through the {t,p}ty layer.
>> >
>> >>
>> >>>>
>> >>>> That is the cleanest solution I know and is not strictly limited to use
>> >>>> with containers so it can also gain greater traction. The interaction
>> >>>> with devtmpfs should be simply having devtmpfs create a mount point for
>> >>>> that filesystem.
>> >>>>
>> >>>> This could be a new cleaner api for things like loopback devices.
>> >>>
>> >>> I sent a patchset that implemented this last year.
>> >>>
>> >> Do you have a pointer/commit hash for this?
>> >
>> > Yes, sure:
>> > https://lore.kernel.org/linux-block/20200424162052.441452-1-christian.brauner@ubuntu.com/
>> >
>> > You can also just pull my branch. I think it's still based on v5.7 or sm:
>> > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs
>> >
>> > I'm happy to collaborate on this too.
>> >
>> How _very_ curious. 'kernfs: handle multiple namespace tags' and 'loop:
>> preserve sysfs backwards compability' are essentially the same patches I
>> did for my block namespaces prototyp; I named it 'KOBJ_NS_TYPE_BLK', not
>> 'KOBJ_NS_TYPE_USER', though :-)
>>
>> Guess we really should cooperate.
>>
>> Speaking of which: why did you name it 'user' namespace?
>> There already is a generic 'user_namespace' in
>> include/linux/user_namespace.h, serving as a container for all
>> namespaces; as such it probably should include this 'user' namespace,
>> leading to quite some confusion.
>>
>> Or did I misunderstood something here?
>
> Ah yes, you misunderstand. The KOBJ_NS_TYPE_* tags are namespace tags.
> So KOBJ_NS_TYPE_NET is a network namespace tag. So KOBJ_NS_TYPE_USER is
> a user namespace tag not a completely new namespace. The idea very
> roughly being that devices such as loop devices are ultimately filtered
> by user namespace which is taken from the s_user_ns the loopfs instance
> is mounted in. We should compare notes.
There are two easy possibilities.
- All of the devices on the filesystem show up in sysfs with unique
major minor numbers.
- None of the devices on the filesystem show up in sysfs.
(Which I believe is what devpts does).
I favor none of the virtual devices showing up in sysfs. Maybe existing
userspace needs the devices in sysfs, but if the solution is simply to
skip sysfs for virtual devices that is much simpler.
Eric
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-11 18:14 ` Eric W. Biederman
@ 2021-06-14 7:49 ` Enrico Weigelt, metux IT consult
2021-06-14 8:22 ` Greg KH
2021-06-14 17:36 ` Eric W. Biederman
0 siblings, 2 replies; 48+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-14 7:49 UTC (permalink / raw)
To: Eric W. Biederman, Christian Brauner
Cc: Hannes Reinecke, gregkh, containers, linux-kernel
On 11.06.21 20:14, Eric W. Biederman wrote:
Hi,
> I favor none of the virtual devices showing up in sysfs. Maybe existing
> userspace needs the devices in sysfs, but if the solution is simply to
> skip sysfs for virtual devices that is much simpler.
Sorry for being a little bit confused, but by virtual devices you mean
things like pty's or all the other stuff we already see under
/sys/device/virtual ?
I'm yet unsure what the better way is. If we're just talking about pty's
specifically, I maybe could live with threating them like "special sort
of pipes", but I guess that would require some extra magic.
If I'm not mistaken, the whole sysfs stuff is automatically handled
device classes and bus'es - seems that tty's are also class devs.
How would you skip the virtual devices from sysfs ? Adding some filter
into sysfs that looks at the device class (or some flag within it) ?
--mtx
--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-14 7:49 ` Enrico Weigelt, metux IT consult
@ 2021-06-14 8:22 ` Greg KH
2021-06-14 17:36 ` Eric W. Biederman
1 sibling, 0 replies; 48+ messages in thread
From: Greg KH @ 2021-06-14 8:22 UTC (permalink / raw)
To: Enrico Weigelt, metux IT consult
Cc: Eric W. Biederman, Christian Brauner, Hannes Reinecke,
containers, linux-kernel
On Mon, Jun 14, 2021 at 09:49:22AM +0200, Enrico Weigelt, metux IT consult wrote:
> On 11.06.21 20:14, Eric W. Biederman wrote:
>
> Hi,
>
> > I favor none of the virtual devices showing up in sysfs. Maybe existing
> > userspace needs the devices in sysfs, but if the solution is simply to
> > skip sysfs for virtual devices that is much simpler.
>
> Sorry for being a little bit confused, but by virtual devices you mean
> things like pty's or all the other stuff we already see under
> /sys/device/virtual ?
>
> I'm yet unsure what the better way is. If we're just talking about pty's
> specifically, I maybe could live with threating them like "special sort
> of pipes", but I guess that would require some extra magic.
>
> If I'm not mistaken, the whole sysfs stuff is automatically handled
> device classes and bus'es - seems that tty's are also class devs.
>
> How would you skip the virtual devices from sysfs ? Adding some filter
> into sysfs that looks at the device class (or some flag within it) ?
Wait, step back. What _EXACTLY_ are you wanting to do here? If you
have not looked at how sysfs handles devices today, that leads me to
believe that you do not have a real model in place.
Again, spend some time and write some code please before continuing this
thread. We don't like to talk about vague things when you do not even
have an idea of what you want.
good luck!
greg k-h
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-14 7:49 ` Enrico Weigelt, metux IT consult
2021-06-14 8:22 ` Greg KH
@ 2021-06-14 17:36 ` Eric W. Biederman
2021-06-15 11:24 ` Enrico Weigelt, metux IT consult
1 sibling, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2021-06-14 17:36 UTC (permalink / raw)
To: Enrico Weigelt, metux IT consult
Cc: Christian Brauner, Hannes Reinecke, gregkh, containers, linux-kernel
"Enrico Weigelt, metux IT consult" <lkml@metux.net> writes:
> On 11.06.21 20:14, Eric W. Biederman wrote:
>
> Hi,
>
>> I favor none of the virtual devices showing up in sysfs. Maybe existing
>> userspace needs the devices in sysfs, but if the solution is simply to
>> skip sysfs for virtual devices that is much simpler.
>
> Sorry for being a little bit confused, but by virtual devices you mean
> things like pty's or all the other stuff we already see under
> /sys/device/virtual ?
By virtual devices I mean all devices that are not physical pieces
of hardware. For block devices I mean devices such as loopback
devices that are created on demand. Ramdisks that start this
conversation could also be considered virtual devices.
> How would you skip the virtual devices from sysfs ? Adding some filter
> into sysfs that looks at the device class (or some flag within it) ?
I would just not run the code to create sysfs entries when the virtual
devices are created.
If you have virtual devices showing up in their own filesystem they
don't even need major or minor numbers. You can just have files
that accept ioctls like device nodes. In principle it is
possible to skip a lot of the historical infrastructure. If the
infrastructure is not needed it is worth skipping.
I haven't dug into the block layer recently enough to say what is needed
or not. I think there are some thing such as stat on a mounted
filesystem that need a major and minor numbers. Which probably means
you have to use major and minor numbers. By virtue of using common
infrastructure that implies showing up in sysfs and devtmpfs. Things
would be limited just by not mounting devtmpfs in a container.
It is worth checking how much of the common infrastructure you need when
you start creating virtual devices.
The only reason the network devices need changes to sysfs is to allow
different network devices with the same name to show up in different
network namespaces.
If you can fundamentally avoid the problem of devices with the same
name needing to show up in sysfs and devtmpfs by using filesystems
then sysfs and devtmpfs needs no changes.
Hotplug is sufficiently widespread now that it should be possible
to avoid the hard problem of having duplicate names for block devices,
one way or another. Thus talking of changing sysfs seems completely
unnecessary.
Eric
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-14 17:36 ` Eric W. Biederman
@ 2021-06-15 11:24 ` Enrico Weigelt, metux IT consult
2021-06-15 11:33 ` Greg KH
0 siblings, 1 reply; 48+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-15 11:24 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Christian Brauner, Hannes Reinecke, gregkh, containers, linux-kernel
On 14.06.21 19:36, Eric W. Biederman wrote:
> By virtual devices I mean all devices that are not physical pieces
> of hardware. For block devices I mean devices such as loopback
> devices that are created on demand. Ramdisks that start this
> conversation could also be considered virtual devices.
Ok. Do you also count partitions in here ?
IMHO we've got another category to look up: devices that (can) create
more (sub)devices. Examples coming into my head are loopdev, ptmx,
partitions, etc.
The big problem here: fist we'd need to be clear on the actual
semantics in namespaced context, for example:
* what happens when you talk to /dev/loop0 and create a new loopdev
inside a container - shall it be ever visible on the host ?
* what if you want to create an loopdev on some file thats only visible
to the host, but that loopdev shall appear inside a container ?
("virtual disk" scenario)
>> How would you skip the virtual devices from sysfs ? Adding some filter
>> into sysfs that looks at the device class (or some flag within it) ?
>
> I would just not run the code to create sysfs entries when the virtual
> devices are created.
Oh, that would most likely make userland unhappy.
Besides, that won't be so trivial due to the way sysfs works. Because
sysfs more or less just presents kobj's. Each kobj may have attributes,
a parent, and a list of childs. A device is n kobj, and it needs to
be registered into the device hierarchy to work at all. Sysfs itself
doesn't really know whether something is a virtual device (or a device
at all) - it just calls some functions from kobject_type for things like
reading/writing attributes, etc. But I don't see anything where
kobject_type's can implement their own iterators.
As things are right now, not registering a device in sysfs means not
registering it at all.
By the way: i'm just wondering whether it would make sense to give
kobject_type it's own iteration and lookup functions. Unless I'm fully
mistaken, that could help solving several other problems, e.g. device
renaming (currently *very* tricky and only works to some extend for
network devices).
IMHO, we could then eg. fetch the device names (/sys/devices/...)
directly from the struct device instead of the kset (perhaps a simple
list instead of kset would also do here), and also create the symlinks
(e.g. /sys/class/.../) on the fly. Once that's done, renaming a device
should become rather simple.
At that point, adding multiple views or certain parts of sysfs (e.g. the
devices hierarchy) could perhaps be done by implementing special
iterators take take the view criteria into account.
@Greg: what's your take on that iterator idea ?
> If you have virtual devices showing up in their own filesystem they
> don't even need major or minor numbers. You can just have files
> that accept ioctls like device nodes. In principle it is
> possible to skip a lot of the historical infrastructure. If the
> infrastructure is not needed it is worth skipping.
Ah, I see where you're going. You wanna completely drop these virtual
devices and replace them by a synthentic fs that *looks* like it
contains devices ? Well, theoretically it should be possible, since fs'
may handle opening device nodes completely own, instead of calling
generic code (is there any that actually does ?).
BUT: in that case we have to really make sure that processes inside the
container cannot ever open any device node outside that special fs.
> I haven't dug into the block layer recently enough to say what is needed
> or not. I think there are some thing such as stat on a mounted
> filesystem that need a major and minor numbers. Which probably means
> you have to use major and minor numbers. By virtue of using common
> infrastructure that implies showing up in sysfs and devtmpfs. Things
> would be limited just by not mounting devtmpfs in a container.
Note that this approach also needs to support things like dynamically
creating new device nodes (inside the container), udev, ... otherwise
you'd need very special handling in userland again (lxc folks would
become very unhappy ;-))
> It is worth checking how much of the common infrastructure you need when
> you start creating virtual devices.
s/virtual devices/synthetic filesystems/;
You approach goes much into the Plan9 direction (which in generally I'd
love to see). But whatever we gonna do here needs to remain compatible
with what existing userland expects - we've got a lot of Unix tradition
to keep here.
OR: we had to declare that (once inside the devns) we throw it all alway
and it create something entirely new that's more like an Plan9 subsystem
than an Linux container. Also interesting, but not what i've started
this discussion for.
> The only reason the network devices need changes to sysfs is to allow
> different network devices with the same name to show up in different
> network namespaces.
>
> If you can fundamentally avoid the problem of devices with the same
> name needing to show up in sysfs and devtmpfs by using filesystems
> then sysfs and devtmpfs needs no changes.
Well, that's only for the sysfs part. Network devices still need to
be namespaced in other places (socket, etc) - what's already done by
netns.
But yes, it sounds nice if we had entirely different namespaces for
network device names (e.g. any of the hosts network devices could
appear simply as "eth0" inside a container, if you want to)
--mtx
--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: device namespaces
2021-06-15 11:24 ` Enrico Weigelt, metux IT consult
@ 2021-06-15 11:33 ` Greg KH
0 siblings, 0 replies; 48+ messages in thread
From: Greg KH @ 2021-06-15 11:33 UTC (permalink / raw)
To: Enrico Weigelt, metux IT consult
Cc: Eric W. Biederman, Christian Brauner, Hannes Reinecke,
containers, linux-kernel
On Tue, Jun 15, 2021 at 01:24:24PM +0200, Enrico Weigelt, metux IT consult wrote:
> @Greg: what's your take on that iterator idea ?
I want you to stop talking about ideas, and try to implement them before
this conversation wastes anyone else's time and energy.
There is a good reason we do not do this type of "let's discuss things!"
in the kernel community, and that is because almost none of it matters
without working code.
So please, let's see some patches that implement your ideas and then we
can discuss them.
Until then, consider this thread ignored from me.
greg k-h
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Device Namespaces
@ 2013-09-29 19:28 Amir Goldstein
[not found] ` <CAA2m6veny-7_ONMA973Wu36U4kz4gAuw0dpodkb8+GZDv6VNBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 48+ messages in thread
From: Amir Goldstein @ 2013-09-29 19:28 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Linux Containers, Kay Sievers, Andy Lutomirski,
devel-GEFAQzZX7r8dnm+yROfE0A, Eric W. Biederman, lxc-devel,
mhw-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Stephane Graber
On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote:
> > So the big issues for a device namespace to solve are filtering which
> > devices a container has access to and being able to dynamically change
> > which devices those are at run time (aka hotplug).
>
> As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG
> anymore, because it was redundant), I think you need to really think
> this through better (pci, memory, cpus, etc.) before you do anything in
> the kernel.
>
> > After having thought about this for a bit I don't know if a pure
> > userspace solution is sufficient or actually a good idea.
> >
> > - We can manually manage a tmpfs with device nodes in userspace.
> > (But that is deprecated functionality in the mainstream kernel).
>
> Yes, but I'm not going to namespace devtmpfs, as that is going to be an
> impossible task, right?
>
That sounds like a challenge ;-)
Seriously, as Serge correctly noted, it would not be that different from
devpts
if you start from an empty devtmpfs and populate it with devices that are
"added
in the context of that namespace".
The semantics in which devices are "added in the context of a namespace"
is the missing piece of the puzzle.
What we really like to see is a setns() style API that can be used to
add a device in the context of a namespace in either a "shared" or "private"
mode.
This kind of API is a required building block for us to write device drivers
that are namespace aware in a way that userspace will have enough
flexibility
for dynamic configuration.
We are trying to come up with a proposal for that sort of API.
When we have something decent, we shall post it.
> And remember, udev doesn't create device nodes anymore...
>
> > - We can manually export a subset of sysfs with bind mounts.
> > (But that feels hacky, and is essentially incompatible with hotplug).
>
> True.
>
> > - We can relay a call of /sbin/hotplug from outside of a container
> > to inside of a container based on policy.
> > (But no one uses /sbin/hotplug anymore).
>
> That's right, they should be listening to libudev events, so why can't
> your daemon shuffle them off to the proper container, all in userspace?
>
> > - There is no way to fake netlink uevents for a container to see them.
> > (The best we could do is replace udev everywhere with something that
> > listens on a unix domain socket).
>
> You shouldn't need to do this.
>
> > - It would be nice to replace the device cgroup with a comprehensive
> > solution that really works. (Among other things the device cgroup
> > does not work in terms of struct device the underlying kernel
> > abstraction for devices).
>
> I didn't even know there was a device cgroup.
>
> Which means that if there is one, odds are it's useless.
>
> > We must manage sysfs entries as well device nodes because:
> > - Seeing more than we should has the real potential to confuse
> > userspace, especially a userspace that replays uevents.
>
> You should never replay uevents. If you don't do that, why can't you
> see all of sysfs?
>
> > - Some device control must happens through writing to sysfs files and
> > if we don't remove all root privileges from a container only by
> > exporting a subset of sysfs to that container can we limit which
> > sysfs nodes can be written to.
>
> But you have the issue of controlling devices in a "shared" way, which
> isn't going to be usable for almost all devices.
>
> > The current kernel tagged sysfs entry support does not look like a good
> > match for the impelementing device filtering. The common case will
> > be allowing devices like /dev/zero, and /dev/null that live in
> > /sys/devices/virtual and are the devices we are most likely to care
> > about. Those devices need to live in multiple device namespaces so
> > everyone can use them. Perhaps exclusive assignment will be the more
> > common paradigm for device namespaces like it is for network devices in
> > the network namespace but from what little I can of this problem right
> now I
> > don't think so.
> >
> > I definitely think we should hold off on a kernel level implementation
> > until we really understand the issues and are ready to implement device
> > namespaces correctly.
>
> I agree, especially as I don't think this will ever work.
>
> > A userspace implementation looks like it can only do about 95% of what
> > is really needed, but at the same time looks like an easy way to
> > experiment until the problem is sufficiently well understood.
>
> 95% is probably way better than what you have today, and will fit the
> needs of almost everyone today, so why not do it?
>
> I'd argue that those last 5% either are custom solutions that never get
> merged, or candidates for true virtulization.
>
> > In summary the situation with device hoptlug and containers sucks today,
> > and we need to do something. Running a linux desktop in a container is
> > a reasonably good example use case.
>
> No it isn't. I'd argue that this is a horrible use case, one that you
> shouldn't do. Why not just use multi-head machines like people do who
> really want to do this, relying on user separation? That's a workable
> solution that is quite common and works very well today.
>
> > Having one standard common maintainable implementation would be very
> > useful and the most logical place for that would be in the kernel.
> > For now we should focus on simple device filtering and hotplug.
>
> Just listen for libudev stuff, don't try to filter them, or ever
> "replay" them, that way lies madness, and lots of nasty race conditions
> that is guaranteed to break things.
>
> good luck,
>
> greg k-h
>
^ permalink raw reply [flat|nested] 48+ messages in thread
* RFC: Device Namespaces
@ 2013-08-22 17:43 Oren Laadan
2013-08-22 18:21 ` Serge Hallyn
0 siblings, 1 reply; 48+ messages in thread
From: Oren Laadan @ 2013-08-22 17:43 UTC (permalink / raw)
To: Linux Containers; +Cc: lxc-devel
Hi everyone!
We [1] have been working on bringing lightweight virtualization to
Linux-based mobile devices like Android (or other Linux-based devices with
diverse I/O) and want to share our solution: device namespaces.
Imagine you could run several instances of your favorite mobile OS or other
distributions in isolated containers, each under the impression of having
exclusive access to device drivers; Interact and switch between them within
a blink, no flashing, no reboot.
Device namespaces are an extension to existing Linux kernel namespaces that
brings lightweight virtualization to Linux-based end-user devices,
primarily mobile devices.
Device namespaces introduce a private and virtual namespace for device
drivers to create the illusion for a process group that it interacts
exclusively with a set of drivers. Device namespaces also introduce the
concepts of an “active” namespace with which a user interacts, vs
“non-active” namespaces that run in the background, and the ability to
switch between them.[2]
We are planning to prepare individual patches to be submitted to the
relevant maintainers and mailing lists. In the meantime, we already want to
share a set of patches on top of the Android goldfish Kernel 3.4 as well as
a user-space demo, so you can see where we are heading and get an overview
of the approach and see how it works.
We are aware that the patches are not ready for submission in their current
state, and we'd highly appreciate any feedback or suggestions which may
come to your mind once you have a look [3]. Of particular interest is to
elaborate a proper userspace API with respect to existing and future
use-cases. To illustrate a simple use-case we also provide a simple
userspace demo for Android [4].
I will be presenting "The Case for Linux Device Namespace" [5] at LinuxCon
North America 2013 [6]. We will also be attending the Containers Track [7]
at LPC 2013 to present the current state of the patches and discuss the
best course to proceed.
We are looking forward to hear from you!
Thanks,
Oren.
1: http://www.cellrox.com/
2: https://github.com/Cellrox/devns-patches/wiki/DeviceNamespace
3: https://github.com/Cellrox/devns-patches
4: https://github.com/Cellrox/devns-demo
5: http://sched.co/1asN1v7
6: http://events.linuxfoundation.org/events/linuxcon-north-america
7: http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/tracks/153
--
Oren Laadan
Cellrox Ltd.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: RFC: Device Namespaces
@ 2013-08-22 18:21 ` Serge Hallyn
2013-08-26 10:11 ` Oren Laadan
0 siblings, 1 reply; 48+ messages in thread
From: Serge Hallyn @ 2013-08-22 18:21 UTC (permalink / raw)
To: Oren Laadan; +Cc: Linux Containers, lxc-devel
Quoting Oren Laadan (orenl@cellrox.com):
> Hi everyone!
>
> We [1] have been working on bringing lightweight virtualization to
> Linux-based mobile devices like Android (or other Linux-based devices with
> diverse I/O) and want to share our solution: device namespaces.
>
> Imagine you could run several instances of your favorite mobile OS or other
> distributions in isolated containers, each under the impression of having
> exclusive access to device drivers; Interact and switch between them within
> a blink, no flashing, no reboot.
>
> Device namespaces are an extension to existing Linux kernel namespaces that
> brings lightweight virtualization to Linux-based end-user devices,
> primarily mobile devices.
> Device namespaces introduce a private and virtual namespace for device
> drivers to create the illusion for a process group that it interacts
> exclusively with a set of drivers. Device namespaces also introduce the
> concepts of an “active” namespace with which a user interacts, vs
> “non-active” namespaces that run in the background, and the ability to
> switch between them.[2]
Note that unless I'm misunderstanding what you're saying here, this is
also what net_ns does. A netns can exist with no processes so long as
you've bound its /proc/$$/ns/net somewhere. You can then re-enter that
ns using ns_attach. I haven't looked closely enough yet to see whether
you should be (or are) using the same interface.
> We are planning to prepare individual patches to be submitted to the
Looking forward to it, and seeing you at the containers track :)
> 2: https://github.com/Cellrox/devns-patches/wiki/DeviceNamespace
> 3: https://github.com/Cellrox/devns-patches
> 4: https://github.com/Cellrox/devns-demo
(Have looked over the wiki, will look over the patches as well)
-serge
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: RFC: Device Namespaces
2013-08-22 18:21 ` Serge Hallyn
@ 2013-08-26 10:11 ` Oren Laadan
2013-09-06 17:50 ` Eric W. Biederman
0 siblings, 1 reply; 48+ messages in thread
From: Oren Laadan @ 2013-08-26 10:11 UTC (permalink / raw)
Cc: Linux Containers, lxc-devel
Hi Serge,
On Thu, Aug 22, 2013 at 2:21 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>wrote:
> Quoting Oren Laadan (orenl-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org):
> > Hi everyone!
> >
> > We [1] have been working on bringing lightweight virtualization to
> > Linux-based mobile devices like Android (or other Linux-based devices
> with
> > diverse I/O) and want to share our solution: device namespaces.
> >
> > Imagine you could run several instances of your favorite mobile OS or
> other
> > distributions in isolated containers, each under the impression of having
> > exclusive access to device drivers; Interact and switch between them
> within
> > a blink, no flashing, no reboot.
> >
> > Device namespaces are an extension to existing Linux kernel namespaces
> that
> > brings lightweight virtualization to Linux-based end-user devices,
> > primarily mobile devices.
> > Device namespaces introduce a private and virtual namespace for device
> > drivers to create the illusion for a process group that it interacts
> > exclusively with a set of drivers. Device namespaces also introduce the
> > concepts of an “active” namespace with which a user interacts, vs
> > “non-active” namespaces that run in the background, and the ability to
> > switch between them.[2]
>
> Note that unless I'm misunderstanding what you're saying here, this is
> also what net_ns does. A netns can exist with no processes so long as
> you've bound its /proc/$$/ns/net somewhere. You can then re-enter that
> ns using ns_attach. I haven't looked closely enough yet to see whether
> you should be (or are) using the same interface.
>
>
To illustrate the need for device namespaces, consider the use case of
running two containers of your favorite OS (say, Android), on a single
physical phone. As a user, you either work in one container, or in the
other, and you will want to be able to switch between them (just like with
apps on mobile devices: you interact with one application at a time, and
switch between them).
See here for a demo of how it works: http://vimeo.com/60113683
To accomplish this, device namespaces solve two shortcomings of existing
namespaces:
1. A namespace for device drivers: each (Android) container needs a
private view of all devices. This includes logical drivers, like binder (in
Android) but also loop device; and physical devices, like the framebuffer
and the touch-screen.
In other words, device namespaces virtualize the _major/minor_ and the
_state_ of device drivers. With the exception of VFS, network, and PTY
(note: all three offer/are virtual devices), device drivers are otherwise
not isolated between containers.
2. A namespace for interactive scenarios: a namespace can be "active" - it
has access to the hardware, e.g. display and touch-screen. This will be the
container with which the user is interacting right now. Otherwise a
namespace is "non-active" - it still runs in the background, but can
neither alter the display nor receive input from the touch-screen.
Switching to another container means a context switch in the relevant
drivers, so that they restore the state and now "obey" the other namespace.
You can also think about the "active" namespace as foreground, and the
"non-active" as background, akin to foreground/background processes in a
terminal with job-control. Similar to how a terminal delivers input to the
foreground task only but not to the background tasks - this is enforced by
the new device namespace.
More details on this use-case are in the wiki:
https://github.com/Cellrox/devns-patches/wiki/Thinvisor).
> We are planning to prepare individual patches to be submitted to the
>
> Looking forward to it, and seeing you at the containers track :)
>
Same here!
>
> > 2: https://github.com/Cellrox/devns-patches/wiki/DeviceNamespace
> > 3: https://github.com/Cellrox/devns-patches
> > 4: https://github.com/Cellrox/devns-demo
>
> (Have looked over the wiki, will look over the patches as well)
>
> -serge
>
Thanks,
Oren.
--
Oren Laadan
Cellrox Ltd.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: RFC: Device Namespaces
@ 2013-09-06 17:50 ` Eric W. Biederman
2013-09-08 12:28 ` Amir Goldstein
0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2013-09-06 17:50 UTC (permalink / raw)
To: Oren Laadan; +Cc: Linux Containers, lxc-devel
Oren Laadan <orenl@cellrox.com> writes:
> Hi Serge,
>
>
> On Thu, Aug 22, 2013 at 2:21 PM, Serge Hallyn <serge.hallyn@ubuntu.com>wrote:
>
>> Quoting Oren Laadan (orenl@cellrox.com):
>> > Hi everyone!
>> >
>> > We [1] have been working on bringing lightweight virtualization to
>> > Linux-based mobile devices like Android (or other Linux-based devices
>> with
>> > diverse I/O) and want to share our solution: device namespaces.
>> >
>> > Imagine you could run several instances of your favorite mobile OS or
>> other
>> > distributions in isolated containers, each under the impression of having
>> > exclusive access to device drivers; Interact and switch between them
>> within
>> > a blink, no flashing, no reboot.
>> >
>> > Device namespaces are an extension to existing Linux kernel namespaces
>> that
>> > brings lightweight virtualization to Linux-based end-user devices,
>> > primarily mobile devices.
>> > Device namespaces introduce a private and virtual namespace for device
>> > drivers to create the illusion for a process group that it interacts
>> > exclusively with a set of drivers. Device namespaces also introduce the
>> > concepts of an “active” namespace with which a user interacts, vs
>> > “non-active” namespaces that run in the background, and the ability to
>> > switch between them.[2]
>>
>> Note that unless I'm misunderstanding what you're saying here, this is
>> also what net_ns does. A netns can exist with no processes so long as
>> you've bound its /proc/$$/ns/net somewhere. You can then re-enter that
>> ns using ns_attach. I haven't looked closely enough yet to see whether
>> you should be (or are) using the same interface.
>>
>>
> To illustrate the need for device namespaces, consider the use case of
> running two containers of your favorite OS (say, Android), on a single
> physical phone. As a user, you either work in one container, or in the
> other, and you will want to be able to switch between them (just like with
> apps on mobile devices: you interact with one application at a time, and
> switch between them).
>
> See here for a demo of how it works: http://vimeo.com/60113683
>
> To accomplish this, device namespaces solve two shortcomings of existing
> namespaces:
>
> 1. A namespace for device drivers: each (Android) container needs a
> private view of all devices. This includes logical drivers, like binder (in
> Android) but also loop device; and physical devices, like the framebuffer
> and the touch-screen.
>
> In other words, device namespaces virtualize the _major/minor_ and the
> _state_ of device drivers. With the exception of VFS, network, and PTY
> (note: all three offer/are virtual devices), device drivers are otherwise
> not isolated between containers.
>
> 2. A namespace for interactive scenarios: a namespace can be "active" - it
> has access to the hardware, e.g. display and touch-screen. This will be the
> container with which the user is interacting right now. Otherwise a
> namespace is "non-active" - it still runs in the background, but can
> neither alter the display nor receive input from the touch-screen.
> Switching to another container means a context switch in the relevant
> drivers, so that they restore the state and now "obey" the other namespace.
>
> You can also think about the "active" namespace as foreground, and the
> "non-active" as background, akin to foreground/background processes in a
> terminal with job-control. Similar to how a terminal delivers input to the
> foreground task only but not to the background tasks - this is enforced by
> the new device namespace.
>
> More details on this use-case are in the wiki:
> https://github.com/Cellrox/devns-patches/wiki/Thinvisor).
I think this is going to take some talking, and looking at code.
I think you are talking about having wrappers around your devices so you
can share. Which is not the quite same problem the rest of us have been
thinking of when talking about a device namespace.
My first impression is that this is better solved with more appropriate
abstractions in userspace or in the kernel.
But we can talk at LPC and see what we can hash out.
Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: RFC: Device Namespaces
@ 2013-09-08 12:28 ` Amir Goldstein
2013-09-09 0:51 ` Eric W. Biederman
0 siblings, 1 reply; 48+ messages in thread
From: Amir Goldstein @ 2013-09-08 12:28 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Linux Containers, lxc-devel
On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>wrote:
> Oren Laadan <orenl-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> writes:
>
> > Hi Serge,
> >
> >
> > On Thu, Aug 22, 2013 at 2:21 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org
> >wrote:
> >
> >> Quoting Oren Laadan (orenl-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org):
> >> > Hi everyone!
> >> >
> >> > We [1] have been working on bringing lightweight virtualization to
> >> > Linux-based mobile devices like Android (or other Linux-based devices
> >> with
> >> > diverse I/O) and want to share our solution: device namespaces.
> >> >
> >> > Imagine you could run several instances of your favorite mobile OS or
> >> other
> >> > distributions in isolated containers, each under the impression of
> having
> >> > exclusive access to device drivers; Interact and switch between them
> >> within
> >> > a blink, no flashing, no reboot.
> >> >
> >> > Device namespaces are an extension to existing Linux kernel namespaces
> >> that
> >> > brings lightweight virtualization to Linux-based end-user devices,
> >> > primarily mobile devices.
> >> > Device namespaces introduce a private and virtual namespace for device
> >> > drivers to create the illusion for a process group that it interacts
> >> > exclusively with a set of drivers. Device namespaces also introduce
> the
> >> > concepts of an “active” namespace with which a user interacts, vs
> >> > “non-active” namespaces that run in the background, and the ability to
> >> > switch between them.[2]
> >>
> >> Note that unless I'm misunderstanding what you're saying here, this is
> >> also what net_ns does. A netns can exist with no processes so long as
> >> you've bound its /proc/$$/ns/net somewhere. You can then re-enter that
> >> ns using ns_attach. I haven't looked closely enough yet to see whether
> >> you should be (or are) using the same interface.
> >>
> >>
> > To illustrate the need for device namespaces, consider the use case of
> > running two containers of your favorite OS (say, Android), on a single
> > physical phone. As a user, you either work in one container, or in the
> > other, and you will want to be able to switch between them (just like
> with
> > apps on mobile devices: you interact with one application at a time, and
> > switch between them).
> >
> > See here for a demo of how it works: http://vimeo.com/60113683
> >
> > To accomplish this, device namespaces solve two shortcomings of existing
> > namespaces:
> >
> > 1. A namespace for device drivers: each (Android) container needs a
> > private view of all devices. This includes logical drivers, like binder
> (in
> > Android) but also loop device; and physical devices, like the framebuffer
> > and the touch-screen.
> >
> > In other words, device namespaces virtualize the _major/minor_ and the
> > _state_ of device drivers. With the exception of VFS, network, and PTY
> > (note: all three offer/are virtual devices), device drivers are otherwise
> > not isolated between containers.
> >
> > 2. A namespace for interactive scenarios: a namespace can be "active" -
> it
> > has access to the hardware, e.g. display and touch-screen. This will be
> the
> > container with which the user is interacting right now. Otherwise a
> > namespace is "non-active" - it still runs in the background, but can
> > neither alter the display nor receive input from the touch-screen.
> > Switching to another container means a context switch in the relevant
> > drivers, so that they restore the state and now "obey" the other
> namespace.
> >
> > You can also think about the "active" namespace as foreground, and the
> > "non-active" as background, akin to foreground/background processes in a
> > terminal with job-control. Similar to how a terminal delivers input to
> the
> > foreground task only but not to the background tasks - this is enforced
> by
> > the new device namespace.
> >
> > More details on this use-case are in the wiki:
> > https://github.com/Cellrox/devns-patches/wiki/Thinvisor).
>
> I think this is going to take some talking, and looking at code.
>
>
Hi Eric,
If we can get people to take a quick look at the code before LPC
that could make the LPC discussions more effective.
Even looking at one of the subsystem patches can give a basic
idea of the work we have done:
https://github.com/Cellrox/linux/commits/devns-goldfish-3.4
I think you are talking about having wrappers around your devices so you
> can share. Which is not the quite same problem the rest of us have been
> thinking of when talking about a device namespace.
>
We are interested in all problems related to virtualizated view of devices
inside a container, so let our work so far be a starting point to discuss
all of them.
>
> My first impression is that this is better solved with more appropriate
> abstractions in userspace or in the kernel.
>
> But we can talk at LPC and see what we can hash out.
>
Looking forward to that :-)
Amir.
>
> Eric
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: RFC: Device Namespaces
@ 2013-09-09 0:51 ` Eric W. Biederman
2013-09-10 7:09 ` Amir Goldstein
0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2013-09-09 0:51 UTC (permalink / raw)
To: Amir Goldstein; +Cc: Linux Containers, lxc-devel
Amir Goldstein <amir@cellrox.com> writes:
> On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>
> Hi Eric,
>
> If we can get people to take a quick look at the code before LPC
> that could make the LPC discussions more effective.
> Even looking at one of the subsystem patches can give a basic
> idea of the work we have done:
> https://github.com/Cellrox/linux/commits/devns-goldfish-3.4
>
> I think you are talking about having wrappers around your devices
> so you
> can share. Which is not the quite same problem the rest of us
> have been
> thinking of when talking about a device namespace.
>
> We are interested in all problems related to virtualizated view of
> devices
> inside a container, so let our work so far be a starting point to
> discuss all of them.
>
> My first impression is that this is better solved with more
> appropriate
> abstractions in userspace or in the kernel.
As I read your code, you are solving the problem of one opener of a
device among a group of openers being able to access a device at a time.
Which leads to the question why can't the multiplexing happen in
userspace?
I think with your design it would not be possible to play a song in one
device namespace while doing work in the other. As a security model
that isn't wrong but as someone trying to get work done that could be a
real pain.
The more common concern is to have devices we can use all of the time.
There may be a need for a device namespace and multiplexing access to
hardware devices makes that clearer. So far nothing has risen to the
level of we actually need a device namespace to do X. Especially in an
erra of hotplug and dynamic device numbers.
It is arguable that you could do your kind of device multiplexing with a
fuse device in userspace that implements your desired policy.
And policy is where cell situtation seems to fall down because it hard
codes one specific policy into the kernel, and a policy most situations
don't find useful.
Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: RFC: Device Namespaces
@ 2013-09-10 7:09 ` Amir Goldstein
2013-09-25 11:05 ` Janne Karhunen
0 siblings, 1 reply; 48+ messages in thread
From: Amir Goldstein @ 2013-09-10 7:09 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Linux Containers, lxc-devel
On Mon, Sep 9, 2013 at 2:51 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>wrote:
> Amir Goldstein <amir-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> writes:
>
> > On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman
> > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> >
> > Hi Eric,
> >
> > If we can get people to take a quick look at the code before LPC
> > that could make the LPC discussions more effective.
> > Even looking at one of the subsystem patches can give a basic
> > idea of the work we have done:
> > https://github.com/Cellrox/linux/commits/devns-goldfish-3.4
> >
> > I think you are talking about having wrappers around your devices
> > so you
> > can share. Which is not the quite same problem the rest of us
> > have been
> > thinking of when talking about a device namespace.
> >
> > We are interested in all problems related to virtualizated view of
> > devices
> > inside a container, so let our work so far be a starting point to
> > discuss all of them.
> >
> > My first impression is that this is better solved with more
> > appropriate
> > abstractions in userspace or in the kernel.
>
> As I read your code, you are solving the problem of one opener of a
> device among a group of openers being able to access a device at a time.
> Which leads to the question why can't the multiplexing happen in
> userspace?
>
> I think with your design it would not be possible to play a song in one
> device namespace while doing work in the other. As a security model
> that isn't wrong but as someone trying to get work done that could be a
> real pain.
>
As a matter of fact, in our multi persona phone, you *can* hear music played
from background persona, but you *cannot* see images drawn from background
persona.
> The more common concern is to have devices we can use all of the time.
>
> There may be a need for a device namespace and multiplexing access to
> hardware devices makes that clearer. So far nothing has risen to the
> level of we actually need a device namespace to do X. Especially in an
> erra of hotplug and dynamic device numbers.
>
> It is arguable that you could do your kind of device multiplexing with a
> fuse device in userspace that implements your desired policy.
>
I agree about it being arguable :-)
We shall present our arguments on LPC.
>
> And policy is where cell situtation seems to fall down because it hard
> codes one specific policy into the kernel, and a policy most situations
> don't find useful.
>
>
It's true that for our product, we have made hardcoded policy decisions in
our kernel
patches, but that was just as a proof of concept for the technique.
We do envision being able to dynamically assign a device to a specific devns
(e.g. block,loop) keep a device shared between multi devns (e.g. audio)
and in addition to that, being able to multiplex a device between multi
devns (e.g. framebuffer)
> Eric
>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: RFC: Device Namespaces
@ 2013-09-25 11:05 ` Janne Karhunen
[not found] ` <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 48+ messages in thread
From: Janne Karhunen @ 2013-09-25 11:05 UTC (permalink / raw)
To: Amir Goldstein; +Cc: Linux Containers, Eric W. Biederman, lxc-devel
On Tue, Sep 10, 2013 at 10:09 AM, Amir Goldstein <amir-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> wrote:
> On Mon, Sep 9, 2013 at 2:51 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>wrote:
>
>> Amir Goldstein <amir-3AfRa/s5aFdBDgjK7y7TUQ@public.gmane.org> writes:
>>
>> > On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman
>> > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> >
>> > Hi Eric,
>> >
>> > If we can get people to take a quick look at the code before LPC
>> > that could make the LPC discussions more effective.
Hi,
I think we are curious enough to experiment with Erics idea of
implementing basic 'device namespace' in userspace (never miss an
opportunity to throw away kernel code). Can anyone point out any
obvious reason why this would not work if we consider bulk of the work
being plain access filtering?
That being said, is there a valid reason why binder is part of device
namespace here instead of IPC?
--
Janne
^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2021-06-15 11:33 UTC | newest]
Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-08 9:38 device namespaces Enrico Weigelt, metux IT consult
2021-06-08 12:30 ` Christian Brauner
2021-06-08 12:41 ` Greg Kroah-Hartman
2021-06-08 14:10 ` Hannes Reinecke
2021-06-08 14:29 ` Christian Brauner
2021-06-08 15:54 ` Hannes Reinecke
2021-06-08 17:16 ` Eric W. Biederman
2021-06-09 6:38 ` Christian Brauner
2021-06-09 7:02 ` Hannes Reinecke
2021-06-09 7:21 ` Christian Brauner
2021-06-09 7:54 ` Hannes Reinecke
2021-06-09 8:09 ` Christian Brauner
2021-06-11 18:14 ` Eric W. Biederman
2021-06-14 7:49 ` Enrico Weigelt, metux IT consult
2021-06-14 8:22 ` Greg KH
2021-06-14 17:36 ` Eric W. Biederman
2021-06-15 11:24 ` Enrico Weigelt, metux IT consult
2021-06-15 11:33 ` Greg KH
-- strict thread matches above, loose matches on Subject: below --
2013-09-29 19:28 Device Namespaces Amir Goldstein
[not found] ` <CAA2m6veny-7_ONMA973Wu36U4kz4gAuw0dpodkb8+GZDv6VNBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-29 20:06 ` Greg Kroah-Hartman
[not found] ` <20130929200620.GA31304-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-30 15:36 ` Michael H. Warfield
2013-10-03 0:44 ` Eric W. Biederman
[not found] ` <87a9iri3ot.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-10-03 0:59 ` Eric W. Biederman
2013-10-03 8:58 ` Amir Goldstein
[not found] ` <CAA2m6vc3OFmS9VwiTavRzPqhn+qoe6vDCO2sitXpEQ8a1JVyfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-03 9:17 ` Eric W. Biederman
2013-08-22 17:43 RFC: " Oren Laadan
2013-08-22 18:21 ` Serge Hallyn
2013-08-26 10:11 ` Oren Laadan
2013-09-06 17:50 ` Eric W. Biederman
2013-09-08 12:28 ` Amir Goldstein
2013-09-09 0:51 ` Eric W. Biederman
2013-09-10 7:09 ` Amir Goldstein
2013-09-25 11:05 ` Janne Karhunen
[not found] ` <CAE=NcrbyFFoMn2nfBA_=ZtwD=eGLvqK=L-U9MuGrtJFLZfZppw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-25 21:34 ` Eric W. Biederman
[not found] ` <87bo3gshz5.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-09-26 5:33 ` Greg Kroah-Hartman
[not found] ` <20130926053320.GB3725-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26 8:25 ` Janne Karhunen
[not found] ` <CAE=NcrbPXGWU8FUgwchXyL5HjXf+4AKbgUWGe1ZO=Xcq=iV-Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-26 13:56 ` Greg Kroah-Hartman
[not found] ` <20130926135604.GA16624-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26 17:01 ` Janne Karhunen
[not found] ` <CAE=NcrY3xC1AF_GV2b1KsF7AwYZTuGBuKLS5yBUWoWcmKU4YBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-26 17:07 ` Greg Kroah-Hartman
[not found] ` <20130926170757.GA9345-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-26 17:56 ` Janne Karhunen
2013-09-30 15:37 ` James Bottomley
[not found] ` <1380555439.2161.5.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2013-09-30 16:11 ` Greg Kroah-Hartman
[not found] ` <20130930161117.GA26459-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-09-30 16:33 ` James Bottomley
2013-10-01 6:19 ` Janne Karhunen
[not found] ` <CAE=NcrYV2RiMV7PcwEjFGFRBrz9XdZGs86Wau2a+6xpYN2aEHA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-01 17:27 ` Andy Lutomirski
[not found] ` <CALCETrWWoHzuJcnfEUY+cFpOgT5gnG8U1cVbCW0_8V7Z_v6DJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-01 17:53 ` Serge E. Hallyn
[not found] ` <20131001175345.GA4145-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-10-01 19:51 ` Eric W. Biederman
[not found] ` <87had0wz07.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-10-01 20:46 ` Serge Hallyn
2013-10-02 22:55 ` Eric W. Biederman
2013-10-01 20:57 ` Greg Kroah-Hartman
[not found] ` <20131001205718.GA17036-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-10-02 22:45 ` Eric W. Biederman
2013-10-01 22:19 ` Michael H. Warfield
2013-10-01 18:36 ` Janne Karhunen
2013-10-01 17:33 ` Greg Kroah-Hartman
[not found] ` <20131001173342.GA19267-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2013-10-01 18:23 ` Janne Karhunen
2013-10-28 23:31 ` Andrey Wagin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).