linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] block namespaces
@ 2021-05-27  8:01 Hannes Reinecke
  2021-06-09 18:36 ` James Bottomley
  0 siblings, 1 reply; 5+ messages in thread
From: Hannes Reinecke @ 2021-05-27  8:01 UTC (permalink / raw)
  To: lsf-pc, linux-block, linux-scsi, Linux NVMe Mailinglist

Hi all,

I guess it's time to tick off yet another item on my long-term to-do list:

Block namespaces
----------------

Idea is similar to what network already does: allowing each user
namespace to have a different 'view' on the existing block devices.
EG if the admin creates a ramdisk in one namespace this device should
not be visible to other namespaces.
But for me the most important use-case would be qemu; currently the
devices need to be set up in the host, even though the host has no
business touching it as they really belong to the qemu instance. This is
causing quite some irritation eg when this device has LVM or MD metadata
and udev is trying to activate it on the host.

Overall plan is to restrict views of '/dev', '/sys/dev/block' and
'/sys/block' to only present the devices 'visible' for this namespace.
Initially the drivers would keep their global enumeration, but plan is
to make the drivers namespace-aware, too, such that each namespace could
have its own driver-specific device enumeration.

Goal of this topic is to get a consensus on whether block namespaces are
a feature which would find interest, and also to discuss some design
details here:
- Only in certain cases can a namespace be assigned (eg by calling
'modprobe', starting iscsiadm, or calling nvme-cli); how do we handle
devices for which no namespace can be identified?
- Shall we allow for different device enumeration per namespace?
- Into which level should we go with hiding sysfs structures?
  Is blanking out the higher-level interfaces in /dev and /sys/block
  enough?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block namespaces
  2021-05-27  8:01 [LSF/MM/BPF TOPIC] block namespaces Hannes Reinecke
@ 2021-06-09 18:36 ` James Bottomley
  2021-06-10  5:49   ` Hannes Reinecke
  0 siblings, 1 reply; 5+ messages in thread
From: James Bottomley @ 2021-06-09 18:36 UTC (permalink / raw)
  To: Hannes Reinecke, lsf-pc, linux-block, linux-scsi, Linux NVMe Mailinglist

On Thu, 2021-05-27 at 10:01 +0200, Hannes Reinecke wrote:
> Hi all,
> 
> I guess it's time to tick off yet another item on my long-term to-do
> list:
> 
> Block namespaces
> ----------------
> 
> Idea is similar to what network already does: allowing each user
> namespace to have a different 'view' on the existing block devices.
> EG if the admin creates a ramdisk in one namespace this device should
> not be visible to other namespaces.
> But for me the most important use-case would be qemu; currently the
> devices need to be set up in the host, even though the host has no
> business touching it as they really belong to the qemu instance. This
> is causing quite some irritation eg when this device has LVM or MD
> metadata and udev is trying to activate it on the host.

I suppose the first question is "why block only?"  There are several
existing device namespace proposals which would be more generic.

> Overall plan is to restrict views of '/dev', '/sys/dev/block' and
> '/sys/block' to only present the devices 'visible' for this
> namespace.

We actually already have a devices cgroup that does some of this:

https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt

However, visibility isn't the only problem, for direct passthrough
there's also uevent handling and people have even asked about module
loading.

>  Initially the drivers would keep their global enumeration, but plan
> is to make the drivers namespace-aware, too, such that each namespace
> could have its own driver-specific device enumeration.

I really wouldn't do this.  Namespace/Cgroup separation should be kept
as high as possible.  If it leaks into the drivers it will become
unmaintainable.  Why do you think you need the drivers to be aware?  If
it's just enumeration, that should all be doable with the visibility
driver unless you want to do things like compact numbering?

> Goal of this topic is to get a consensus on whether block namespaces
> are a feature which would find interest, and also to discuss some
> design details here:
> - Only in certain cases can a namespace be assigned (eg by calling
> 'modprobe', starting iscsiadm, or calling nvme-cli); how do we handle
> devices for which no namespace can be identified?
> - Shall we allow for different device enumeration per namespace?
> - Into which level should we go with hiding sysfs structures?
>   Is blanking out the higher-level interfaces in /dev and /sys/block
>   enough?

First question is does the device cgroup do enough for you and if not
what's missing?

James



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block namespaces
  2021-06-09 18:36 ` James Bottomley
@ 2021-06-10  5:49   ` Hannes Reinecke
  2021-06-10 14:29     ` James Bottomley
  0 siblings, 1 reply; 5+ messages in thread
From: Hannes Reinecke @ 2021-06-10  5:49 UTC (permalink / raw)
  To: James Bottomley, lsf-pc, linux-block, linux-scsi, Linux NVMe Mailinglist

On 6/9/21 8:36 PM, James Bottomley wrote:
> On Thu, 2021-05-27 at 10:01 +0200, Hannes Reinecke wrote:
>> Hi all,
>>
>> I guess it's time to tick off yet another item on my long-term to-do
>> list:
>>
>> Block namespaces
>> ----------------
>>
>> Idea is similar to what network already does: allowing each user
>> namespace to have a different 'view' on the existing block devices.
>> EG if the admin creates a ramdisk in one namespace this device should
>> not be visible to other namespaces.
>> But for me the most important use-case would be qemu; currently the
>> devices need to be set up in the host, even though the host has no
>> business touching it as they really belong to the qemu instance. This
>> is causing quite some irritation eg when this device has LVM or MD
>> metadata and udev is trying to activate it on the host.
> 
> I suppose the first question is "why block only?"  There are several
> existing device namespace proposals which would be more generic.
> 

Well; I'm more of a storage person, and do know the needs and 
shortcomings in that area. Less well so in other areas...

>> Overall plan is to restrict views of '/dev', '/sys/dev/block' and
>> '/sys/block' to only present the devices 'visible' for this
>> namespace.
> 
> We actually already have a devices cgroup that does some of this:
> 
> https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt
> 
I know. But this essentially is a filter on '/dev' only, and needs to be 
configured. Which makes it very unwieldy to use.
And the contents of sysfs are not modified, so there's a mismatch 
between contents in /dev and /sys.
Which might cause issues with monitoring tools.

> However, visibility isn't the only problem, for direct passthrough
> there's also uevent handling and people have even asked about module
> loading.
> 
I am aware, and that's another reason why device cgroup doesn't cut it.

>>   Initially the drivers would keep their global enumeration, but plan
>> is to make the drivers namespace-aware, too, such that each namespace
>> could have its own driver-specific device enumeration.
> 
> I really wouldn't do this.  Namespace/Cgroup separation should be kept
> as high as possible.  If it leaks into the drivers it will become
> unmaintainable.  Why do you think you need the drivers to be aware?  If
> it's just enumeration, that should all be doable with the visibility
> driver unless you want to do things like compact numbering?
> 
Which is precisely why I mentioned device modifications.
On a generic level we can influence the visibility of devices in 
relation to namespaces, we cannot influence the devices themselves.
This will lead to namespaces seeing disjunct device numbers (ie 8:0 and 
8:8 on ns 1, 8:4 on ns 2). Not that I think that will be an issue, but 
certainly a change in behaviour.

>> Goal of this topic is to get a consensus on whether block namespaces
>> are a feature which would find interest, and also to discuss some
>> design details here:
>> - Only in certain cases can a namespace be assigned (eg by calling
>> 'modprobe', starting iscsiadm, or calling nvme-cli); how do we handle
>> devices for which no namespace can be identified?
>> - Shall we allow for different device enumeration per namespace?
>> - Into which level should we go with hiding sysfs structures?
>>    Is blanking out the higher-level interfaces in /dev and /sys/block
>>    enough?
> 
> First question is does the device cgroup do enough for you and if not
> what's missing?
> 
See above. sysfs modifications and uevent filtering are missing.
This infrastructure for that is already in place thanks to network 
namespaces, we 'just' need to make use of it.
Additional drawback is the manual configuration of device-cgroup.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block namespaces
  2021-06-10  5:49   ` Hannes Reinecke
@ 2021-06-10 14:29     ` James Bottomley
  2021-06-10 15:05       ` Hannes Reinecke
  0 siblings, 1 reply; 5+ messages in thread
From: James Bottomley @ 2021-06-10 14:29 UTC (permalink / raw)
  To: Hannes Reinecke, lsf-pc, linux-block, linux-scsi, Linux NVMe Mailinglist

On Thu, 2021-06-10 at 07:49 +0200, Hannes Reinecke wrote:
> On 6/9/21 8:36 PM, James Bottomley wrote:
> > On Thu, 2021-05-27 at 10:01 +0200, Hannes Reinecke wrote:
> > > Hi all,
> > > 
> > > I guess it's time to tick off yet another item on my long-term
> > > to-do list:
> > > 
> > > Block namespaces
> > > ----------------
> > > 
> > > Idea is similar to what network already does: allowing each user
> > > namespace to have a different 'view' on the existing block
> > > devices.  EG if the admin creates a ramdisk in one namespace this
> > > device should not be visible to other namespaces.  But for me the
> > > most important use-case would be qemu; currently the devices need
> > > to be set up in the host, even though the host has no business
> > > touching it as they really belong to the qemu instance.  This
> > > is causing quite some irritation eg when this device has LVM or
> > > MD metadata and udev is trying to activate it on the host.
> > 
> > I suppose the first question is "why block only?"  There are
> > several existing device namespace proposals which would be more
> > generic.
> > 
> 
> Well; I'm more of a storage person, and do know the needs and 
> shortcomings in that area. Less well so in other areas...

OK, but this should work for all devices, just like the device cgroup
if it's going to be an adjunct to it.

> > > Overall plan is to restrict views of '/dev', '/sys/dev/block' and
> > > '/sys/block' to only present the devices 'visible' for this
> > > namespace.
> > 
> > We actually already have a devices cgroup that does some of this:
> > 
> > https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt
> > 
> I know. But this essentially is a filter on '/dev' only, and needs to
> be configured. Which makes it very unwieldy to use.
> And the contents of sysfs are not modified, so there's a mismatch 
> between contents in /dev and /sys.
> Which might cause issues with monitoring tools.

Firstly, since it does part of what you want, we at least need to
understand why you think it can't be enhanced to do everything.

The /sys problem has been discussed many times.  GregKH really doesn't
like the idea of filtering /sys (and most container people agree), so
the options available seem to be don't mount /sys in a container or
emulate it via fuse.  If you pick either does the device cgroup now
work for you?

> > However, visibility isn't the only problem, for direct passthrough
> > there's also uevent handling and people have even asked about
> > module loading.
> > 
> I am aware, and that's another reason why device cgroup doesn't cut
> it.

Christian Brauner is looking at this ... apparently Ubuntu has some
thumb drive inside lxc container use case that needs it.

> > >   Initially the drivers would keep their global enumeration, but
> > > plan is to make the drivers namespace-aware, too, such that each
> > > namespace could have its own driver-specific device enumeration.
> > 
> > I really wouldn't do this.  Namespace/Cgroup separation should be
> > kept as high as possible.  If it leaks into the drivers it will
> > become unmaintainable.  Why do you think you need the drivers to be
> > aware?  If it's just enumeration, that should all be doable with
> > the visibility driver unless you want to do things like compact
> > numbering?
> > 
> Which is precisely why I mentioned device modifications.
> On a generic level we can influence the visibility of devices in 
> relation to namespaces, we cannot influence the devices themselves.
> This will lead to namespaces seeing disjunct device numbers (ie 8:0
> and 8:8 on ns 1, 8:4 on ns 2). Not that I think that will be an
> issue, but  certainly a change in behaviour.

Well, not necessarily, the pid namespace is an example of a remapping
namespace.  The same thing could be done for device numbering, but
there really needs to be a compelling case.  Given our use of hotplug,
why would any tool assume compact numbering?  And if you don't need
compact numbering there's no need to bother with remapping.

> > > Goal of this topic is to get a consensus on whether block
> > > namespaces are a feature which would find interest, and also to
> > > discuss some design details here:
> > > - Only in certain cases can a namespace be assigned (eg by
> > > calling
> > > 'modprobe', starting iscsiadm, or calling nvme-cli); how do we
> > > handle
> > > devices for which no namespace can be identified?
> > > - Shall we allow for different device enumeration per namespace?
> > > - Into which level should we go with hiding sysfs structures?
> > >    Is blanking out the higher-level interfaces in /dev and
> > > /sys/block    enough?
> > 
> > First question is does the device cgroup do enough for you and if
> > not what's missing?
> > 
> See above. sysfs modifications and uevent filtering are missing.
> This infrastructure for that is already in place thanks to network 
> namespaces, we 'just' need to make use of it.
> Additional drawback is the manual configuration of device-cgroup.

OK, so still, why not fix or enhance the device cgroup?  The current
proposal seems to want to duplicate it as a namespace.

James



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block namespaces
  2021-06-10 14:29     ` James Bottomley
@ 2021-06-10 15:05       ` Hannes Reinecke
  0 siblings, 0 replies; 5+ messages in thread
From: Hannes Reinecke @ 2021-06-10 15:05 UTC (permalink / raw)
  To: James Bottomley, lsf-pc, linux-block, linux-scsi, Linux NVMe Mailinglist

On 6/10/21 4:29 PM, James Bottomley wrote:
> On Thu, 2021-06-10 at 07:49 +0200, Hannes Reinecke wrote:
>> On 6/9/21 8:36 PM, James Bottomley wrote:
>>> On Thu, 2021-05-27 at 10:01 +0200, Hannes Reinecke wrote:
>>>> Hi all,
>>>>
>>>> I guess it's time to tick off yet another item on my long-term
>>>> to-do list:
>>>>
>>>> Block namespaces
>>>> ----------------
>>>>
>>>> Idea is similar to what network already does: allowing each user
>>>> namespace to have a different 'view' on the existing block
>>>> devices.  EG if the admin creates a ramdisk in one namespace this
>>>> device should not be visible to other namespaces.  But for me the
>>>> most important use-case would be qemu; currently the devices need
>>>> to be set up in the host, even though the host has no business
>>>> touching it as they really belong to the qemu instance.  This
>>>> is causing quite some irritation eg when this device has LVM or
>>>> MD metadata and udev is trying to activate it on the host.
>>>
>>> I suppose the first question is "why block only?"  There are
>>> several existing device namespace proposals which would be more
>>> generic.
>>>
>>
>> Well; I'm more of a storage person, and do know the needs and 
>> shortcomings in that area. Less well so in other areas...
> 
> OK, but this should work for all devices, just like the device cgroup
> if it's going to be an adjunct to it.
> 
>>>> Overall plan is to restrict views of '/dev', '/sys/dev/block' and
>>>> '/sys/block' to only present the devices 'visible' for this
>>>> namespace.
>>>
>>> We actually already have a devices cgroup that does some of this:
>>>
>>> https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt
>>>
>> I know. But this essentially is a filter on '/dev' only, and needs to
>> be configured. Which makes it very unwieldy to use.
>> And the contents of sysfs are not modified, so there's a mismatch 
>> between contents in /dev and /sys.
>> Which might cause issues with monitoring tools.
> 
> Firstly, since it does part of what you want, we at least need to
> understand why you think it can't be enhanced to do everything.
> 
> The /sys problem has been discussed many times.  GregKH really doesn't
> like the idea of filtering /sys (and most container people agree), so
> the options available seem to be don't mount /sys in a container or
> emulate it via fuse.  If you pick either does the device cgroup now
> work for you?
> 

... begging the question why he relented for network namespaces.
They _do_ modify sysfs, and that via generic hooks (ie making struct
class namespace-aware).
And as this is a public interface I don't see why I can't be using it...

>>> However, visibility isn't the only problem, for direct passthrough
>>> there's also uevent handling and people have even asked about
>>> module loading.
>>>
>> I am aware, and that's another reason why device cgroup doesn't cut
>> it.
> 
> Christian Brauner is looking at this ... apparently Ubuntu has some
> thumb drive inside lxc container use case that needs it.
> 

Guess with whom I'm discussing on how to implement this.

>>>>   Initially the drivers would keep their global enumeration, but
>>>> plan is to make the drivers namespace-aware, too, such that each
>>>> namespace could have its own driver-specific device enumeration.
>>>
>>> I really wouldn't do this.  Namespace/Cgroup separation should be
>>> kept as high as possible.  If it leaks into the drivers it will
>>> become unmaintainable.  Why do you think you need the drivers to be
>>> aware?  If it's just enumeration, that should all be doable with
>>> the visibility driver unless you want to do things like compact
>>> numbering?
>>>
>> Which is precisely why I mentioned device modifications.
>> On a generic level we can influence the visibility of devices in 
>> relation to namespaces, we cannot influence the devices themselves.
>> This will lead to namespaces seeing disjunct device numbers (ie 8:0
>> and 8:8 on ns 1, 8:4 on ns 2). Not that I think that will be an
>> issue, but  certainly a change in behaviour.
> 
> Well, not necessarily, the pid namespace is an example of a remapping
> namespace.  The same thing could be done for device numbering, but
> there really needs to be a compelling case.  Given our use of hotplug,
> why would any tool assume compact numbering?  And if you don't need
> compact numbering there's no need to bother with remapping.
> 

Which is my assumption, too. I have just mentioned it here for
completeness sake. It's not that I'll be implementing it for now.

>>>> Goal of this topic is to get a consensus on whether block
>>>> namespaces are a feature which would find interest, and also to
>>>> discuss some design details here:
>>>> - Only in certain cases can a namespace be assigned (eg by
>>>> calling
>>>> 'modprobe', starting iscsiadm, or calling nvme-cli); how do we
>>>> handle
>>>> devices for which no namespace can be identified?
>>>> - Shall we allow for different device enumeration per namespace?
>>>> - Into which level should we go with hiding sysfs structures?
>>>>    Is blanking out the higher-level interfaces in /dev and
>>>> /sys/block    enough?
>>>
>>> First question is does the device cgroup do enough for you and if
>>> not what's missing?
>>>
>> See above. sysfs modifications and uevent filtering are missing.
>> This infrastructure for that is already in place thanks to network 
>> namespaces, we 'just' need to make use of it.
>> Additional drawback is the manual configuration of device-cgroup.
> 
> OK, so still, why not fix or enhance the device cgroup?  The current
> proposal seems to want to duplicate it as a namespace.
> 
Primarily because device-group is really limited in its functionality,
and the interface into the device core is, quite frankly, horrible.
(Implementing it under security/device_cgroup.c, and having direct
function calls from fs/block_dev.c and fs/namei.c?
hch will yet at me if I attempted that.)

Re-implementing that as a namespace is the cleaner solution.
Actually, it should be possible to re-implementing device-cgroup on to
of the block namespaces; for that we'll have to expand it to cover all
devices, but you're arguing for that anyway.
So, let's see.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-06-10 15:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27  8:01 [LSF/MM/BPF TOPIC] block namespaces Hannes Reinecke
2021-06-09 18:36 ` James Bottomley
2021-06-10  5:49   ` Hannes Reinecke
2021-06-10 14:29     ` James Bottomley
2021-06-10 15:05       ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).