linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: James Bottomley <James.Bottomley@HansenPartnership.com>,
	"lsf-pc@lists.linux-foundation.org" 
	<lsf-pc@lists.linux-foundation.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	Linux NVMe Mailinglist <linux-nvme@lists.infradead.org>
Subject: Re: [LSF/MM/BPF TOPIC] block namespaces
Date: Thu, 10 Jun 2021 17:05:07 +0200	[thread overview]
Message-ID: <33f720ce-484b-0c42-b14a-0a7df3e09e81@suse.de> (raw)
In-Reply-To: <f31fbb6d2f374a39d22e3fca122d757e905d2711.camel@HansenPartnership.com>

On 6/10/21 4:29 PM, James Bottomley wrote:
> On Thu, 2021-06-10 at 07:49 +0200, Hannes Reinecke wrote:
>> On 6/9/21 8:36 PM, James Bottomley wrote:
>>> On Thu, 2021-05-27 at 10:01 +0200, Hannes Reinecke wrote:
>>>> Hi all,
>>>>
>>>> I guess it's time to tick off yet another item on my long-term
>>>> to-do list:
>>>>
>>>> Block namespaces
>>>> ----------------
>>>>
>>>> Idea is similar to what network already does: allowing each user
>>>> namespace to have a different 'view' on the existing block
>>>> devices.  EG if the admin creates a ramdisk in one namespace this
>>>> device should not be visible to other namespaces.  But for me the
>>>> most important use-case would be qemu; currently the devices need
>>>> to be set up in the host, even though the host has no business
>>>> touching it as they really belong to the qemu instance.  This
>>>> is causing quite some irritation eg when this device has LVM or
>>>> MD metadata and udev is trying to activate it on the host.
>>>
>>> I suppose the first question is "why block only?"  There are
>>> several existing device namespace proposals which would be more
>>> generic.
>>>
>>
>> Well; I'm more of a storage person, and do know the needs and 
>> shortcomings in that area. Less well so in other areas...
> 
> OK, but this should work for all devices, just like the device cgroup
> if it's going to be an adjunct to it.
> 
>>>> Overall plan is to restrict views of '/dev', '/sys/dev/block' and
>>>> '/sys/block' to only present the devices 'visible' for this
>>>> namespace.
>>>
>>> We actually already have a devices cgroup that does some of this:
>>>
>>> https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt
>>>
>> I know. But this essentially is a filter on '/dev' only, and needs to
>> be configured. Which makes it very unwieldy to use.
>> And the contents of sysfs are not modified, so there's a mismatch 
>> between contents in /dev and /sys.
>> Which might cause issues with monitoring tools.
> 
> Firstly, since it does part of what you want, we at least need to
> understand why you think it can't be enhanced to do everything.
> 
> The /sys problem has been discussed many times.  GregKH really doesn't
> like the idea of filtering /sys (and most container people agree), so
> the options available seem to be don't mount /sys in a container or
> emulate it via fuse.  If you pick either does the device cgroup now
> work for you?
> 

... begging the question why he relented for network namespaces.
They _do_ modify sysfs, and that via generic hooks (ie making struct
class namespace-aware).
And as this is a public interface I don't see why I can't be using it...

>>> However, visibility isn't the only problem, for direct passthrough
>>> there's also uevent handling and people have even asked about
>>> module loading.
>>>
>> I am aware, and that's another reason why device cgroup doesn't cut
>> it.
> 
> Christian Brauner is looking at this ... apparently Ubuntu has some
> thumb drive inside lxc container use case that needs it.
> 

Guess with whom I'm discussing on how to implement this.

>>>>   Initially the drivers would keep their global enumeration, but
>>>> plan is to make the drivers namespace-aware, too, such that each
>>>> namespace could have its own driver-specific device enumeration.
>>>
>>> I really wouldn't do this.  Namespace/Cgroup separation should be
>>> kept as high as possible.  If it leaks into the drivers it will
>>> become unmaintainable.  Why do you think you need the drivers to be
>>> aware?  If it's just enumeration, that should all be doable with
>>> the visibility driver unless you want to do things like compact
>>> numbering?
>>>
>> Which is precisely why I mentioned device modifications.
>> On a generic level we can influence the visibility of devices in 
>> relation to namespaces, we cannot influence the devices themselves.
>> This will lead to namespaces seeing disjunct device numbers (ie 8:0
>> and 8:8 on ns 1, 8:4 on ns 2). Not that I think that will be an
>> issue, but  certainly a change in behaviour.
> 
> Well, not necessarily, the pid namespace is an example of a remapping
> namespace.  The same thing could be done for device numbering, but
> there really needs to be a compelling case.  Given our use of hotplug,
> why would any tool assume compact numbering?  And if you don't need
> compact numbering there's no need to bother with remapping.
> 

Which is my assumption, too. I have just mentioned it here for
completeness sake. It's not that I'll be implementing it for now.

>>>> Goal of this topic is to get a consensus on whether block
>>>> namespaces are a feature which would find interest, and also to
>>>> discuss some design details here:
>>>> - Only in certain cases can a namespace be assigned (eg by
>>>> calling
>>>> 'modprobe', starting iscsiadm, or calling nvme-cli); how do we
>>>> handle
>>>> devices for which no namespace can be identified?
>>>> - Shall we allow for different device enumeration per namespace?
>>>> - Into which level should we go with hiding sysfs structures?
>>>>    Is blanking out the higher-level interfaces in /dev and
>>>> /sys/block    enough?
>>>
>>> First question is does the device cgroup do enough for you and if
>>> not what's missing?
>>>
>> See above. sysfs modifications and uevent filtering are missing.
>> This infrastructure for that is already in place thanks to network 
>> namespaces, we 'just' need to make use of it.
>> Additional drawback is the manual configuration of device-cgroup.
> 
> OK, so still, why not fix or enhance the device cgroup?  The current
> proposal seems to want to duplicate it as a namespace.
> 
Primarily because device-group is really limited in its functionality,
and the interface into the device core is, quite frankly, horrible.
(Implementing it under security/device_cgroup.c, and having direct
function calls from fs/block_dev.c and fs/namei.c?
hch will yet at me if I attempted that.)

Re-implementing that as a namespace is the cleaner solution.
Actually, it should be possible to re-implementing device-cgroup on to
of the block namespaces; for that we'll have to expand it to cover all
devices, but you're arguing for that anyway.
So, let's see.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

      reply	other threads:[~2021-06-10 15:05 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-27  8:01 [LSF/MM/BPF TOPIC] block namespaces Hannes Reinecke
2021-06-09 18:36 ` James Bottomley
2021-06-10  5:49   ` Hannes Reinecke
2021-06-10 14:29     ` James Bottomley
2021-06-10 15:05       ` Hannes Reinecke [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=33f720ce-484b-0c42-b14a-0a7df3e09e81@suse.de \
    --to=hare@suse.de \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).