linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Mahoney <jeffm@suse.com>
To: Dave Chinner <david@fromorbit.com>, Mark Fasheh <mfasheh@suse.de>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-btrfs@vger.kernel.org
Subject: Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
Date: Tue, 8 May 2018 22:06:44 -0400	[thread overview]
Message-ID: <e6698201-a4a6-2146-d4c9-9e26c18ad828@suse.com> (raw)
In-Reply-To: <20180508233840.GM10363@dastard>


[-- Attachment #1.1: Type: text/plain, Size: 8661 bytes --]

On 5/8/18 7:38 PM, Dave Chinner wrote:
> On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
>> Hi,
>>
>> The VFS's super_block covers a variety of filesystem functionality. In
>> particular we have a single structure representing both I/O and
>> namespace domains.
>>
>> There are requirements to de-couple this functionality. For example,
>> filesystems with more than one root (such as btrfs subvolumes) can
>> have multiple inode namespaces. This starts to confuse userspace when
>> it notices multiple inodes with the same inode/device tuple on a
>> filesystem.
> 
> Devil's Advocate - I'm not looking at the code, I'm commenting on
> architectural issues I see here.
> 
> The XFS subvolume work I've been doing explicitly uses a superblock
> per subvolume. That's because subvolumes are designed to be
> completely independent of the backing storage - they know nothing
> about the underlying storage except to share a BDI for writeback
> purposes and write to whatever block device the remapping layer
> gives them at IO time.  Hence XFS subvolumes have (at this point)
> their own unique s_dev, on-disk format configuration, journal, space
> accounting, etc. i.e. They are fully independent filesystems in
> their own right, and as such we do not have multiple inode
> namespaces per superblock.

That's a fundamental difference between how your XFS subvolumes work and
how btrfs subvolumes do.  There is no independence among btrfs
subvolumes.  When a snapshot is created, it has a few new blocks but
otherwise shares the metadata of the source subvolume.  The metadata
trees are shared across all of the subvolumes and there are several
internal trees used to manage all of it.  It's a single storage pool and
a single transaction engine.  There are housekeeping and maintenance
tasks that operate across the entire file system internally.  I
understand that there are several problems you need to solve at the VFS
layer to get your version of subvolumes up and running, but trying to
shoehorn one into the other is bound to fail.

> So this doesn't sound like a "subvolume problem" - it's a "how do we
> sanely support multiple independent namespaces per superblock"
> problem. AFAICT, this same problem exists with bind mounts and mount
> namespaces - they are effectively multiple roots on a single
> superblock, but it's done at the vfsmount level and so the
> superblock knows nothing about them.

In this case, you're talking about the user-visible file system
hierarchy namespace that has no bearing on the underlying file system
outside of per-mount flags.  It makes sense for that to be above the
superblock because the file system doesn't care about them.  We're
interested in the inode namespace, which for every other file system can
be described using an inode and a superblock pair, but btrfs has another
layer in the middle: inode -> btrfs_root -> superblock.  The lifetime
rules for e.g. the s_dev follow that middle layer and a vfsmount can
disappear well before the inode does.

> So this kinda feel like there's still a impedence mismatch between
> btrfs subvolumes being mounted as subtrees on the underlying root
> vfsmount rather than being created as truly independent vfs
> namespaces that share a superblock. To put that as a question: why
> aren't btrfs subvolumes vfsmounts in their own right, and the unique
> information subvolume information get stored in (or obtained from)
> the vfsmount?

Those are two separate problems.   Using a vfsmount to export the
btrfs_root is on my roadmap.  I have a WIP patch set that automounts the
subvolumes when stepping into a new one, but it's to fix a longstanding
UX wart.  Ultimately, vfsmounts are at the wrong level to solve the
inode namespace problem.  Again, there's the lifetime issue.  There are
also many places where we only have an inode and need the s_dev
associated with it.  Most of these sites are well removed from having
access to a vfsmount and pinning one and passing it around carries no
other benefit.

>> In addition, it's currently impossible for a filesystem subvolume to
>> have a different security context from it's parent. If we could allow
>> for subvolumes to optionally specify their own security context, we
>> could use them as containers directly instead of having to go through
>> an overlay.
> 
> Again, XFS subvolumes don't have this problem. So really we need to
> frame this discussion in terms of supporting multiple namespaces
> within a superblock sanely, not subvolumes.
> 
>> I ran into this particular problem with respect to Btrfs some years
>> ago and sent out a very naive set of patches which were (rightfully)
>> not incorporated:
>>
>> https://marc.info/?l=linux-btrfs&m=130074451403261&w=2
>> https://marc.info/?l=linux-btrfs&m=130532890824992&w=2
>>
>> During the discussion, one question did come up - why can't
>> filesystems like Btrfs use a superblock per subvolume? There's a
>> couple of problems with that:
>>
>> - It's common for a single Btrfs filesystem to have thousands of
>>   subvolumes. So keeping a superblock for each subvol in memory would
>>   get prohibively expensive - imagine having 8000 copies of struct
>>   super_block for a file system just because we wanted some separation
>>   of say, s_dev.
> 
> That's no different to using individual overlay mounts for the
> thousands of containers that are on the system. This doesn't seem to
> be a major problem...

Overlay mounts are indepedent of one another and don't need coordination
among them.  The memory usage is relatively unimportant.  The important
part is having a bunch of superblocks that all correspond to the same
resources and coordinating them at the VFS level.  Your assumptions
below follow how your XFS subvolumes work, where there's a clear
hierarchy between the subvolumes and the master filesystem with a
mapping layer between them.  Btrfs subvolumes have no such hierarchy.
Everything is shared.  So while we could use a writeback hierarchy to
merge all of the inode lists before doing writeback on the 'master'
superblock, we'd gain nothing by it.  Handling anything involving
s_umount with a superblock per subvolume would be a nightmare.
Ultimately, it would be a ton of effort that amounts to working around
the VFS instead of with it.

>> - Writeback would also have to walk all of these superblocks -
>>   again not very good for system performance.
> 
> Background writeback is backing device focussed, not superblock
> focussed. It will only iterate the superblocks that have dirty
> inodes on the bdi writeback lists, not all the superblocks on the
> bdi. IOWs, this isn't a major problem except for sync() operations
> that iterate superblocks.....
> 
>> - Anyone wanting to lock down I/O on a filesystem would have to
>> freeze all the superblocks. This goes for most things related to
>> I/O really - we simply can't afford to have the kernel walking
>> thousands of superblocks to sync a single fs.
> 
> Not with XFS subvolumes. Freezing the underlying parent filesystem
> will effectively stop all IO from the mounted subvolumes by freezing
> remapping calls before IO. Sure, those subvolumes aren't in a
> consistent state, but we don't freeze userspace so none of the
> application data is ever in a consistent state when filesystems are
> frozen.
> 
> So, again, I'm not sure there's /subvolume/ problem here. There's
> definitely a "freeze heirarchy" problem, but that already exists and
> it's something we talked about at LSFMM because we need to solve it
> for reliable hibernation.

There's only a freeze hierarchy problem if we have to use multiple
superblocks.  Otherwise, we freeze the whole thing or we don't.  Trying
to freeze a single subvolume would be an illusion for the user since the
underlying file system would still be active underneath it.  Under the
hood, things like relocation don't even look at what subvolume owns a
particular extent until it must.  So it could be coordinating thousands
of superblocks to do what a single lock does now and for what benefit?

>> It's far more efficient then to pull those fields we need for a
>> subvolume namespace into their own structure.
>
> I'm not convinced yet - it still feels like it's the wrong layer to
> be solving the multiple namespace per superblock problem....

It needs to be between the inode and the superblock.  If there are
multiple user-visible namespaces, each will still get the same
underlying file system namespace.

-Jeff

-- 
Jeff Mahoney
SUSE Labs


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2018-05-09  2:06 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-08 18:03 [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root Mark Fasheh
2018-05-08 18:03 ` [PATCH 01/76] vfs: Introduce struct fs_view Mark Fasheh
2018-05-08 18:03 ` [PATCH 02/76] arch: Use inode_sb() helper instead of inode->i_sb Mark Fasheh
2018-05-08 18:03 ` [PATCH 03/76] drivers: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 04/76] fs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 05/76] include: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 06/76] ipc: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 07/76] kernel: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 08/76] mm: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 09/76] net: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 10/76] security: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 11/76] fs/9p: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 12/76] fs/adfs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 13/76] fs/affs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 14/76] fs/afs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 15/76] fs/autofs4: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 16/76] fs/befs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 17/76] fs/bfs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 18/76] fs/btrfs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 19/76] fs/ceph: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 20/76] fs/cifs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 21/76] fs/coda: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 22/76] fs/configfs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 23/76] fs/cramfs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 24/76] fs/crypto: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 25/76] fs/ecryptfs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 26/76] fs/efivarfs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 27/76] fs/efs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 28/76] fs/exofs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 29/76] fs/exportfs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 30/76] fs/ext2: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 31/76] fs/ext4: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 32/76] fs/f2fs: " Mark Fasheh
2018-05-10 10:10   ` Chao Yu
2018-05-08 18:03 ` [PATCH 33/76] fs/fat: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 34/76] fs/freevxfs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 35/76] fs/fuse: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 36/76] fs/gfs2: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 37/76] fs/hfs: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 38/76] fs/hfsplus: " Mark Fasheh
2018-05-08 18:03 ` [PATCH 39/76] fs/hostfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 40/76] fs/hpfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 41/76] fs/hugetlbfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 42/76] fs/isofs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 43/76] fs/jbd2: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 44/76] fs/jffs2: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 45/76] fs/jfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 46/76] fs/kernfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 47/76] fs/lockd: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 48/76] fs/minix: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 49/76] fs/nfsd: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 50/76] fs/nfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 51/76] fs/nilfs2: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 52/76] fs/notify: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 53/76] fs/ntfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 54/76] fs/ocfs2: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 55/76] fs/omfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 56/76] fs/openpromfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 57/76] fs/orangefs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 58/76] fs/overlayfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 59/76] fs/proc: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 60/76] fs/qnx4: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 61/76] fs/qnx6: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 62/76] fs/quota: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 63/76] fs/ramfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 64/76] fs/read: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 65/76] fs/reiserfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 66/76] fs/romfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 67/76] fs/squashfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 68/76] fs/sysv: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 69/76] fs/ubifs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 70/76] fs/udf: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 71/76] fs/ufs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 72/76] fs/xfs: " Mark Fasheh
2018-05-08 18:04 ` [PATCH 73/76] vfs: Move s_dev to to struct fs_view Mark Fasheh
2018-05-08 18:04 ` [PATCH 74/76] fs: Use fs_view device from struct inode Mark Fasheh
2018-05-08 18:04 ` [PATCH 75/76] fs: Use fs view device from struct super_block Mark Fasheh
2018-05-08 18:04 ` [PATCH 76/76] btrfs: Use fs_view in roots, point inodes to it Mark Fasheh
2018-05-08 23:38 ` [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root Dave Chinner
2018-05-09  2:06   ` Jeff Mahoney [this message]
2018-05-09  6:41     ` Dave Chinner
2018-06-05 20:17       ` Jeff Mahoney
2018-06-06  9:49         ` Amir Goldstein
2018-06-06 20:42           ` Mark Fasheh
2018-06-07  6:06             ` Amir Goldstein
2018-06-07 20:44               ` Mark Fasheh
2018-06-06 21:19           ` Jeff Mahoney
2018-06-07  6:17             ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e6698201-a4a6-2146-d4c9-9e26c18ad828@suse.com \
    --to=jeffm@suse.com \
    --cc=david@fromorbit.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mfasheh@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).