linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Brauner <brauner@kernel.org>
To: John Groves <John@groves.net>
Cc: John Groves <jgroves@micron.com>,
	Jonathan Corbet <corbet@lwn.net>,
	 Dan Williams <dan.j.williams@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	 Dave Jiang <dave.jiang@intel.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	 Jan Kara <jack@suse.cz>, Matthew Wilcox <willy@infradead.org>,
	linux-cxl@vger.kernel.org,  linux-fsdevel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	 nvdimm@lists.linux.dev, john@jagalactic.com,
	Dave Chinner <david@fromorbit.com>,
	 Christoph Hellwig <hch@infradead.org>,
	dave.hansen@linux.intel.com, gregory.price@memverge.com
Subject: Re: [RFC PATCH 11/20] famfs: Add fs_context_operations
Date: Wed, 28 Feb 2024 11:07:20 +0100	[thread overview]
Message-ID: <20240228-zecken-zonen-d07e89c6f536@brauner> (raw)
In-Reply-To: <6jrtl2vc4dmi5b6db6tte2ckiyjmiwezbtlwrtmm464v65wkhj@znzv2mwjfgsk>

> I wasn't aware of the new fsconfig interface. Is there documentation or a
> file sytsem that already uses it that I should refer to? I didn't find an
> obvious candidate, but it might be me. If it should be obvious from the
> example above, tell me and I'll try harder.
> 
> My famfs code above was copied from ramfs. If you point me to 

Ok, but that's the wrong filesystem to use as a model imho. Because it
really doesn't deal with devices at all. That's why it uses
get_tree_nodev() with "nodev" as in "no device" kinda. So ramfs doesn't
have any of these issues. Whereas your filesystems is dealing with
devices dax (or pmem).

> documentation I might send you a ramfs fsconfig patch too :D.

So the manpages are at:

https://github.com/brauner/man-pages-md

But really, there shouldn't be anything that needs to change for ramfs.

> > What errno is EALREADY? Isn't that socket stuff. In any case, it seems
> > you want EBUSY?
> 
> Thanks... That should probaby be EBUSY. But the whole famfs_context_list
> should probably also be removed. More below...
> 
> > 
> > But bigger picture I'm lost. And why do you keep that list based on
> > strings? What if I do:
> > 
> > mount -t famfs /dev/pmem1234 /mnt # succeeds
> > 
> > mount -t famfs /dev/pmem1234 /opt # ah, fsck me, this fails.. But wait a minute....
> > 
> > mount --bind /dev/pmem1234 /evil-masterplan
> > 
> > mount -t famfs /evil-masterplan /opt # succeeds. YAY
> > 
> > I believe that would trivially defeat your check.
> > 
> 
> And I suspect this is related to the get_tree issue you noticed below.
> 
> This famfs code was working in 6.5 without keeping the linked list of devices,
> but in 6.6/6.7/6.8 it works provided you don't try to repeat a mount command
> that has already succeeded. I'm not sure why 6.5 protected me from that,
> but the later versions don't. In 6.6+ That hits a BUG_ON (have specifics on 
> that but not handy right now).

get_tree_nodev() by default will always allocate a new superblock. This
is how tmpfs and ramfs work. If you do:

mount -t tmpfs tmpfs /mnt
mount -t tmpfs tmpfs /opt

You get two new, independent superblocks. This is what you want for
these multi-instance filesystems: each new mount creates a new instance.

If famfs doesn't want to allow reusing devices - which I very much think
it wants to prevent - then it cannot use get_tree_nodev() directly
without having a hack like you did. Because you'll get a new superblock
no problem. So the fact that it did work somehow likely was a bug in
your code.

The reason your code causes crashes is very likely this:

struct famfs_fs_info *fsi = sb->s_fs_info;
handlep = bdev_open_by_path(fc->source, FAMFS_BLKDEV_MODE, fsi, &fs_holder_ops);

If you look at Documentation/filesystems/porting.rst you should see that
if you use @fs_holder_ops then your holder should be the struct
super_block, not your personal fsinfo.

> So for a while we just removed repeated mount requests from the famfs smoke
> tests, but eventually I implemented the list above, which - though you're right
> it would be easy to circumvent and therefore is not right - it did solve the
> problem that we were testing for.
> 
> I suspect that correctly handling get_tree might solve this problem.
> 
> Please assume that linked list will be removed - it was not the right solution.
> 
> More below...
> 
> > > +		}
> > > +	}
> > > +
> > > +	list_add(&fsi->fsi_list, &famfs_context_list);
> > > +	mutex_unlock(&famfs_context_mutex);
> > > +
> > > +	return get_tree_nodev(fc, famfs_fill_super);
> > 
> > So why isn't this using get_tree_bdev()? Note that a while ago I
> > added FSCONFIG_CMD_CREAT_EXCL which prevents silent superblock reuse. To
> > implement that I added fs_context->exclusive. If you unconditionally set
> > fc->exclusive = 1 in your famfs_init_fs_context() and use
> > get_tree_bdev() it will give you EBUSY if fc->source is already in use -
> > including other famfs instances.
> > 
> > I also fail to yet understand how that function which actually opens the block
> > device and gets the dax device figures into this. It's a bit hard to follow
> > what's going on since you add all those unused functions and types so there's
> > never a wider context to see that stuff in.
> 
> Clearly that's a bug in my code. That get_tree_nodev() is from ramfs, which
> was the starting point for famfs.
> 
> I'm wondering if doing this correctly (get_tree_bdev() when it's pmem) would
> have solved my double mount problem on 6.6 onward.
> 
> However, there's another wrinkle: I'm concluding
> (see https://lore.kernel.org/linux-fsdevel/ups6cvjw6bx5m3hotn452brbbcgemnarsasre6ep2lbe4tpjsy@ezp6oh5c72ur/)
> that famfs should drop block support and just work with /dev/dax. So famfs 
> may be the first file system to be hosted on a character device? Certainly 
> first on character dax. 

Ugh, ok. I defer to others whether that makes sense or not. It would be
a lot easier for you if you used pmem block devices, I guess because it
would be easy to detect reuse in common infrastructure.

But also, I'm looking at your code a bit closer. There's a bit of a
wrinkle the way it's currently written...

Say someone went a bit weird and did:

mount -t xfs xfs /dev/sda /my/xfs-filesystem
mknod DAX_DEVICE /my/xfs-filesystem/dax1234

and then did:

mount -t famfs famfs /my/xfs-filesystem/dax1234 /mnt

Internally in famfs you do:

fsi->dax_filp = filp_open(fc->source, O_RDWR, 0);

and you stash that file... Which means that you are pinning that xfs
filesystems implicitly. IOW, if someone does:

umount /my/xfs-filesystem

they get EBUSY for completely opaque reasons. And if they did:

umount -l /my/xfs-filesystem

followed by mounting that xfs filesystem again they'd get the same
superblock for that xfs filesystem.

What I'm trying to say is that I think you cannot pin another filesystem
like this when you open that device.

IOW, you either need to stash the plain dax device or dax needs to
become it's own tiny internal pseudo fs such that we can open dax
devices internally just like files. Which might actually also be worth
doing. But I'm not the maintainer of that.

> 
> Given that, what variant of get_tree() should it call? Should it add 
> get_tree_dax()? I'm not yet familiar enough with that code to have a worthy 
> opinion on this.

I don't think we need a common helper if famfs would be the only user of this.
But maybe I'm wrong. But roughly you'd need something similar to what we
do for block devices, I'd reckon. So lookup_daxdev() which is similar to
lookup_bdev() and allows you to translate from path to dax device
number maybe.

lookup_daxdev(const char *name, struct dax_dev? *daxdev)
{
	/* Don't actually open the dax device pointlessly */
	kern_path(fc->source, LOOKUP_FOLLOW, path);
	if (!S_ISCHR(inode->i_mode))
		// fail
	if (!may_open_dev(&path))
		// fail

	// check dax device and pin

	// get rid of path references
	path_put(&path);
}

famfs_get_tree(/* broken broken broken */)
{

	lookup_daxdev(fc->source, &ddev);

	sb = sget_fc(fc, famfs_test_super, set_anon_super_fc)
	if (IS_ERR(sb))
		// Error here may mean (aside from memory):
		// * superblock incompatible bc of read-write vs read-only
		// * non-matching user namespace
		// * FSCONFIG_CMD_CREATE_EXCL requested by mounter

	if (!sb->s_root) {
		// fill_super; new sb; dax device currently not used.
	} else {
		// A superblock for that dax device already exists and
		// may be reused. Any additional rejection reasons for
		// such an sb are up to the filesystem.
	}

	// Now really open or claim the dax device.
	// If you fail get rid of the superblock
	// (deactivate_locked_super()).

All handwavy, I know and probably I forgot details. But for you to fill
that in. ;)

  parent reply	other threads:[~2024-02-28 10:07 UTC|newest]

Thread overview: 106+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
2024-02-23 17:41 ` [RFC PATCH 01/20] famfs: Documentation John Groves
2024-02-23 17:41 ` [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2024-02-26 12:05   ` Jonathan Cameron
2024-02-26 15:00     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now John Groves
2024-02-26 12:10   ` Jonathan Cameron
2024-02-26 15:13     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap John Groves
2024-02-26 12:21   ` Jonathan Cameron
2024-02-26 15:48     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
2024-02-26 12:32   ` Jonathan Cameron
2024-02-26 16:09     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter John Groves
2024-02-26 12:34   ` Jonathan Cameron
2024-02-26 16:12     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h John Groves
2024-02-24  1:39   ` Randy Dunlap
2024-02-24  2:23     ` John Groves
2024-02-24  3:27       ` Randy Dunlap
2024-02-24 23:32         ` John Groves
2024-02-24 23:40           ` Randy Dunlap
2024-02-26 12:39   ` Jonathan Cameron
2024-02-26 16:44     ` John Groves
2024-02-26 16:56       ` Jonathan Cameron
2024-02-26 18:04         ` John Groves
2024-02-23 17:41 ` [RFC PATCH 08/20] famfs: Add famfs_internal.h John Groves
2024-02-26 12:48   ` Jonathan Cameron
2024-02-26 17:35     ` John Groves
2024-02-27 10:28       ` Jonathan Cameron
2024-02-28  1:06         ` John Groves
2024-02-27 13:38   ` Christian Brauner
2024-02-27 14:12     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 09/20] famfs: Add super_operations John Groves
2024-02-26 12:51   ` Jonathan Cameron
2024-02-26 21:47     ` John Groves
2024-02-27 10:34       ` Jonathan Cameron
2024-02-27 17:48     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations John Groves
2024-02-26 12:56   ` Jonathan Cameron
2024-02-26 22:22     ` John Groves
2024-02-27 13:39   ` Christian Brauner
2024-02-27 18:38     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 11/20] famfs: Add fs_context_operations John Groves
2024-02-26 13:20   ` Jonathan Cameron
2024-02-26 22:43     ` John Groves
2024-02-27 13:41   ` Christian Brauner
2024-02-28  0:59     ` John Groves
2024-02-28  1:49       ` Randy Dunlap
2024-02-28  8:17         ` Christian Brauner
2024-02-28 10:07       ` Christian Brauner [this message]
2024-02-28 12:01         ` Christian Brauner
2024-02-23 17:41 ` [RFC PATCH 12/20] famfs: Add inode_operations and file_system_type John Groves
2024-02-26 13:25   ` Jonathan Cameron
2024-02-26 22:53     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 13/20] famfs: Add iomap_ops John Groves
2024-02-26 13:30   ` Jonathan Cameron
2024-02-26 23:00     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 14/20] famfs: Add struct file_operations John Groves
2024-02-26 13:32   ` Jonathan Cameron
2024-02-26 23:09     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 15/20] famfs: Add ioctl to file_operations John Groves
2024-02-26 13:44   ` Jonathan Cameron
2024-02-23 17:42 ` [RFC PATCH 16/20] famfs: Add fault counters John Groves
2024-02-23 18:23   ` Dave Hansen
2024-02-23 19:56     ` John Groves
2024-02-23 20:04       ` Dan Williams
2024-02-23 20:39         ` John Groves
2024-02-23 21:19           ` Dave Hansen
2024-02-23 23:50             ` Dan Williams
2024-02-24  3:59               ` Matthew Wilcox
2024-02-24  4:30                 ` Dan Williams
2024-02-23 17:42 ` [RFC PATCH 17/20] famfs: Add module stuff John Groves
2024-02-26 13:47   ` Jonathan Cameron
2024-02-27 22:15     ` John Groves
2024-02-23 17:42 ` [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch John Groves
2024-02-26 13:52   ` Jonathan Cameron
2024-02-27 22:27     ` John Groves
2024-02-23 17:42 ` [RFC PATCH 19/20] famfs: Update MAINTAINERS file John Groves
2024-02-23 17:42 ` [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing John Groves
2024-02-24  1:50   ` Randy Dunlap
2024-02-24  2:24     ` John Groves
2024-02-24  0:07 ` [RFC PATCH 00/20] Introduce the famfs shared-memory file system Luis Chamberlain
2024-02-26 13:27   ` John Groves
2024-02-26 15:53     ` Luis Chamberlain
2024-02-26 21:16       ` John Groves
2024-02-27  0:58         ` Luis Chamberlain
2024-02-27  2:05           ` John Groves
2024-02-29  2:15             ` Dave Chinner
2024-02-29 14:52               ` John Groves
2024-03-11  1:29                 ` Dave Chinner
2024-02-29  6:52 ` Amir Goldstein
2024-02-29 22:16   ` John Groves
2024-05-17  9:55   ` Miklos Szeredi
2024-05-19  5:59     ` Amir Goldstein
2024-05-22  2:05       ` John Groves
2024-05-22  8:58         ` Miklos Szeredi
2024-05-22 10:16           ` Amir Goldstein
2024-05-22 11:28             ` Miklos Szeredi
2024-05-22 13:41               ` Amir Goldstein
2024-05-23  2:49           ` John Groves
2024-05-23 13:57             ` Miklos Szeredi
2024-05-24  0:47               ` John Groves
2024-05-24  7:55                 ` Miklos Szeredi
2024-05-25 22:54                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240228-zecken-zonen-d07e89c6f536@brauner \
    --to=brauner@kernel.org \
    --cc=John@groves.net \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=gregory.price@memverge.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jgroves@micron.com \
    --cc=john@jagalactic.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).