Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Carlos Maiolino <cmaiolino@redhat.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-fsdevel@vger.kernel.org, hch@lst.de, adilger@dilger.ca,
	sandeen@redhat.com, david@fromorbit.com
Subject: Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
Date: Fri, 8 Feb 2019 09:58:34 +0100
Message-ID: <20190208085834.sfhgrn4z5wwvavoy@hades.usersys.redhat.com> (raw)
In-Reply-To: <20190207181655.GC27972@magnolia>

On Thu, Feb 07, 2019 at 10:16:55AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 07, 2019 at 01:36:41PM +0100, Carlos Maiolino wrote:
> > Apologies, I forgot to mention another thing..
> > 
> > On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > > > start returning device information via struct fiemap_extent, at least
> > > > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > > > filesystem.
> > > > > > 
> > > > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > > > idea.
> > > > > 
> > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > > > shouldn't be expecting any meaning in reserved areas.
> > > > > 
> > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > > > with such information?
> > > > > 
> > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > > > 
> > > > Ok, may I ask why not?
> > > 
> > > I think it's a bad idea to add a flag to FIEMAP to change its behavior
> > > to suit an older and even crappier legacy interface (i.e. FIBMAP).
> > > 
> > > FIBMAP is architecturally broken in that we can't /ever/ provide the
> > > context of "which device does this map to?"
> > > 
> > > FIEMAP is architecturally deficient as well, but its ioctl structure
> > > definition is flexible enough that we can report "which device does this
> > > map to".
> > > 
> > > I want to enhance FIEMAP to deal with multi-device filesystems
> > > correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> > > and *lilo.
> > > 
> > > > My apologies if I am wrong, but, per my understanding, there is
> > > > nothing today, which tells userspace which device belongs the extent
> > > > map reported by FIEMAP.
> > > 
> > > Right...
> > > 
> > > > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > > > BTRFS, we simply do not provide such information.
> > > 
> > > Right...
> > > 
> > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > > > a FIBMAP has been requested, so the current behavior of both ioctls
> > > > won't change.
> > > 
> > > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > > be more expressive.  Once that's done, we can use the more expressive
> > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > > 
> > > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > > start returning data in a reserved field.  New userspace programs that
> > > know about the flag can start reading information from the new field if
> > > they see the flag, and old userspace programs don't know about the flag
> > > and won't be any worse off.
> > > 
> > > > Enabling filesystems to return device information into fiemap_extent
> > > > requires modification of all filesystems to provide such information,
> > > > which will not have any use other than matching the mounted device to
> > > > the device where the extent is.
> > > 
> > > Perhaps it would help for me to present a more concrete proposal:
> > > 
> > > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > > @@ -22,7 +22,19 @@ struct fiemap_extent {
> > >  	__u64 fe_length;   /* length in bytes for this extent */
> > >  	__u64 fe_reserved64[2];
> > >  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > > -	__u32 fe_reserved[3];
> > > +
> > > +	/*
> > > +	 * Underlying device that this extent is stored on.
> > > +	 *
> > > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > > +	 * between backing devices but has no intrinsic meaning.  If neither
> > > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > > +	 * EXTENT_DEV flags may be set at any time.
> > > +	 */
> > > +	__u32 fe_device;
> > > +	__u32 fe_reserved[2];
> > >  };
> > >  
> > >  struct fiemap {
> > > @@ -66,5 +78,14 @@ struct fiemap {
> > >  						    * merged for efficiency. */
> > >  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> > >  						    * files. */
> > > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > > +						    * structure containing the
> > > +						    * major and minor numbers
> > > +						    * of a block device. */
> > > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > > +						    * cookie that can be used
> > > +						    * to distinguish physical
> > > +						    * devices but otherwise
> > > +						    * has no meaning. */
> > >  
> > >  #endif /* _LINUX_FIEMAP_H */
> > > 
> > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > > encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> > 
> > Here, I believe you are forgetting that filesystems do not touch fiemap_extent
> > directly. We call fiemap_fell_next_extent() helper to fill each extent found by
> > fiemap. So, in either way, we'd need to modify fiemap_fill_next_extent() and the
> > callbacks being used to accommodate this new field or create a new helper to
> > modify the device which doesn't sound reasonable. So, either way, we will end up
> > needing to modify all filesystems.
> 
> Yep.  Drat.  I guess you could add a bdev parameter to
> fiemap_fill_next_extent, and we'd use that to encode fe_device.  If the
> fs passes NULL then we just get it from the superblock or something.
> 
> > So, although I really like the idea of improving the FIEMAP interface, I'm
> > starting to consider another patchset for it. I think it requires an interface
> > change big enough to fit in this patchset, which actually has a different
> > purpose. Or, maybe, address this at the end of this patchset, leaving different
> > interface changes in different patchsets, instead of making many changes all at
> > once, mixed together.
> 
> <nod> I think you're right, fiemap upgrades as one series and then
> fibmap-via-fiemap as the second one.
> 

Ok, fair enough, looks like we have an agreement :P I'll work on this direction
now, and set aside this patchset while we can improve FIEMAP to return the
device id, and then rebase this patchset on top of that.

Thanks for the review

> --D
> 
> > > 
> > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > > and encode the replica number in fe_device.
> > > 
> > > Existing filesystems can be left unchanged, in which case neither
> > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > > meaningless, the same as they are today.  Reporting fe_device is entirely
> > > optional.
> > > 
> > > Userspace programs will now be able to tell which device the file data
> > > lives on, which has been sort-of requested for years, if the filesystem
> > > chooses to start exporting that information.
> > > 
> > > Your FIBMAP-via-FIEMAP backend can do something like:
> > > 
> > > /* FIBMAP only returns results for the same block device backing the fs. */
> > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > > 	return 0;
> > > 
> > > /* Can't tell what is the backing device, bail out. */
> > > if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > > 	return 0;
> > > 
> > > /*
> > >  * Either fe_device matches the backing device or the implementation
> > >  * doesn't tell us about the backing device, so assume it's ok.
> > >  */
> > > <return FIBMAP results>
> > > 
> > > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > > take advantage of that solution to remedy my objections to the proposed
> > > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > > behavior flag that userspace knows about but isn't allowed to pass in.
> > > 
> > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > > > than the device id in fiemap_extent. I don't see much advantage in
> > > > adding the device id instead of using the flag.
> > > > 
> > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > > > userspace, so, it would require a check to make sure it didn't come from
> > > > userspace if ioctl_fiemap() was used.
> > > > 
> > > > I think there are 2 other possibilities which can be used to fix this.
> > > > 
> > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > > > - If the device id is a must for you, maybe add the device id into
> > > >   fiemap_extent_info instead of fiemap_extent.
> > > 
> > > That won't work with btrfs, which can store file extents on multiple
> > > different physical devices.
> > > 
> > > >   So we don't mess with a UAPI exported data structure and still
> > > >   provides a way to the filesystems to provide which device the mapped
> > > >   extent is in.
> > > > 
> > > > What you think?
> > > > 
> > > > Cheers
> > > > 
> > > > 
> > > > > 
> > > > > --D
> > > > > 
> > > > > > > 
> > > > > > > > +
> > > > > > > > +	return error;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  /**
> > > > > > > >   *	bmap	- find a block number in a file
> > > > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > > > >   */
> > > > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > > > >  {
> > > > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > > > +	if (inode->i_op->fiemap)
> > > > > > > > +		return bmap_fiemap(inode, block);
> > > > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > > > +						       *block);
> > > > > > > > +	else
> > > > > > > >  		return -EINVAL;
> > > > > > > 
> > > > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > > > suddenly it will support this legacy interface they've never supported
> > > > > > > before.  Are they on board with this?
> > > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > >  
> > > > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > > > >  	return 0;
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > > > --- a/fs/ioctl.c
> > > > > > > > +++ b/fs/ioctl.c
> > > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > > > +{
> > > > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > > > +
> > > > > > > > +	/* only count the extents */
> > > > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +
> > > > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > > > +
> > > > > > > > +	extent->fe_logical = logical;
> > > > > > > > +	extent->fe_physical = phys;
> > > > > > > > +	extent->fe_length = len;
> > > > > > > > +	extent->fe_flags = flags;
> > > > > > > > +
> > > > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +}
> > > > > > > >  /**
> > > > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > > > --- a/include/linux/fs.h
> > > > > > > > +++ b/include/linux/fs.h
> > > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > > > >  };
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > > > -- 
> > > > > > > > 2.17.2
> > > > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Carlos
> > > > 
> > > > -- 
> > > > Carlos
> > 
> > -- 
> > Carlos

-- 
Carlos

  reply index

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
2018-12-05  9:17 ` [PATCH 01/10] fs: Enable bmap() function to properly return errors Carlos Maiolino
2018-12-05  9:17 ` [PATCH 02/10] cachefiles: drop direct usage of ->bmap method Carlos Maiolino
2018-12-05  9:17 ` [PATCH 03/10] ecryptfs: drop direct calls to ->bmap Carlos Maiolino
2018-12-05  9:17 ` [PATCH 04/10 V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap Carlos Maiolino
2019-01-14 16:49   ` Christoph Hellwig
2019-02-04 11:34     ` Carlos Maiolino
2018-12-05  9:17 ` [PATCH 05/10] fs: Move start and length fiemap fields into fiemap_extent_info Carlos Maiolino
2019-01-14 16:50   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 06/10] iomap: Remove length and start fields from iomap_fiemap Carlos Maiolino
2019-01-14 16:51   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 07/10] fs: Use a void pointer to store fiemap_extent Carlos Maiolino
2019-01-14 16:53   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 08/10 V2] fiemap: Use a callback to fill fiemap extents Carlos Maiolino
2019-01-14 16:53   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Carlos Maiolino
2018-12-05 17:36   ` Darrick J. Wong
2018-12-07  9:09     ` Carlos Maiolino
2018-12-07 20:14       ` Andreas Dilger
2019-02-04 15:11     ` Carlos Maiolino
2019-02-04 18:27       ` Darrick J. Wong
2019-02-06 13:37         ` Carlos Maiolino
2019-02-06 20:44           ` Darrick J. Wong
2019-02-06 21:13             ` Andreas Dilger
2019-02-07  9:52               ` Carlos Maiolino
2019-02-08  8:43                 ` Christoph Hellwig
2019-02-11 12:57                   ` Christoph Hellwig
2019-02-11 16:21                     ` Carlos Maiolino
2019-02-11 16:48                       ` Christoph Hellwig
2019-02-07 11:59             ` Carlos Maiolino
2019-02-07 17:02               ` Darrick J. Wong
2019-02-07 21:25                 ` Andreas Dilger
2019-02-08  8:46                   ` Christoph Hellwig
2019-02-08 10:36                     ` Carlos Maiolino
2019-02-08 21:03                       ` Andreas Dilger
2019-02-08  9:08                   ` Carlos Maiolino
2019-02-08  9:03                 ` Carlos Maiolino
2019-02-07 12:36             ` Carlos Maiolino
2019-02-07 18:16               ` Darrick J. Wong
2019-02-08  8:58                 ` Carlos Maiolino [this message]
2019-02-06 21:04           ` Andreas Dilger
2019-01-14 16:56   ` Christoph Hellwig
2019-02-05  9:56     ` Carlos Maiolino
2019-02-05 18:25       ` Christoph Hellwig
2019-02-06  9:50         ` Carlos Maiolino
2018-12-05  9:17 ` [PATCH 10/10] xfs: Get rid of ->bmap Carlos Maiolino
2018-12-05 17:37   ` Darrick J. Wong
2018-12-06 13:06     ` Carlos Maiolino
2018-12-06 18:56 ` [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Andreas Grünbacher
2018-12-07  9:34   ` Carlos Maiolino
2019-01-14 16:50     ` Christoph Hellwig
2019-01-14 17:56       ` Andreas Grünbacher
2019-01-14 17:58         ` Christoph Hellwig

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190208085834.sfhgrn4z5wwvavoy@hades.usersys.redhat.com \
    --to=cmaiolino@redhat.com \
    --cc=adilger@dilger.ca \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=sandeen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org linux-fsdevel@archiver.kernel.org
	public-inbox-index linux-fsdevel


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox