Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Carlos Maiolino <cmaiolino@redhat.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-fsdevel@vger.kernel.org, hch@lst.de, adilger@dilger.ca,
	sandeen@redhat.com, david@fromorbit.com
Subject: Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
Date: Fri, 8 Feb 2019 10:03:40 +0100
Message-ID: <20190208090340.tfyu3gjvgztr26et@hades.usersys.redhat.com> (raw)
In-Reply-To: <20190207170210.GB27972@magnolia>

On Thu, Feb 07, 2019 at 09:02:10AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote:
> > On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > > > start returning device information via struct fiemap_extent, at least
> > > > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > > > filesystem.
> > > > > > 
> > > > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > > > idea.
> > > > > 
> > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > > > shouldn't be expecting any meaning in reserved areas.
> > > > > 
> > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > > > with such information?
> > > > > 
> > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > > > 
> > > > Ok, may I ask why not?
> > > 
> > > I think it's a bad idea to add a flag to FIEMAP to change its behavior
> > > to suit an older and even crappier legacy interface (i.e. FIBMAP).
> > > 
> > > FIBMAP is architecturally broken in that we can't /ever/ provide the
> > > context of "which device does this map to?"
> > > 
> > > FIEMAP is architecturally deficient as well, but its ioctl structure
> > > definition is flexible enough that we can report "which device does this
> > > map to".
> > > 
> > > I want to enhance FIEMAP to deal with multi-device filesystems
> > > correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> > > and *lilo.
> > > 
> > > > My apologies if I am wrong, but, per my understanding, there is
> > > > nothing today, which tells userspace which device belongs the extent
> > > > map reported by FIEMAP.
> > > 
> > > Right...
> > > 
> > > > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > > > BTRFS, we simply do not provide such information.
> > > 
> > > Right...
> > > 
> > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > > > a FIBMAP has been requested, so the current behavior of both ioctls
> > > > won't change.
> > > 
> > > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > > be more expressive.  Once that's done, we can use the more expressive
> > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > > 
> > > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > > start returning data in a reserved field.  New userspace programs that
> > > know about the flag can start reading information from the new field if
> > > they see the flag, and old userspace programs don't know about the flag
> > > and won't be any worse off.
> > > 
> > > > Enabling filesystems to return device information into fiemap_extent
> > > > requires modification of all filesystems to provide such information,
> > > > which will not have any use other than matching the mounted device to
> > > > the device where the extent is.
> > > 
> > > Perhaps it would help for me to present a more concrete proposal:
> > > 
> > > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > > @@ -22,7 +22,19 @@ struct fiemap_extent {
> > >  	__u64 fe_length;   /* length in bytes for this extent */
> > >  	__u64 fe_reserved64[2];
> > >  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > > -	__u32 fe_reserved[3];
> > > +
> > > +	/*
> > > +	 * Underlying device that this extent is stored on.
> > > +	 *
> > > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > > +	 * between backing devices but has no intrinsic meaning.  If neither
> > > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > > +	 * EXTENT_DEV flags may be set at any time.
> > > +	 */
> > > +	__u32 fe_device;
> > > +	__u32 fe_reserved[2];
> > >  };
> > >  
> > >  struct fiemap {
> > > @@ -66,5 +78,14 @@ struct fiemap {
> > >  						    * merged for efficiency. */
> > >  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> > >  						    * files. */
> > > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > > +						    * structure containing the
> > > +						    * major and minor numbers
> > > +						    * of a block device. */
> > > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > > +						    * cookie that can be used
> > > +						    * to distinguish physical
> > > +						    * devices but otherwise
> > > +						    * has no meaning. */
> > >  
> > >  #endif /* _LINUX_FIEMAP_H */
> > > 
> > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > > encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> > > 
> > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > > and encode the replica number in fe_device.
> > > 
> > 
> > All of this makes sense, but I'm struggling to understand what you mean by
> > replica number here, and why it justify a second flag.
> 
> I left in the "device cookie" thing in the proposal to accomodate a
> request from the Lustre folks to be able to report which replica is
> storing a particular extent map.  Apparently the replica id is simply a
> 32-bit number that isn't inherently useful, hence the vagueness around
> what "cookie" really means...
> 
> ...oh, right, lustre fell out of drivers/staging/.  You could probably
> leave it out then.
> 
> > > Existing filesystems can be left unchanged, in which case neither
> > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > > meaningless, the same as they are today.  Reporting fe_device is entirely
> > > optional.
> > > 
> > > Userspace programs will now be able to tell which device the file data
> > > lives on, which has been sort-of requested for years, if the filesystem
> > > chooses to start exporting that information.
> > > 
> > > Your FIBMAP-via-FIEMAP backend can do something like:
> > > 
> > > /* FIBMAP only returns results for the same block device backing the fs. */
> > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > > 	return 0;
> > > 
> > > /* Can't tell what is the backing device, bail out. */
> > > if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > > 	return 0;
> > > 
> > 
> > Ok, the first conditional, is ok, the second one is not making sense to me.
> > Looks like you are basically using it to flag the filesystem can't tell
> > exactly which device the current extent is, let's say for example, distributed
> > filesystems, where the physical extent can actually be on a different machine.
> > But I can't say for sure, can you give me more details about what you are trying
> > to achieve here?
> 
> You've understood me correctly. :)
> 
> > > /*
> > >  * Either fe_device matches the backing device or the implementation
> > >  * doesn't tell us about the backing device, so assume it's ok.
> > >  */
> > > <return FIBMAP results>
> > >
> > 
> > This actually looks to contradict what you have been complaining, about some
> > filesystems which doesn't support FIBMAP currently, will now suddenly start to
> > support. Assuming it's ok if the implementation doesn't tell us about the
> > backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the
> > backing device, assuming it's ok will just fall into your first complain.
> 
> Sorry, this thread has been going on so long that I forgot your goal for
> this series. :/
> 
> Specifically, I had forgotten that you're removing the ->bmap pointer,
> which means that filesystems don't have any particular way to signal
> "Yes on FIEMAP, no on FIBMAP".  Somehow I had thought that you were
> merely creating a generic_file_bmap() that would call FIEMAP and ripping
> out all the adhoc bmap implementations.
> 
> Hmm, how many filesystems support FIEMAP and not FIBMAP?
> 
> btrfs, nilfs2, and overlayfs.  Also bad_inode.c...?
> 
> Hmm, how many filesystems support FIBMAP and not FIEMAP?
> 
> adfs, affs, befs, bfs, efs, exofs, fat, freevxfs, fuse(?), nfs, hfsplus,
> isofs, jfs, minixfs, ntfs, qnx[46], reiserfs, sysv, udf, and ufs.

Eh, that's why we should keep:

if (inode->i_op->fiemap)
	return bmap_fiemap(inode, block);
else if (..a_ops->bmap)
	->a_ops->bmap(...)
else
	return -EINVAL;


> 
> > Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE
> > you mentioned.
> 
> I think you've understood it about as well as I can explain it.  Maybe
> Andreas will have more to say about the lustre replica id, but OTOH it's
> gone and so there's no user of it, so we could just drop it until lustre
> comes back.
>

Ok, thanks for confirming.

> > > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > > take advantage of that solution to remedy my objections to the proposed
> > > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > > behavior flag that userspace knows about but isn't allowed to pass in.
> > >
> > 
> > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > > > than the device id in fiemap_extent. I don't see much advantage in
> > > > adding the device id instead of using the flag.
> > > > 
> > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > > > userspace, so, it would require a check to make sure it didn't come from
> > > > userspace if ioctl_fiemap() was used.
> > > > 
> > > > I think there are 2 other possibilities which can be used to fix this.
> > > > 
> > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > > > - If the device id is a must for you, maybe add the device id into
> > > >   fiemap_extent_info instead of fiemap_extent.
> > > 
> > > That won't work with btrfs, which can store file extents on multiple
> > > different physical devices.
> > > 
> > > >   So we don't mess with a UAPI exported data structure and still
> > > >   provides a way to the filesystems to provide which device the mapped
> > > >   extent is in.
> > > > 
> > > > What you think?
> > > > 
> > > > Cheers
> > > > 
> > > > 
> > > > > 
> > > > > --D
> > > > > 
> > > > > > > 
> > > > > > > > +
> > > > > > > > +	return error;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  /**
> > > > > > > >   *	bmap	- find a block number in a file
> > > > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > > > >   */
> > > > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > > > >  {
> > > > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > > > +	if (inode->i_op->fiemap)
> > > > > > > > +		return bmap_fiemap(inode, block);
> > > > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > > > +						       *block);
> > > > > > > > +	else
> > > > > > > >  		return -EINVAL;
> > > > > > > 
> > > > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > > > suddenly it will support this legacy interface they've never supported
> > > > > > > before.  Are they on board with this?
> > > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > >  
> > > > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > > > >  	return 0;
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > > > --- a/fs/ioctl.c
> > > > > > > > +++ b/fs/ioctl.c
> > > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > > > +{
> > > > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > > > +
> > > > > > > > +	/* only count the extents */
> > > > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +
> > > > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > > > +
> > > > > > > > +	extent->fe_logical = logical;
> > > > > > > > +	extent->fe_physical = phys;
> > > > > > > > +	extent->fe_length = len;
> > > > > > > > +	extent->fe_flags = flags;
> > > > > > > > +
> > > > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +}
> > > > > > > >  /**
> > > > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > > > --- a/include/linux/fs.h
> > > > > > > > +++ b/include/linux/fs.h
> > > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > > > >  };
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > > > -- 
> > > > > > > > 2.17.2
> > > > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Carlos
> > > > 
> > > > -- 
> > > > Carlos
> > 
> > -- 
> > Carlos

-- 
Carlos

  parent reply index

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
2018-12-05  9:17 ` [PATCH 01/10] fs: Enable bmap() function to properly return errors Carlos Maiolino
2018-12-05  9:17 ` [PATCH 02/10] cachefiles: drop direct usage of ->bmap method Carlos Maiolino
2018-12-05  9:17 ` [PATCH 03/10] ecryptfs: drop direct calls to ->bmap Carlos Maiolino
2018-12-05  9:17 ` [PATCH 04/10 V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap Carlos Maiolino
2019-01-14 16:49   ` Christoph Hellwig
2019-02-04 11:34     ` Carlos Maiolino
2018-12-05  9:17 ` [PATCH 05/10] fs: Move start and length fiemap fields into fiemap_extent_info Carlos Maiolino
2019-01-14 16:50   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 06/10] iomap: Remove length and start fields from iomap_fiemap Carlos Maiolino
2019-01-14 16:51   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 07/10] fs: Use a void pointer to store fiemap_extent Carlos Maiolino
2019-01-14 16:53   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 08/10 V2] fiemap: Use a callback to fill fiemap extents Carlos Maiolino
2019-01-14 16:53   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Carlos Maiolino
2018-12-05 17:36   ` Darrick J. Wong
2018-12-07  9:09     ` Carlos Maiolino
2018-12-07 20:14       ` Andreas Dilger
2019-02-04 15:11     ` Carlos Maiolino
2019-02-04 18:27       ` Darrick J. Wong
2019-02-06 13:37         ` Carlos Maiolino
2019-02-06 20:44           ` Darrick J. Wong
2019-02-06 21:13             ` Andreas Dilger
2019-02-07  9:52               ` Carlos Maiolino
2019-02-08  8:43                 ` Christoph Hellwig
2019-02-11 12:57                   ` Christoph Hellwig
2019-02-11 16:21                     ` Carlos Maiolino
2019-02-11 16:48                       ` Christoph Hellwig
2019-02-07 11:59             ` Carlos Maiolino
2019-02-07 17:02               ` Darrick J. Wong
2019-02-07 21:25                 ` Andreas Dilger
2019-02-08  8:46                   ` Christoph Hellwig
2019-02-08 10:36                     ` Carlos Maiolino
2019-02-08 21:03                       ` Andreas Dilger
2019-02-08  9:08                   ` Carlos Maiolino
2019-02-08  9:03                 ` Carlos Maiolino [this message]
2019-02-07 12:36             ` Carlos Maiolino
2019-02-07 18:16               ` Darrick J. Wong
2019-02-08  8:58                 ` Carlos Maiolino
2019-02-06 21:04           ` Andreas Dilger
2019-01-14 16:56   ` Christoph Hellwig
2019-02-05  9:56     ` Carlos Maiolino
2019-02-05 18:25       ` Christoph Hellwig
2019-02-06  9:50         ` Carlos Maiolino
2018-12-05  9:17 ` [PATCH 10/10] xfs: Get rid of ->bmap Carlos Maiolino
2018-12-05 17:37   ` Darrick J. Wong
2018-12-06 13:06     ` Carlos Maiolino
2018-12-06 18:56 ` [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Andreas Grünbacher
2018-12-07  9:34   ` Carlos Maiolino
2019-01-14 16:50     ` Christoph Hellwig
2019-01-14 17:56       ` Andreas Grünbacher
2019-01-14 17:58         ` Christoph Hellwig

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190208090340.tfyu3gjvgztr26et@hades.usersys.redhat.com \
    --to=cmaiolino@redhat.com \
    --cc=adilger@dilger.ca \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=sandeen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org linux-fsdevel@archiver.kernel.org
	public-inbox-index linux-fsdevel


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox