From: Andreas Dilger <adilger@dilger.ca>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Christoph Hellwig <hch@lst.de>, Eric Sandeen <sandeen@redhat.com>,
david@fromorbit.com
Subject: Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
Date: Wed, 6 Feb 2019 14:13:32 -0700 [thread overview]
Message-ID: <0258844F-A305-4744-8C70-B27A3E49ADEC@dilger.ca> (raw)
In-Reply-To: <20190206204431.GB32119@magnolia>
[-- Attachment #1: Type: text/plain, Size: 11359 bytes --]
On Feb 6, 2019, at 1:44 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
>>>>> In any case, I think a better solution to the multi-device problem is to
>>>>> start returning device information via struct fiemap_extent, at least
>>>>> inside the kernel. Use one of the reserved fields to declare a new
>>>>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
>>>>> device number, and then you can check that against inode->i_sb->s_bdev
>>>>> to avoid returning results for the non-primary device of a multi-device
>>>>> filesystem.
>>>>
>>>> I agree we should address it here, but I don't think fiemap_extent is the right
>>>> place for it, it is linked to the UAPI, and changing it is usually not a good
>>>> idea.
>>>
>>> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
>>> into some sort of dev_t/per-device cookie should be fine. Userspace
>>> shouldn't be expecting any meaning in reserved areas.
>>>
>>>> I think I got your idea anyway, but, what if, instead returning the bdev in
>>>> fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
>>>> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
>>>> with such information?
>>>
>>> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
>>
>> Ok, may I ask why not?
>
> I think it's a bad idea to add a flag to FIEMAP to change its behavior
> to suit an older and even crappier legacy interface (i.e. FIBMAP).
>
> FIBMAP is architecturally broken in that we can't /ever/ provide the
> context of "which device does this map to?"
>
> FIEMAP is architecturally deficient as well, but its ioctl structure
> definition is flexible enough that we can report "which device does this
> map to".
>
> I want to enhance FIEMAP to deal with multi-device filesystems
> correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> and *lilo.
>
>> My apologies if I am wrong, but, per my understanding, there is
>> nothing today, which tells userspace which device belongs the extent
>> map reported by FIEMAP.
>
> Right...
>
>> If it belongs to the RT device in XFS, or whatever disk in a raid in
>> BTRFS, we simply do not provide such information.
>
> Right...
>
>> So, the goal is to provide a way to tell the filesystem if a FIEMAP or
>> a FIBMAP has been requested, so the current behavior of both ioctls
>> won't change.
>
> ...but from my point of view, the FIEMAP behavior *ought* to change to
> be more expressive. Once that's done, we can use the more expressive
> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
>
> The whole point of having fe_reserved* fields in struct fiemap_extent is
> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> start returning data in a reserved field. New userspace programs that
> know about the flag can start reading information from the new field if
> they see the flag, and old userspace programs don't know about the flag
> and won't be any worse off.
Exactly correct.
>> Enabling filesystems to return device information into fiemap_extent
>> requires modification of all filesystems to provide such information,
>> which will not have any use other than matching the mounted device to
>> the device where the extent is.
>
> Perhaps it would help for me to present a more concrete proposal:
>
> --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800
> +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800
> @@ -22,7 +22,19 @@ struct fiemap_extent {
> __u64 fe_length; /* length in bytes for this extent */
> __u64 fe_reserved64[2];
> __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
> - __u32 fe_reserved[3];
> +
> + /*
> + * Underlying device that this extent is stored on.
> + *
> + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is
> + * set, this field is a 32-bit cookie that can be used to distinguish
> + * between backing devices but has no intrinsic meaning. If neither
> + * EXTENT_DEV flag is set, this field is meaningless. Only one of the
> + * EXTENT_DEV flags may be set at any time.
> + */
> + __u32 fe_device;
> + __u32 fe_reserved[2];
> };
>
> struct fiemap {
> @@ -66,5 +78,14 @@ struct fiemap {
> * merged for efficiency. */
> #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other
> * files. */
> +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t
> + * structure containing the
> + * major and minor numbers
> + * of a block device. */
> +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit
> + * cookie that can be used
> + * to distinguish physical
> + * devices but otherwise
> + * has no meaning. */
>
> #endif /* _LINUX_FIEMAP_H */
>
> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> encoding:
>
> fe_device = new_encode_dev(xfs_get_device_for_file());
>
> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> and encode the replica number in fe_device.
>
> Existing filesystems can be left unchanged, in which case neither
> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> meaningless, the same as they are today. Reporting fe_device is entirely
> optional.
I like this better than my plain "FIEMAP_EXTENT_DEVICE" proposal, since it
allows userspace to distinguish between an actual dev_t a unique-but-
locally-meaninless identifier that is needed for network filesystems.
Cheers, Andreas
> Userspace programs will now be able to tell which device the file data
> lives on, which has been sort-of requested for years, if the filesystem
> chooses to start exporting that information.
>
> Your FIBMAP-via-FIEMAP backend can do something like:
>
> /* FIBMAP only returns results for the same block device backing the fs. */
> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> return 0;
>
> /* Can't tell what is the backing device, bail out. */
> if (fe->fe_flags & EXTENT_DEV_COOKIE)
> return 0;
>
> /*
> * Either fe_device matches the backing device or the implementation
> * doesn't tell us about the backing device, so assume it's ok.
> */
> <return FIBMAP results>
>
> So that's how I'd solve a longstanding design problem of FIEMAP and then
> take advantage of that solution to remedy my objections to the proposed
> "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG
> behavior flag that userspace knows about but isn't allowed to pass in.
>
>> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
>> than the device id in fiemap_extent. I don't see much advantage in
>> adding the device id instead of using the flag.
>>
>> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
>> userspace, so, it would require a check to make sure it didn't come from
>> userspace if ioctl_fiemap() was used.
>>
>> I think there are 2 other possibilities which can be used to fix this.
>>
>> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
>> - If the device id is a must for you, maybe add the device id into
>> fiemap_extent_info instead of fiemap_extent.
>
> That won't work with btrfs, which can store file extents on multiple
> different physical devices.
>
>> So we don't mess with a UAPI exported data structure and still
>> provides a way to the filesystems to provide which device the mapped
>> extent is in.
>>
>> What you think?
>>
>> Cheers
>>
>>
>>>
>>> --D
>>>
>>>>>
>>>>>> +
>>>>>> + return error;
>>>>>> +}
>>>>>> +
>>>>>> /**
>>>>>> * bmap - find a block number in a file
>>>>>> * @inode: inode owning the block number being requested
>>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
>>>>>> */
>>>>>> int bmap(struct inode *inode, sector_t *block)
>>>>>> {
>>>>>> - if (!inode->i_mapping->a_ops->bmap)
>>>>>> + if (inode->i_op->fiemap)
>>>>>> + return bmap_fiemap(inode, block);
>>>>>> + else if (inode->i_mapping->a_ops->bmap)
>>>>>> + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
>>>>>> + *block);
>>>>>> + else
>>>>>> return -EINVAL;
>>>>>
>>>>> Waitaminute. btrfs currently supports fiemap but not bmap, and now
>>>>> suddenly it will support this legacy interface they've never supported
>>>>> before. Are they on board with this?
>>>>>
>>>>> --D
>>>>>
>>>>>>
>>>>>> - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
>>>>>> return 0;
>>>>>> }
>>>>>> EXPORT_SYMBOL(bmap);
>>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
>>>>>> index 6086978fe01e..bfa59df332bf 100644
>>>>>> --- a/fs/ioctl.c
>>>>>> +++ b/fs/ioctl.c
>>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>>> return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>> }
>>>>>>
>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>>> + u64 phys, u64 len, u32 flags)
>>>>>> +{
>>>>>> + struct fiemap_extent *extent = fieinfo->fi_extents_start;
>>>>>> +
>>>>>> + /* only count the extents */
>>>>>> + if (fieinfo->fi_extents_max == 0) {
>>>>>> + fieinfo->fi_extents_mapped++;
>>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>> + }
>>>>>> +
>>>>>> + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
>>>>>> + return 1;
>>>>>> +
>>>>>> + if (flags & SET_UNKNOWN_FLAGS)
>>>>>> + flags |= FIEMAP_EXTENT_UNKNOWN;
>>>>>> + if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
>>>>>> + flags |= FIEMAP_EXTENT_ENCODED;
>>>>>> + if (flags & SET_NOT_ALIGNED_FLAGS)
>>>>>> + flags |= FIEMAP_EXTENT_NOT_ALIGNED;
>>>>>> +
>>>>>> + extent->fe_logical = logical;
>>>>>> + extent->fe_physical = phys;
>>>>>> + extent->fe_length = len;
>>>>>> + extent->fe_flags = flags;
>>>>>> +
>>>>>> + fieinfo->fi_extents_mapped++;
>>>>>> +
>>>>>> + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
>>>>>> + return 1;
>>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>> +}
>>>>>> /**
>>>>>> * fiemap_fill_next_extent - Fiemap helper function
>>>>>> * @fieinfo: Fiemap context passed into ->fiemap
>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>>> index 7a434979201c..28bb523d532a 100644
>>>>>> --- a/include/linux/fs.h
>>>>>> +++ b/include/linux/fs.h
>>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
>>>>>> fiemap_fill_cb fi_cb;
>>>>>> };
>>>>>>
>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
>>>>>> + u64 phys, u64 len, u32 flags);
>>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
>>>>>> u64 phys, u64 len, u32 flags);
>>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
>>>>>> --
>>>>>> 2.17.2
>>>>>>
>>>>
>>>> --
>>>> Carlos
>>
>> --
>> Carlos
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]
next prev parent reply other threads:[~2019-02-06 21:13 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-12-05 9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
2018-12-05 9:17 ` [PATCH 01/10] fs: Enable bmap() function to properly return errors Carlos Maiolino
2018-12-05 9:17 ` [PATCH 02/10] cachefiles: drop direct usage of ->bmap method Carlos Maiolino
2018-12-05 9:17 ` [PATCH 03/10] ecryptfs: drop direct calls to ->bmap Carlos Maiolino
2018-12-05 9:17 ` [PATCH 04/10 V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap Carlos Maiolino
2019-01-14 16:49 ` Christoph Hellwig
2019-02-04 11:34 ` Carlos Maiolino
2018-12-05 9:17 ` [PATCH 05/10] fs: Move start and length fiemap fields into fiemap_extent_info Carlos Maiolino
2019-01-14 16:50 ` Christoph Hellwig
2018-12-05 9:17 ` [PATCH 06/10] iomap: Remove length and start fields from iomap_fiemap Carlos Maiolino
2019-01-14 16:51 ` Christoph Hellwig
2018-12-05 9:17 ` [PATCH 07/10] fs: Use a void pointer to store fiemap_extent Carlos Maiolino
2019-01-14 16:53 ` Christoph Hellwig
2018-12-05 9:17 ` [PATCH 08/10 V2] fiemap: Use a callback to fill fiemap extents Carlos Maiolino
2019-01-14 16:53 ` Christoph Hellwig
2018-12-05 9:17 ` [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Carlos Maiolino
2018-12-05 17:36 ` Darrick J. Wong
2018-12-07 9:09 ` Carlos Maiolino
2018-12-07 20:14 ` Andreas Dilger
2019-02-04 15:11 ` Carlos Maiolino
2019-02-04 18:27 ` Darrick J. Wong
2019-02-06 13:37 ` Carlos Maiolino
2019-02-06 20:44 ` Darrick J. Wong
2019-02-06 21:13 ` Andreas Dilger [this message]
2019-02-07 9:52 ` Carlos Maiolino
2019-02-08 8:43 ` Christoph Hellwig
2019-02-11 12:57 ` Christoph Hellwig
2019-02-11 16:21 ` Carlos Maiolino
2019-02-11 16:48 ` Christoph Hellwig
2019-02-07 11:59 ` Carlos Maiolino
2019-02-07 17:02 ` Darrick J. Wong
2019-02-07 21:25 ` Andreas Dilger
2019-02-08 8:46 ` Christoph Hellwig
2019-02-08 10:36 ` Carlos Maiolino
2019-02-08 21:03 ` Andreas Dilger
2019-02-08 9:08 ` Carlos Maiolino
2019-02-08 9:03 ` Carlos Maiolino
2019-02-07 12:36 ` Carlos Maiolino
2019-02-07 18:16 ` Darrick J. Wong
2019-02-08 8:58 ` Carlos Maiolino
2019-02-06 21:04 ` Andreas Dilger
2019-01-14 16:56 ` Christoph Hellwig
2019-02-05 9:56 ` Carlos Maiolino
2019-02-05 18:25 ` Christoph Hellwig
2019-02-06 9:50 ` Carlos Maiolino
2018-12-05 9:17 ` [PATCH 10/10] xfs: Get rid of ->bmap Carlos Maiolino
2018-12-05 17:37 ` Darrick J. Wong
2018-12-06 13:06 ` Carlos Maiolino
2018-12-06 18:56 ` [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Andreas Grünbacher
2018-12-07 9:34 ` Carlos Maiolino
2019-01-14 16:50 ` Christoph Hellwig
2019-01-14 17:56 ` Andreas Grünbacher
2019-01-14 17:58 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0258844F-A305-4744-8C70-B27A3E49ADEC@dilger.ca \
--to=adilger@dilger.ca \
--cc=cmaiolino@redhat.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@lst.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=sandeen@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).