On Feb 6, 2019, at 1:44 PM, Darrick J. Wong wrote: > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote: >>>>> In any case, I think a better solution to the multi-device problem is to >>>>> start returning device information via struct fiemap_extent, at least >>>>> inside the kernel. Use one of the reserved fields to declare a new >>>>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t >>>>> device number, and then you can check that against inode->i_sb->s_bdev >>>>> to avoid returning results for the non-primary device of a multi-device >>>>> filesystem. >>>> >>>> I agree we should address it here, but I don't think fiemap_extent is the right >>>> place for it, it is linked to the UAPI, and changing it is usually not a good >>>> idea. >>> >>> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields >>> into some sort of dev_t/per-device cookie should be fine. Userspace >>> shouldn't be expecting any meaning in reserved areas. >>> >>>> I think I got your idea anyway, but, what if, instead returning the bdev in >>>> fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to >>>> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do >>>> with such information? >>> >>> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. >> >> Ok, may I ask why not? > > I think it's a bad idea to add a flag to FIEMAP to change its behavior > to suit an older and even crappier legacy interface (i.e. FIBMAP). > > FIBMAP is architecturally broken in that we can't /ever/ provide the > context of "which device does this map to?" > > FIEMAP is architecturally deficient as well, but its ioctl structure > definition is flexible enough that we can report "which device does this > map to". > > I want to enhance FIEMAP to deal with multi-device filesystems > correctly, and as much as I want to kill FIBMAP, I can't because of zipl > and *lilo. > >> My apologies if I am wrong, but, per my understanding, there is >> nothing today, which tells userspace which device belongs the extent >> map reported by FIEMAP. > > Right... > >> If it belongs to the RT device in XFS, or whatever disk in a raid in >> BTRFS, we simply do not provide such information. > > Right... > >> So, the goal is to provide a way to tell the filesystem if a FIEMAP or >> a FIBMAP has been requested, so the current behavior of both ioctls >> won't change. > > ...but from my point of view, the FIEMAP behavior *ought* to change to > be more expressive. Once that's done, we can use the more expressive > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > > The whole point of having fe_reserved* fields in struct fiemap_extent is > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > start returning data in a reserved field. New userspace programs that > know about the flag can start reading information from the new field if > they see the flag, and old userspace programs don't know about the flag > and won't be any worse off. Exactly correct. >> Enabling filesystems to return device information into fiemap_extent >> requires modification of all filesystems to provide such information, >> which will not have any use other than matching the mounted device to >> the device where the extent is. > > Perhaps it would help for me to present a more concrete proposal: > > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > @@ -22,7 +22,19 @@ struct fiemap_extent { > __u64 fe_length; /* length in bytes for this extent */ > __u64 fe_reserved64[2]; > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > - __u32 fe_reserved[3]; > + > + /* > + * Underlying device that this extent is stored on. > + * > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > + * set, this field is a 32-bit cookie that can be used to distinguish > + * between backing devices but has no intrinsic meaning. If neither > + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > + * EXTENT_DEV flags may be set at any time. > + */ > + __u32 fe_device; > + __u32 fe_reserved[2]; > }; > > struct fiemap { > @@ -66,5 +78,14 @@ struct fiemap { > * merged for efficiency. */ > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > * files. */ > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > + * structure containing the > + * major and minor numbers > + * of a block device. */ > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > + * cookie that can be used > + * to distinguish physical > + * devices but otherwise > + * has no meaning. */ > > #endif /* _LINUX_FIEMAP_H */ > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > encoding: > > fe_device = new_encode_dev(xfs_get_device_for_file()); > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > and encode the replica number in fe_device. > > Existing filesystems can be left unchanged, in which case neither > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > meaningless, the same as they are today. Reporting fe_device is entirely > optional. I like this better than my plain "FIEMAP_EXTENT_DEVICE" proposal, since it allows userspace to distinguish between an actual dev_t a unique-but- locally-meaninless identifier that is needed for network filesystems. Cheers, Andreas > Userspace programs will now be able to tell which device the file data > lives on, which has been sort-of requested for years, if the filesystem > chooses to start exporting that information. > > Your FIBMAP-via-FIEMAP backend can do something like: > > /* FIBMAP only returns results for the same block device backing the fs. */ > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > return 0; > > /* Can't tell what is the backing device, bail out. */ > if (fe->fe_flags & EXTENT_DEV_COOKIE) > return 0; > > /* > * Either fe_device matches the backing device or the implementation > * doesn't tell us about the backing device, so assume it's ok. > */ > > > So that's how I'd solve a longstanding design problem of FIEMAP and then > take advantage of that solution to remedy my objections to the proposed > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > behavior flag that userspace knows about but isn't allowed to pass in. > >> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive >> than the device id in fiemap_extent. I don't see much advantage in >> adding the device id instead of using the flag. >> >> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via >> userspace, so, it would require a check to make sure it didn't come from >> userspace if ioctl_fiemap() was used. >> >> I think there are 2 other possibilities which can be used to fix this. >> >> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, >> - If the device id is a must for you, maybe add the device id into >> fiemap_extent_info instead of fiemap_extent. > > That won't work with btrfs, which can store file extents on multiple > different physical devices. > >> So we don't mess with a UAPI exported data structure and still >> provides a way to the filesystems to provide which device the mapped >> extent is in. >> >> What you think? >> >> Cheers >> >> >>> >>> --D >>> >>>>> >>>>>> + >>>>>> + return error; >>>>>> +} >>>>>> + >>>>>> /** >>>>>> * bmap - find a block number in a file >>>>>> * @inode: inode owning the block number being requested >>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); >>>>>> */ >>>>>> int bmap(struct inode *inode, sector_t *block) >>>>>> { >>>>>> - if (!inode->i_mapping->a_ops->bmap) >>>>>> + if (inode->i_op->fiemap) >>>>>> + return bmap_fiemap(inode, block); >>>>>> + else if (inode->i_mapping->a_ops->bmap) >>>>>> + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, >>>>>> + *block); >>>>>> + else >>>>>> return -EINVAL; >>>>> >>>>> Waitaminute. btrfs currently supports fiemap but not bmap, and now >>>>> suddenly it will support this legacy interface they've never supported >>>>> before. Are they on board with this? >>>>> >>>>> --D >>>>> >>>>>> >>>>>> - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); >>>>>> return 0; >>>>>> } >>>>>> EXPORT_SYMBOL(bmap); >>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c >>>>>> index 6086978fe01e..bfa59df332bf 100644 >>>>>> --- a/fs/ioctl.c >>>>>> +++ b/fs/ioctl.c >>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, >>>>>> return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>> } >>>>>> >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, >>>>>> + u64 phys, u64 len, u32 flags) >>>>>> +{ >>>>>> + struct fiemap_extent *extent = fieinfo->fi_extents_start; >>>>>> + >>>>>> + /* only count the extents */ >>>>>> + if (fieinfo->fi_extents_max == 0) { >>>>>> + fieinfo->fi_extents_mapped++; >>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>> + } >>>>>> + >>>>>> + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) >>>>>> + return 1; >>>>>> + >>>>>> + if (flags & SET_UNKNOWN_FLAGS) >>>>>> + flags |= FIEMAP_EXTENT_UNKNOWN; >>>>>> + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) >>>>>> + flags |= FIEMAP_EXTENT_ENCODED; >>>>>> + if (flags & SET_NOT_ALIGNED_FLAGS) >>>>>> + flags |= FIEMAP_EXTENT_NOT_ALIGNED; >>>>>> + >>>>>> + extent->fe_logical = logical; >>>>>> + extent->fe_physical = phys; >>>>>> + extent->fe_length = len; >>>>>> + extent->fe_flags = flags; >>>>>> + >>>>>> + fieinfo->fi_extents_mapped++; >>>>>> + >>>>>> + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) >>>>>> + return 1; >>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>> +} >>>>>> /** >>>>>> * fiemap_fill_next_extent - Fiemap helper function >>>>>> * @fieinfo: Fiemap context passed into ->fiemap >>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h >>>>>> index 7a434979201c..28bb523d532a 100644 >>>>>> --- a/include/linux/fs.h >>>>>> +++ b/include/linux/fs.h >>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { >>>>>> fiemap_fill_cb fi_cb; >>>>>> }; >>>>>> >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, >>>>>> + u64 phys, u64 len, u32 flags); >>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, >>>>>> u64 phys, u64 len, u32 flags); >>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); >>>>>> -- >>>>>> 2.17.2 >>>>>> >>>> >>>> -- >>>> Carlos >> >> -- >> Carlos Cheers, Andreas