On Feb 6, 2019, at 6:37 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote:
>>> On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote:
>>>> In any case, I think a better solution to the multi-device problem is to
>>>> start returning device information via struct fiemap_extent, at least
>>>> inside the kernel.  Use one of the reserved fields to declare a new
>>>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
>>>> device number, and then you can check that against inode->i_sb->s_bdev
>>>> to avoid returning results for the non-primary device of a multi-device
>>>> filesystem.
>>> 
>>> I agree we should address it here, but I don't think fiemap_extent is the right
>>> place for it, it is linked to the UAPI, and changing it is usually not a good
>>> idea.
>> 
>> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
>> into some sort of dev_t/per-device cookie should be fine.  Userspace
>> shouldn't be expecting any meaning in reserved areas.

We are already using the __u32 fiemap_extent::fe_reserved[0] as fe_device for
Lustre, to return the server index to userspace for filefrag with suitable
patches.  That is needed because a single file may be striped across multiple
servers, and could instead return the dev_t for local multi-device filesystems.

>>> I think I got your idea anyway, but, what if, instead returning the bdev in
>>> fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
>>> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
>>> with such information?
>> 
>> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> 
> Ok, may I ask why not?
> 
> My apologies if I am wrong, but, per my understanding, there is nothing today,
> which tells userspace which device belongs the extent map reported by FIEMAP.
> If it belongs to the RT device in XFS, or whatever disk in a raid in BTRFS, we
> simply do not provide such information. So, the goal is to provide a way to tell
> the filesystem if a FIEMAP or a FIBMAP has been requested, so the current
> behavior of both ioctls won't change.
> 
> Enabling filesystems to return device information into fiemap_extent requires
> modification of all filesystems to provide such information, which will not have
> any use other than matching the mounted device to the device where the extent
> is.

Filling in the fe_device field is not harmful for existing filesystems, since it
has virtually zero cost (not more than zeroing the field to avoid leaking kernel
data) and older userspace tools would just ignore it.  What would be better than
just filling in the fe_device field would be to also add:

    #define FIEMAP_EXTENT_DEVICE 0x2000

to indicate that fe_device contains a valid value.  That tells userspace that the
filesystem is filling in the field, and allows compatibility with older kernels
and allows incremental addition for filesystems that can handle this (XFS, BtrFS).

We haven't added the FIEMAP_EXTENT_DEVICE flag for Lustre, but it would make sense
to do so.

> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive than the
> device id in fiemap_extent. I don't see much advantage in adding the device id
> instead of using the flag.

We also have for Lustre:

    #define FIEMAP_FLAG_DEVICE_ORDER 0x40000000

which requests that the kernel FIEMAP return the extents for each block device
first rather than in file logical block order.  That avoids interleaving the
extents across all of the devices in e.g. 1MB chunks (think RAID-0) which would
force the maximum returned extent size to 1MB even though there are much larger
contiguous extents allocated on each device.  Instead, DEVICE_ORDER returns
all of the extents for device 0 first, then device 1 next, etc.  This shows if
the on-disk allocation is good or bad, and also fills in the fe_device field.

> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> userspace, so, it would require a check to make sure it didn't come from
> userspace if ioctl_fiemap() was used.

Are you talking about a FIEMAP_FLAG_FIBMAP flag, or about returning the fe_device
field?  I think that passing a flag like FIEMAP_FLAG_DEVICE_ORDER from userspace
is fine in this case, because it has a concrete meaning and is not just an internal
flag.

> I think there are 2 other possibilities which can be used to fix this.
> 
> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> - If the device id is a must for you, maybe add the device id into
>  fiemap_extent_info instead of fiemap_extent. So we don't mess with a UAPI
>  exported data structure and still provides a way to the filesystems to provide
>  which device the mapped extent is in.

No, that would mean all of the change, without making it more useful to userspace.

Also, if with only a device per fiemap_extent_info then it won't handle filesystems
that may allocate a single file on multiple devices, such as BtrFS and Lustre.

Cheers, Andreas

>>>> 
>>>>> +
>>>>> +	return error;
>>>>> +}
>>>>> +
>>>>> /**
>>>>>  *	bmap	- find a block number in a file
>>>>>  *	@inode:  inode owning the block number being requested
>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
>>>>>  */
>>>>> int bmap(struct inode *inode, sector_t *block)
>>>>> {
>>>>> -	if (!inode->i_mapping->a_ops->bmap)
>>>>> +	if (inode->i_op->fiemap)
>>>>> +		return bmap_fiemap(inode, block);
>>>>> +	else if (inode->i_mapping->a_ops->bmap)
>>>>> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
>>>>> +						       *block);
>>>>> +	else
>>>>> 		return -EINVAL;
>>>> 
>>>> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
>>>> suddenly it will support this legacy interface they've never supported
>>>> before.  Are they on board with this?
>>>> 
>>>> --D
>>>> 
>>>>> 
>>>>> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
>>>>> 	return 0;
>>>>> }
>>>>> EXPORT_SYMBOL(bmap);
>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
>>>>> index 6086978fe01e..bfa59df332bf 100644
>>>>> --- a/fs/ioctl.c
>>>>> +++ b/fs/ioctl.c
>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>> 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>> }
>>>>> 
>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>> +			    u64 phys, u64 len, u32 flags)
>>>>> +{
>>>>> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
>>>>> +
>>>>> +	/* only count the extents */
>>>>> +	if (fieinfo->fi_extents_max == 0) {
>>>>> +		fieinfo->fi_extents_mapped++;
>>>>> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>> +	}
>>>>> +
>>>>> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
>>>>> +		return 1;
>>>>> +
>>>>> +	if (flags & SET_UNKNOWN_FLAGS)
>>>>> +		flags |= FIEMAP_EXTENT_UNKNOWN;
>>>>> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
>>>>> +		flags |= FIEMAP_EXTENT_ENCODED;
>>>>> +	if (flags & SET_NOT_ALIGNED_FLAGS)
>>>>> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
>>>>> +
>>>>> +	extent->fe_logical = logical;
>>>>> +	extent->fe_physical = phys;
>>>>> +	extent->fe_length = len;
>>>>> +	extent->fe_flags = flags;
>>>>> +
>>>>> +	fieinfo->fi_extents_mapped++;
>>>>> +
>>>>> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
>>>>> +		return 1;
>>>>> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>> +}
>>>>> /**
>>>>>  * fiemap_fill_next_extent - Fiemap helper function
>>>>>  * @fieinfo:	Fiemap context passed into ->fiemap
>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>> index 7a434979201c..28bb523d532a 100644
>>>>> --- a/include/linux/fs.h
>>>>> +++ b/include/linux/fs.h
>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
>>>>> 	fiemap_fill_cb	fi_cb;
>>>>> };
>>>>> 
>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
>>>>> +			      u64 phys, u64 len, u32 flags);
>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
>>>>> 			    u64 phys, u64 len, u32 flags);
>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
>>>>> --
>>>>> 2.17.2
>>>>> 
>>> 
>>> --
>>> Carlos
> 
> --
> Carlos


Cheers, Andreas