From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.6 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A316AC282C2 for ; Thu, 7 Feb 2019 18:17:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 5D45E2086C for ; Thu, 7 Feb 2019 18:17:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="hf4jiygf" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726774AbfBGSRN (ORCPT ); Thu, 7 Feb 2019 13:17:13 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:51148 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726319AbfBGSRN (ORCPT ); Thu, 7 Feb 2019 13:17:13 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x17IEOuc196420; Thu, 7 Feb 2019 18:17:04 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=corp-2018-07-02; bh=XRqUPKzPikZhBIbfrHX5Bo86lpQICSoqoLioo2hRU/Q=; b=hf4jiygf55GeYB6+oP+fJgmsXboiPIXVU4yv/fdoMFIUHjRbSceP9LroV55uRNVU0oDP BvXliOl9JNmrDzIyS1mRkNsMYvlwIJB4PV9F2Y+ovtEngauZha0ka53GlXwBeV6+h3hy t9wTKUDh2XP1pKrveXG7n4smL6WDHGZpa75vYlodwmAoFZ/4B6jSx2TyG6EV11sSRfTH VfirNs8hdwWX7XYakeSh1npgP1TleFYAqx8FnWxObWULKWpbjNTK01r6eejeVkWFbrRg 9jabJe9/GyTifBnN9yLFD0TBuarcAzT/7WYOegAfRRs6m3KTlPRWZRDMeE0zXICiQOp4 2g== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2120.oracle.com with ESMTP id 2qd98ngkp4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 07 Feb 2019 18:17:03 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id x17IGwT3026612 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 7 Feb 2019 18:16:58 GMT Received: from abhmp0018.oracle.com (abhmp0018.oracle.com [141.146.116.24]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id x17IGvhE008913; Thu, 7 Feb 2019 18:16:57 GMT Received: from localhost (/67.169.218.210) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 07 Feb 2019 18:16:57 +0000 Date: Thu, 7 Feb 2019 10:16:55 -0800 From: "Darrick J. Wong" To: Carlos Maiolino Cc: linux-fsdevel@vger.kernel.org, hch@lst.de, adilger@dilger.ca, sandeen@redhat.com, david@fromorbit.com Subject: Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Message-ID: <20190207181655.GC27972@magnolia> References: <20181205091728.29903-1-cmaiolino@redhat.com> <20181205091728.29903-10-cmaiolino@redhat.com> <20181205173650.GA8112@magnolia> <20190204151147.rra4n7k56ec4ndob@hades.usersys.redhat.com> <20190204182722.GA32119@magnolia> <20190206133753.oqpw7citye6apdpr@hades.usersys.redhat.com> <20190206204431.GB32119@magnolia> <20190207123641.irplijomcbni7bu2@hades.usersys.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190207123641.irplijomcbni7bu2@hades.usersys.redhat.com> User-Agent: Mutt/1.9.4 (2018-02-28) X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9160 signatures=668682 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1902070137 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Thu, Feb 07, 2019 at 01:36:41PM +0100, Carlos Maiolino wrote: > Apologies, I forgot to mention another thing.. > > On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote: > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote: > > > > > > In any case, I think a better solution to the multi-device problem is to > > > > > > start returning device information via struct fiemap_extent, at least > > > > > > inside the kernel. Use one of the reserved fields to declare a new > > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > > > > > > device number, and then you can check that against inode->i_sb->s_bdev > > > > > > to avoid returning results for the non-primary device of a multi-device > > > > > > filesystem. > > > > > > > > > > I agree we should address it here, but I don't think fiemap_extent is the right > > > > > place for it, it is linked to the UAPI, and changing it is usually not a good > > > > > idea. > > > > > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields > > > > into some sort of dev_t/per-device cookie should be fine. Userspace > > > > shouldn't be expecting any meaning in reserved areas. > > > > > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in > > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to > > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do > > > > > with such information? > > > > > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. > > > > > > Ok, may I ask why not? > > > > I think it's a bad idea to add a flag to FIEMAP to change its behavior > > to suit an older and even crappier legacy interface (i.e. FIBMAP). > > > > FIBMAP is architecturally broken in that we can't /ever/ provide the > > context of "which device does this map to?" > > > > FIEMAP is architecturally deficient as well, but its ioctl structure > > definition is flexible enough that we can report "which device does this > > map to". > > > > I want to enhance FIEMAP to deal with multi-device filesystems > > correctly, and as much as I want to kill FIBMAP, I can't because of zipl > > and *lilo. > > > > > My apologies if I am wrong, but, per my understanding, there is > > > nothing today, which tells userspace which device belongs the extent > > > map reported by FIEMAP. > > > > Right... > > > > > If it belongs to the RT device in XFS, or whatever disk in a raid in > > > BTRFS, we simply do not provide such information. > > > > Right... > > > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or > > > a FIBMAP has been requested, so the current behavior of both ioctls > > > won't change. > > > > ...but from my point of view, the FIEMAP behavior *ought* to change to > > be more expressive. Once that's done, we can use the more expressive > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > > > > The whole point of having fe_reserved* fields in struct fiemap_extent is > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > > start returning data in a reserved field. New userspace programs that > > know about the flag can start reading information from the new field if > > they see the flag, and old userspace programs don't know about the flag > > and won't be any worse off. > > > > > Enabling filesystems to return device information into fiemap_extent > > > requires modification of all filesystems to provide such information, > > > which will not have any use other than matching the mounted device to > > > the device where the extent is. > > > > Perhaps it would help for me to present a more concrete proposal: > > > > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > > @@ -22,7 +22,19 @@ struct fiemap_extent { > > __u64 fe_length; /* length in bytes for this extent */ > > __u64 fe_reserved64[2]; > > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > > - __u32 fe_reserved[3]; > > + > > + /* > > + * Underlying device that this extent is stored on. > > + * > > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > > + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > > + * set, this field is a 32-bit cookie that can be used to distinguish > > + * between backing devices but has no intrinsic meaning. If neither > > + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > > + * EXTENT_DEV flags may be set at any time. > > + */ > > + __u32 fe_device; > > + __u32 fe_reserved[2]; > > }; > > > > struct fiemap { > > @@ -66,5 +78,14 @@ struct fiemap { > > * merged for efficiency. */ > > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > > * files. */ > > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > > + * structure containing the > > + * major and minor numbers > > + * of a block device. */ > > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > > + * cookie that can be used > > + * to distinguish physical > > + * devices but otherwise > > + * has no meaning. */ > > > > #endif /* _LINUX_FIEMAP_H */ > > > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > > encoding fe_device = new_encode_dev(xfs_get_device_for_file()). > > Here, I believe you are forgetting that filesystems do not touch fiemap_extent > directly. We call fiemap_fell_next_extent() helper to fill each extent found by > fiemap. So, in either way, we'd need to modify fiemap_fill_next_extent() and the > callbacks being used to accommodate this new field or create a new helper to > modify the device which doesn't sound reasonable. So, either way, we will end up > needing to modify all filesystems. Yep. Drat. I guess you could add a bdev parameter to fiemap_fill_next_extent, and we'd use that to encode fe_device. If the fs passes NULL then we just get it from the superblock or something. > So, although I really like the idea of improving the FIEMAP interface, I'm > starting to consider another patchset for it. I think it requires an interface > change big enough to fit in this patchset, which actually has a different > purpose. Or, maybe, address this at the end of this patchset, leaving different > interface changes in different patchsets, instead of making many changes all at > once, mixed together. I think you're right, fiemap upgrades as one series and then fibmap-via-fiemap as the second one. --D > > > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > > and encode the replica number in fe_device. > > > > Existing filesystems can be left unchanged, in which case neither > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > > meaningless, the same as they are today. Reporting fe_device is entirely > > optional. > > > > Userspace programs will now be able to tell which device the file data > > lives on, which has been sort-of requested for years, if the filesystem > > chooses to start exporting that information. > > > > Your FIBMAP-via-FIEMAP backend can do something like: > > > > /* FIBMAP only returns results for the same block device backing the fs. */ > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > > return 0; > > > > /* Can't tell what is the backing device, bail out. */ > > if (fe->fe_flags & EXTENT_DEV_COOKIE) > > return 0; > > > > /* > > * Either fe_device matches the backing device or the implementation > > * doesn't tell us about the backing device, so assume it's ok. > > */ > > > > > > So that's how I'd solve a longstanding design problem of FIEMAP and then > > take advantage of that solution to remedy my objections to the proposed > > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > > behavior flag that userspace knows about but isn't allowed to pass in. > > > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive > > > than the device id in fiemap_extent. I don't see much advantage in > > > adding the device id instead of using the flag. > > > > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > > > userspace, so, it would require a check to make sure it didn't come from > > > userspace if ioctl_fiemap() was used. > > > > > > I think there are 2 other possibilities which can be used to fix this. > > > > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > > > - If the device id is a must for you, maybe add the device id into > > > fiemap_extent_info instead of fiemap_extent. > > > > That won't work with btrfs, which can store file extents on multiple > > different physical devices. > > > > > So we don't mess with a UAPI exported data structure and still > > > provides a way to the filesystems to provide which device the mapped > > > extent is in. > > > > > > What you think? > > > > > > Cheers > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > + > > > > > > > + return error; > > > > > > > +} > > > > > > > + > > > > > > > /** > > > > > > > * bmap - find a block number in a file > > > > > > > * @inode: inode owning the block number being requested > > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > > > > > > */ > > > > > > > int bmap(struct inode *inode, sector_t *block) > > > > > > > { > > > > > > > - if (!inode->i_mapping->a_ops->bmap) > > > > > > > + if (inode->i_op->fiemap) > > > > > > > + return bmap_fiemap(inode, block); > > > > > > > + else if (inode->i_mapping->a_ops->bmap) > > > > > > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > > > > > > + *block); > > > > > > > + else > > > > > > > return -EINVAL; > > > > > > > > > > > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > > > > > > suddenly it will support this legacy interface they've never supported > > > > > > before. Are they on board with this? > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > > > > > > return 0; > > > > > > > } > > > > > > > EXPORT_SYMBOL(bmap); > > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > > > > > > index 6086978fe01e..bfa59df332bf 100644 > > > > > > > --- a/fs/ioctl.c > > > > > > > +++ b/fs/ioctl.c > > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > } > > > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > > + u64 phys, u64 len, u32 flags) > > > > > > > +{ > > > > > > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > > > > > > + > > > > > > > + /* only count the extents */ > > > > > > > + if (fieinfo->fi_extents_max == 0) { > > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > + } > > > > > > > + > > > > > > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > > > > > > + return 1; > > > > > > > + > > > > > > > + if (flags & SET_UNKNOWN_FLAGS) > > > > > > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > > > > > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > > > > > > + flags |= FIEMAP_EXTENT_ENCODED; > > > > > > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > > > > > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > > > > > > + > > > > > > > + extent->fe_logical = logical; > > > > > > > + extent->fe_physical = phys; > > > > > > > + extent->fe_length = len; > > > > > > > + extent->fe_flags = flags; > > > > > > > + > > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > > + > > > > > > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > > > > > > + return 1; > > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > +} > > > > > > > /** > > > > > > > * fiemap_fill_next_extent - Fiemap helper function > > > > > > > * @fieinfo: Fiemap context passed into ->fiemap > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > > > > index 7a434979201c..28bb523d532a 100644 > > > > > > > --- a/include/linux/fs.h > > > > > > > +++ b/include/linux/fs.h > > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > > > > > > fiemap_fill_cb fi_cb; > > > > > > > }; > > > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > > + u64 phys, u64 len, u32 flags); > > > > > > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > > u64 phys, u64 len, u32 flags); > > > > > > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > > > > > > -- > > > > > > > 2.17.2 > > > > > > > > > > > > > > > > > -- > > > > > Carlos > > > > > > -- > > > Carlos > > -- > Carlos