From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE4DFC282C2 for ; Thu, 7 Feb 2019 21:25:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4FB9721721 for ; Thu, 7 Feb 2019 21:25:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=dilger-ca.20150623.gappssmtp.com header.i=@dilger-ca.20150623.gappssmtp.com header.b="ou1I3IDc" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726650AbfBGVZH (ORCPT ); Thu, 7 Feb 2019 16:25:07 -0500 Received: from mail-pf1-f193.google.com ([209.85.210.193]:44252 "EHLO mail-pf1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726512AbfBGVZH (ORCPT ); Thu, 7 Feb 2019 16:25:07 -0500 Received: by mail-pf1-f193.google.com with SMTP id u6so554502pfh.11 for ; Thu, 07 Feb 2019 13:25:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dilger-ca.20150623.gappssmtp.com; s=20150623; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=YCJFA/RBQds0afcE/W762mvNAHPaFU869aKBn0RRlno=; b=ou1I3IDc7V6iFlinOwko8cudLhK64U9b3jicgHYNLjRAuHzTh/RKl7rSlPlar/cNFa huXYZGYq3RLbUo2+kTKCp0Kb/VF8N2wbOwNwVwp2tTCo2XpTuM1otdtNns8pDO6N2ssc UTeZ+vc+OpVPN4+1IlD5fmve1Fk8KFELcsNcvjr20jphSBPSGC4yGC8ce7NjIIhNl2S3 ovvTIKQTy72rfURBmwMS4lM2DtwHw04R+Q626mtMUrUuGH3A4P/YuHknwtX4M7hBwfxo U0BuTloJQrhPiWmbYKBKDXX/dQYY6YRuiEt+W/GIPhflN0xOxswjjiPwdLkEr6db05wm nP5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=YCJFA/RBQds0afcE/W762mvNAHPaFU869aKBn0RRlno=; b=RonCM1eqjGgpUCoZAcDThB8ubQ3YcsiYncF70hBzES9CCWKj0fT9Vdv23DPEL6JNFi PLv3UZK3keTtqzmvk3hsJzfL3/JsaKT+zpplVFRPSHYJBjyIoPi7F4AIIfli32XSZ+Qa hUH8uGMQiP6cGJCde/HIlm8dGJx+g7ySUU4/K4oO7U0jtTZsXCWctsXiaeG0xojSIi0t IwrZFDp6pdJvSqxEG3AlZE/nH5bHU6v4rouWk6pLjHytPc2BpTJGXxx3FUCt0G+G1BNx SzX8zNMoHUPMLXvz6PGvzeq2paTrh6v/LfPsrdP0544C8NcD52u/HxQ8qGL+ltdx4wh4 YxCA== X-Gm-Message-State: AHQUAubbey/5z4NtmeaSbSuXHsKEOCuUH5teUMqxyklp61IvYN/inJt7 fyEeKeTLWmPRcgT7arjGsJPNCw== X-Google-Smtp-Source: AHgI3Iacvw51ETIWk966N/Kiwku03Itn++23ovOJijWMWCyBiWmf2nYwwIaFQMzR9DN13CLnToUkrw== X-Received: by 2002:a62:4886:: with SMTP id q6mr18843916pfi.182.1549574705760; Thu, 07 Feb 2019 13:25:05 -0800 (PST) Received: from cabot-wlan.adilger.int (S0106a84e3fe4b223.cg.shawcable.net. [70.77.216.213]) by smtp.gmail.com with ESMTPSA id t90sm88437pfj.23.2019.02.07.13.25.03 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 13:25:04 -0800 (PST) From: Andreas Dilger Message-Id: Content-Type: multipart/signed; boundary="Apple-Mail=_96F9B488-9A64-45D3-B289-0AF4F45F063D"; protocol="application/pgp-signature"; micalg=pgp-sha256 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Date: Thu, 7 Feb 2019 14:25:01 -0700 In-Reply-To: <20190207170210.GB27972@magnolia> Cc: Carlos Maiolino , linux-fsdevel , Christoph Hellwig , Eric Sandeen , david@fromorbit.com To: "Darrick J. Wong" References: <20181205091728.29903-1-cmaiolino@redhat.com> <20181205091728.29903-10-cmaiolino@redhat.com> <20181205173650.GA8112@magnolia> <20190204151147.rra4n7k56ec4ndob@hades.usersys.redhat.com> <20190204182722.GA32119@magnolia> <20190206133753.oqpw7citye6apdpr@hades.usersys.redhat.com> <20190206204431.GB32119@magnolia> <20190207115954.jfkf4fcoyfxqjue6@hades.usersys.redhat.com> <20190207170210.GB27972@magnolia> X-Mailer: Apple Mail (2.3273) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org --Apple-Mail=_96F9B488-9A64-45D3-B289-0AF4F45F063D Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On Feb 7, 2019, at 10:02 AM, Darrick J. Wong = wrote: > On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote: >> On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote: >>>=20 >>> ...but from my point of view, the FIEMAP behavior *ought* to change = to >>> be more expressive. Once that's done, we can use the more = expressive >>> FIEMAP output to solve the problem of FIBMAP vs. multi-disk = filesystems. >>>=20 >>> The whole point of having fe_reserved* fields in struct = fiemap_extent is >>> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem = can >>> start returning data in a reserved field. New userspace programs = that >>> know about the flag can start reading information from the new field = if >>> they see the flag, and old userspace programs don't know about the = flag >>> and won't be any worse off. >>>=20 >>> Perhaps it would help for me to present a more concrete proposal: >>>=20 >>> --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 = -0800 >>> +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 = -0800 >>> @@ -22,7 +22,19 @@ struct fiemap_extent { >>> __u64 fe_length; /* length in bytes for this extent */ >>> __u64 fe_reserved64[2]; >>> __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ >>> - __u32 fe_reserved[3]; >>> + >>> + /* >>> + * Underlying device that this extent is stored on. >>> + * >>> + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t = containing the >>> + * major and minor numbers of a device. If = FIEMAP_EXTENT_DEV_COOKIE is >>> + * set, this field is a 32-bit cookie that can be used to = distinguish >>> + * between backing devices but has no intrinsic meaning. If = neither >>> + * EXTENT_DEV flag is set, this field is meaningless. Only one = of the >>> + * EXTENT_DEV flags may be set at any time. >>> + */ >>> + __u32 fe_device; >>> + __u32 fe_reserved[2]; >>> }; >>>=20 >>> struct fiemap { >>> @@ -66,5 +78,14 @@ struct fiemap { >>> * merged for = efficiency. */ >>> #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space = shared with other >>> * files. */ >>> +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device = is a dev_t >>> + * structure = containing the >>> + * major and minor = numbers >>> + * of a block device. = */ >>> +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a = 32-bit >>> + * cookie that can be = used >>> + * to distinguish = physical >>> + * devices but = otherwise >>> + * has no meaning. */ >>>=20 >>> #endif /* _LINUX_FIEMAP_H */ >>>=20 >>> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and = start >>> encoding fe_device =3D new_encode_dev(xfs_get_device_for_file()). >>>=20 >>> Some clustered filesystem or whatever could set = FIEMAP_EXTENT_DEV_COOKIE >>> and encode the replica number in fe_device. >>>=20 >>=20 >> All of this makes sense, but I'm struggling to understand what you = mean by >> replica number here, and why it justify a second flag. >=20 > I left in the "device cookie" thing in the proposal to accomodate a > request from the Lustre folks to be able to report which replica is > storing a particular extent map. Apparently the replica id is simply = a > 32-bit number that isn't inherently useful, hence the vagueness around > what "cookie" really means... >=20 > ...oh, right, lustre fell out of drivers/staging/. You could probably > leave it out then. Do we really need to be this way, about reserving a single flag for = Lustre, which will likely also be useful for other filesystems? It's not like Lustre is some closed-source binary module for which we need to make = life difficult, it is used by many thousands of the largest computers at labs and universities and companies around the world. We are working to = clean up the code outside the staging tree and resubmit it. Not reserving a = flag just means we will continue to use random values in Lustre before it can be merged, which will make life harder when we try to merge again. In the case of Lustre, the proposed DEV_COOKIE would indicate fe_device = is the integer index number of the server on which each extent of the file = is located (Darrick's "replica number" term is not really correct). The = server index is familiar to all Lustre users, so having filefrag print out the device number of "0000" or "0009" is totally clear to them. For pNFS or Ceph or other network filesystems (if they implement = filefrag) it could be an index or some other number (e.g. the IP address of the = server or low bits of the UUID or whatever). Reading back in the archives of = the original FIEMAP discussion, it seems BtrFS would prefer to use = DEV_COOKIE instead of DEV_T, because it uses internal RAID encoding and not plain = block devices, but I'm not familiar with the details there. Alternately, or in addition to, a DEV_COOKIE flag which indicates that = the same fe_device field is "not a device", it would be possible to add: #define FIEMAP_NO_DIRECT 0x40000000 and/or: #define FIEMAP_EXTENT_NET 0x80000000 /* Data stored remotely. * Sets NO_DIRECT flag */ returned by the filesystem that indicates the extent blocks are not = local to the node, so FIBMAP should return an error (-EOPNOTSUP or -EREMOTE or whatever) because the file can't be booted from. In that case, we could return FIEMAP_EXTENT_DEVICE to indicate the fe_device field is valid, = and return FIEMAP_EXTENT_NET to indicate the values in fe_device are not = local block devices, just filesystem-specific values to distinguish devices. However, I'm open to both _DEV_COOKIE and _NET flag if that is = preferred, since I think the two are somewhat complementary. >> This actually looks to contradict what you have been complaining, = about some >> filesystems which doesn't support FIBMAP currently, will now suddenly = start to >> support. Assuming it's ok if the implementation doesn't tell us about = the >> backing device, will simply make FIBMAP work. Let's say BTRFS doesn't = report the >> backing device, assuming it's ok will just fall into your first = complain. >=20 > Sorry, this thread has been going on so long that I forgot your goal = for > this series. :/ >=20 > Specifically, I had forgotten that you're removing the ->bmap pointer, > which means that filesystems don't have any particular way to signal > "Yes on FIEMAP, no on FIBMAP". Somehow I had thought that you were > merely creating a generic_file_bmap() that would call FIEMAP and = ripping > out all the adhoc bmap implementations. Just a reminder here, you should set FIEMAP_FLAG_SYNC when mapping = FIBMAP to FIEMAP so that the data on that file is flushed to disk before = returning, since the block mapping may not be assigned yet or may be unstable, = which could lead to an unbootable system if used for LILO. Cheers, Andreas >>> Existing filesystems can be left unchanged, in which case neither >>> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are >>> meaningless, the same as they are today. Reporting fe_device is = entirely >>> optional. >>>=20 >>> Userspace programs will now be able to tell which device the file = data >>> lives on, which has been sort-of requested for years, if the = filesystem >>> chooses to start exporting that information. >>>=20 >>> Your FIBMAP-via-FIEMAP backend can do something like: >>>=20 >>> /* FIBMAP only returns results for the same block device backing the = fs. */ >>> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device !=3D = inode->i_sb->sb_device) >>> return 0; >>>=20 >>> /* Can't tell what is the backing device, bail out. */ >>> if (fe->fe_flags & EXTENT_DEV_COOKIE) >>> return 0; >>>=20 >>=20 >> Ok, the first conditional, is ok, the second one is not making sense = to me. >> Looks like you are basically using it to flag the filesystem can't = tell >> exactly which device the current extent is, let's say for example, = distributed >> filesystems, where the physical extent can actually be on a different = machine. >> But I can't say for sure, can you give me more details about what you = are trying >> to achieve here? >=20 > You've understood me correctly. :) >=20 >>> /* >>> * Either fe_device matches the backing device or the implementation >>> * doesn't tell us about the backing device, so assume it's ok. >>> */ >>> >>>=20 >>=20 >> Anyway, I think I need to understand more your usage idea for = EXTENT_DEV_COOKIE >> you mentioned. >=20 > I think you've understood it about as well as I can explain it. Maybe > Andreas will have more to say about the lustre replica id, but OTOH = it's > gone and so there's no user of it, so we could just drop it until = lustre > comes back. >=20 >>> So that's how I'd solve a longstanding design problem of FIEMAP and = then >>> take advantage of that solution to remedy my objections to the = proposed >>> "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG >>> behavior flag that userspace knows about but isn't allowed to pass = in. >>>=20 >>=20 >>>> A FIEMAP_FLAG will also require FS changes, but IMHO, less = intrusive >>>> than the device id in fiemap_extent. I don't see much advantage in >>>> adding the device id instead of using the flag. >>>>=20 >>>> A problem I see using a new FIEMAP_FLAG, is it 'could' be also = passed via >>>> userspace, so, it would require a check to make sure it didn't come = from >>>> userspace if ioctl_fiemap() was used. >>>>=20 >>>> I think there are 2 other possibilities which can be used to fix = this. >>>>=20 >>>> - Use a boolean inside fiemap_extent_info to identify a fibmap = call, or, >>>> - If the device id is a must for you, maybe add the device id into >>>> fiemap_extent_info instead of fiemap_extent. >>>=20 >>> That won't work with btrfs, which can store file extents on multiple >>> different physical devices. >>>=20 >>>> So we don't mess with a UAPI exported data structure and still >>>> provides a way to the filesystems to provide which device the = mapped >>>> extent is in. >>>>=20 >>>> What you think? >>>>=20 >>>> Cheers >>>>=20 >>>>=20 >>>>>=20 >>>>> --D >>>>>=20 >>>>>>>=20 >>>>>>>> + >>>>>>>> + return error; >>>>>>>> +} >>>>>>>> + >>>>>>>> /** >>>>>>>> * bmap - find a block number in a file >>>>>>>> * @inode: inode owning the block number being requested >>>>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); >>>>>>>> */ >>>>>>>> int bmap(struct inode *inode, sector_t *block) >>>>>>>> { >>>>>>>> - if (!inode->i_mapping->a_ops->bmap) >>>>>>>> + if (inode->i_op->fiemap) >>>>>>>> + return bmap_fiemap(inode, block); >>>>>>>> + else if (inode->i_mapping->a_ops->bmap) >>>>>>>> + *block =3D = inode->i_mapping->a_ops->bmap(inode->i_mapping, >>>>>>>> + *block); >>>>>>>> + else >>>>>>>> return -EINVAL; >>>>>>>=20 >>>>>>> Waitaminute. btrfs currently supports fiemap but not bmap, and = now >>>>>>> suddenly it will support this legacy interface they've never = supported >>>>>>> before. Are they on board with this? >>>>>>>=20 >>>>>>> --D >>>>>>>=20 >>>>>>>>=20 >>>>>>>> - *block =3D = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); >>>>>>>> return 0; >>>>>>>> } >>>>>>>> EXPORT_SYMBOL(bmap); >>>>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c >>>>>>>> index 6086978fe01e..bfa59df332bf 100644 >>>>>>>> --- a/fs/ioctl.c >>>>>>>> +++ b/fs/ioctl.c >>>>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct = fiemap_extent_info *fieinfo, u64 logical, >>>>>>>> return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>>>> } >>>>>>>>=20 >>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info = *fieinfo, u64 logical, >>>>>>>> + u64 phys, u64 len, u32 flags) >>>>>>>> +{ >>>>>>>> + struct fiemap_extent *extent =3D = fieinfo->fi_extents_start; >>>>>>>> + >>>>>>>> + /* only count the extents */ >>>>>>>> + if (fieinfo->fi_extents_max =3D=3D 0) { >>>>>>>> + fieinfo->fi_extents_mapped++; >>>>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>>>> + } >>>>>>>> + >>>>>>>> + if (fieinfo->fi_extents_mapped >=3D = fieinfo->fi_extents_max) >>>>>>>> + return 1; >>>>>>>> + >>>>>>>> + if (flags & SET_UNKNOWN_FLAGS) >>>>>>>> + flags |=3D FIEMAP_EXTENT_UNKNOWN; >>>>>>>> + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) >>>>>>>> + flags |=3D FIEMAP_EXTENT_ENCODED; >>>>>>>> + if (flags & SET_NOT_ALIGNED_FLAGS) >>>>>>>> + flags |=3D FIEMAP_EXTENT_NOT_ALIGNED; >>>>>>>> + >>>>>>>> + extent->fe_logical =3D logical; >>>>>>>> + extent->fe_physical =3D phys; >>>>>>>> + extent->fe_length =3D len; >>>>>>>> + extent->fe_flags =3D flags; >>>>>>>> + >>>>>>>> + fieinfo->fi_extents_mapped++; >>>>>>>> + >>>>>>>> + if (fieinfo->fi_extents_mapped =3D=3D = fieinfo->fi_extents_max) >>>>>>>> + return 1; >>>>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>>>> +} >>>>>>>> /** >>>>>>>> * fiemap_fill_next_extent - Fiemap helper function >>>>>>>> * @fieinfo: Fiemap context passed into ->fiemap >>>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h >>>>>>>> index 7a434979201c..28bb523d532a 100644 >>>>>>>> --- a/include/linux/fs.h >>>>>>>> +++ b/include/linux/fs.h >>>>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { >>>>>>>> fiemap_fill_cb fi_cb; >>>>>>>> }; >>>>>>>>=20 >>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, = u64 logical, >>>>>>>> + u64 phys, u64 len, u32 flags); >>>>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, = u64 logical, >>>>>>>> u64 phys, u64 len, u32 flags); >>>>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 = fs_flags); >>>>>>>> -- >>>>>>>> 2.17.2 >>>>>>>>=20 >>>>>>=20 >>>>>> -- >>>>>> Carlos >>>>=20 >>>> -- >>>> Carlos >>=20 >> -- >> Carlos Cheers, Andreas --Apple-Mail=_96F9B488-9A64-45D3-B289-0AF4F45F063D Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP -----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org iQIzBAEBCAAdFiEEDb73u6ZejP5ZMprvcqXauRfMH+AFAlxcoi0ACgkQcqXauRfM H+Br1xAAsaoIq0oEek4GRkq9imu5m1RWe9hgbYQ9BOuO44yLjF0lqZbuJO0d7trM yFLJAX0X4SId5TVeQOHFwSBj4YyVb//Gnlr+tv5zGZhBCTRGZ9cI7pHAurcLG3Rc dLmzFuAgMnymEZxSrvrJzlI9NfjVxH3m/b60eDmDh84TxUAUXZG7+EdL0xKDz5Xc bMvuYTnkYaLPenFU/cEUfzogLBo4Jg6/V0VLmTc2f3eOPQwRCieuDgfdIRqvyQKH 6i30HBVSmxmoy0MN3Clh9/4Ca4t0SfU9xYAhwxxhsFRy/ye8wAnQV/g4rJl+43V+ jVDh9rolQ/hwjRFQwoCtKXWM4aBofneHSAS6A70UGZDOcBvZ+3hsglZKzrq3z8OX HNuSq9hDzXoEMJZb2A5f9ZlP/XOflcx4aLJDEYnnlaMqmfN4Hv3OujsUkhAIE5EZ 5mwVNiB4slVnU65S+YzjVk9YxOIjnc+1buehiLG4IxjKlXBMn9brOlFOPBpNwiMf 4K+LPNjdbhdFnGitkwp0TYGO1xKFQS3jaUJAXCkf3iYgYT46qfDs3SzBYX5rcwZ1 ogQVx7YyVK+qkBOR+no44IndRGXU8nafNTTd1lPNHcy6d7FSUFUWw6H4mNzZae6n kHwXur4KgbOWuDoUTQnvh6p+GDrC03YwmAZkyPJMjVL5gSiLaoA= =VlEb -----END PGP SIGNATURE----- --Apple-Mail=_96F9B488-9A64-45D3-B289-0AF4F45F063D--