From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21959C169C4 for ; Wed, 6 Feb 2019 21:13:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D687C218D9 for ; Wed, 6 Feb 2019 21:13:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=dilger-ca.20150623.gappssmtp.com header.i=@dilger-ca.20150623.gappssmtp.com header.b="TVI40QYT" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727099AbfBFVNi (ORCPT ); Wed, 6 Feb 2019 16:13:38 -0500 Received: from mail-pl1-f196.google.com ([209.85.214.196]:32818 "EHLO mail-pl1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726596AbfBFVNi (ORCPT ); Wed, 6 Feb 2019 16:13:38 -0500 Received: by mail-pl1-f196.google.com with SMTP id z8so1074836plo.0 for ; Wed, 06 Feb 2019 13:13:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dilger-ca.20150623.gappssmtp.com; s=20150623; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=xIsozkgaLtOklhvJLegsn3klrqLCUJqu0KIkO06yRzo=; b=TVI40QYTTJp9xBeJR4GyEAfjRZUaBCtgA47Ae7RGP6KPWRbduETewyxSHobo01q4BX Ky8zh7AZsGPdcRV+jn+pPwcWXNNG0sZagvrqHMOZTShQp8sjF2K5n3D+hD4khm6gOw9V u8qhMFNfgSkaWJV1sdCsx/G9qHOBkp+6yd7OudoRUo5SKr3nbpGbbCBuMuX/Z/h46URg /10HjlDGU40cM05SVggB+J6M6v7bFAvKF6+RIFtX82KUp012A8ugnHSQCylGx6Avq+nI WZPUvD0TmJezj9KZZsjo+dvXZMre0xfyKFnsXR0mQ8/GB3mspWXqFq49Y2yqp19I1d9g kOeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=xIsozkgaLtOklhvJLegsn3klrqLCUJqu0KIkO06yRzo=; b=mBllCBFztjFwS1owI3ZDy2jPX+3xbXZgEPC7naTVnMB/dgCfBl+Vj7X+KdcPS4lAEP Aw6ShNL9s1B7ZqQJrU+LegkKTnjF3uJ63RHNAKtTjfdY8S6lTH4pNMOWjxryEFUwZy7s Ts7kWTsjejPrm8R/NZHoeZnCRiCeyi+paly5uhQAozAh0+cRkwWpoWQCT6ePVmt45s+p GW+k9jdGmGMJmGROVJmVEMclcL4+As585pc6Mn+jmqDgSlb2SkzcHOIr3gZSc8CILDIM lOkPjJj9E63s5oHpAnPWx41F8xH1CV/wMP4eR5O06h6tVPh0fXUYSy4KPjGV+/dZb8Om SIFw== X-Gm-Message-State: AHQUAuYEuKsraCQJyaeCI/xIGnOM/rljPd/oFhiJx8iadxkCHsq6HVmz UZAC2sN5QrT6rPyv5XJ7X7AUSw== X-Google-Smtp-Source: AHgI3IZEkDjMCrspPhEc8Y4TYwOsTqki7mCHiKMMynqIokBGYEvqol6z8RU/kKOY1cF3DrKI36jtEQ== X-Received: by 2002:a17:902:b096:: with SMTP id p22mr12725040plr.105.1549487617053; Wed, 06 Feb 2019 13:13:37 -0800 (PST) Received: from cabot-wlan.adilger.int (S0106a84e3fe4b223.cg.shawcable.net. [70.77.216.213]) by smtp.gmail.com with ESMTPSA id u87sm14085691pfi.2.2019.02.06.13.13.35 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 06 Feb 2019 13:13:36 -0800 (PST) From: Andreas Dilger Message-Id: <0258844F-A305-4744-8C70-B27A3E49ADEC@dilger.ca> Content-Type: multipart/signed; boundary="Apple-Mail=_4B837B82-EC15-453B-8C8F-D42D8D27B2E2"; protocol="application/pgp-signature"; micalg=pgp-sha256 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Date: Wed, 6 Feb 2019 14:13:32 -0700 In-Reply-To: <20190206204431.GB32119@magnolia> Cc: Carlos Maiolino , linux-fsdevel , Christoph Hellwig , Eric Sandeen , david@fromorbit.com To: "Darrick J. Wong" References: <20181205091728.29903-1-cmaiolino@redhat.com> <20181205091728.29903-10-cmaiolino@redhat.com> <20181205173650.GA8112@magnolia> <20190204151147.rra4n7k56ec4ndob@hades.usersys.redhat.com> <20190204182722.GA32119@magnolia> <20190206133753.oqpw7citye6apdpr@hades.usersys.redhat.com> <20190206204431.GB32119@magnolia> X-Mailer: Apple Mail (2.3273) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org --Apple-Mail=_4B837B82-EC15-453B-8C8F-D42D8D27B2E2 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On Feb 6, 2019, at 1:44 PM, Darrick J. Wong = wrote: >=20 > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote: >>>>> In any case, I think a better solution to the multi-device problem = is to >>>>> start returning device information via struct fiemap_extent, at = least >>>>> inside the kernel. Use one of the reserved fields to declare a = new >>>>> '__u32 fe_device' field in struct fiemap_extent which can be the = dev_t >>>>> device number, and then you can check that against = inode->i_sb->s_bdev >>>>> to avoid returning results for the non-primary device of a = multi-device >>>>> filesystem. >>>>=20 >>>> I agree we should address it here, but I don't think fiemap_extent = is the right >>>> place for it, it is linked to the UAPI, and changing it is usually = not a good >>>> idea. >>>=20 >>> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved = fields >>> into some sort of dev_t/per-device cookie should be fine. Userspace >>> shouldn't be expecting any meaning in reserved areas. >>>=20 >>>> I think I got your idea anyway, but, what if, instead returning the = bdev in >>>> fiemap_extent, we instead, send a flag (via fi_flags) to the = filesystem, to >>>> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide = what to do >>>> with such information? >>>=20 >>> I don't like the idea of adding a FIEMAP_FLAG to distinguish = callers. >>=20 >> Ok, may I ask why not? >=20 > I think it's a bad idea to add a flag to FIEMAP to change its behavior > to suit an older and even crappier legacy interface (i.e. FIBMAP). >=20 > FIBMAP is architecturally broken in that we can't /ever/ provide the > context of "which device does this map to?" >=20 > FIEMAP is architecturally deficient as well, but its ioctl structure > definition is flexible enough that we can report "which device does = this > map to". >=20 > I want to enhance FIEMAP to deal with multi-device filesystems > correctly, and as much as I want to kill FIBMAP, I can't because of = zipl > and *lilo. >=20 >> My apologies if I am wrong, but, per my understanding, there is >> nothing today, which tells userspace which device belongs the extent >> map reported by FIEMAP. >=20 > Right... >=20 >> If it belongs to the RT device in XFS, or whatever disk in a raid in >> BTRFS, we simply do not provide such information. >=20 > Right... >=20 >> So, the goal is to provide a way to tell the filesystem if a FIEMAP = or >> a FIBMAP has been requested, so the current behavior of both ioctls >> won't change. >=20 > ...but from my point of view, the FIEMAP behavior *ought* to change to > be more expressive. Once that's done, we can use the more expressive > FIEMAP output to solve the problem of FIBMAP vs. multi-disk = filesystems. >=20 > The whole point of having fe_reserved* fields in struct fiemap_extent = is > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem = can > start returning data in a reserved field. New userspace programs that > know about the flag can start reading information from the new field = if > they see the flag, and old userspace programs don't know about the = flag > and won't be any worse off. Exactly correct. >> Enabling filesystems to return device information into fiemap_extent >> requires modification of all filesystems to provide such information, >> which will not have any use other than matching the mounted device to >> the device where the extent is. >=20 > Perhaps it would help for me to present a more concrete proposal: >=20 > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 = -0800 > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 = -0800 > @@ -22,7 +22,19 @@ struct fiemap_extent { > __u64 fe_length; /* length in bytes for this extent */ > __u64 fe_reserved64[2]; > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > - __u32 fe_reserved[3]; > + > + /* > + * Underlying device that this extent is stored on. > + * > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t = containing the > + * major and minor numbers of a device. If = FIEMAP_EXTENT_DEV_COOKIE is > + * set, this field is a 32-bit cookie that can be used to = distinguish > + * between backing devices but has no intrinsic meaning. If = neither > + * EXTENT_DEV flag is set, this field is meaningless. Only one = of the > + * EXTENT_DEV flags may be set at any time. > + */ > + __u32 fe_device; > + __u32 fe_reserved[2]; > }; >=20 > struct fiemap { > @@ -66,5 +78,14 @@ struct fiemap { > * merged for = efficiency. */ > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with = other > * files. */ > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a = dev_t > + * structure = containing the > + * major and minor = numbers > + * of a block device. = */ > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a = 32-bit > + * cookie that can be = used > + * to distinguish = physical > + * devices but = otherwise > + * has no meaning. */ >=20 > #endif /* _LINUX_FIEMAP_H */ >=20 > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and = start > encoding: >=20 > fe_device =3D new_encode_dev(xfs_get_device_for_file()); >=20 > Some clustered filesystem or whatever could set = FIEMAP_EXTENT_DEV_COOKIE > and encode the replica number in fe_device. >=20 > Existing filesystems can be left unchanged, in which case neither > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > meaningless, the same as they are today. Reporting fe_device is = entirely > optional. I like this better than my plain "FIEMAP_EXTENT_DEVICE" proposal, since = it allows userspace to distinguish between an actual dev_t a unique-but- locally-meaninless identifier that is needed for network filesystems. Cheers, Andreas > Userspace programs will now be able to tell which device the file data > lives on, which has been sort-of requested for years, if the = filesystem > chooses to start exporting that information. >=20 > Your FIBMAP-via-FIEMAP backend can do something like: >=20 > /* FIBMAP only returns results for the same block device backing = the fs. */ > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device !=3D = inode->i_sb->sb_device) > return 0; >=20 > /* Can't tell what is the backing device, bail out. */ > if (fe->fe_flags & EXTENT_DEV_COOKIE) > return 0; >=20 > /* > * Either fe_device matches the backing device or the = implementation > * doesn't tell us about the backing device, so assume it's ok. > */ > >=20 > So that's how I'd solve a longstanding design problem of FIEMAP and = then > take advantage of that solution to remedy my objections to the = proposed > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > behavior flag that userspace knows about but isn't allowed to pass in. >=20 >> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive >> than the device id in fiemap_extent. I don't see much advantage in >> adding the device id instead of using the flag. >>=20 >> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed = via >> userspace, so, it would require a check to make sure it didn't come = from >> userspace if ioctl_fiemap() was used. >>=20 >> I think there are 2 other possibilities which can be used to fix = this. >>=20 >> - Use a boolean inside fiemap_extent_info to identify a fibmap call, = or, >> - If the device id is a must for you, maybe add the device id into >> fiemap_extent_info instead of fiemap_extent. >=20 > That won't work with btrfs, which can store file extents on multiple > different physical devices. >=20 >> So we don't mess with a UAPI exported data structure and still >> provides a way to the filesystems to provide which device the mapped >> extent is in. >>=20 >> What you think? >>=20 >> Cheers >>=20 >>=20 >>>=20 >>> --D >>>=20 >>>>>=20 >>>>>> + >>>>>> + return error; >>>>>> +} >>>>>> + >>>>>> /** >>>>>> * bmap - find a block number in a file >>>>>> * @inode: inode owning the block number being requested >>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); >>>>>> */ >>>>>> int bmap(struct inode *inode, sector_t *block) >>>>>> { >>>>>> - if (!inode->i_mapping->a_ops->bmap) >>>>>> + if (inode->i_op->fiemap) >>>>>> + return bmap_fiemap(inode, block); >>>>>> + else if (inode->i_mapping->a_ops->bmap) >>>>>> + *block =3D = inode->i_mapping->a_ops->bmap(inode->i_mapping, >>>>>> + *block); >>>>>> + else >>>>>> return -EINVAL; >>>>>=20 >>>>> Waitaminute. btrfs currently supports fiemap but not bmap, and = now >>>>> suddenly it will support this legacy interface they've never = supported >>>>> before. Are they on board with this? >>>>>=20 >>>>> --D >>>>>=20 >>>>>>=20 >>>>>> - *block =3D = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); >>>>>> return 0; >>>>>> } >>>>>> EXPORT_SYMBOL(bmap); >>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c >>>>>> index 6086978fe01e..bfa59df332bf 100644 >>>>>> --- a/fs/ioctl.c >>>>>> +++ b/fs/ioctl.c >>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct = fiemap_extent_info *fieinfo, u64 logical, >>>>>> return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>> } >>>>>>=20 >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info = *fieinfo, u64 logical, >>>>>> + u64 phys, u64 len, u32 flags) >>>>>> +{ >>>>>> + struct fiemap_extent *extent =3D = fieinfo->fi_extents_start; >>>>>> + >>>>>> + /* only count the extents */ >>>>>> + if (fieinfo->fi_extents_max =3D=3D 0) { >>>>>> + fieinfo->fi_extents_mapped++; >>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>> + } >>>>>> + >>>>>> + if (fieinfo->fi_extents_mapped >=3D = fieinfo->fi_extents_max) >>>>>> + return 1; >>>>>> + >>>>>> + if (flags & SET_UNKNOWN_FLAGS) >>>>>> + flags |=3D FIEMAP_EXTENT_UNKNOWN; >>>>>> + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) >>>>>> + flags |=3D FIEMAP_EXTENT_ENCODED; >>>>>> + if (flags & SET_NOT_ALIGNED_FLAGS) >>>>>> + flags |=3D FIEMAP_EXTENT_NOT_ALIGNED; >>>>>> + >>>>>> + extent->fe_logical =3D logical; >>>>>> + extent->fe_physical =3D phys; >>>>>> + extent->fe_length =3D len; >>>>>> + extent->fe_flags =3D flags; >>>>>> + >>>>>> + fieinfo->fi_extents_mapped++; >>>>>> + >>>>>> + if (fieinfo->fi_extents_mapped =3D=3D = fieinfo->fi_extents_max) >>>>>> + return 1; >>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>> +} >>>>>> /** >>>>>> * fiemap_fill_next_extent - Fiemap helper function >>>>>> * @fieinfo: Fiemap context passed into ->fiemap >>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h >>>>>> index 7a434979201c..28bb523d532a 100644 >>>>>> --- a/include/linux/fs.h >>>>>> +++ b/include/linux/fs.h >>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { >>>>>> fiemap_fill_cb fi_cb; >>>>>> }; >>>>>>=20 >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, = u64 logical, >>>>>> + u64 phys, u64 len, u32 flags); >>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 = logical, >>>>>> u64 phys, u64 len, u32 flags); >>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 = fs_flags); >>>>>> -- >>>>>> 2.17.2 >>>>>>=20 >>>>=20 >>>> -- >>>> Carlos >>=20 >> -- >> Carlos Cheers, Andreas --Apple-Mail=_4B837B82-EC15-453B-8C8F-D42D8D27B2E2 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP -----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org iQIzBAEBCAAdFiEEDb73u6ZejP5ZMprvcqXauRfMH+AFAlxbTf0ACgkQcqXauRfM H+B/Bg//b5GsZZQOE+vV4G6bFVvwJhyI6ijhVlNkJYVpTFJ6P7SH6CpmkudO5TRg SjVydquhuc9ILM376y1Q+pSlLLnNbVFSEik1NMmp6WjFqnzMN+yCDtHCaLPwfjkm 7y7UqRD359IB6X6l9gZYpmDrWoAe6RKQKCkWK5/cPBVEXI3hzhLF9JUQxgEK4fcU Jr9qC6tvuVLFsV094/BnIbubD87Uo5pWFpYAK/aAIlkiFznyOrYcmmQrxHJXDWni IqOVdr3piL5GtygHO0kfqsNmIiz9NQGEmpqbTnhpLANrkOY8uKV6UeUH0qPuX/61 XWX/rPM39FtUNNJYlwLKHfUBbr8dmejjZMrVZzXWMBb2lBplG50PSLIJf5GbaJYA CwEaWpffz6qg6hneLhLNwvD9nQPd0qzy6235iF/Q3Eah0+l1yA9BhtnGw5dPkQSH BOIxQSkm14bmzXMIJj2giE/7cb1hDKw1PIeu04e/nN0vU1mlfxk302ax2zm/g3V0 1yOw/mrozpogZf8yVJERurkc4Y+6VXDTzdYVu9bDC5hhV6dtH9aLVsBiJw+H9+Pf r4f9JctWXpQwJKLpxrGqEIb+L/bv7QH8DE3MR6ePkGnGmw4/m/vkA0XmJ36CRXKR 1wgf/Ce0PZIXAiwdgop3TI3E+TGWaoHZIXQoFhfaQgPrKpHEme4= =q9Af -----END PGP SIGNATURE----- --Apple-Mail=_4B837B82-EC15-453B-8C8F-D42D8D27B2E2--