From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7446BC169C4 for ; Wed, 6 Feb 2019 21:06:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1A1BE218D3 for ; Wed, 6 Feb 2019 21:06:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=dilger-ca.20150623.gappssmtp.com header.i=@dilger-ca.20150623.gappssmtp.com header.b="LRH9CsPV" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726861AbfBFVGM (ORCPT ); Wed, 6 Feb 2019 16:06:12 -0500 Received: from mail-pg1-f193.google.com ([209.85.215.193]:44184 "EHLO mail-pg1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726528AbfBFVGM (ORCPT ); Wed, 6 Feb 2019 16:06:12 -0500 Received: by mail-pg1-f193.google.com with SMTP id y1so3451809pgk.11 for ; Wed, 06 Feb 2019 13:06:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dilger-ca.20150623.gappssmtp.com; s=20150623; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=B36JmPpa0yHcWbKu5XQ5IhBVvTg2Bl/TVuJ90zXdIuw=; b=LRH9CsPVID2Lfi73mTFLVDT6UseovB/mLTmjR2TBBngSGiJYi5dBiu98knqqf4xojU UlZQzMTjV2ELwAalsfVycwAW1DsY+/eXqb7NdDoihaL7Y10MA3ulH0d6zuz0ku5Lkyce JhR3ZQ6Alx8sxOa4qUWw0xaH5XhmYe/pp1i0oRB7fQQAkZNUOcQGuUoFl5fSKC2ByZZH P7K3QQqueIAngQTXgl/EtYoTQsIPZCElYhulA6TW00zX0H6jaG476dL2d/CB6myznNNB oVzuQmiu63RknN8dkGRw60SloyhhleFU7LwmIgFo1WUE3LFfe10lvYSR3M+TCRcUPwsm 3/Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=B36JmPpa0yHcWbKu5XQ5IhBVvTg2Bl/TVuJ90zXdIuw=; b=JasivFGtaVQ09ewl3Ft3XMYDT2VjPrb3G0bybljbv9AGu1gLQsSB+po4XN9zxvqq5l T72ydZLGvQP46OVl/sHkXC1UqHwikuoKES92snpyRnXj/XVbJmfa9OA7eCsyuVyQtlky t+H4xkLAjyirwPyJWrd0iduXffLhVzKFjVYVx2p66rE1xgwooQT2InpWa0GpVB4yFCbS mzST3qludobdXwmLUGfX2pJzieM/ViBoqxmDb0EFOtNKJlCz+Ke56ZglukgfwD74Cw3I NIAq8qoZ8sm37DCukBLsx3nQ+K8qNrui3q9sp6YkztdKkjP4oWdbHhyYY9EvnDVSWk2F XfCg== X-Gm-Message-State: AHQUAuahHhqrtzs6jzopzFuHbynFl8e4Cpch7GWG1mLCIBVg7xM8MZaa Ky955wm6tV2udXAvYjHb4CM/2Q== X-Google-Smtp-Source: AHgI3IbBZ+FDwmLV9vQCYOp7jgnxdBLeZcmYpp/3ew+VkPf881Lte63M/2x18ysGHMhLas9AHYZKNQ== X-Received: by 2002:a62:9719:: with SMTP id n25mr12924539pfe.240.1549487171086; Wed, 06 Feb 2019 13:06:11 -0800 (PST) Received: from cabot-wlan.adilger.int (S0106a84e3fe4b223.cg.shawcable.net. [70.77.216.213]) by smtp.gmail.com with ESMTPSA id v11sm146774pfa.49.2019.02.06.13.04.43 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 06 Feb 2019 13:06:09 -0800 (PST) From: Andreas Dilger Message-Id: <6476E6C9-B896-44B9-A5C6-07A5AB752D9F@dilger.ca> Content-Type: multipart/signed; boundary="Apple-Mail=_C9A997A8-D985-49B1-B5F9-9EDBED80AB16"; protocol="application/pgp-signature"; micalg=pgp-sha256 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Date: Wed, 6 Feb 2019 14:04:34 -0700 In-Reply-To: <20190206133753.oqpw7citye6apdpr@hades.usersys.redhat.com> Cc: "Darrick J. Wong" , linux-fsdevel , Christoph Hellwig , Eric Sandeen , david@fromorbit.com To: Carlos Maiolino References: <20181205091728.29903-1-cmaiolino@redhat.com> <20181205091728.29903-10-cmaiolino@redhat.com> <20181205173650.GA8112@magnolia> <20190204151147.rra4n7k56ec4ndob@hades.usersys.redhat.com> <20190204182722.GA32119@magnolia> <20190206133753.oqpw7citye6apdpr@hades.usersys.redhat.com> X-Mailer: Apple Mail (2.3273) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org --Apple-Mail=_C9A997A8-D985-49B1-B5F9-9EDBED80AB16 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On Feb 6, 2019, at 6:37 AM, Carlos Maiolino = wrote: >>> On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote: >>>> In any case, I think a better solution to the multi-device problem = is to >>>> start returning device information via struct fiemap_extent, at = least >>>> inside the kernel. Use one of the reserved fields to declare a new >>>> '__u32 fe_device' field in struct fiemap_extent which can be the = dev_t >>>> device number, and then you can check that against = inode->i_sb->s_bdev >>>> to avoid returning results for the non-primary device of a = multi-device >>>> filesystem. >>>=20 >>> I agree we should address it here, but I don't think fiemap_extent = is the right >>> place for it, it is linked to the UAPI, and changing it is usually = not a good >>> idea. >>=20 >> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved = fields >> into some sort of dev_t/per-device cookie should be fine. Userspace >> shouldn't be expecting any meaning in reserved areas. We are already using the __u32 fiemap_extent::fe_reserved[0] as = fe_device for Lustre, to return the server index to userspace for filefrag with = suitable patches. That is needed because a single file may be striped across = multiple servers, and could instead return the dev_t for local multi-device = filesystems. >>> I think I got your idea anyway, but, what if, instead returning the = bdev in >>> fiemap_extent, we instead, send a flag (via fi_flags) to the = filesystem, to >>> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide = what to do >>> with such information? >>=20 >> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. >=20 > Ok, may I ask why not? >=20 > My apologies if I am wrong, but, per my understanding, there is = nothing today, > which tells userspace which device belongs the extent map reported by = FIEMAP. > If it belongs to the RT device in XFS, or whatever disk in a raid in = BTRFS, we > simply do not provide such information. So, the goal is to provide a = way to tell > the filesystem if a FIEMAP or a FIBMAP has been requested, so the = current > behavior of both ioctls won't change. >=20 > Enabling filesystems to return device information into fiemap_extent = requires > modification of all filesystems to provide such information, which = will not have > any use other than matching the mounted device to the device where the = extent > is. Filling in the fe_device field is not harmful for existing filesystems, = since it has virtually zero cost (not more than zeroing the field to avoid = leaking kernel data) and older userspace tools would just ignore it. What would be = better than just filling in the fe_device field would be to also add: #define FIEMAP_EXTENT_DEVICE 0x2000 to indicate that fe_device contains a valid value. That tells userspace = that the filesystem is filling in the field, and allows compatibility with older = kernels and allows incremental addition for filesystems that can handle this = (XFS, BtrFS). We haven't added the FIEMAP_EXTENT_DEVICE flag for Lustre, but it would = make sense to do so. > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive = than the > device id in fiemap_extent. I don't see much advantage in adding the = device id > instead of using the flag. We also have for Lustre: #define FIEMAP_FLAG_DEVICE_ORDER 0x40000000 which requests that the kernel FIEMAP return the extents for each block = device first rather than in file logical block order. That avoids interleaving = the extents across all of the devices in e.g. 1MB chunks (think RAID-0) = which would force the maximum returned extent size to 1MB even though there are much = larger contiguous extents allocated on each device. Instead, DEVICE_ORDER = returns all of the extents for device 0 first, then device 1 next, etc. This = shows if the on-disk allocation is good or bad, and also fills in the fe_device = field. > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed = via > userspace, so, it would require a check to make sure it didn't come = from > userspace if ioctl_fiemap() was used. Are you talking about a FIEMAP_FLAG_FIBMAP flag, or about returning the = fe_device field? I think that passing a flag like FIEMAP_FLAG_DEVICE_ORDER from = userspace is fine in this case, because it has a concrete meaning and is not just = an internal flag. > I think there are 2 other possibilities which can be used to fix this. >=20 > - Use a boolean inside fiemap_extent_info to identify a fibmap call, = or, > - If the device id is a must for you, maybe add the device id into > fiemap_extent_info instead of fiemap_extent. So we don't mess with a = UAPI > exported data structure and still provides a way to the filesystems = to provide > which device the mapped extent is in. No, that would mean all of the change, without making it more useful to = userspace. Also, if with only a device per fiemap_extent_info then it won't handle = filesystems that may allocate a single file on multiple devices, such as BtrFS and = Lustre. Cheers, Andreas >>>>=20 >>>>> + >>>>> + return error; >>>>> +} >>>>> + >>>>> /** >>>>> * bmap - find a block number in a file >>>>> * @inode: inode owning the block number being requested >>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); >>>>> */ >>>>> int bmap(struct inode *inode, sector_t *block) >>>>> { >>>>> - if (!inode->i_mapping->a_ops->bmap) >>>>> + if (inode->i_op->fiemap) >>>>> + return bmap_fiemap(inode, block); >>>>> + else if (inode->i_mapping->a_ops->bmap) >>>>> + *block =3D = inode->i_mapping->a_ops->bmap(inode->i_mapping, >>>>> + *block); >>>>> + else >>>>> return -EINVAL; >>>>=20 >>>> Waitaminute. btrfs currently supports fiemap but not bmap, and now >>>> suddenly it will support this legacy interface they've never = supported >>>> before. Are they on board with this? >>>>=20 >>>> --D >>>>=20 >>>>>=20 >>>>> - *block =3D inode->i_mapping->a_ops->bmap(inode->i_mapping, = *block); >>>>> return 0; >>>>> } >>>>> EXPORT_SYMBOL(bmap); >>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c >>>>> index 6086978fe01e..bfa59df332bf 100644 >>>>> --- a/fs/ioctl.c >>>>> +++ b/fs/ioctl.c >>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct = fiemap_extent_info *fieinfo, u64 logical, >>>>> return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>> } >>>>>=20 >>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, = u64 logical, >>>>> + u64 phys, u64 len, u32 flags) >>>>> +{ >>>>> + struct fiemap_extent *extent =3D fieinfo->fi_extents_start; >>>>> + >>>>> + /* only count the extents */ >>>>> + if (fieinfo->fi_extents_max =3D=3D 0) { >>>>> + fieinfo->fi_extents_mapped++; >>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>> + } >>>>> + >>>>> + if (fieinfo->fi_extents_mapped >=3D fieinfo->fi_extents_max) >>>>> + return 1; >>>>> + >>>>> + if (flags & SET_UNKNOWN_FLAGS) >>>>> + flags |=3D FIEMAP_EXTENT_UNKNOWN; >>>>> + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) >>>>> + flags |=3D FIEMAP_EXTENT_ENCODED; >>>>> + if (flags & SET_NOT_ALIGNED_FLAGS) >>>>> + flags |=3D FIEMAP_EXTENT_NOT_ALIGNED; >>>>> + >>>>> + extent->fe_logical =3D logical; >>>>> + extent->fe_physical =3D phys; >>>>> + extent->fe_length =3D len; >>>>> + extent->fe_flags =3D flags; >>>>> + >>>>> + fieinfo->fi_extents_mapped++; >>>>> + >>>>> + if (fieinfo->fi_extents_mapped =3D=3D fieinfo->fi_extents_max) >>>>> + return 1; >>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>> +} >>>>> /** >>>>> * fiemap_fill_next_extent - Fiemap helper function >>>>> * @fieinfo: Fiemap context passed into ->fiemap >>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h >>>>> index 7a434979201c..28bb523d532a 100644 >>>>> --- a/include/linux/fs.h >>>>> +++ b/include/linux/fs.h >>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { >>>>> fiemap_fill_cb fi_cb; >>>>> }; >>>>>=20 >>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, = u64 logical, >>>>> + u64 phys, u64 len, u32 flags); >>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 = logical, >>>>> u64 phys, u64 len, u32 flags); >>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 = fs_flags); >>>>> -- >>>>> 2.17.2 >>>>>=20 >>>=20 >>> -- >>> Carlos >=20 > -- > Carlos Cheers, Andreas --Apple-Mail=_C9A997A8-D985-49B1-B5F9-9EDBED80AB16 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP -----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org iQIzBAEBCAAdFiEEDb73u6ZejP5ZMprvcqXauRfMH+AFAlxbS+IACgkQcqXauRfM H+DPxA/+K/yUI47RofoKb9oYSVcYIc9PVICZ5mLDKK9v5wtntSLkcdW+im1VFqjm 3KaEbdD1hzo1GPFmNOlrfqhFbMh11eUgfD+a2zIySzA5pKvW2ZOEDSZUrFXJTdx5 zRRceziadllSdjMBHxtE1RAFqhuVa92eNGIzR9P4ZufU5dgxao9CsLvOmHgI3SNe ZioWTNfqm5hbVImlouNYjkj9R0O28URvjfa8zvvdlICBQiYiy+U3mHwuwgrnqgut MuZJBinNzpQ5EAwN1Ar6gB3Z/O8s6IKGmItbol4uvWwYArdXOmV2R1cVeMGZiY6K QQkSOJNGQzF1l9ogio3JBhexAFaPqXZ9jVQgYvwl79ofurlFzNdKCeb7rej7n/JH LQmjngI2QS/FwZExk5EgnJVWekg82ImoqhWhqef/o9EXATTBQrlagKXvWEr/A7oG bJuBuvayego1VzVxDqknpRB1bMByd7XvaWpANY21UKCPCjcgfpTZJfpXgMUNXSsB CQVr4Rd6zsnYzNxraXfMNdiJnkicDDQ42mIbv2N7weqBIBl+IdEnQdBKeLT4ANNT 1sY9yXOq3sN1bmw+VWwQjFnBM/9PedeEBblkIpHgsOnhjNEsUv2SDE8sfXzRTU1h cHFgyaSQQEaZX/JsQmqdCNZSn1BH5dMGIsHRFMyDmY1FuOPQDto= =U0Mb -----END PGP SIGNATURE----- --Apple-Mail=_C9A997A8-D985-49B1-B5F9-9EDBED80AB16--