From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_NEOMUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5381FC282C2 for ; Thu, 7 Feb 2019 09:52:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1772221872 for ; Thu, 7 Feb 2019 09:52:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726742AbfBGJwl (ORCPT ); Thu, 7 Feb 2019 04:52:41 -0500 Received: from mail-wm1-f68.google.com ([209.85.128.68]:54444 "EHLO mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726293AbfBGJwl (ORCPT ); Thu, 7 Feb 2019 04:52:41 -0500 Received: by mail-wm1-f68.google.com with SMTP id a62so5384609wmh.4 for ; Thu, 07 Feb 2019 01:52:37 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=xql6pjh5dPI2mC9+PA73RBtQwkRlG8pCAT94Rdn8hWY=; b=kd9Gtr4Th+sGi/2UBtJ+ZZRBP2aw5hmK2ttFXGJd7EXtdYgNfWsTIGnD4k9bfxHCLx ezldZKjMqxG59gR6oK/8Yh3MWd7jP10mvKvcTaXqcMNz8wYiRs/xLk9U7YajU/4LN2fs lVvNdE6S6zBjqtZ5BBwNCcneBT7gWU1dr4WdeZfZ8tr4dG/ikFEva4GD3UuQ4cdk0NwJ r/8LRnz69j4PvTrdjYiIGfAamOAG1E/1YhqVj7AKVcTKj2iWdKx2rEogAfalZG4m7xYe /E/RULl1zQnVNKeq1gdPaM69mlYcECc7xlr/KaU3iWj5gFLoVblbpuPoiAmPmLVY55EG EFLg== X-Gm-Message-State: AHQUAuYNLt5zg2ZXhQhGUl19bYozOUijJGDGd0ucKqOHSBy84Y47w2YR Cp5QUHL6kw+H3/ATFEnRBAipmA== X-Google-Smtp-Source: AHgI3IbMkAm3gxurUM2Bj/QpdlaV5j0oFmQQSxZ6Aciu9LN5vd0Jf+hcYXxCyDmbKxJNSLKhQO31mA== X-Received: by 2002:a1c:c90b:: with SMTP id f11mr6564121wmb.33.1549533156893; Thu, 07 Feb 2019 01:52:36 -0800 (PST) Received: from hades.usersys.redhat.com (ip-89-103-126-188.net.upcbroadband.cz. [89.103.126.188]) by smtp.gmail.com with ESMTPSA id e4sm18090444wrt.53.2019.02.07.01.52.35 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 07 Feb 2019 01:52:35 -0800 (PST) Date: Thu, 7 Feb 2019 10:52:33 +0100 From: Carlos Maiolino To: Andreas Dilger Cc: "Darrick J. Wong" , linux-fsdevel , Christoph Hellwig , Eric Sandeen , david@fromorbit.com Subject: Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Message-ID: <20190207095233.imqha36zgoqilxx2@hades.usersys.redhat.com> References: <20181205091728.29903-1-cmaiolino@redhat.com> <20181205091728.29903-10-cmaiolino@redhat.com> <20181205173650.GA8112@magnolia> <20190204151147.rra4n7k56ec4ndob@hades.usersys.redhat.com> <20190204182722.GA32119@magnolia> <20190206133753.oqpw7citye6apdpr@hades.usersys.redhat.com> <20190206204431.GB32119@magnolia> <0258844F-A305-4744-8C70-B27A3E49ADEC@dilger.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0258844F-A305-4744-8C70-B27A3E49ADEC@dilger.ca> User-Agent: NeoMutt/20180716 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org > >> If it belongs to the RT device in XFS, or whatever disk in a raid in > >> BTRFS, we simply do not provide such information. > > > > Right... > > > >> So, the goal is to provide a way to tell the filesystem if a FIEMAP or > >> a FIBMAP has been requested, so the current behavior of both ioctls > >> won't change. > > > > ...but from my point of view, the FIEMAP behavior *ought* to change to > > be more expressive. Once that's done, we can use the more expressive > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > > > > The whole point of having fe_reserved* fields in struct fiemap_extent is > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > > start returning data in a reserved field. New userspace programs that > > know about the flag can start reading information from the new field if > > they see the flag, and old userspace programs don't know about the flag > > and won't be any worse off. > Btw, I am not saying I don't like the idea, I like it. What I was trying to do was to avoid touching UAPI in this patchset. But... I'll try to implement your idea here, send it to the list and raise my shields. Thanks for the help Andreas/Darrick. > Exactly correct. > > >> Enabling filesystems to return device information into fiemap_extent > >> requires modification of all filesystems to provide such information, > >> which will not have any use other than matching the mounted device to > >> the device where the extent is. > > > > Perhaps it would help for me to present a more concrete proposal: > > > > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > > @@ -22,7 +22,19 @@ struct fiemap_extent { > > __u64 fe_length; /* length in bytes for this extent */ > > __u64 fe_reserved64[2]; > > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > > - __u32 fe_reserved[3]; > > + > > + /* > > + * Underlying device that this extent is stored on. > > + * > > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > > + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > > + * set, this field is a 32-bit cookie that can be used to distinguish > > + * between backing devices but has no intrinsic meaning. If neither > > + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > > + * EXTENT_DEV flags may be set at any time. > > + */ > > + __u32 fe_device; > > + __u32 fe_reserved[2]; > > }; > > > > struct fiemap { > > @@ -66,5 +78,14 @@ struct fiemap { > > * merged for efficiency. */ > > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > > * files. */ > > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > > + * structure containing the > > + * major and minor numbers > > + * of a block device. */ > > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > > + * cookie that can be used > > + * to distinguish physical > > + * devices but otherwise > > + * has no meaning. */ > > > > #endif /* _LINUX_FIEMAP_H */ > > > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > > encoding: > > > > fe_device = new_encode_dev(xfs_get_device_for_file()); > > > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > > and encode the replica number in fe_device. > > > > Existing filesystems can be left unchanged, in which case neither > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > > meaningless, the same as they are today. Reporting fe_device is entirely > > optional. > > I like this better than my plain "FIEMAP_EXTENT_DEVICE" proposal, since it > allows userspace to distinguish between an actual dev_t a unique-but- > locally-meaninless identifier that is needed for network filesystems. > > Cheers, Andreas > > > Userspace programs will now be able to tell which device the file data > > lives on, which has been sort-of requested for years, if the filesystem > > chooses to start exporting that information. > > > > Your FIBMAP-via-FIEMAP backend can do something like: > > > > /* FIBMAP only returns results for the same block device backing the fs. */ > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > > return 0; > > > > /* Can't tell what is the backing device, bail out. */ > > if (fe->fe_flags & EXTENT_DEV_COOKIE) > > return 0; > > > > /* > > * Either fe_device matches the backing device or the implementation > > * doesn't tell us about the backing device, so assume it's ok. > > */ > > > > > > So that's how I'd solve a longstanding design problem of FIEMAP and then > > take advantage of that solution to remedy my objections to the proposed > > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > > behavior flag that userspace knows about but isn't allowed to pass in. > > > >> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive > >> than the device id in fiemap_extent. I don't see much advantage in > >> adding the device id instead of using the flag. > >> > >> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > >> userspace, so, it would require a check to make sure it didn't come from > >> userspace if ioctl_fiemap() was used. > >> > >> I think there are 2 other possibilities which can be used to fix this. > >> > >> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > >> - If the device id is a must for you, maybe add the device id into > >> fiemap_extent_info instead of fiemap_extent. > > > > That won't work with btrfs, which can store file extents on multiple > > different physical devices. > > > >> So we don't mess with a UAPI exported data structure and still > >> provides a way to the filesystems to provide which device the mapped > >> extent is in. > >> > >> What you think? > >> > >> Cheers > >> > >> > >>> > >>> --D > >>> > >>>>> > >>>>>> + > >>>>>> + return error; > >>>>>> +} > >>>>>> + > >>>>>> /** > >>>>>> * bmap - find a block number in a file > >>>>>> * @inode: inode owning the block number being requested > >>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > >>>>>> */ > >>>>>> int bmap(struct inode *inode, sector_t *block) > >>>>>> { > >>>>>> - if (!inode->i_mapping->a_ops->bmap) > >>>>>> + if (inode->i_op->fiemap) > >>>>>> + return bmap_fiemap(inode, block); > >>>>>> + else if (inode->i_mapping->a_ops->bmap) > >>>>>> + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > >>>>>> + *block); > >>>>>> + else > >>>>>> return -EINVAL; > >>>>> > >>>>> Waitaminute. btrfs currently supports fiemap but not bmap, and now > >>>>> suddenly it will support this legacy interface they've never supported > >>>>> before. Are they on board with this? > >>>>> > >>>>> --D > >>>>> > >>>>>> > >>>>>> - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > >>>>>> return 0; > >>>>>> } > >>>>>> EXPORT_SYMBOL(bmap); > >>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c > >>>>>> index 6086978fe01e..bfa59df332bf 100644 > >>>>>> --- a/fs/ioctl.c > >>>>>> +++ b/fs/ioctl.c > >>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > >>>>>> return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > >>>>>> } > >>>>>> > >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > >>>>>> + u64 phys, u64 len, u32 flags) > >>>>>> +{ > >>>>>> + struct fiemap_extent *extent = fieinfo->fi_extents_start; > >>>>>> + > >>>>>> + /* only count the extents */ > >>>>>> + if (fieinfo->fi_extents_max == 0) { > >>>>>> + fieinfo->fi_extents_mapped++; > >>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > >>>>>> + } > >>>>>> + > >>>>>> + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > >>>>>> + return 1; > >>>>>> + > >>>>>> + if (flags & SET_UNKNOWN_FLAGS) > >>>>>> + flags |= FIEMAP_EXTENT_UNKNOWN; > >>>>>> + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > >>>>>> + flags |= FIEMAP_EXTENT_ENCODED; > >>>>>> + if (flags & SET_NOT_ALIGNED_FLAGS) > >>>>>> + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > >>>>>> + > >>>>>> + extent->fe_logical = logical; > >>>>>> + extent->fe_physical = phys; > >>>>>> + extent->fe_length = len; > >>>>>> + extent->fe_flags = flags; > >>>>>> + > >>>>>> + fieinfo->fi_extents_mapped++; > >>>>>> + > >>>>>> + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > >>>>>> + return 1; > >>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > >>>>>> +} > >>>>>> /** > >>>>>> * fiemap_fill_next_extent - Fiemap helper function > >>>>>> * @fieinfo: Fiemap context passed into ->fiemap > >>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h > >>>>>> index 7a434979201c..28bb523d532a 100644 > >>>>>> --- a/include/linux/fs.h > >>>>>> +++ b/include/linux/fs.h > >>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > >>>>>> fiemap_fill_cb fi_cb; > >>>>>> }; > >>>>>> > >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > >>>>>> + u64 phys, u64 len, u32 flags); > >>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > >>>>>> u64 phys, u64 len, u32 flags); > >>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > >>>>>> -- > >>>>>> 2.17.2 > >>>>>> > >>>> > >>>> -- > >>>> Carlos > >> > >> -- > >> Carlos > > > Cheers, Andreas > > > > > -- Carlos