Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Andreas Dilger <adilger@dilger.ca>
To: Carlos Maiolino <cmaiolino@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
	linux-fsdevel@vger.kernel.org, hch@lst.de,
	Eric Sandeen <sandeen@redhat.com>,
	david@fromorbit.com
Subject: Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
Date: Fri, 7 Dec 2018 13:14:53 -0700
Message-ID: <7C812054-F457-4EDD-A527-71052AFD7C3B@dilger.ca> (raw)
In-Reply-To: <20181207090920.jrnwwb35n6r5546c@hades.usersys.redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4520 bytes --]

On Dec 7, 2018, at 2:09 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote:
> On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote:
>> 
>>> +
>>> +	*block = (fextent.fe_physical +
>>> +		  (start - fextent.fe_logical)) >> inode->i_blkbits;
>> 
>> Hmmm, so there's nothing here checking that the physical device fiemap
>> reports is the same device that was passed into the mount.  This is
>> trivially true for most of the filesystems that implement bmap and
>> fiemap, but definitely not true for xfs or btrfs.  I would bet most
>> userspace callers of bmap (since it's an ext2-era ioctl) make that
>> assumption and don't even know how to find the device.
>> 
>> On xfs, the bmap implementation won't return any results for realtime
>> files, but it looks as though we suddenly will start doing that here,
>> because in the new bmap implementation we will use fiemap, and fiemap
>> reports extents without providing any context about which device they're
>> on, and that context-less extent gets passed back to bmap_fiemap.
>> 
>> In any case, I think a better solution to the multi-device problem is to
>> start returning device information via struct fiemap_extent, at least
>> inside the kernel.  Use one of the reserved fields to declare a new
>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
>> device number, and then you can check that against inode->i_sb->s_bdev
>> to avoid returning results for the non-primary device of a multi-device
>> filesystem.

We're using fe_device = fe_reserved[0] to return the device number for Lustre.
For Lustre, the "device number" is just a server index, since the server's
block device number is irrelevant on the client.  For local filesystems, it
should return the 32-bit st_rdev device number to distinguish devices.

I have patches for e2fsprogs filefrag to print the fe_device field.

> Yes, you are right, I haven't thought about multi-dev filesystems. I checked
> btrfs code and it doesn't even support fibmap, exactly because of this problem,
> I wonder though, why it does support FIEMAP then, maybe because the fiemap idea
> isn't provide a way to userspace do IO directly to the device?!
> 
> I'm not sure if crossing dev information is enough though, I did a quick read of
> btrfs code, and the assumption that the block/extent location won't change on
> time, could lead to a time bomb in the future. I wonder if it wouldn't maybe be
> better to add a flag, to identify the usage type, and the filesystem itself
> would define if it should return anything, or not. Like, for example, passing in
> fieinfo->fi_flags, something like FIEMAP_FIBMAP, and check inside the filesystem
> for it.

The FIEMAP_EXTENT_ENCODED flag is meant to be returned when the extent cannot be
read directly from the block device.

For FIBMAP, it should return an error if ENCODED is set, since this file is not
suitable for directly booting a kernel (LILO is the only user of FIBMAP that I'm
aware of).  The filefrag utility prefers to use FIEMAP for efficiency, and only
falls back to FIBMAP if FIEMAP fails.

> From my shallow understanding of btrfs, looks like the location of the data, can
> be moved inside the same device, so, even if the devices are the same as you
> suggested, there is no guarantee the offset will be the same.

On a related note, btrfs also supports compressed extents, which isn't handled
by the current FIEMAP ioctl properly.  There was a patch proposed ages ago to
add FIEMAP_EXTENT_DATA_COMPRESSED, but didn't _quite_ make it over the finish
line, https://www.spinics.net/lists/xfs/msg24629.html has the last discussion.

It added EXTENT_DATA_COMPRESSED and used fe_reserved64[0] as fe_phys_length to
return the on-disk extent size, while fe_length would rename to fe_logi_length
(with a compat macro) to still represented the logical extent length.

 #define FIEMAP_EXTENT_DATA_COMPRESSED	0x00000040 /* Data compressed by fs.
 						    * Sets EXTENT_DATA_ENCODED */
 -	__u64 fe_reserved64[2];
 +	__u64 fe_phys_length; /* physical length in bytes, undefined if
 +			       * DATA_COMPRESSED not set */

There was some discussion on whether there should be a second flag like
FIEMAP_EXTENT_PHYS_LENGTH that is set when the fe_phys_length field is
valid, independent of whether the data is compressed or not.

Since you are reworking the FIEMAP code anyway, would you be interested to
revive that patch series?

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

  reply index

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
2018-12-05  9:17 ` [PATCH 01/10] fs: Enable bmap() function to properly return errors Carlos Maiolino
2018-12-05  9:17 ` [PATCH 02/10] cachefiles: drop direct usage of ->bmap method Carlos Maiolino
2018-12-05  9:17 ` [PATCH 03/10] ecryptfs: drop direct calls to ->bmap Carlos Maiolino
2018-12-05  9:17 ` [PATCH 04/10 V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap Carlos Maiolino
2019-01-14 16:49   ` Christoph Hellwig
2019-02-04 11:34     ` Carlos Maiolino
2018-12-05  9:17 ` [PATCH 05/10] fs: Move start and length fiemap fields into fiemap_extent_info Carlos Maiolino
2019-01-14 16:50   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 06/10] iomap: Remove length and start fields from iomap_fiemap Carlos Maiolino
2019-01-14 16:51   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 07/10] fs: Use a void pointer to store fiemap_extent Carlos Maiolino
2019-01-14 16:53   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 08/10 V2] fiemap: Use a callback to fill fiemap extents Carlos Maiolino
2019-01-14 16:53   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Carlos Maiolino
2018-12-05 17:36   ` Darrick J. Wong
2018-12-07  9:09     ` Carlos Maiolino
2018-12-07 20:14       ` Andreas Dilger [this message]
2019-02-04 15:11     ` Carlos Maiolino
2019-02-04 18:27       ` Darrick J. Wong
2019-02-06 13:37         ` Carlos Maiolino
2019-02-06 20:44           ` Darrick J. Wong
2019-02-06 21:13             ` Andreas Dilger
2019-02-07  9:52               ` Carlos Maiolino
2019-02-08  8:43                 ` Christoph Hellwig
2019-02-11 12:57                   ` Christoph Hellwig
2019-02-11 16:21                     ` Carlos Maiolino
2019-02-11 16:48                       ` Christoph Hellwig
2019-02-07 11:59             ` Carlos Maiolino
2019-02-07 17:02               ` Darrick J. Wong
2019-02-07 21:25                 ` Andreas Dilger
2019-02-08  8:46                   ` Christoph Hellwig
2019-02-08 10:36                     ` Carlos Maiolino
2019-02-08 21:03                       ` Andreas Dilger
2019-02-08  9:08                   ` Carlos Maiolino
2019-02-08  9:03                 ` Carlos Maiolino
2019-02-07 12:36             ` Carlos Maiolino
2019-02-07 18:16               ` Darrick J. Wong
2019-02-08  8:58                 ` Carlos Maiolino
2019-02-06 21:04           ` Andreas Dilger
2019-01-14 16:56   ` Christoph Hellwig
2019-02-05  9:56     ` Carlos Maiolino
2019-02-05 18:25       ` Christoph Hellwig
2019-02-06  9:50         ` Carlos Maiolino
2018-12-05  9:17 ` [PATCH 10/10] xfs: Get rid of ->bmap Carlos Maiolino
2018-12-05 17:37   ` Darrick J. Wong
2018-12-06 13:06     ` Carlos Maiolino
2018-12-06 18:56 ` [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Andreas Grünbacher
2018-12-07  9:34   ` Carlos Maiolino
2019-01-14 16:50     ` Christoph Hellwig
2019-01-14 17:56       ` Andreas Grünbacher
2019-01-14 17:58         ` Christoph Hellwig

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7C812054-F457-4EDD-A527-71052AFD7C3B@dilger.ca \
    --to=adilger@dilger.ca \
    --cc=cmaiolino@redhat.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=sandeen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org linux-fsdevel@archiver.kernel.org
	public-inbox-index linux-fsdevel


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox