On Dec 7, 2018, at 2:09 AM, Carlos Maiolino wrote: > On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote: >> >>> + >>> + *block = (fextent.fe_physical + >>> + (start - fextent.fe_logical)) >> inode->i_blkbits; >> >> Hmmm, so there's nothing here checking that the physical device fiemap >> reports is the same device that was passed into the mount. This is >> trivially true for most of the filesystems that implement bmap and >> fiemap, but definitely not true for xfs or btrfs. I would bet most >> userspace callers of bmap (since it's an ext2-era ioctl) make that >> assumption and don't even know how to find the device. >> >> On xfs, the bmap implementation won't return any results for realtime >> files, but it looks as though we suddenly will start doing that here, >> because in the new bmap implementation we will use fiemap, and fiemap >> reports extents without providing any context about which device they're >> on, and that context-less extent gets passed back to bmap_fiemap. >> >> In any case, I think a better solution to the multi-device problem is to >> start returning device information via struct fiemap_extent, at least >> inside the kernel. Use one of the reserved fields to declare a new >> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t >> device number, and then you can check that against inode->i_sb->s_bdev >> to avoid returning results for the non-primary device of a multi-device >> filesystem. We're using fe_device = fe_reserved[0] to return the device number for Lustre. For Lustre, the "device number" is just a server index, since the server's block device number is irrelevant on the client. For local filesystems, it should return the 32-bit st_rdev device number to distinguish devices. I have patches for e2fsprogs filefrag to print the fe_device field. > Yes, you are right, I haven't thought about multi-dev filesystems. I checked > btrfs code and it doesn't even support fibmap, exactly because of this problem, > I wonder though, why it does support FIEMAP then, maybe because the fiemap idea > isn't provide a way to userspace do IO directly to the device?! > > I'm not sure if crossing dev information is enough though, I did a quick read of > btrfs code, and the assumption that the block/extent location won't change on > time, could lead to a time bomb in the future. I wonder if it wouldn't maybe be > better to add a flag, to identify the usage type, and the filesystem itself > would define if it should return anything, or not. Like, for example, passing in > fieinfo->fi_flags, something like FIEMAP_FIBMAP, and check inside the filesystem > for it. The FIEMAP_EXTENT_ENCODED flag is meant to be returned when the extent cannot be read directly from the block device. For FIBMAP, it should return an error if ENCODED is set, since this file is not suitable for directly booting a kernel (LILO is the only user of FIBMAP that I'm aware of). The filefrag utility prefers to use FIEMAP for efficiency, and only falls back to FIBMAP if FIEMAP fails. > From my shallow understanding of btrfs, looks like the location of the data, can > be moved inside the same device, so, even if the devices are the same as you > suggested, there is no guarantee the offset will be the same. On a related note, btrfs also supports compressed extents, which isn't handled by the current FIEMAP ioctl properly. There was a patch proposed ages ago to add FIEMAP_EXTENT_DATA_COMPRESSED, but didn't _quite_ make it over the finish line, https://www.spinics.net/lists/xfs/msg24629.html has the last discussion. It added EXTENT_DATA_COMPRESSED and used fe_reserved64[0] as fe_phys_length to return the on-disk extent size, while fe_length would rename to fe_logi_length (with a compat macro) to still represented the logical extent length. #define FIEMAP_EXTENT_DATA_COMPRESSED 0x00000040 /* Data compressed by fs. * Sets EXTENT_DATA_ENCODED */ - __u64 fe_reserved64[2]; + __u64 fe_phys_length; /* physical length in bytes, undefined if + * DATA_COMPRESSED not set */ There was some discussion on whether there should be a second flag like FIEMAP_EXTENT_PHYS_LENGTH that is set when the fe_phys_length field is valid, independent of whether the data is compressed or not. Since you are reworking the FIEMAP code anyway, would you be interested to revive that patch series? Cheers, Andreas