[PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal
@ 2018-12-05  9:17 Carlos Maiolino
  2018-12-05  9:17 ` [PATCH 01/10] fs: Enable bmap() function to properly return errors Carlos Maiolino
                   ` (10 more replies)
  0 siblings, 11 replies; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-05  9:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, adilger, sandeen, david

Hi.

This is the second version of the complete series with the goal to remove ->bmap
interface completely, in lieu of FIEMAP.

This new version has been heavily modified in comparison with the first one,
based on comments of Christoph and Andreas. And has been simplified. My
apologies if I forgot to update anything based on previous discussion.

Patch 1-3 has no difference from the previous version, and Christoph's Reviewed-by
has been kept.

Patch 4 is a V2 of the previous set, with updates required by Christoph (a local
sector_t variable, and moving the patch earlier in the series.

Patches 5-9 are essentially the modification of ->fiemap and deprecation of
->bmap.

In this new version, I kept the current fiemap_extent_info structure, and moved
into it all the required data for use in both FIEMAP and FIBMAP, instead of
creating a new data structure as in the old patch set.
This way, basically no heavy modification is required on individual filesystems,
but the update of its fiemap methods, removing start and len arguments, once
they will be passed now into fiemap_extent_info.

Last patch, is the removal of ->bmap method into XFS, I decided to not add the
previous patch for Ext4 because it needs more ext4 internal knowledge, which I
don't have.

Comments are much appreciated.

Cheers

P.S.  I thought about merging patch 7 into patch 8, and, patch 6 into patch 5,
but I decided to leave it as-is by now, and get comments about the way it is,
and if people agree, I'll merge them.

Carlos Maiolino (10):
  fs: Enable bmap() function to properly return errors
  cachefiles: drop direct usage of ->bmap method.
  ecryptfs: drop direct calls to ->bmap
  fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
  fs: Move start and length fiemap fields into fiemap_extent_info
  iomap: Remove length and start fields from iomap_fiemap
  fs: Use a void pointer to store fiemap_extent
  fiemap: Use a callback to fill fiemap extents
  Use FIEMAP for FIBMAP calls
  xfs: Get rid of ->bmap

 drivers/md/md-bitmap.c |  16 +++---
 fs/bad_inode.c         |   3 +-
 fs/btrfs/inode.c       |   5 +-
 fs/cachefiles/rdwr.c   |  27 +++++-----
 fs/ecryptfs/mmap.c     |  16 +++---
 fs/ext2/ext2.h         |   3 +-
 fs/ext2/inode.c        |   6 +--
 fs/ext4/ext4.h         |   3 +-
 fs/ext4/extents.c      |   8 +--
 fs/f2fs/data.c         |   5 +-
 fs/f2fs/f2fs.h         |   3 +-
 fs/gfs2/inode.c        |   5 +-
 fs/hpfs/file.c         |   4 +-
 fs/inode.c             |  68 +++++++++++++++++++-----
 fs/ioctl.c             | 114 +++++++++++++++++++++++++++++------------
 fs/iomap.c             |   4 +-
 fs/jbd2/journal.c      |  22 +++++---
 fs/nilfs2/inode.c      |   5 +-
 fs/nilfs2/nilfs.h      |   3 +-
 fs/ocfs2/extent_map.c  |   5 +-
 fs/ocfs2/extent_map.h  |   3 +-
 fs/overlayfs/inode.c   |   5 +-
 fs/xfs/xfs_aops.c      |  24 ---------
 fs/xfs/xfs_iops.c      |  14 ++---
 fs/xfs/xfs_trace.h     |   1 -
 include/linux/fs.h     |  31 +++++++----
 include/linux/iomap.h  |   2 +-
 mm/page_io.c           |  11 ++--
 28 files changed, 248 insertions(+), 168 deletions(-)

-- 
2.17.2

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/10] fs: Enable bmap() function to properly return errors
  2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
@ 2018-12-05  9:17 ` Carlos Maiolino
  2018-12-05  9:17 ` [PATCH 02/10] cachefiles: drop direct usage of ->bmap method Carlos Maiolino
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-05  9:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, adilger, sandeen, david

By now, bmap() will either return the physical block number related to
the requested file offset or 0 in case of error or the requested offset
maps into a hole.
This patch makes the needed changes to enable bmap() to proper return
errors, using the return value as an error return, and now, a pointer
must be passed to bmap() to be filled with the mapped physical block.

It will change the behavior of bmap() on return:

- negative value in case of error
- zero on success or map fell into a hole

In case of a hole, the *block will be zero too

Since this is a prep patch, by now, the only error return is -EINVAL if
->bmap doesn't exist.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 drivers/md/md-bitmap.c | 16 ++++++++++------
 fs/inode.c             | 30 +++++++++++++++++-------------
 fs/jbd2/journal.c      | 22 +++++++++++++++-------
 include/linux/fs.h     |  2 +-
 mm/page_io.c           | 11 +++++++----
 5 files changed, 50 insertions(+), 31 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 1cd4f991792c..0668b2dd290e 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -363,7 +363,7 @@ static int read_page(struct file *file, unsigned long index,
 	int ret = 0;
 	struct inode *inode = file_inode(file);
 	struct buffer_head *bh;
-	sector_t block;
+	sector_t block, blk_cur;
 
 	pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE,
 		 (unsigned long long)index << PAGE_SHIFT);
@@ -374,17 +374,21 @@ static int read_page(struct file *file, unsigned long index,
 		goto out;
 	}
 	attach_page_buffers(page, bh);
-	block = index << (PAGE_SHIFT - inode->i_blkbits);
+	blk_cur = index << (PAGE_SHIFT - inode->i_blkbits);
 	while (bh) {
+		block = blk_cur;
+
 		if (count == 0)
 			bh->b_blocknr = 0;
 		else {
-			bh->b_blocknr = bmap(inode, block);
-			if (bh->b_blocknr == 0) {
-				/* Cannot use this file! */
+			ret = bmap(inode, &block);
+			if (ret || !block) {
 				ret = -EINVAL;
+				bh->b_blocknr = 0;
 				goto out;
 			}
+
+			bh->b_blocknr = block;
 			bh->b_bdev = inode->i_sb->s_bdev;
 			if (count < (1<<inode->i_blkbits))
 				count = 0;
@@ -398,7 +402,7 @@ static int read_page(struct file *file, unsigned long index,
 			set_buffer_mapped(bh);
 			submit_bh(REQ_OP_READ, 0, bh);
 		}
-		block++;
+		blk_cur++;
 		bh = bh->b_this_page;
 	}
 	page->index = index;
diff --git a/fs/inode.c b/fs/inode.c
index 0cd47fe0dbe5..db681d310465 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1580,21 +1580,25 @@ EXPORT_SYMBOL(iput);
 
 /**
  *	bmap	- find a block number in a file
- *	@inode: inode of file
- *	@block: block to find
- *
- *	Returns the block number on the device holding the inode that
- *	is the disk block number for the block of the file requested.
- *	That is, asked for block 4 of inode 1 the function will return the
- *	disk block relative to the disk start that holds that block of the
- *	file.
+ *	@inode:  inode owning the block number being requested
+ *	@*block: pointer containing the block to find
+ *
+ *	Replaces the value in *block with the block number on the device holding
+ *	corresponding to the requested block number in the file.
+ *	That is, asked for block 4 of inode 1 the function will replace the
+ *	4 in *block, with disk block relative to the disk start that holds that
+ *	block of the file.
+ *
+ *	Returns -EINVAL in case of error, 0 otherwise. If mapping falls into a
+ *	hole, returns 0 and *block is also set to 0.
  */
-sector_t bmap(struct inode *inode, sector_t block)
+int bmap(struct inode *inode, sector_t *block)
 {
-	sector_t res = 0;
-	if (inode->i_mapping->a_ops->bmap)
-		res = inode->i_mapping->a_ops->bmap(inode->i_mapping, block);
-	return res;
+	if (!inode->i_mapping->a_ops->bmap)
+		return -EINVAL;
+
+	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
+	return 0;
 }
 EXPORT_SYMBOL(bmap);
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 8ef6b6daaa7a..7acaf6f55404 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -814,18 +814,23 @@ int jbd2_journal_bmap(journal_t *journal, unsigned long blocknr,
 {
 	int err = 0;
 	unsigned long long ret;
+	sector_t block = 0;
 
 	if (journal->j_inode) {
-		ret = bmap(journal->j_inode, blocknr);
-		if (ret)
-			*retp = ret;
-		else {
+		block = blocknr;
+		ret = bmap(journal->j_inode, &block);
+
+		if (ret || !block) {
 			printk(KERN_ALERT "%s: journal block not found "
 					"at offset %lu on %s\n",
 			       __func__, blocknr, journal->j_devname);
 			err = -EIO;
 			__journal_abort_soft(journal, err);
+
+		} else {
+			*retp = block;
 		}
+
 	} else {
 		*retp = blocknr; /* +journal->j_blk_offset */
 	}
@@ -1251,11 +1256,14 @@ journal_t *jbd2_journal_init_dev(struct block_device *bdev,
 journal_t *jbd2_journal_init_inode(struct inode *inode)
 {
 	journal_t *journal;
+	sector_t blocknr;
 	char *p;
-	unsigned long long blocknr;
+	int err = 0;
+
+	blocknr = 0;
+	err = bmap(inode, &blocknr);
 
-	blocknr = bmap(inode, 0);
-	if (!blocknr) {
+	if (err || !blocknr) {
 		pr_err("%s: Cannot locate journal superblock\n",
 			__func__);
 		return NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5f7f67bd33a5..71dac6e00e27 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2811,7 +2811,7 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
 extern void emergency_sync(void);
 extern void emergency_remount(void);
 #ifdef CONFIG_BLOCK
-extern sector_t bmap(struct inode *, sector_t);
+extern int bmap(struct inode *, sector_t *);
 #endif
 extern int notify_change(struct dentry *, struct iattr *, struct inode **);
 extern int inode_permission(struct inode *, int);
diff --git a/mm/page_io.c b/mm/page_io.c
index 57572ff46016..b52e3e2ec13c 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -177,8 +177,9 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 
 		cond_resched();
 
-		first_block = bmap(inode, probe_block);
-		if (first_block == 0)
+		first_block = probe_block;
+		ret = bmap(inode, &first_block);
+		if (ret || !first_block)
 			goto bad_bmap;
 
 		/*
@@ -193,9 +194,11 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 					block_in_page++) {
 			sector_t block;
 
-			block = bmap(inode, probe_block + block_in_page);
-			if (block == 0)
+			block = probe_block + block_in_page;
+			ret = bmap(inode, &block);
+			if (ret || !block)
 				goto bad_bmap;
+
 			if (block != first_block + block_in_page) {
 				/* Discontiguity */
 				probe_block++;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 02/10] cachefiles: drop direct usage of ->bmap method.
  2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
  2018-12-05  9:17 ` [PATCH 01/10] fs: Enable bmap() function to properly return errors Carlos Maiolino
@ 2018-12-05  9:17 ` Carlos Maiolino
  2018-12-05  9:17 ` [PATCH 03/10] ecryptfs: drop direct calls to ->bmap Carlos Maiolino
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-05  9:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, adilger, sandeen, david

Replace the direct usage of ->bmap method by a bmap() call.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/cachefiles/rdwr.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 40f7595aad10..a3ee23fe9269 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -400,7 +400,7 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,
 	struct cachefiles_object *object;
 	struct cachefiles_cache *cache;
 	struct inode *inode;
-	sector_t block0, block;
+	sector_t block;
 	unsigned shift;
 	int ret;
 
@@ -416,7 +416,6 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,
 
 	inode = d_backing_inode(object->backer);
 	ASSERT(S_ISREG(inode->i_mode));
-	ASSERT(inode->i_mapping->a_ops->bmap);
 	ASSERT(inode->i_mapping->a_ops->readpages);
 
 	/* calculate the shift required to use bmap */
@@ -432,12 +431,14 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,
 	 *   enough for this as it doesn't indicate errors, but it's all we've
 	 *   got for the moment
 	 */
-	block0 = page->index;
-	block0 <<= shift;
+	block = page->index;
+	block <<= shift;
+
+	ret = bmap(inode, &block);
+	ASSERT(!ret);
 
-	block = inode->i_mapping->a_ops->bmap(inode->i_mapping, block0);
 	_debug("%llx -> %llx",
-	       (unsigned long long) block0,
+	       (unsigned long long) (page->index << shift),
 	       (unsigned long long) block);
 
 	if (block) {
@@ -709,7 +710,6 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,
 
 	inode = d_backing_inode(object->backer);
 	ASSERT(S_ISREG(inode->i_mode));
-	ASSERT(inode->i_mapping->a_ops->bmap);
 	ASSERT(inode->i_mapping->a_ops->readpages);
 
 	/* calculate the shift required to use bmap */
@@ -726,7 +726,7 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,
 
 	ret = space ? -ENODATA : -ENOBUFS;
 	list_for_each_entry_safe(page, _n, pages, lru) {
-		sector_t block0, block;
+		sector_t block;
 
 		/* we assume the absence or presence of the first block is a
 		 * good enough indication for the page as a whole
@@ -734,13 +734,14 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,
 		 *   good enough for this as it doesn't indicate errors, but
 		 *   it's all we've got for the moment
 		 */
-		block0 = page->index;
-		block0 <<= shift;
+		block = page->index;
+		block <<= shift;
+
+		ret = bmap(inode, &block);
+		ASSERT(!ret);
 
-		block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
-						      block0);
 		_debug("%llx -> %llx",
-		       (unsigned long long) block0,
+		       (unsigned long long) (page->index << shift),
 		       (unsigned long long) block);
 
 		if (block) {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 03/10] ecryptfs: drop direct calls to ->bmap
  2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
  2018-12-05  9:17 ` [PATCH 01/10] fs: Enable bmap() function to properly return errors Carlos Maiolino
  2018-12-05  9:17 ` [PATCH 02/10] cachefiles: drop direct usage of ->bmap method Carlos Maiolino
@ 2018-12-05  9:17 ` Carlos Maiolino
  2018-12-05  9:17 ` [PATCH 04/10 V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap Carlos Maiolino
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-05  9:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, adilger, sandeen, david

Replace direct ->bmap calls by bmap() method.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/ecryptfs/mmap.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
index cdf358b209d9..ff323dccef36 100644
--- a/fs/ecryptfs/mmap.c
+++ b/fs/ecryptfs/mmap.c
@@ -538,16 +538,12 @@ static int ecryptfs_write_end(struct file *file,
 
 static sector_t ecryptfs_bmap(struct address_space *mapping, sector_t block)
 {
-	int rc = 0;
-	struct inode *inode;
-	struct inode *lower_inode;
-
-	inode = (struct inode *)mapping->host;
-	lower_inode = ecryptfs_inode_to_lower(inode);
-	if (lower_inode->i_mapping->a_ops->bmap)
-		rc = lower_inode->i_mapping->a_ops->bmap(lower_inode->i_mapping,
-							 block);
-	return rc;
+	struct inode *lower_inode = ecryptfs_inode_to_lower(mapping->host);
+	int ret = bmap(lower_inode, &block);
+
+	if (ret)
+		return 0;
+	return block;
 }
 
 const struct address_space_operations ecryptfs_aops = {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/10 V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
  2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
                   ` (2 preceding siblings ...)
  2018-12-05  9:17 ` [PATCH 03/10] ecryptfs: drop direct calls to ->bmap Carlos Maiolino
@ 2018-12-05  9:17 ` Carlos Maiolino
  2019-01-14 16:49   ` Christoph Hellwig
  2018-12-05  9:17 ` [PATCH 05/10] fs: Move start and length fiemap fields into fiemap_extent_info Carlos Maiolino
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-05  9:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, adilger, sandeen, david

Now we have the possibility of proper error return in bmap, use bmap()
function in ioctl_fibmap() instead of calling ->bmap method directly.

V2:
	- Use a local sector_t variable to asign the block number
	  instead of using direct casting.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/ioctl.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index d64f622cac8b..e0cc0dd5f9aa 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -53,19 +53,26 @@ EXPORT_SYMBOL(vfs_ioctl);
 
 static int ioctl_fibmap(struct file *filp, int __user *p)
 {
-	struct address_space *mapping = filp->f_mapping;
-	int res, block;
+	struct inode *inode = file_inode(filp);
+	int error, usr_blk;
+	sector_t block;
 
-	/* do we support this mess? */
-	if (!mapping->a_ops->bmap)
-		return -EINVAL;
 	if (!capable(CAP_SYS_RAWIO))
 		return -EPERM;
-	res = get_user(block, p);
-	if (res)
-		return res;
-	res = mapping->a_ops->bmap(mapping, block);
-	return put_user(res, p);
+
+	error = get_user(usr_blk, p);
+	if (error)
+		return error;
+
+	block = usr_blk;
+	error = bmap(inode, &block);
+	if (error)
+		return error;
+	usr_blk = block;
+
+	error = put_user(usr_blk, p);
+
+	return error;
 }
 
 /**
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 05/10] fs: Move start and length fiemap fields into fiemap_extent_info
  2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
                   ` (3 preceding siblings ...)
  2018-12-05  9:17 ` [PATCH 04/10 V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap Carlos Maiolino
@ 2018-12-05  9:17 ` Carlos Maiolino
  2019-01-14 16:50   ` Christoph Hellwig
  2018-12-05  9:17 ` [PATCH 06/10] iomap: Remove length and start fields from iomap_fiemap Carlos Maiolino
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-05  9:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, adilger, sandeen, david

As the overall goal to deprecate fibmap, Christoph suggested a rework of
the ->fiemap API, in a way we could pass to it a callback to fill the
fiemap structure (one of these callbacks being fiemap_fill_next_extent).

To avoid the need to add several fields into the ->fiemap method, aggregate
everything into a single data structure, and pass it along.

This patch isn't suppose to add any functional change, only to update
filesystems providing ->fiemap() method.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/bad_inode.c        |  3 +--
 fs/btrfs/inode.c      |  5 +++--
 fs/ext2/ext2.h        |  3 +--
 fs/ext2/inode.c       |  6 ++----
 fs/ext4/ext4.h        |  3 +--
 fs/ext4/extents.c     |  8 ++++----
 fs/f2fs/data.c        |  5 +++--
 fs/f2fs/f2fs.h        |  3 +--
 fs/gfs2/inode.c       |  5 +++--
 fs/hpfs/file.c        |  4 ++--
 fs/ioctl.c            | 16 ++++++++++------
 fs/nilfs2/inode.c     |  5 +++--
 fs/nilfs2/nilfs.h     |  3 +--
 fs/ocfs2/extent_map.c |  5 +++--
 fs/ocfs2/extent_map.h |  3 +--
 fs/overlayfs/inode.c  |  5 ++---
 fs/xfs/xfs_iops.c     | 10 +++++-----
 include/linux/fs.h    | 21 +++++++++++----------
 18 files changed, 57 insertions(+), 56 deletions(-)

diff --git a/fs/bad_inode.c b/fs/bad_inode.c
index 8035d2a44561..21dfaf876814 100644
--- a/fs/bad_inode.c
+++ b/fs/bad_inode.c
@@ -120,8 +120,7 @@ static struct posix_acl *bad_inode_get_acl(struct inode *inode, int type)
 }
 
 static int bad_inode_fiemap(struct inode *inode,
-			    struct fiemap_extent_info *fieinfo, u64 start,
-			    u64 len)
+			    struct fiemap_extent_info *fieinfo)
 {
 	return -EIO;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4a2f9f7fd96e..8afa1ac3d5e9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8619,9 +8619,10 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 #define BTRFS_FIEMAP_FLAGS	(FIEMAP_FLAG_SYNC)
 
-static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		__u64 start, __u64 len)
+static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
 {
+	u64	start = fieinfo->fi_start;
+	u64	len = fieinfo->fi_len;
 	int	ret;
 
 	ret = fiemap_check_flags(fieinfo, BTRFS_FIEMAP_FLAGS);
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index e770cd100a6a..368e3e1f201f 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -775,8 +775,7 @@ extern void ext2_evict_inode(struct inode *);
 extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int);
 extern int ext2_setattr (struct dentry *, struct iattr *);
 extern void ext2_set_inode_flags(struct inode *inode);
-extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		       u64 start, u64 len);
+extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo);
 
 /* ioctl.c */
 extern long ext2_ioctl(struct file *, unsigned int, unsigned long);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index e4bb9386c045..98c932167fa8 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -855,11 +855,9 @@ const struct iomap_ops ext2_iomap_ops = {
 const struct iomap_ops ext2_iomap_ops;
 #endif /* CONFIG_FS_DAX */
 
-int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		u64 start, u64 len)
+int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
 {
-	return generic_block_fiemap(inode, fieinfo, start, len,
-				    ext2_get_block);
+	return generic_block_fiemap(inode, fieinfo, ext2_get_block);
 }
 
 static int ext2_writepage(struct page *page, struct writeback_control *wbc)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3f89d0ab08fc..4ff340688b7b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3140,8 +3140,7 @@ extern struct ext4_ext_path *ext4_find_extent(struct inode *, ext4_lblk_t,
 extern void ext4_ext_drop_refs(struct ext4_ext_path *);
 extern int ext4_ext_check_inode(struct inode *inode);
 extern ext4_lblk_t ext4_ext_next_allocated_block(struct ext4_ext_path *path);
-extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-			__u64 start, __u64 len);
+extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo);
 extern int ext4_ext_precache(struct inode *inode);
 extern int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len);
 extern int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 240b6dea5441..4efd8e5225ec 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5045,9 +5045,10 @@ static int ext4_xattr_fiemap(struct inode *inode,
 	return (error < 0 ? error : 0);
 }
 
-int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		__u64 start, __u64 len)
+int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
 {
+	u64 start = fieinfo->fi_start;
+	u64 len = fieinfo->fi_len;
 	ext4_lblk_t start_blk;
 	int error = 0;
 
@@ -5069,8 +5070,7 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 
 	/* fallback to generic here if not in extents fmt */
 	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
-		return generic_block_fiemap(inode, fieinfo, start, len,
-			ext4_get_block);
+		return generic_block_fiemap(inode, fieinfo, ext4_get_block);
 
 	if (fiemap_check_flags(fieinfo, EXT4_FIEMAP_FLAGS))
 		return -EBADR;
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b293cb3e27a2..3ef592ebab1c 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1401,9 +1401,10 @@ static int f2fs_xattr_fiemap(struct inode *inode,
 	return (err < 0 ? err : 0);
 }
 
-int f2fs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		u64 start, u64 len)
+int f2fs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
 {
+	u64 start = fieinfo->fi_start;
+	u64 len = fieinfo->fi_len;
 	struct buffer_head map_bh;
 	sector_t start_blk, last_blk;
 	pgoff_t next_pgofs;
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 1e031971a466..c2d0ab85d22a 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -3096,8 +3096,7 @@ int f2fs_do_write_data_page(struct f2fs_io_info *fio);
 void __do_map_lock(struct f2fs_sb_info *sbi, int flag, bool lock);
 int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map,
 			int create, int flag);
-int f2fs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-			u64 start, u64 len);
+int f2fs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo);
 bool f2fs_should_update_inplace(struct inode *inode, struct f2fs_io_info *fio);
 bool f2fs_should_update_outplace(struct inode *inode, struct f2fs_io_info *fio);
 void f2fs_invalidate_page(struct page *page, unsigned int offset,
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 648f0ca1ad57..9669ad6224da 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -2004,9 +2004,10 @@ static int gfs2_getattr(const struct path *path, struct kstat *stat,
 	return 0;
 }
 
-static int gfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		       u64 start, u64 len)
+static int gfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
 {
+	u64 start = fieinfo->fi_start;
+	u64 len = fieinfo->fi_len;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_holder gh;
 	int ret;
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index 1ecec124e76f..0eece4ae1f11 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -190,9 +190,9 @@ static sector_t _hpfs_bmap(struct address_space *mapping, sector_t block)
 	return generic_block_bmap(mapping, block, hpfs_get_block);
 }
 
-static int hpfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, u64 start, u64 len)
+static int hpfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
 {
-	return generic_block_fiemap(inode, fieinfo, start, len, hpfs_get_block);
+	return generic_block_fiemap(inode, fieinfo, hpfs_get_block);
 }
 
 const struct address_space_operations hpfs_aops = {
diff --git a/fs/ioctl.c b/fs/ioctl.c
index e0cc0dd5f9aa..a8eae8916e01 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -208,6 +208,8 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg)
 	fieinfo.fi_flags = fiemap.fm_flags;
 	fieinfo.fi_extents_max = fiemap.fm_extent_count;
 	fieinfo.fi_extents_start = ufiemap->fm_extents;
+	fieinfo.fi_start = fiemap.fm_start;
+	fieinfo.fi_len = len;
 
 	if (fiemap.fm_extent_count != 0 &&
 	    !access_ok(VERIFY_WRITE, fieinfo.fi_extents_start,
@@ -217,7 +219,7 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg)
 	if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC)
 		filemap_write_and_wait(inode->i_mapping);
 
-	error = inode->i_op->fiemap(inode, &fieinfo, fiemap.fm_start, len);
+	error = inode->i_op->fiemap(inode, &fieinfo);
 	fiemap.fm_flags = fieinfo.fi_flags;
 	fiemap.fm_mapped_extents = fieinfo.fi_extents_mapped;
 	if (copy_to_user(ufiemap, &fiemap, sizeof(fiemap)))
@@ -294,9 +296,11 @@ static inline loff_t blk_to_logical(struct inode *inode, sector_t blk)
  */
 
 int __generic_block_fiemap(struct inode *inode,
-			   struct fiemap_extent_info *fieinfo, loff_t start,
-			   loff_t len, get_block_t *get_block)
+			   struct fiemap_extent_info *fieinfo,
+			   get_block_t *get_block)
 {
+	loff_t start = fieinfo->fi_start;
+	loff_t len = fieinfo->fi_len;
 	struct buffer_head map_bh;
 	sector_t start_blk, last_blk;
 	loff_t isize = i_size_read(inode);
@@ -453,12 +457,12 @@ EXPORT_SYMBOL(__generic_block_fiemap);
  */
 
 int generic_block_fiemap(struct inode *inode,
-			 struct fiemap_extent_info *fieinfo, u64 start,
-			 u64 len, get_block_t *get_block)
+			 struct fiemap_extent_info *fieinfo,
+			 get_block_t *get_block)
 {
 	int ret;
 	inode_lock(inode);
-	ret = __generic_block_fiemap(inode, fieinfo, start, len, get_block);
+	ret = __generic_block_fiemap(inode, fieinfo, get_block);
 	inode_unlock(inode);
 	return ret;
 }
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 671085512e0f..1f37d086371c 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -992,9 +992,10 @@ void nilfs_dirty_inode(struct inode *inode, int flags)
 	nilfs_transaction_commit(inode->i_sb); /* never fails */
 }
 
-int nilfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		 __u64 start, __u64 len)
+int nilfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
 {
+	u64 start = fieinfo->fi_start;
+	u64 len = fieinfo->fi_len;
 	struct the_nilfs *nilfs = inode->i_sb->s_fs_info;
 	__u64 logical = 0, phys = 0, size = 0;
 	__u32 flags = 0;
diff --git a/fs/nilfs2/nilfs.h b/fs/nilfs2/nilfs.h
index a2f247b6a209..55d1307ed710 100644
--- a/fs/nilfs2/nilfs.h
+++ b/fs/nilfs2/nilfs.h
@@ -276,8 +276,7 @@ extern int nilfs_inode_dirty(struct inode *);
 int nilfs_set_file_dirty(struct inode *inode, unsigned int nr_dirty);
 extern int __nilfs_mark_inode_dirty(struct inode *, int);
 extern void nilfs_dirty_inode(struct inode *, int flags);
-int nilfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		 __u64 start, __u64 len);
+int nilfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo);
 static inline int nilfs_mark_inode_dirty(struct inode *inode)
 {
 	return __nilfs_mark_inode_dirty(inode, I_DIRTY);
diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
index 06cb96462bf9..e01fd38ea935 100644
--- a/fs/ocfs2/extent_map.c
+++ b/fs/ocfs2/extent_map.c
@@ -749,8 +749,7 @@ static int ocfs2_fiemap_inline(struct inode *inode, struct buffer_head *di_bh,
 
 #define OCFS2_FIEMAP_FLAGS	(FIEMAP_FLAG_SYNC)
 
-int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		 u64 map_start, u64 map_len)
+int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
 {
 	int ret, is_last;
 	u32 mapping_end, cpos;
@@ -759,6 +758,8 @@ int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	u64 len_bytes, phys_bytes, virt_bytes;
 	struct buffer_head *di_bh = NULL;
 	struct ocfs2_extent_rec rec;
+	u64 map_start = fieinfo->fi_start;
+	u64 map_len = fieinfo->fi_len;
 
 	ret = fiemap_check_flags(fieinfo, OCFS2_FIEMAP_FLAGS);
 	if (ret)
diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h
index 1057586ec19f..793be96099c0 100644
--- a/fs/ocfs2/extent_map.h
+++ b/fs/ocfs2/extent_map.h
@@ -50,8 +50,7 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, u32 *p_cluster,
 int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno,
 				u64 *ret_count, unsigned int *extent_flags);
 
-int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		 u64 map_start, u64 map_len);
+int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo);
 
 int ocfs2_overwrite_io(struct inode *inode, struct buffer_head *di_bh,
 		       u64 map_start, u64 map_len);
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 6bcc9dedc342..d69c2528673f 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -465,8 +465,7 @@ int ovl_update_time(struct inode *inode, struct timespec64 *ts, int flags)
 	return 0;
 }
 
-static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		      u64 start, u64 len)
+static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
 {
 	int err;
 	struct inode *realinode = ovl_inode_real(inode);
@@ -480,7 +479,7 @@ static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC)
 		filemap_write_and_wait(realinode->i_mapping);
 
-	err = realinode->i_op->fiemap(realinode, fieinfo, start, len);
+	err = realinode->i_op->fiemap(realinode, fieinfo,);
 	revert_creds(old_cred);
 
 	return err;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index f48ffd7a8d3e..1040e8346286 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1092,12 +1092,12 @@ xfs_vn_update_time(
 
 STATIC int
 xfs_vn_fiemap(
-	struct inode		*inode,
-	struct fiemap_extent_info *fieinfo,
-	u64			start,
-	u64			length)
+	struct inode		  *inode,
+	struct fiemap_extent_info *fieinfo)
 {
-	int			error;
+	u64	start = fieinfo->fi_start;
+	u64	length = fieinfo->fi_len;
+	int	error;
 
 	xfs_ilock(XFS_I(inode), XFS_IOLOCK_SHARED);
 	if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 71dac6e00e27..a7ca228bd191 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1696,11 +1696,14 @@ extern bool may_open_dev(const struct path *path);
  * VFS FS_IOC_FIEMAP helper definitions.
  */
 struct fiemap_extent_info {
-	unsigned int fi_flags;		/* Flags as passed from user */
-	unsigned int fi_extents_mapped;	/* Number of mapped extents */
-	unsigned int fi_extents_max;	/* Size of fiemap_extent array */
-	struct fiemap_extent __user *fi_extents_start; /* Start of
-							fiemap_extent array */
+	unsigned int	fi_flags;		/* Flags as passed from user */
+	u64		fi_start;
+	u64		fi_len;
+	unsigned int	fi_extents_mapped;	/* Number of mapped extents */
+	unsigned int	fi_extents_max;		/* Size of fiemap_extent array */
+	struct		fiemap_extent __user *fi_extents_start;	/* Start of
+								   fiemap_extent
+								   array */
 };
 int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
 			    u64 phys, u64 len, u32 flags);
@@ -1847,8 +1850,7 @@ struct inode_operations {
 	int (*setattr) (struct dentry *, struct iattr *);
 	int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
-	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
-		      u64 len);
+	int (*fiemap)(struct inode *, struct fiemap_extent_info *);
 	int (*update_time)(struct inode *, struct timespec64 *, int);
 	int (*atomic_open)(struct inode *, struct dentry *,
 			   struct file *, unsigned open_flag,
@@ -3207,11 +3209,10 @@ extern int vfs_readlink(struct dentry *, char __user *, int);
 
 extern int __generic_block_fiemap(struct inode *inode,
 				  struct fiemap_extent_info *fieinfo,
-				  loff_t start, loff_t len,
 				  get_block_t *get_block);
 extern int generic_block_fiemap(struct inode *inode,
-				struct fiemap_extent_info *fieinfo, u64 start,
-				u64 len, get_block_t *get_block);
+				struct fiemap_extent_info *fieinfo,
+				get_block_t *get_block);
 
 extern struct file_system_type *get_filesystem(struct file_system_type *fs);
 extern void put_filesystem(struct file_system_type *fs);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 06/10] iomap: Remove length and start fields from iomap_fiemap
  2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
                   ` (4 preceding siblings ...)
  2018-12-05  9:17 ` [PATCH 05/10] fs: Move start and length fiemap fields into fiemap_extent_info Carlos Maiolino
@ 2018-12-05  9:17 ` Carlos Maiolino
  2019-01-14 16:51   ` Christoph Hellwig
  2018-12-05  9:17 ` [PATCH 07/10] fs: Use a void pointer to store fiemap_extent Carlos Maiolino
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-05  9:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, adilger, sandeen, david

fiemap_extent_info now embeds start and length parameters, users of
iomap_fiemap() doesn't need to pass it individually anymore.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/gfs2/inode.c       | 4 +---
 fs/iomap.c            | 4 +++-
 fs/xfs/xfs_iops.c     | 8 ++------
 include/linux/iomap.h | 2 +-
 4 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 9669ad6224da..f728f00c9998 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -2006,8 +2006,6 @@ static int gfs2_getattr(const struct path *path, struct kstat *stat,
 
 static int gfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
 {
-	u64 start = fieinfo->fi_start;
-	u64 len = fieinfo->fi_len;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_holder gh;
 	int ret;
@@ -2018,7 +2016,7 @@ static int gfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
 	if (ret)
 		goto out;
 
-	ret = iomap_fiemap(inode, fieinfo, start, len, &gfs2_iomap_ops);
+	ret = iomap_fiemap(inode, fieinfo, &gfs2_iomap_ops);
 
 	gfs2_glock_dq_uninit(&gh);
 
diff --git a/fs/iomap.c b/fs/iomap.c
index b0462b363bad..64cee9b6f133 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1156,9 +1156,11 @@ iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 }
 
 int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
-		loff_t start, loff_t len, const struct iomap_ops *ops)
+		 const struct iomap_ops *ops)
 {
 	struct fiemap_ctx ctx;
+	loff_t start = fi->fi_start;
+	loff_t len = fi->fi_len;
 	loff_t ret;
 
 	memset(&ctx, 0, sizeof(ctx));
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 1040e8346286..2b04f69c2c58 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1095,18 +1095,14 @@ xfs_vn_fiemap(
 	struct inode		  *inode,
 	struct fiemap_extent_info *fieinfo)
 {
-	u64	start = fieinfo->fi_start;
-	u64	length = fieinfo->fi_len;
 	int	error;
 
 	xfs_ilock(XFS_I(inode), XFS_IOLOCK_SHARED);
 	if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR) {
 		fieinfo->fi_flags &= ~FIEMAP_FLAG_XATTR;
-		error = iomap_fiemap(inode, fieinfo, start, length,
-				&xfs_xattr_iomap_ops);
+		error = iomap_fiemap(inode, fieinfo, &xfs_xattr_iomap_ops);
 	} else {
-		error = iomap_fiemap(inode, fieinfo, start, length,
-				&xfs_iomap_ops);
+		error = iomap_fiemap(inode, fieinfo, &xfs_iomap_ops);
 	}
 	xfs_iunlock(XFS_I(inode), XFS_IOLOCK_SHARED);
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 9a4258154b25..c2b5542694a2 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -145,7 +145,7 @@ int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
 			const struct iomap_ops *ops);
 int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-		loff_t start, loff_t len, const struct iomap_ops *ops);
+		const struct iomap_ops *ops);
 loff_t iomap_seek_hole(struct inode *inode, loff_t offset,
 		const struct iomap_ops *ops);
 loff_t iomap_seek_data(struct inode *inode, loff_t offset,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 07/10] fs: Use a void pointer to store fiemap_extent
  2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
                   ` (5 preceding siblings ...)
  2018-12-05  9:17 ` [PATCH 06/10] iomap: Remove length and start fields from iomap_fiemap Carlos Maiolino
@ 2018-12-05  9:17 ` Carlos Maiolino
  2019-01-14 16:53   ` Christoph Hellwig
  2018-12-05  9:17 ` [PATCH 08/10 V2] fiemap: Use a callback to fill fiemap extents Carlos Maiolino
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-05  9:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, adilger, sandeen, david

Once fieinfo will carry either a kernel pointer or a user pointer for
holding struct fiemap_extent location , use a void pointer to hold its
address.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 include/linux/fs.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index a7ca228bd191..16a58dfe09cc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,9 +1701,8 @@ struct fiemap_extent_info {
 	u64		fi_len;
 	unsigned int	fi_extents_mapped;	/* Number of mapped extents */
 	unsigned int	fi_extents_max;		/* Size of fiemap_extent array */
-	struct		fiemap_extent __user *fi_extents_start;	/* Start of
-								   fiemap_extent
-								   array */
+	void		*fi_extents_start;	/* Start of fiemap_extent
+						   array */
 };
 int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
 			    u64 phys, u64 len, u32 flags);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/10 V2] fiemap: Use a callback to fill fiemap extents
  2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
                   ` (6 preceding siblings ...)
  2018-12-05  9:17 ` [PATCH 07/10] fs: Use a void pointer to store fiemap_extent Carlos Maiolino
@ 2018-12-05  9:17 ` Carlos Maiolino
  2019-01-14 16:53   ` Christoph Hellwig
  2018-12-05  9:17 ` [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Carlos Maiolino
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-05  9:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, adilger, sandeen, david

As a goal to enable fiemap infrastructure to be used by fibmap too, we need a
way to use different helpers to fill extent data, depending on its usage. One
helper to fill extent data stored in user address space (used in fiemap), and
another fo fill extent data stored in kernel address space (will be used in
fibmap).

This patch sets up the usage of a callback to be used to fill in the extents.
It transforms the current fiemap_fill_next_extent, into a simple helper to call
the callback, avoiding unneeded changes on any filesystem, and reutilizes the
original function as the callback used by FIEMAP.

V2:
	- Now based on the rework on fiemap_extent_info (previous was
	  based on fiemap_ctx)

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/ioctl.c         | 39 +++++++++++++++++++++++----------------
 include/linux/fs.h |  7 +++++++
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index a8eae8916e01..6086978fe01e 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -75,25 +75,10 @@ static int ioctl_fibmap(struct file *filp, int __user *p)
 	return error;
 }
 
-/**
- * fiemap_fill_next_extent - Fiemap helper function
- * @fieinfo:	Fiemap context passed into ->fiemap
- * @logical:	Extent logical start offset, in bytes
- * @phys:	Extent physical start offset, in bytes
- * @len:	Extent length, in bytes
- * @flags:	FIEMAP_EXTENT flags that describe this extent
- *
- * Called from file system ->fiemap callback. Will populate extent
- * info as passed in via arguments and copy to user memory. On
- * success, extent count on fieinfo is incremented.
- *
- * Returns 0 on success, -errno on error, 1 if this was the last
- * extent that will fit in user array.
- */
 #define SET_UNKNOWN_FLAGS	(FIEMAP_EXTENT_DELALLOC)
 #define SET_NO_UNMOUNTED_IO_FLAGS	(FIEMAP_EXTENT_DATA_ENCRYPTED)
 #define SET_NOT_ALIGNED_FLAGS	(FIEMAP_EXTENT_DATA_TAIL|FIEMAP_EXTENT_DATA_INLINE)
-int fiemap_fill_next_extent(struct fiemap_extent_info *fieinfo, u64 logical,
+int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
 			    u64 phys, u64 len, u32 flags)
 {
 	struct fiemap_extent extent;
@@ -130,6 +115,27 @@ int fiemap_fill_next_extent(struct fiemap_extent_info *fieinfo, u64 logical,
 		return 1;
 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
 }
+
+/**
+ * fiemap_fill_next_extent - Fiemap helper function
+ * @fieinfo:	Fiemap context passed into ->fiemap
+ * @logical:	Extent logical start offset, in bytes
+ * @phys:	Extent physical start offset, in bytes
+ * @len:	Extent length, in bytes
+ * @flags:	FIEMAP_EXTENT flags that describe this extent
+ *
+ * Called from file system ->fiemap callback. Will populate extent
+ * info as passed in via arguments and copy to user memory. On
+ * success, extent count on fieinfo is incremented.
+ *
+ * Returns 0 on success, -errno on error, 1 if this was the last
+ * extent that will fit in user array.
+ */
+int fiemap_fill_next_extent(struct fiemap_extent_info *fieinfo, u64 logical,
+			    u64 phys, u64 len, u32 flags)
+{
+	return fieinfo->fi_cb(fieinfo, logical, phys, len, flags);
+}
 EXPORT_SYMBOL(fiemap_fill_next_extent);
 
 /**
@@ -210,6 +216,7 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg)
 	fieinfo.fi_extents_start = ufiemap->fm_extents;
 	fieinfo.fi_start = fiemap.fm_start;
 	fieinfo.fi_len = len;
+	fieinfo.fi_cb = fiemap_fill_user_extent;
 
 	if (fiemap.fm_extent_count != 0 &&
 	    !access_ok(VERIFY_WRITE, fieinfo.fi_extents_start,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 16a58dfe09cc..7a434979201c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -64,6 +64,7 @@ struct fscrypt_operations;
 struct fs_context;
 struct fs_parameter_description;
 struct fsinfo_kparams;
+struct fiemap_extent_info;
 enum fsinfo_attribute;
 
 extern void __init inode_init(void);
@@ -1695,6 +1696,10 @@ extern bool may_open_dev(const struct path *path);
 /*
  * VFS FS_IOC_FIEMAP helper definitions.
  */
+
+typedef int (*fiemap_fill_cb)(struct fiemap_extent_info *fieinfo, u64 logical,
+			      u64 phys, u64 len, u32 flags);
+
 struct fiemap_extent_info {
 	unsigned int	fi_flags;		/* Flags as passed from user */
 	u64		fi_start;
@@ -1703,7 +1708,9 @@ struct fiemap_extent_info {
 	unsigned int	fi_extents_max;		/* Size of fiemap_extent array */
 	void		*fi_extents_start;	/* Start of fiemap_extent
 						   array */
+	fiemap_fill_cb	fi_cb;
 };
+
 int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
 			    u64 phys, u64 len, u32 flags);
 int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
                   ` (7 preceding siblings ...)
  2018-12-05  9:17 ` [PATCH 08/10 V2] fiemap: Use a callback to fill fiemap extents Carlos Maiolino
@ 2018-12-05  9:17 ` Carlos Maiolino
  2018-12-05 17:36   ` Darrick J. Wong
  2019-01-14 16:56   ` Christoph Hellwig
  2018-12-05  9:17 ` [PATCH 10/10] xfs: Get rid of ->bmap Carlos Maiolino
  2018-12-06 18:56 ` [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Andreas Grünbacher
  10 siblings, 2 replies; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-05  9:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, adilger, sandeen, david

Enables the usage of FIEMAP ioctl infrastructure to handle FIBMAP calls.
>From now on, ->bmap() methods can start to be removed from filesystems
which already provides ->fiemap().

This adds a new helper - bmap_fiemap() - which is used to fill in the
fiemap request, call into the underlying filesystem and check the flags
set in the extent requested.

Add a new fiemap fill extent callback to handl the in-kernel only
fiemap_extent structure used for FIBMAP.

V2:
	- Now based on the updated fiemap_extent_info,
	- move the fiemap call itself to a new helper

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/inode.c         | 42 ++++++++++++++++++++++++++++++++++++++++--
 fs/ioctl.c         | 32 ++++++++++++++++++++++++++++++++
 include/linux/fs.h |  2 ++
 3 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index db681d310465..f07cc183ddbd 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1578,6 +1578,40 @@ void iput(struct inode *inode)
 }
 EXPORT_SYMBOL(iput);
 
+static int bmap_fiemap(struct inode *inode, sector_t *block)
+{
+	struct fiemap_extent_info fieinfo = { 0, };
+	struct fiemap_extent fextent;
+	u64 start = *block << inode->i_blkbits;
+	int error = -EINVAL;
+
+	fextent.fe_logical = 0;
+	fextent.fe_physical = 0;
+	fieinfo.fi_extents_max = 1;
+	fieinfo.fi_extents_mapped = 0;
+	fieinfo.fi_extents_start = &fextent;
+	fieinfo.fi_start = start;
+	fieinfo.fi_len = 1 << inode->i_blkbits;
+	fieinfo.fi_flags = 0;
+	fieinfo.fi_cb = fiemap_fill_kernel_extent;
+
+	error = inode->i_op->fiemap(inode, &fieinfo);
+
+	if (error)
+		return error;
+
+	if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN |
+				FIEMAP_EXTENT_ENCODED |
+				FIEMAP_EXTENT_DATA_INLINE |
+				FIEMAP_EXTENT_UNWRITTEN))
+		return -EINVAL;
+
+	*block = (fextent.fe_physical +
+		  (start - fextent.fe_logical)) >> inode->i_blkbits;
+
+	return error;
+}
+
 /**
  *	bmap	- find a block number in a file
  *	@inode:  inode owning the block number being requested
@@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
  */
 int bmap(struct inode *inode, sector_t *block)
 {
-	if (!inode->i_mapping->a_ops->bmap)
+	if (inode->i_op->fiemap)
+		return bmap_fiemap(inode, block);
+	else if (inode->i_mapping->a_ops->bmap)
+		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
+						       *block);
+	else
 		return -EINVAL;
 
-	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
 	return 0;
 }
 EXPORT_SYMBOL(bmap);
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 6086978fe01e..bfa59df332bf 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
 }
 
+int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
+			    u64 phys, u64 len, u32 flags)
+{
+	struct fiemap_extent *extent = fieinfo->fi_extents_start;
+
+	/* only count the extents */
+	if (fieinfo->fi_extents_max == 0) {
+		fieinfo->fi_extents_mapped++;
+		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
+	}
+
+	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
+		return 1;
+
+	if (flags & SET_UNKNOWN_FLAGS)
+		flags |= FIEMAP_EXTENT_UNKNOWN;
+	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
+		flags |= FIEMAP_EXTENT_ENCODED;
+	if (flags & SET_NOT_ALIGNED_FLAGS)
+		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
+
+	extent->fe_logical = logical;
+	extent->fe_physical = phys;
+	extent->fe_length = len;
+	extent->fe_flags = flags;
+
+	fieinfo->fi_extents_mapped++;
+
+	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
+		return 1;
+	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
+}
 /**
  * fiemap_fill_next_extent - Fiemap helper function
  * @fieinfo:	Fiemap context passed into ->fiemap
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7a434979201c..28bb523d532a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
 	fiemap_fill_cb	fi_cb;
 };
 
+int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
+			      u64 phys, u64 len, u32 flags);
 int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
 			    u64 phys, u64 len, u32 flags);
 int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 10/10] xfs: Get rid of ->bmap
  2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
                   ` (8 preceding siblings ...)
  2018-12-05  9:17 ` [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Carlos Maiolino
@ 2018-12-05  9:17 ` Carlos Maiolino
  2018-12-05 17:37   ` Darrick J. Wong
  2018-12-06 18:56 ` [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Andreas Grünbacher
  10 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-05  9:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, adilger, sandeen, david

We don't need ->bmap anymore, only usage for it was FIBMAP, which is now
gone.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/xfs/xfs_aops.c  | 24 ------------------------
 fs/xfs/xfs_trace.h |  1 -
 2 files changed, 25 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9984e0..26f5bb80d007 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -963,29 +963,6 @@ xfs_vm_releasepage(
 	return iomap_releasepage(page, gfp_mask);
 }
 
-STATIC sector_t
-xfs_vm_bmap(
-	struct address_space	*mapping,
-	sector_t		block)
-{
-	struct xfs_inode	*ip = XFS_I(mapping->host);
-
-	trace_xfs_vm_bmap(ip);
-
-	/*
-	 * The swap code (ab-)uses ->bmap to get a block mapping and then
-	 * bypasses the file system for actual I/O.  We really can't allow
-	 * that on reflinks inodes, so we have to skip out here.  And yes,
-	 * 0 is the magic code for a bmap error.
-	 *
-	 * Since we don't pass back blockdev info, we can't return bmap
-	 * information for rt files either.
-	 */
-	if (xfs_is_reflink_inode(ip) || XFS_IS_REALTIME_INODE(ip))
-		return 0;
-	return iomap_bmap(mapping, block, &xfs_iomap_ops);
-}
-
 STATIC int
 xfs_vm_readpage(
 	struct file		*unused,
@@ -1024,7 +1001,6 @@ const struct address_space_operations xfs_address_space_operations = {
 	.set_page_dirty		= iomap_set_page_dirty,
 	.releasepage		= xfs_vm_releasepage,
 	.invalidatepage		= xfs_vm_invalidatepage,
-	.bmap			= xfs_vm_bmap,
 	.direct_IO		= noop_direct_IO,
 	.migratepage		= iomap_migrate_page,
 	.is_partially_uptodate  = iomap_is_partially_uptodate,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 3043e5ed6495..d836b9b84aae 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -618,7 +618,6 @@ DEFINE_INODE_EVENT(xfs_readdir);
 #ifdef CONFIG_XFS_POSIX_ACL
 DEFINE_INODE_EVENT(xfs_get_acl);
 #endif
-DEFINE_INODE_EVENT(xfs_vm_bmap);
 DEFINE_INODE_EVENT(xfs_file_ioctl);
 DEFINE_INODE_EVENT(xfs_file_compat_ioctl);
 DEFINE_INODE_EVENT(xfs_ioctl_setattr);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2018-12-05  9:17 ` [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Carlos Maiolino
@ 2018-12-05 17:36   ` Darrick J. Wong
  2018-12-07  9:09     ` Carlos Maiolino
  2019-02-04 15:11     ` Carlos Maiolino
  2019-01-14 16:56   ` Christoph Hellwig
  1 sibling, 2 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-12-05 17:36 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Wed, Dec 05, 2018 at 10:17:27AM +0100, Carlos Maiolino wrote:
> Enables the usage of FIEMAP ioctl infrastructure to handle FIBMAP calls.
> From now on, ->bmap() methods can start to be removed from filesystems
> which already provides ->fiemap().
> 
> This adds a new helper - bmap_fiemap() - which is used to fill in the
> fiemap request, call into the underlying filesystem and check the flags
> set in the extent requested.
> 
> Add a new fiemap fill extent callback to handl the in-kernel only
> fiemap_extent structure used for FIBMAP.
> 
> V2:
> 	- Now based on the updated fiemap_extent_info,
> 	- move the fiemap call itself to a new helper
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
>  fs/inode.c         | 42 ++++++++++++++++++++++++++++++++++++++++--
>  fs/ioctl.c         | 32 ++++++++++++++++++++++++++++++++
>  include/linux/fs.h |  2 ++
>  3 files changed, 74 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index db681d310465..f07cc183ddbd 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1578,6 +1578,40 @@ void iput(struct inode *inode)
>  }
>  EXPORT_SYMBOL(iput);
>  
> +static int bmap_fiemap(struct inode *inode, sector_t *block)
> +{
> +	struct fiemap_extent_info fieinfo = { 0, };
> +	struct fiemap_extent fextent;
> +	u64 start = *block << inode->i_blkbits;
> +	int error = -EINVAL;
> +
> +	fextent.fe_logical = 0;
> +	fextent.fe_physical = 0;
> +	fieinfo.fi_extents_max = 1;
> +	fieinfo.fi_extents_mapped = 0;
> +	fieinfo.fi_extents_start = &fextent;
> +	fieinfo.fi_start = start;
> +	fieinfo.fi_len = 1 << inode->i_blkbits;
> +	fieinfo.fi_flags = 0;
> +	fieinfo.fi_cb = fiemap_fill_kernel_extent;
> +
> +	error = inode->i_op->fiemap(inode, &fieinfo);
> +
> +	if (error)
> +		return error;
> +
> +	if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN |
> +				FIEMAP_EXTENT_ENCODED |
> +				FIEMAP_EXTENT_DATA_INLINE |
> +				FIEMAP_EXTENT_UNWRITTEN))
> +		return -EINVAL;

AFAICT, three of the filesystems that support COW writes (xfs, ocfs2,
and btrfs) do not return bmap results for files with shared blocks.
This check here should include FIEMAP_EXTENT_SHARED since external
overwrites of a COW file block are bad news on btrfs (and ocfs2 and
xfs).

> +
> +	*block = (fextent.fe_physical +
> +		  (start - fextent.fe_logical)) >> inode->i_blkbits;

Hmmm, so there's nothing here checking that the physical device fiemap
reports is the same device that was passed into the mount.  This is
trivially true for most of the filesystems that implement bmap and
fiemap, but definitely not true for xfs or btrfs.  I would bet most
userspace callers of bmap (since it's an ext2-era ioctl) make that
assumption and don't even know how to find the device.

On xfs, the bmap implementation won't return any results for realtime
files, but it looks as though we suddenly will start doing that here,
because in the new bmap implementation we will use fiemap, and fiemap
reports extents without providing any context about which device they're
on, and that context-less extent gets passed back to bmap_fiemap.

In any case, I think a better solution to the multi-device problem is to
start returning device information via struct fiemap_extent, at least
inside the kernel.  Use one of the reserved fields to declare a new
'__u32 fe_device' field in struct fiemap_extent which can be the dev_t
device number, and then you can check that against inode->i_sb->s_bdev
to avoid returning results for the non-primary device of a multi-device
filesystem.

> +
> +	return error;
> +}
> +
>  /**
>   *	bmap	- find a block number in a file
>   *	@inode:  inode owning the block number being requested
> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
>   */
>  int bmap(struct inode *inode, sector_t *block)
>  {
> -	if (!inode->i_mapping->a_ops->bmap)
> +	if (inode->i_op->fiemap)
> +		return bmap_fiemap(inode, block);
> +	else if (inode->i_mapping->a_ops->bmap)
> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> +						       *block);
> +	else
>  		return -EINVAL;

Waitaminute.  btrfs currently supports fiemap but not bmap, and now
suddenly it will support this legacy interface they've never supported
before.  Are they on board with this?

--D

>  
> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
>  	return 0;
>  }
>  EXPORT_SYMBOL(bmap);
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 6086978fe01e..bfa59df332bf 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>  }
>  
> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> +			    u64 phys, u64 len, u32 flags)
> +{
> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> +
> +	/* only count the extents */
> +	if (fieinfo->fi_extents_max == 0) {
> +		fieinfo->fi_extents_mapped++;
> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> +	}
> +
> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> +		return 1;
> +
> +	if (flags & SET_UNKNOWN_FLAGS)
> +		flags |= FIEMAP_EXTENT_UNKNOWN;
> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> +		flags |= FIEMAP_EXTENT_ENCODED;
> +	if (flags & SET_NOT_ALIGNED_FLAGS)
> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> +
> +	extent->fe_logical = logical;
> +	extent->fe_physical = phys;
> +	extent->fe_length = len;
> +	extent->fe_flags = flags;
> +
> +	fieinfo->fi_extents_mapped++;
> +
> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> +		return 1;
> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> +}
>  /**
>   * fiemap_fill_next_extent - Fiemap helper function
>   * @fieinfo:	Fiemap context passed into ->fiemap
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7a434979201c..28bb523d532a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
>  	fiemap_fill_cb	fi_cb;
>  };
>  
> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> +			      u64 phys, u64 len, u32 flags);
>  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
>  			    u64 phys, u64 len, u32 flags);
>  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/10] xfs: Get rid of ->bmap
  2018-12-05  9:17 ` [PATCH 10/10] xfs: Get rid of ->bmap Carlos Maiolino
@ 2018-12-05 17:37   ` Darrick J. Wong
  2018-12-06 13:06     ` Carlos Maiolino
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-12-05 17:37 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Wed, Dec 05, 2018 at 10:17:28AM +0100, Carlos Maiolino wrote:
> We don't need ->bmap anymore, only usage for it was FIBMAP, which is now
> gone.
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
>  fs/xfs/xfs_aops.c  | 24 ------------------------
>  fs/xfs/xfs_trace.h |  1 -
>  2 files changed, 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 338b9d9984e0..26f5bb80d007 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -963,29 +963,6 @@ xfs_vm_releasepage(
>  	return iomap_releasepage(page, gfp_mask);
>  }
>  
> -STATIC sector_t
> -xfs_vm_bmap(
> -	struct address_space	*mapping,
> -	sector_t		block)
> -{
> -	struct xfs_inode	*ip = XFS_I(mapping->host);
> -
> -	trace_xfs_vm_bmap(ip);
> -
> -	/*
> -	 * The swap code (ab-)uses ->bmap to get a block mapping and then
> -	 * bypasses the file system for actual I/O.  We really can't allow
> -	 * that on reflinks inodes, so we have to skip out here.  And yes,
> -	 * 0 is the magic code for a bmap error.
> -	 *
> -	 * Since we don't pass back blockdev info, we can't return bmap
> -	 * information for rt files either.
> -	 */
> -	if (xfs_is_reflink_inode(ip) || XFS_IS_REALTIME_INODE(ip))
> -		return 0;
> -	return iomap_bmap(mapping, block, &xfs_iomap_ops);

If you're going to delete this, you might as well kill iomap_bmap too
since xfs is the only user of it.

--D

> -}
> -
>  STATIC int
>  xfs_vm_readpage(
>  	struct file		*unused,
> @@ -1024,7 +1001,6 @@ const struct address_space_operations xfs_address_space_operations = {
>  	.set_page_dirty		= iomap_set_page_dirty,
>  	.releasepage		= xfs_vm_releasepage,
>  	.invalidatepage		= xfs_vm_invalidatepage,
> -	.bmap			= xfs_vm_bmap,
>  	.direct_IO		= noop_direct_IO,
>  	.migratepage		= iomap_migrate_page,
>  	.is_partially_uptodate  = iomap_is_partially_uptodate,
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 3043e5ed6495..d836b9b84aae 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -618,7 +618,6 @@ DEFINE_INODE_EVENT(xfs_readdir);
>  #ifdef CONFIG_XFS_POSIX_ACL
>  DEFINE_INODE_EVENT(xfs_get_acl);
>  #endif
> -DEFINE_INODE_EVENT(xfs_vm_bmap);
>  DEFINE_INODE_EVENT(xfs_file_ioctl);
>  DEFINE_INODE_EVENT(xfs_file_compat_ioctl);
>  DEFINE_INODE_EVENT(xfs_ioctl_setattr);
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/10] xfs: Get rid of ->bmap
  2018-12-05 17:37   ` Darrick J. Wong
@ 2018-12-06 13:06     ` Carlos Maiolino
  0 siblings, 0 replies; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-06 13:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Wed, Dec 05, 2018 at 09:37:55AM -0800, Darrick J. Wong wrote:
> On Wed, Dec 05, 2018 at 10:17:28AM +0100, Carlos Maiolino wrote:
> > We don't need ->bmap anymore, only usage for it was FIBMAP, which is now
> > gone.
> > 
> > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c  | 24 ------------------------
> >  fs/xfs/xfs_trace.h |  1 -
> >  2 files changed, 25 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 338b9d9984e0..26f5bb80d007 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -963,29 +963,6 @@ xfs_vm_releasepage(
> >  	return iomap_releasepage(page, gfp_mask);
> >  }
> >  
> > -STATIC sector_t
> > -xfs_vm_bmap(
> > -	struct address_space	*mapping,
> > -	sector_t		block)
> > -{
> > -	struct xfs_inode	*ip = XFS_I(mapping->host);
> > -
> > -	trace_xfs_vm_bmap(ip);
> > -
> > -	/*
> > -	 * The swap code (ab-)uses ->bmap to get a block mapping and then
> > -	 * bypasses the file system for actual I/O.  We really can't allow
> > -	 * that on reflinks inodes, so we have to skip out here.  And yes,
> > -	 * 0 is the magic code for a bmap error.
> > -	 *
> > -	 * Since we don't pass back blockdev info, we can't return bmap
> > -	 * information for rt files either.
> > -	 */
> > -	if (xfs_is_reflink_inode(ip) || XFS_IS_REALTIME_INODE(ip))
> > -		return 0;
> > -	return iomap_bmap(mapping, block, &xfs_iomap_ops);
> 
> If you're going to delete this, you might as well kill iomap_bmap too
> since xfs is the only user of it.
> 

I can do this for sure, if I need to re-do a whole V3, I'll add a patch for it,
otherwise, I'll do it after the patch gets integrated.

Thanks for the review Darrick.

> --D
> 
> > -}
> > -
> >  STATIC int
> >  xfs_vm_readpage(
> >  	struct file		*unused,
> > @@ -1024,7 +1001,6 @@ const struct address_space_operations xfs_address_space_operations = {
> >  	.set_page_dirty		= iomap_set_page_dirty,
> >  	.releasepage		= xfs_vm_releasepage,
> >  	.invalidatepage		= xfs_vm_invalidatepage,
> > -	.bmap			= xfs_vm_bmap,
> >  	.direct_IO		= noop_direct_IO,
> >  	.migratepage		= iomap_migrate_page,
> >  	.is_partially_uptodate  = iomap_is_partially_uptodate,
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 3043e5ed6495..d836b9b84aae 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -618,7 +618,6 @@ DEFINE_INODE_EVENT(xfs_readdir);
> >  #ifdef CONFIG_XFS_POSIX_ACL
> >  DEFINE_INODE_EVENT(xfs_get_acl);
> >  #endif
> > -DEFINE_INODE_EVENT(xfs_vm_bmap);
> >  DEFINE_INODE_EVENT(xfs_file_ioctl);
> >  DEFINE_INODE_EVENT(xfs_file_compat_ioctl);
> >  DEFINE_INODE_EVENT(xfs_ioctl_setattr);
> > -- 
> > 2.17.2
> > 

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal
  2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
                   ` (9 preceding siblings ...)
  2018-12-05  9:17 ` [PATCH 10/10] xfs: Get rid of ->bmap Carlos Maiolino
@ 2018-12-06 18:56 ` Andreas Grünbacher
  2018-12-07  9:34   ` Carlos Maiolino
  10 siblings, 1 reply; 53+ messages in thread
From: Andreas Grünbacher @ 2018-12-06 18:56 UTC (permalink / raw)
  To: cmaiolino, Christoph Hellwig
  Cc: Linux FS-devel Mailing List, Andreas Dilger, sandeen, Dave Chinner

Hi,

Am Mi., 5. Dez. 2018 um 10:18 Uhr schrieb Carlos Maiolino
<cmaiolino@redhat.com>:
> This is the second version of the complete series with the goal to remove ->bmap
> interface completely, in lieu of FIEMAP.

I'm not thrilled by this approach. How about exposing the iomap
operations at the vfs layer (for example, in the super block) and
implementing bmap on top of that instead?

(I realize that xfs has separate iomap operations for xattrs, but that
is only used in its fiemap implementation.)

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2018-12-05 17:36   ` Darrick J. Wong
@ 2018-12-07  9:09     ` Carlos Maiolino
  2018-12-07 20:14       ` Andreas Dilger
  2019-02-04 15:11     ` Carlos Maiolino
  1 sibling, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-07  9:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote:
> On Wed, Dec 05, 2018 at 10:17:27AM +0100, Carlos Maiolino wrote:
> > Enables the usage of FIEMAP ioctl infrastructure to handle FIBMAP calls.
> > From now on, ->bmap() methods can start to be removed from filesystems
> > which already provides ->fiemap().
> > 
> > This adds a new helper - bmap_fiemap() - which is used to fill in the
> > fiemap request, call into the underlying filesystem and check the flags
> > set in the extent requested.
> > 
> > Add a new fiemap fill extent callback to handl the in-kernel only
> > fiemap_extent structure used for FIBMAP.
> > 
> > V2:
> > 	- Now based on the updated fiemap_extent_info,
> > 	- move the fiemap call itself to a new helper
> > 
> > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> > ---
> >  fs/inode.c         | 42 ++++++++++++++++++++++++++++++++++++++++--
> >  fs/ioctl.c         | 32 ++++++++++++++++++++++++++++++++
> >  include/linux/fs.h |  2 ++
> >  3 files changed, 74 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index db681d310465..f07cc183ddbd 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1578,6 +1578,40 @@ void iput(struct inode *inode)
> >  }
> >  EXPORT_SYMBOL(iput);
> >  
> > +static int bmap_fiemap(struct inode *inode, sector_t *block)
> > +{
> > +	struct fiemap_extent_info fieinfo = { 0, };
> > +	struct fiemap_extent fextent;
> > +	u64 start = *block << inode->i_blkbits;
> > +	int error = -EINVAL;
> > +
> > +	fextent.fe_logical = 0;
> > +	fextent.fe_physical = 0;
> > +	fieinfo.fi_extents_max = 1;
> > +	fieinfo.fi_extents_mapped = 0;
> > +	fieinfo.fi_extents_start = &fextent;
> > +	fieinfo.fi_start = start;
> > +	fieinfo.fi_len = 1 << inode->i_blkbits;
> > +	fieinfo.fi_flags = 0;
> > +	fieinfo.fi_cb = fiemap_fill_kernel_extent;
> > +
> > +	error = inode->i_op->fiemap(inode, &fieinfo);
> > +
> > +	if (error)
> > +		return error;
> > +
> > +	if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN |
> > +				FIEMAP_EXTENT_ENCODED |
> > +				FIEMAP_EXTENT_DATA_INLINE |
> > +				FIEMAP_EXTENT_UNWRITTEN))
> > +		return -EINVAL;
> 
> AFAICT, three of the filesystems that support COW writes (xfs, ocfs2,
> and btrfs) do not return bmap results for files with shared blocks.
> This check here should include FIEMAP_EXTENT_SHARED since external
> overwrites of a COW file block are bad news on btrfs (and ocfs2 and
> xfs).

Yes, it does need to check for FIEMAP_EXTENT_SHARED too, I had it on my plans
but I forgot to add it when setting up the flags. Thanks for reminding me.

> 
> > +
> > +	*block = (fextent.fe_physical +
> > +		  (start - fextent.fe_logical)) >> inode->i_blkbits;
> 
> Hmmm, so there's nothing here checking that the physical device fiemap
> reports is the same device that was passed into the mount.  This is
> trivially true for most of the filesystems that implement bmap and
> fiemap, but definitely not true for xfs or btrfs.  I would bet most
> userspace callers of bmap (since it's an ext2-era ioctl) make that
> assumption and don't even know how to find the device.
> 
> On xfs, the bmap implementation won't return any results for realtime
> files, but it looks as though we suddenly will start doing that here,
> because in the new bmap implementation we will use fiemap, and fiemap
> reports extents without providing any context about which device they're
> on, and that context-less extent gets passed back to bmap_fiemap.
> 
> In any case, I think a better solution to the multi-device problem is to
> start returning device information via struct fiemap_extent, at least
> inside the kernel.  Use one of the reserved fields to declare a new
> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> device number, and then you can check that against inode->i_sb->s_bdev
> to avoid returning results for the non-primary device of a multi-device
> filesystem.

Yes, you are right, I haven't thought about multi-dev filesystems. I checked
btrfs code and it doesn't even support fibmap, exactly because of this problem,
I wonder though, why it does support FIEMAP then, maybe because the fiemap idea
isn't provide a way to userspace do IO directly to the device?!

I'm not sure if crossing dev information is enough though, I did a quick read of
btrfs code, and the assumption that the block/extent location won't change on
time, could lead to a time bomb in the future. I wonder if it wouldn't maybe be
better to add a flag, to identify the usage type, and the filesystem itself
would define if it should return anything, or not. Like, for example, passing in
fieinfo->fi_flags, something like FIEMAP_FIBMAP, and check inside the filesystem
for it.

>From my shallow understanding of btrfs, looks like the location of the data, can
be moved inside the same device, so, even if the devices are the same as you
suggested, there is no guarantee the offset will be the same.

Cheers.

> 
> > +
> > +	return error;
> > +}
> > +
> >  /**
> >   *	bmap	- find a block number in a file
> >   *	@inode:  inode owning the block number being requested
> > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> >   */
> >  int bmap(struct inode *inode, sector_t *block)
> >  {
> > -	if (!inode->i_mapping->a_ops->bmap)
> > +	if (inode->i_op->fiemap)
> > +		return bmap_fiemap(inode, block);
> > +	else if (inode->i_mapping->a_ops->bmap)
> > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > +						       *block);
> > +	else
> >  		return -EINVAL;
> 
> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> suddenly it will support this legacy interface they've never supported
> before.  Are they on board with this?
> 
> --D
> 
> >  
> > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> >  	return 0;
> >  }
> >  EXPORT_SYMBOL(bmap);
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 6086978fe01e..bfa59df332bf 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >  }
> >  
> > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > +			    u64 phys, u64 len, u32 flags)
> > +{
> > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > +
> > +	/* only count the extents */
> > +	if (fieinfo->fi_extents_max == 0) {
> > +		fieinfo->fi_extents_mapped++;
> > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > +	}
> > +
> > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > +		return 1;
> > +
> > +	if (flags & SET_UNKNOWN_FLAGS)
> > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > +		flags |= FIEMAP_EXTENT_ENCODED;
> > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > +
> > +	extent->fe_logical = logical;
> > +	extent->fe_physical = phys;
> > +	extent->fe_length = len;
> > +	extent->fe_flags = flags;
> > +
> > +	fieinfo->fi_extents_mapped++;
> > +
> > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > +		return 1;
> > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > +}
> >  /**
> >   * fiemap_fill_next_extent - Fiemap helper function
> >   * @fieinfo:	Fiemap context passed into ->fiemap
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 7a434979201c..28bb523d532a 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> >  	fiemap_fill_cb	fi_cb;
> >  };
> >  
> > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > +			      u64 phys, u64 len, u32 flags);
> >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> >  			    u64 phys, u64 len, u32 flags);
> >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > -- 
> > 2.17.2
> > 

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal
  2018-12-06 18:56 ` [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Andreas Grünbacher
@ 2018-12-07  9:34   ` Carlos Maiolino
  2019-01-14 16:50     ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2018-12-07  9:34 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: Christoph Hellwig, Linux FS-devel Mailing List, Andreas Dilger,
	sandeen, Dave Chinner

On Thu, Dec 06, 2018 at 07:56:02PM +0100, Andreas Grï¿½nbacher wrote:
> Hi,
> 
> Am Mi., 5. Dez. 2018 um 10:18 Uhr schrieb Carlos Maiolino
> <cmaiolino@redhat.com>:
> > This is the second version of the complete series with the goal to remove ->bmap
> > interface completely, in lieu of FIEMAP.
> 
> I'm not thrilled by this approach. How about exposing the iomap
> operations at the vfs layer (for example, in the super block) and
> implementing bmap on top of that instead?
> 

Well, the idea is exactly to get rid of bmap, not reimplement it. We can use the
same operation for both cases (fiemap+fibmap), so I honestly don't see which
advantages would be by reimplementing it.

> (I realize that xfs has separate iomap operations for xattrs, but that
> is only used in its fiemap implementation.)
> 
> Thanks,
> Andreas

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2018-12-07  9:09     ` Carlos Maiolino
@ 2018-12-07 20:14       ` Andreas Dilger
  0 siblings, 0 replies; 53+ messages in thread
From: Andreas Dilger @ 2018-12-07 20:14 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Darrick J. Wong, linux-fsdevel, hch, Eric Sandeen, david

[-- Attachment #1: Type: text/plain, Size: 4520 bytes --]

On Dec 7, 2018, at 2:09 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote:
> On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote:
>> 
>>> +
>>> +	*block = (fextent.fe_physical +
>>> +		  (start - fextent.fe_logical)) >> inode->i_blkbits;
>> 
>> Hmmm, so there's nothing here checking that the physical device fiemap
>> reports is the same device that was passed into the mount.  This is
>> trivially true for most of the filesystems that implement bmap and
>> fiemap, but definitely not true for xfs or btrfs.  I would bet most
>> userspace callers of bmap (since it's an ext2-era ioctl) make that
>> assumption and don't even know how to find the device.
>> 
>> On xfs, the bmap implementation won't return any results for realtime
>> files, but it looks as though we suddenly will start doing that here,
>> because in the new bmap implementation we will use fiemap, and fiemap
>> reports extents without providing any context about which device they're
>> on, and that context-less extent gets passed back to bmap_fiemap.
>> 
>> In any case, I think a better solution to the multi-device problem is to
>> start returning device information via struct fiemap_extent, at least
>> inside the kernel.  Use one of the reserved fields to declare a new
>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
>> device number, and then you can check that against inode->i_sb->s_bdev
>> to avoid returning results for the non-primary device of a multi-device
>> filesystem.

We're using fe_device = fe_reserved[0] to return the device number for Lustre.
For Lustre, the "device number" is just a server index, since the server's
block device number is irrelevant on the client.  For local filesystems, it
should return the 32-bit st_rdev device number to distinguish devices.

I have patches for e2fsprogs filefrag to print the fe_device field.

> Yes, you are right, I haven't thought about multi-dev filesystems. I checked
> btrfs code and it doesn't even support fibmap, exactly because of this problem,
> I wonder though, why it does support FIEMAP then, maybe because the fiemap idea
> isn't provide a way to userspace do IO directly to the device?!
> 
> I'm not sure if crossing dev information is enough though, I did a quick read of
> btrfs code, and the assumption that the block/extent location won't change on
> time, could lead to a time bomb in the future. I wonder if it wouldn't maybe be
> better to add a flag, to identify the usage type, and the filesystem itself
> would define if it should return anything, or not. Like, for example, passing in
> fieinfo->fi_flags, something like FIEMAP_FIBMAP, and check inside the filesystem
> for it.

The FIEMAP_EXTENT_ENCODED flag is meant to be returned when the extent cannot be
read directly from the block device.

For FIBMAP, it should return an error if ENCODED is set, since this file is not
suitable for directly booting a kernel (LILO is the only user of FIBMAP that I'm
aware of).  The filefrag utility prefers to use FIEMAP for efficiency, and only
falls back to FIBMAP if FIEMAP fails.

> From my shallow understanding of btrfs, looks like the location of the data, can
> be moved inside the same device, so, even if the devices are the same as you
> suggested, there is no guarantee the offset will be the same.

On a related note, btrfs also supports compressed extents, which isn't handled
by the current FIEMAP ioctl properly.  There was a patch proposed ages ago to
add FIEMAP_EXTENT_DATA_COMPRESSED, but didn't _quite_ make it over the finish
line, https://www.spinics.net/lists/xfs/msg24629.html has the last discussion.

It added EXTENT_DATA_COMPRESSED and used fe_reserved64[0] as fe_phys_length to
return the on-disk extent size, while fe_length would rename to fe_logi_length
(with a compat macro) to still represented the logical extent length.

 #define FIEMAP_EXTENT_DATA_COMPRESSED	0x00000040 /* Data compressed by fs.
 						    * Sets EXTENT_DATA_ENCODED */
 -	__u64 fe_reserved64[2];
 +	__u64 fe_phys_length; /* physical length in bytes, undefined if
 +			       * DATA_COMPRESSED not set */

There was some discussion on whether there should be a second flag like
FIEMAP_EXTENT_PHYS_LENGTH that is set when the fe_phys_length field is
valid, independent of whether the data is compressed or not.

Since you are reworking the FIEMAP code anyway, would you be interested to
revive that patch series?

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/10 V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
  2018-12-05  9:17 ` [PATCH 04/10 V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap Carlos Maiolino
@ 2019-01-14 16:49   ` Christoph Hellwig
  2019-02-04 11:34     ` Carlos Maiolino
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2019-01-14 16:49 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Wed, Dec 05, 2018 at 10:17:22AM +0100, Carlos Maiolino wrote:
> Now we have the possibility of proper error return in bmap, use bmap()
> function in ioctl_fibmap() instead of calling ->bmap method directly.
> 
> V2:
> 	- Use a local sector_t variable to asign the block number
> 	  instead of using direct casting.
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
>  fs/ioctl.c | 27 +++++++++++++++++----------
>  1 file changed, 17 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index d64f622cac8b..e0cc0dd5f9aa 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -53,19 +53,26 @@ EXPORT_SYMBOL(vfs_ioctl);
>  
>  static int ioctl_fibmap(struct file *filp, int __user *p)
>  {
> +	struct inode *inode = file_inode(filp);
> +	int error, usr_blk;
> +	sector_t block;
>  
>  	if (!capable(CAP_SYS_RAWIO))
>  		return -EPERM;
> +
> +	error = get_user(usr_blk, p);
> +	if (error)
> +		return error;

Does get_user/put_user actually return an error?

All the code I know just does:

	if (get_user()))
		return -EFAULT;

and co.

> +
> +	block = usr_blk;
> +	error = bmap(inode, &block);
> +	if (error)
> +		return error;
> +	usr_blk = block;

Nitpick: maybe i'd name ur_block block and block sector, which seems
to flow a little better.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal
  2018-12-07  9:34   ` Carlos Maiolino
@ 2019-01-14 16:50     ` Christoph Hellwig
  2019-01-14 17:56       ` Andreas Grünbacher
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2019-01-14 16:50 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Andreas Grünbacher, Christoph Hellwig,
	Linux FS-devel Mailing List, Andreas Dilger, sandeen,
	Dave Chinner

On Fri, Dec 07, 2018 at 10:34:29AM +0100, Carlos Maiolino wrote:
> On Thu, Dec 06, 2018 at 07:56:02PM +0100, Andreas Grünbacher wrote:
> > Hi,
> > 
> > Am Mi., 5. Dez. 2018 um 10:18 Uhr schrieb Carlos Maiolino
> > <cmaiolino@redhat.com>:
> > > This is the second version of the complete series with the goal to remove ->bmap
> > > interface completely, in lieu of FIEMAP.
> > 
> > I'm not thrilled by this approach. How about exposing the iomap
> > operations at the vfs layer (for example, in the super block) and
> > implementing bmap on top of that instead?
> > 
> 
> Well, the idea is exactly to get rid of bmap, not reimplement it. We can use the
> same operation for both cases (fiemap+fibmap), so I honestly don't see which
> advantages would be by reimplementing it.

Exactly.  iomap really is a possibly implementation.  Everytime we
exposed implementation details at the ops level that created horrible
abuses.  The most important still relevant example is
write_begin/write_end, which require fs specific locking but are
exposed in a way where we can't easily enforce that.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/10] fs: Move start and length fiemap fields into fiemap_extent_info
  2018-12-05  9:17 ` [PATCH 05/10] fs: Move start and length fiemap fields into fiemap_extent_info Carlos Maiolino
@ 2019-01-14 16:50   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2019-01-14 16:50 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/10] iomap: Remove length and start fields from iomap_fiemap
  2018-12-05  9:17 ` [PATCH 06/10] iomap: Remove length and start fields from iomap_fiemap Carlos Maiolino
@ 2019-01-14 16:51   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2019-01-14 16:51 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Wed, Dec 05, 2018 at 10:17:24AM +0100, Carlos Maiolino wrote:
> fiemap_extent_info now embeds start and length parameters, users of
> iomap_fiemap() doesn't need to pass it individually anymore.
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/10] fs: Use a void pointer to store fiemap_extent
  2018-12-05  9:17 ` [PATCH 07/10] fs: Use a void pointer to store fiemap_extent Carlos Maiolino
@ 2019-01-14 16:53   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2019-01-14 16:53 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

>  	u64		fi_len;
>  	unsigned int	fi_extents_mapped;	/* Number of mapped extents */
>  	unsigned int	fi_extents_max;		/* Size of fiemap_extent array */
> -	struct		fiemap_extent __user *fi_extents_start;	/* Start of
> -								   fiemap_extent
> -								   array */
> +	void		*fi_extents_start;	/* Start of fiemap_extent
> +						   array */

I think this patch should be merged into the one passing the callback
as it is logically related.  I'd also rename fi_extents_start to
fi_cb_data to make the relation clear.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 08/10 V2] fiemap: Use a callback to fill fiemap extents
  2018-12-05  9:17 ` [PATCH 08/10 V2] fiemap: Use a callback to fill fiemap extents Carlos Maiolino
@ 2019-01-14 16:53   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2019-01-14 16:53 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Wed, Dec 05, 2018 at 10:17:26AM +0100, Carlos Maiolino wrote:
> As a goal to enable fiemap infrastructure to be used by fibmap too, we need a
> way to use different helpers to fill extent data, depending on its usage. One
> helper to fill extent data stored in user address space (used in fiemap), and
> another fo fill extent data stored in kernel address space (will be used in
> fibmap).
> 
> This patch sets up the usage of a callback to be used to fill in the extents.
> It transforms the current fiemap_fill_next_extent, into a simple helper to call
> the callback, avoiding unneeded changes on any filesystem, and reutilizes the
> original function as the callback used by FIEMAP.

Looks good modulo the fact that the previous patch should be merged
into this one.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2018-12-05  9:17 ` [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Carlos Maiolino
  2018-12-05 17:36   ` Darrick J. Wong
@ 2019-01-14 16:56   ` Christoph Hellwig
  2019-02-05  9:56     ` Carlos Maiolino
  1 sibling, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2019-01-14 16:56 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> +			    u64 phys, u64 len, u32 flags)

Any reason this function isn't in inode.c next to the caller and marked
static?

Otherwise looks fine except for the additional sanity checking pointed
out by Darrick.

> +	/* only count the extents */
> +	if (fieinfo->fi_extents_max == 0) {
> +		fieinfo->fi_extents_mapped++;
> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;

Maybe do a 'goto out' here?

> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;

And reuse this return.   Bonus points for using a good old
if here:

	if (flags & FIEMAP_EXTENT_LAST)
		return 1;
	return 0;

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal
  2019-01-14 16:50     ` Christoph Hellwig
@ 2019-01-14 17:56       ` Andreas Grünbacher
  2019-01-14 17:58         ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Andreas Grünbacher @ 2019-01-14 17:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Carlos Maiolino, Linux FS-devel Mailing List, Andreas Dilger,
	sandeen, Dave Chinner

Am Mo., 14. Jan. 2019 um 17:50 Uhr schrieb Christoph Hellwig <hch@lst.de>:
> On Fri, Dec 07, 2018 at 10:34:29AM +0100, Carlos Maiolino wrote:
> > On Thu, Dec 06, 2018 at 07:56:02PM +0100, Andreas Grünbacher wrote:
> > > Hi,
> > >
> > > Am Mi., 5. Dez. 2018 um 10:18 Uhr schrieb Carlos Maiolino
> > > <cmaiolino@redhat.com>:
> > > > This is the second version of the complete series with the goal to remove ->bmap
> > > > interface completely, in lieu of FIEMAP.
> > >
> > > I'm not thrilled by this approach. How about exposing the iomap
> > > operations at the vfs layer (for example, in the super block) and
> > > implementing bmap on top of that instead?
> > >
> >
> > Well, the idea is exactly to get rid of bmap, not reimplement it. We can use the
> > same operation for both cases (fiemap+fibmap), so I honestly don't see which
> > advantages would be by reimplementing it.
>
> Exactly.  iomap really is a possibly implementation.  Everytime we
> exposed implementation details at the ops level that created horrible
> abuses.  The most important still relevant example is
> write_begin/write_end, which require fs specific locking but are
> exposed in a way where we can't easily enforce that.

Yes, locking. The fiemap_fill_cb callback hack still makes the fiemap
interface much uglier though. So couldn't the existing iop be used to
fill a kernel buffer in a way similar to what functions like
kernel_readv do? That would at least avoid wrecking an existing
interface.

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal
  2019-01-14 17:56       ` Andreas Grünbacher
@ 2019-01-14 17:58         ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2019-01-14 17:58 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: Christoph Hellwig, Carlos Maiolino, Linux FS-devel Mailing List,
	Andreas Dilger, sandeen, Dave Chinner

On Mon, Jan 14, 2019 at 06:56:16PM +0100, Andreas Grünbacher wrote:
> Yes, locking. The fiemap_fill_cb callback hack still makes the fiemap
> interface much uglier though. So couldn't the existing iop be used to
> fill a kernel buffer in a way similar to what functions like
> kernel_readv do? That would at least avoid wrecking an existing
> interface.

There is no file system visible change at all, the callback happens
all behind the back.  We could do a less extensible union based version,
but I see absolutely no upside in that.

set_fs as in kernel_readv needs to go away, so no new users should be
added.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/10 V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
  2019-01-14 16:49   ` Christoph Hellwig
@ 2019-02-04 11:34     ` Carlos Maiolino
  0 siblings, 0 replies; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-04 11:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, adilger, sandeen, david

Hi Christoph. Sorry for the delayed reply.

> > +	error = get_user(usr_blk, p);
> > +	if (error)
> > +		return error;
> 
> Does get_user/put_user actually return an error?
> 
> All the code I know just does:
> 
> 	if (get_user()))
> 		return -EFAULT;
> 
> and co.

According to the comment above it, it either returns zero on success, or -EFAULT
on error. So, both approaches are correct, either mine, or the one you
mentioned. It just seems more logical to me, to return whatever error code
returned by the function (well, macro), instead of hardcoding -EFAULT.

> 
> > +
> > +	block = usr_blk;
> > +	error = bmap(inode, &block);
> > +	if (error)
> > +		return error;
> > +	usr_blk = block;
> 
> Nitpick: maybe i'd name ur_block block and block sector, which seems
> to flow a little better.

Sure, it's ok for me, I'll change that on a V3
> 

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2018-12-05 17:36   ` Darrick J. Wong
  2018-12-07  9:09     ` Carlos Maiolino
@ 2019-02-04 15:11     ` Carlos Maiolino
  2019-02-04 18:27       ` Darrick J. Wong
  1 sibling, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-04 15:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, hch, adilger, sandeen, david

Hi, Sorry for the long delay Darrick.

> > +	fextent.fe_logical = 0;
> > +	fextent.fe_physical = 0;
> > +	fieinfo.fi_extents_max = 1;
> > +	fieinfo.fi_extents_mapped = 0;
> > +	fieinfo.fi_extents_start = &fextent;
> > +	fieinfo.fi_start = start;
> > +	fieinfo.fi_len = 1 << inode->i_blkbits;
> > +	fieinfo.fi_flags = 0;
> > +	fieinfo.fi_cb = fiemap_fill_kernel_extent;
> > +
> > +	error = inode->i_op->fiemap(inode, &fieinfo);
> > +
> > +	if (error)
> > +		return error;
> > +
> > +	if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN |
> > +				FIEMAP_EXTENT_ENCODED |
> > +				FIEMAP_EXTENT_DATA_INLINE |
> > +				FIEMAP_EXTENT_UNWRITTEN))
> > +		return -EINVAL;
> 
> AFAICT, three of the filesystems that support COW writes (xfs, ocfs2,
> and btrfs) do not return bmap results for files with shared blocks.
> This check here should include FIEMAP_EXTENT_SHARED since external
> overwrites of a COW file block are bad news on btrfs (and ocfs2 and
> xfs).

ok, np

> 
> > +
> > +	*block = (fextent.fe_physical +
> > +		  (start - fextent.fe_logical)) >> inode->i_blkbits;
> 
> Hmmm, so there's nothing here checking that the physical device fiemap
> reports is the same device that was passed into the mount.  This is
> trivially true for most of the filesystems that implement bmap and
> fiemap, but definitely not true for xfs or btrfs.  I would bet most
> userspace callers of bmap (since it's an ext2-era ioctl) make that
> assumption and don't even know how to find the device.

Makes sense.

> 
> On xfs, the bmap implementation won't return any results for realtime
> files, but it looks as though we suddenly will start doing that here,
> because in the new bmap implementation we will use fiemap, and fiemap
> reports extents without providing any context about which device they're
> on, and that context-less extent gets passed back to bmap_fiemap.
> 
> In any case, I think a better solution to the multi-device problem is to
> start returning device information via struct fiemap_extent, at least
> inside the kernel.  Use one of the reserved fields to declare a new
> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> device number, and then you can check that against inode->i_sb->s_bdev
> to avoid returning results for the non-primary device of a multi-device
> filesystem.

I agree we should address it here, but I don't think fiemap_extent is the right
place for it, it is linked to the UAPI, and changing it is usually not a good
idea.

I think I got your idea anyway, but, what if, instead returning the bdev in
fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
with such information?

> 
> > +
> > +	return error;
> > +}
> > +
> >  /**
> >   *	bmap	- find a block number in a file
> >   *	@inode:  inode owning the block number being requested
> > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> >   */
> >  int bmap(struct inode *inode, sector_t *block)
> >  {
> > -	if (!inode->i_mapping->a_ops->bmap)
> > +	if (inode->i_op->fiemap)
> > +		return bmap_fiemap(inode, block);
> > +	else if (inode->i_mapping->a_ops->bmap)
> > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > +						       *block);
> > +	else
> >  		return -EINVAL;
> 
> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> suddenly it will support this legacy interface they've never supported
> before.  Are they on board with this?
> 
> --D
> 
> >  
> > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> >  	return 0;
> >  }
> >  EXPORT_SYMBOL(bmap);
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 6086978fe01e..bfa59df332bf 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >  }
> >  
> > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > +			    u64 phys, u64 len, u32 flags)
> > +{
> > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > +
> > +	/* only count the extents */
> > +	if (fieinfo->fi_extents_max == 0) {
> > +		fieinfo->fi_extents_mapped++;
> > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > +	}
> > +
> > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > +		return 1;
> > +
> > +	if (flags & SET_UNKNOWN_FLAGS)
> > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > +		flags |= FIEMAP_EXTENT_ENCODED;
> > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > +
> > +	extent->fe_logical = logical;
> > +	extent->fe_physical = phys;
> > +	extent->fe_length = len;
> > +	extent->fe_flags = flags;
> > +
> > +	fieinfo->fi_extents_mapped++;
> > +
> > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > +		return 1;
> > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > +}
> >  /**
> >   * fiemap_fill_next_extent - Fiemap helper function
> >   * @fieinfo:	Fiemap context passed into ->fiemap
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 7a434979201c..28bb523d532a 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> >  	fiemap_fill_cb	fi_cb;
> >  };
> >  
> > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > +			      u64 phys, u64 len, u32 flags);
> >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> >  			    u64 phys, u64 len, u32 flags);
> >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > -- 
> > 2.17.2
> > 

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-04 15:11     ` Carlos Maiolino
@ 2019-02-04 18:27       ` Darrick J. Wong
  2019-02-06 13:37         ` Carlos Maiolino
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2019-02-04 18:27 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Mon, Feb 04, 2019 at 04:11:47PM +0100, Carlos Maiolino wrote:
> Hi, Sorry for the long delay Darrick.
> 
> > > +	fextent.fe_logical = 0;
> > > +	fextent.fe_physical = 0;
> > > +	fieinfo.fi_extents_max = 1;
> > > +	fieinfo.fi_extents_mapped = 0;
> > > +	fieinfo.fi_extents_start = &fextent;
> > > +	fieinfo.fi_start = start;
> > > +	fieinfo.fi_len = 1 << inode->i_blkbits;
> > > +	fieinfo.fi_flags = 0;
> > > +	fieinfo.fi_cb = fiemap_fill_kernel_extent;
> > > +
> > > +	error = inode->i_op->fiemap(inode, &fieinfo);
> > > +
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN |
> > > +				FIEMAP_EXTENT_ENCODED |
> > > +				FIEMAP_EXTENT_DATA_INLINE |
> > > +				FIEMAP_EXTENT_UNWRITTEN))
> > > +		return -EINVAL;
> > 
> > AFAICT, three of the filesystems that support COW writes (xfs, ocfs2,
> > and btrfs) do not return bmap results for files with shared blocks.
> > This check here should include FIEMAP_EXTENT_SHARED since external
> > overwrites of a COW file block are bad news on btrfs (and ocfs2 and
> > xfs).
> 
> ok, np
> 
> > 
> > > +
> > > +	*block = (fextent.fe_physical +
> > > +		  (start - fextent.fe_logical)) >> inode->i_blkbits;
> > 
> > Hmmm, so there's nothing here checking that the physical device fiemap
> > reports is the same device that was passed into the mount.  This is
> > trivially true for most of the filesystems that implement bmap and
> > fiemap, but definitely not true for xfs or btrfs.  I would bet most
> > userspace callers of bmap (since it's an ext2-era ioctl) make that
> > assumption and don't even know how to find the device.
> 
> Makes sense.
> 
> > 
> > On xfs, the bmap implementation won't return any results for realtime
> > files, but it looks as though we suddenly will start doing that here,
> > because in the new bmap implementation we will use fiemap, and fiemap
> > reports extents without providing any context about which device they're
> > on, and that context-less extent gets passed back to bmap_fiemap.
> > 
> > In any case, I think a better solution to the multi-device problem is to
> > start returning device information via struct fiemap_extent, at least
> > inside the kernel.  Use one of the reserved fields to declare a new
> > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > device number, and then you can check that against inode->i_sb->s_bdev
> > to avoid returning results for the non-primary device of a multi-device
> > filesystem.
> 
> I agree we should address it here, but I don't think fiemap_extent is the right
> place for it, it is linked to the UAPI, and changing it is usually not a good
> idea.

Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
into some sort of dev_t/per-device cookie should be fine.  Userspace
shouldn't be expecting any meaning in reserved areas.

> I think I got your idea anyway, but, what if, instead returning the bdev in
> fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> with such information?

I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.

--D

> > 
> > > +
> > > +	return error;
> > > +}
> > > +
> > >  /**
> > >   *	bmap	- find a block number in a file
> > >   *	@inode:  inode owning the block number being requested
> > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > >   */
> > >  int bmap(struct inode *inode, sector_t *block)
> > >  {
> > > -	if (!inode->i_mapping->a_ops->bmap)
> > > +	if (inode->i_op->fiemap)
> > > +		return bmap_fiemap(inode, block);
> > > +	else if (inode->i_mapping->a_ops->bmap)
> > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > +						       *block);
> > > +	else
> > >  		return -EINVAL;
> > 
> > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > suddenly it will support this legacy interface they've never supported
> > before.  Are they on board with this?
> > 
> > --D
> > 
> > >  
> > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > >  	return 0;
> > >  }
> > >  EXPORT_SYMBOL(bmap);
> > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > index 6086978fe01e..bfa59df332bf 100644
> > > --- a/fs/ioctl.c
> > > +++ b/fs/ioctl.c
> > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > >  }
> > >  
> > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > +			    u64 phys, u64 len, u32 flags)
> > > +{
> > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > +
> > > +	/* only count the extents */
> > > +	if (fieinfo->fi_extents_max == 0) {
> > > +		fieinfo->fi_extents_mapped++;
> > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > +	}
> > > +
> > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > +		return 1;
> > > +
> > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > +
> > > +	extent->fe_logical = logical;
> > > +	extent->fe_physical = phys;
> > > +	extent->fe_length = len;
> > > +	extent->fe_flags = flags;
> > > +
> > > +	fieinfo->fi_extents_mapped++;
> > > +
> > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > +		return 1;
> > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > +}
> > >  /**
> > >   * fiemap_fill_next_extent - Fiemap helper function
> > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 7a434979201c..28bb523d532a 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > >  	fiemap_fill_cb	fi_cb;
> > >  };
> > >  
> > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > +			      u64 phys, u64 len, u32 flags);
> > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > >  			    u64 phys, u64 len, u32 flags);
> > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > -- 
> > > 2.17.2
> > > 
> 
> -- 
> Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-01-14 16:56   ` Christoph Hellwig
@ 2019-02-05  9:56     ` Carlos Maiolino
  2019-02-05 18:25       ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-05  9:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, adilger, sandeen, david

Hi Christoph.

On Mon, Jan 14, 2019 at 05:56:17PM +0100, Christoph Hellwig wrote:
> > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > +			    u64 phys, u64 len, u32 flags)
> 
> Any reason this function isn't in inode.c next to the caller and marked
> static?
> 

No reason other than to keep it close to its peer fiemap_fill_user_extent(), I
honestly do prefer to keep both together than in separated files. But, I'm up
to move it to fs/inode.c if required.

> Otherwise looks fine except for the additional sanity checking pointed
> out by Darrick.

Working on that.

> 
> > +	/* only count the extents */
> > +	if (fieinfo->fi_extents_max == 0) {
> > +		fieinfo->fi_extents_mapped++;
> > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> 
> Maybe do a 'goto out' here?
> 
> > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> 
> And reuse this return.   Bonus points for using a good old
> if here:
> 
> 	if (flags & FIEMAP_EXTENT_LAST)
> 		return 1;
> 	return 0;

Ok, will be in the new version, thanks for the review :)

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-05  9:56     ` Carlos Maiolino
@ 2019-02-05 18:25       ` Christoph Hellwig
  2019-02-06  9:50         ` Carlos Maiolino
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2019-02-05 18:25 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Christoph Hellwig, linux-fsdevel, adilger, sandeen, david

On Tue, Feb 05, 2019 at 10:56:01AM +0100, Carlos Maiolino wrote:
> > Any reason this function isn't in inode.c next to the caller and marked
> > static?
> > 
> 
> No reason other than to keep it close to its peer fiemap_fill_user_extent(), I
> honestly do prefer to keep both together than in separated files. But, I'm up
> to move it to fs/inode.c if required.

After your series fiemap_fill_user_extent should be static and close
to it's caller, so with the kernel one in inode.c everything should
be neat and symmetric.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-05 18:25       ` Christoph Hellwig
@ 2019-02-06  9:50         ` Carlos Maiolino
  0 siblings, 0 replies; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-06  9:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, adilger, sandeen, david

On Tue, Feb 05, 2019 at 07:25:18PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 05, 2019 at 10:56:01AM +0100, Carlos Maiolino wrote:
> > > Any reason this function isn't in inode.c next to the caller and marked
> > > static?
> > > 
> > 
> > No reason other than to keep it close to its peer fiemap_fill_user_extent(), I
> > honestly do prefer to keep both together than in separated files. But, I'm up
> > to move it to fs/inode.c if required.
> 
> After your series fiemap_fill_user_extent should be static and close
> to it's caller, so with the kernel one in inode.c everything should
> be neat and symmetric.

You are right, I didn't pay attention to that, thanks for the heads up, I'll fix
it on the next version

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-04 18:27       ` Darrick J. Wong
@ 2019-02-06 13:37         ` Carlos Maiolino
  2019-02-06 20:44           ` Darrick J. Wong
  2019-02-06 21:04           ` Andreas Dilger
  0 siblings, 2 replies; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-06 13:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, hch, adilger, sandeen, david

> > > In any case, I think a better solution to the multi-device problem is to
> > > start returning device information via struct fiemap_extent, at least
> > > inside the kernel.  Use one of the reserved fields to declare a new
> > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > device number, and then you can check that against inode->i_sb->s_bdev
> > > to avoid returning results for the non-primary device of a multi-device
> > > filesystem.
> > 
> > I agree we should address it here, but I don't think fiemap_extent is the right
> > place for it, it is linked to the UAPI, and changing it is usually not a good
> > idea.
> 
> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> into some sort of dev_t/per-device cookie should be fine.  Userspace
> shouldn't be expecting any meaning in reserved areas.
> 
> > I think I got your idea anyway, but, what if, instead returning the bdev in
> > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > with such information?
> 
> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.

Ok, may I ask why not?

My apologies if I am wrong, but, per my understanding, there is nothing today,
which tells userspace which device belongs the extent map reported by FIEMAP.
If it belongs to the RT device in XFS, or whatever disk in a raid in BTRFS, we
simply do not provide such information. So, the goal is to provide a way to tell
the filesystem if a FIEMAP or a FIBMAP has been requested, so the current
behavior of both ioctls won't change.

Enabling filesystems to return device information into fiemap_extent requires
modification of all filesystems to provide such information, which will not have
any use other than matching the mounted device to the device where the extent
is.

A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive than the
device id in fiemap_extent. I don't see much advantage in adding the device id
instead of using the flag.

A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
userspace, so, it would require a check to make sure it didn't come from
userspace if ioctl_fiemap() was used.

I think there are 2 other possibilities which can be used to fix this.

- Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
- If the device id is a must for you, maybe add the device id into
  fiemap_extent_info instead of fiemap_extent. So we don't mess with a UAPI
  exported data structure and still provides a way to the filesystems to provide
  which device the mapped extent is in.

What you think?

Cheers


> 
> --D
> 
> > > 
> > > > +
> > > > +	return error;
> > > > +}
> > > > +
> > > >  /**
> > > >   *	bmap	- find a block number in a file
> > > >   *	@inode:  inode owning the block number being requested
> > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > >   */
> > > >  int bmap(struct inode *inode, sector_t *block)
> > > >  {
> > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > +	if (inode->i_op->fiemap)
> > > > +		return bmap_fiemap(inode, block);
> > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > +						       *block);
> > > > +	else
> > > >  		return -EINVAL;
> > > 
> > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > suddenly it will support this legacy interface they've never supported
> > > before.  Are they on board with this?
> > > 
> > > --D
> > > 
> > > >  
> > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > >  	return 0;
> > > >  }
> > > >  EXPORT_SYMBOL(bmap);
> > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > index 6086978fe01e..bfa59df332bf 100644
> > > > --- a/fs/ioctl.c
> > > > +++ b/fs/ioctl.c
> > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > >  }
> > > >  
> > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > +			    u64 phys, u64 len, u32 flags)
> > > > +{
> > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > +
> > > > +	/* only count the extents */
> > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > +		fieinfo->fi_extents_mapped++;
> > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > +	}
> > > > +
> > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > +		return 1;
> > > > +
> > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > +
> > > > +	extent->fe_logical = logical;
> > > > +	extent->fe_physical = phys;
> > > > +	extent->fe_length = len;
> > > > +	extent->fe_flags = flags;
> > > > +
> > > > +	fieinfo->fi_extents_mapped++;
> > > > +
> > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > +		return 1;
> > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > +}
> > > >  /**
> > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index 7a434979201c..28bb523d532a 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > >  	fiemap_fill_cb	fi_cb;
> > > >  };
> > > >  
> > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > +			      u64 phys, u64 len, u32 flags);
> > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > >  			    u64 phys, u64 len, u32 flags);
> > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > -- 
> > > > 2.17.2
> > > > 
> > 
> > -- 
> > Carlos

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-06 13:37         ` Carlos Maiolino
@ 2019-02-06 20:44           ` Darrick J. Wong
  2019-02-06 21:13             ` Andreas Dilger
                               ` (2 more replies)
  2019-02-06 21:04           ` Andreas Dilger
  1 sibling, 3 replies; 53+ messages in thread
From: Darrick J. Wong @ 2019-02-06 20:44 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > In any case, I think a better solution to the multi-device problem is to
> > > > start returning device information via struct fiemap_extent, at least
> > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > to avoid returning results for the non-primary device of a multi-device
> > > > filesystem.
> > > 
> > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > idea.
> > 
> > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > shouldn't be expecting any meaning in reserved areas.
> > 
> > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > with such information?
> > 
> > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> 
> Ok, may I ask why not?

I think it's a bad idea to add a flag to FIEMAP to change its behavior
to suit an older and even crappier legacy interface (i.e. FIBMAP).

FIBMAP is architecturally broken in that we can't /ever/ provide the
context of "which device does this map to?"

FIEMAP is architecturally deficient as well, but its ioctl structure
definition is flexible enough that we can report "which device does this
map to".

I want to enhance FIEMAP to deal with multi-device filesystems
correctly, and as much as I want to kill FIBMAP, I can't because of zipl
and *lilo.

> My apologies if I am wrong, but, per my understanding, there is
> nothing today, which tells userspace which device belongs the extent
> map reported by FIEMAP.

Right...

> If it belongs to the RT device in XFS, or whatever disk in a raid in
> BTRFS, we simply do not provide such information.

Right...

> So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> a FIBMAP has been requested, so the current behavior of both ioctls
> won't change.

...but from my point of view, the FIEMAP behavior *ought* to change to
be more expressive.  Once that's done, we can use the more expressive
FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.

The whole point of having fe_reserved* fields in struct fiemap_extent is
so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
start returning data in a reserved field.  New userspace programs that
know about the flag can start reading information from the new field if
they see the flag, and old userspace programs don't know about the flag
and won't be any worse off.

> Enabling filesystems to return device information into fiemap_extent
> requires modification of all filesystems to provide such information,
> which will not have any use other than matching the mounted device to
> the device where the extent is.

Perhaps it would help for me to present a more concrete proposal:

--- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
+++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
@@ -22,7 +22,19 @@ struct fiemap_extent {
 	__u64 fe_length;   /* length in bytes for this extent */
 	__u64 fe_reserved64[2];
 	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
-	__u32 fe_reserved[3];
+
+	/*
+	 * Underlying device that this extent is stored on.
+	 *
+	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
+	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
+	 * set, this field is a 32-bit cookie that can be used to distinguish
+	 * between backing devices but has no intrinsic meaning.  If neither
+	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
+	 * EXTENT_DEV flags may be set at any time.
+	 */
+	__u32 fe_device;
+	__u32 fe_reserved[2];
 };
 
 struct fiemap {
@@ -66,5 +78,14 @@ struct fiemap {
 						    * merged for efficiency. */
 #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
 						    * files. */
+#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
+						    * structure containing the
+						    * major and minor numbers
+						    * of a block device. */
+#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
+						    * cookie that can be used
+						    * to distinguish physical
+						    * devices but otherwise
+						    * has no meaning. */
 
 #endif /* _LINUX_FIEMAP_H */

Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
encoding fe_device = new_encode_dev(xfs_get_device_for_file()).

Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
and encode the replica number in fe_device.

Existing filesystems can be left unchanged, in which case neither
EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
meaningless, the same as they are today.  Reporting fe_device is entirely
optional.

Userspace programs will now be able to tell which device the file data
lives on, which has been sort-of requested for years, if the filesystem
chooses to start exporting that information.

Your FIBMAP-via-FIEMAP backend can do something like:

/* FIBMAP only returns results for the same block device backing the fs. */
if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
	return 0;

/* Can't tell what is the backing device, bail out. */
if (fe->fe_flags & EXTENT_DEV_COOKIE)
	return 0;

/*
 * Either fe_device matches the backing device or the implementation
 * doesn't tell us about the backing device, so assume it's ok.
 */
<return FIBMAP results>

So that's how I'd solve a longstanding design problem of FIEMAP and then
take advantage of that solution to remedy my objections to the proposed
"Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
behavior flag that userspace knows about but isn't allowed to pass in.

> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> than the device id in fiemap_extent. I don't see much advantage in
> adding the device id instead of using the flag.
> 
> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> userspace, so, it would require a check to make sure it didn't come from
> userspace if ioctl_fiemap() was used.
> 
> I think there are 2 other possibilities which can be used to fix this.
> 
> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> - If the device id is a must for you, maybe add the device id into
>   fiemap_extent_info instead of fiemap_extent.

That won't work with btrfs, which can store file extents on multiple
different physical devices.

>   So we don't mess with a UAPI exported data structure and still
>   provides a way to the filesystems to provide which device the mapped
>   extent is in.
> 
> What you think?
> 
> Cheers
> 
> 
> > 
> > --D
> > 
> > > > 
> > > > > +
> > > > > +	return error;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   *	bmap	- find a block number in a file
> > > > >   *	@inode:  inode owning the block number being requested
> > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > >   */
> > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > >  {
> > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > +	if (inode->i_op->fiemap)
> > > > > +		return bmap_fiemap(inode, block);
> > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > +						       *block);
> > > > > +	else
> > > > >  		return -EINVAL;
> > > > 
> > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > suddenly it will support this legacy interface they've never supported
> > > > before.  Are they on board with this?
> > > > 
> > > > --D
> > > > 
> > > > >  
> > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > >  	return 0;
> > > > >  }
> > > > >  EXPORT_SYMBOL(bmap);
> > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > --- a/fs/ioctl.c
> > > > > +++ b/fs/ioctl.c
> > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > >  }
> > > > >  
> > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > +{
> > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > +
> > > > > +	/* only count the extents */
> > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > +		fieinfo->fi_extents_mapped++;
> > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > +	}
> > > > > +
> > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > +		return 1;
> > > > > +
> > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > +
> > > > > +	extent->fe_logical = logical;
> > > > > +	extent->fe_physical = phys;
> > > > > +	extent->fe_length = len;
> > > > > +	extent->fe_flags = flags;
> > > > > +
> > > > > +	fieinfo->fi_extents_mapped++;
> > > > > +
> > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > +		return 1;
> > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > +}
> > > > >  /**
> > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > index 7a434979201c..28bb523d532a 100644
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > >  	fiemap_fill_cb	fi_cb;
> > > > >  };
> > > > >  
> > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > +			      u64 phys, u64 len, u32 flags);
> > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > >  			    u64 phys, u64 len, u32 flags);
> > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > -- 
> > > > > 2.17.2
> > > > > 
> > > 
> > > -- 
> > > Carlos
> 
> -- 
> Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-06 13:37         ` Carlos Maiolino
  2019-02-06 20:44           ` Darrick J. Wong
@ 2019-02-06 21:04           ` Andreas Dilger
  1 sibling, 0 replies; 53+ messages in thread
From: Andreas Dilger @ 2019-02-06 21:04 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Darrick J. Wong, linux-fsdevel, Christoph Hellwig, Eric Sandeen, david

[-- Attachment #1: Type: text/plain, Size: 8398 bytes --]

On Feb 6, 2019, at 6:37 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote:
>>> On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote:
>>>> In any case, I think a better solution to the multi-device problem is to
>>>> start returning device information via struct fiemap_extent, at least
>>>> inside the kernel.  Use one of the reserved fields to declare a new
>>>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
>>>> device number, and then you can check that against inode->i_sb->s_bdev
>>>> to avoid returning results for the non-primary device of a multi-device
>>>> filesystem.
>>> 
>>> I agree we should address it here, but I don't think fiemap_extent is the right
>>> place for it, it is linked to the UAPI, and changing it is usually not a good
>>> idea.
>> 
>> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
>> into some sort of dev_t/per-device cookie should be fine.  Userspace
>> shouldn't be expecting any meaning in reserved areas.

We are already using the __u32 fiemap_extent::fe_reserved[0] as fe_device for
Lustre, to return the server index to userspace for filefrag with suitable
patches.  That is needed because a single file may be striped across multiple
servers, and could instead return the dev_t for local multi-device filesystems.

>>> I think I got your idea anyway, but, what if, instead returning the bdev in
>>> fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
>>> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
>>> with such information?
>> 
>> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> 
> Ok, may I ask why not?
> 
> My apologies if I am wrong, but, per my understanding, there is nothing today,
> which tells userspace which device belongs the extent map reported by FIEMAP.
> If it belongs to the RT device in XFS, or whatever disk in a raid in BTRFS, we
> simply do not provide such information. So, the goal is to provide a way to tell
> the filesystem if a FIEMAP or a FIBMAP has been requested, so the current
> behavior of both ioctls won't change.
> 
> Enabling filesystems to return device information into fiemap_extent requires
> modification of all filesystems to provide such information, which will not have
> any use other than matching the mounted device to the device where the extent
> is.

Filling in the fe_device field is not harmful for existing filesystems, since it
has virtually zero cost (not more than zeroing the field to avoid leaking kernel
data) and older userspace tools would just ignore it.  What would be better than
just filling in the fe_device field would be to also add:

    #define FIEMAP_EXTENT_DEVICE 0x2000

to indicate that fe_device contains a valid value.  That tells userspace that the
filesystem is filling in the field, and allows compatibility with older kernels
and allows incremental addition for filesystems that can handle this (XFS, BtrFS).

We haven't added the FIEMAP_EXTENT_DEVICE flag for Lustre, but it would make sense
to do so.

> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive than the
> device id in fiemap_extent. I don't see much advantage in adding the device id
> instead of using the flag.

We also have for Lustre:

    #define FIEMAP_FLAG_DEVICE_ORDER 0x40000000

which requests that the kernel FIEMAP return the extents for each block device
first rather than in file logical block order.  That avoids interleaving the
extents across all of the devices in e.g. 1MB chunks (think RAID-0) which would
force the maximum returned extent size to 1MB even though there are much larger
contiguous extents allocated on each device.  Instead, DEVICE_ORDER returns
all of the extents for device 0 first, then device 1 next, etc.  This shows if
the on-disk allocation is good or bad, and also fills in the fe_device field.

> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> userspace, so, it would require a check to make sure it didn't come from
> userspace if ioctl_fiemap() was used.

Are you talking about a FIEMAP_FLAG_FIBMAP flag, or about returning the fe_device
field?  I think that passing a flag like FIEMAP_FLAG_DEVICE_ORDER from userspace
is fine in this case, because it has a concrete meaning and is not just an internal
flag.

> I think there are 2 other possibilities which can be used to fix this.
> 
> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> - If the device id is a must for you, maybe add the device id into
>  fiemap_extent_info instead of fiemap_extent. So we don't mess with a UAPI
>  exported data structure and still provides a way to the filesystems to provide
>  which device the mapped extent is in.

No, that would mean all of the change, without making it more useful to userspace.

Also, if with only a device per fiemap_extent_info then it won't handle filesystems
that may allocate a single file on multiple devices, such as BtrFS and Lustre.

Cheers, Andreas

>>>> 
>>>>> +
>>>>> +	return error;
>>>>> +}
>>>>> +
>>>>> /**
>>>>>  *	bmap	- find a block number in a file
>>>>>  *	@inode:  inode owning the block number being requested
>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
>>>>>  */
>>>>> int bmap(struct inode *inode, sector_t *block)
>>>>> {
>>>>> -	if (!inode->i_mapping->a_ops->bmap)
>>>>> +	if (inode->i_op->fiemap)
>>>>> +		return bmap_fiemap(inode, block);
>>>>> +	else if (inode->i_mapping->a_ops->bmap)
>>>>> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
>>>>> +						       *block);
>>>>> +	else
>>>>> 		return -EINVAL;
>>>> 
>>>> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
>>>> suddenly it will support this legacy interface they've never supported
>>>> before.  Are they on board with this?
>>>> 
>>>> --D
>>>> 
>>>>> 
>>>>> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
>>>>> 	return 0;
>>>>> }
>>>>> EXPORT_SYMBOL(bmap);
>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
>>>>> index 6086978fe01e..bfa59df332bf 100644
>>>>> --- a/fs/ioctl.c
>>>>> +++ b/fs/ioctl.c
>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>> 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>> }
>>>>> 
>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>> +			    u64 phys, u64 len, u32 flags)
>>>>> +{
>>>>> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
>>>>> +
>>>>> +	/* only count the extents */
>>>>> +	if (fieinfo->fi_extents_max == 0) {
>>>>> +		fieinfo->fi_extents_mapped++;
>>>>> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>> +	}
>>>>> +
>>>>> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
>>>>> +		return 1;
>>>>> +
>>>>> +	if (flags & SET_UNKNOWN_FLAGS)
>>>>> +		flags |= FIEMAP_EXTENT_UNKNOWN;
>>>>> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
>>>>> +		flags |= FIEMAP_EXTENT_ENCODED;
>>>>> +	if (flags & SET_NOT_ALIGNED_FLAGS)
>>>>> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
>>>>> +
>>>>> +	extent->fe_logical = logical;
>>>>> +	extent->fe_physical = phys;
>>>>> +	extent->fe_length = len;
>>>>> +	extent->fe_flags = flags;
>>>>> +
>>>>> +	fieinfo->fi_extents_mapped++;
>>>>> +
>>>>> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
>>>>> +		return 1;
>>>>> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>> +}
>>>>> /**
>>>>>  * fiemap_fill_next_extent - Fiemap helper function
>>>>>  * @fieinfo:	Fiemap context passed into ->fiemap
>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>> index 7a434979201c..28bb523d532a 100644
>>>>> --- a/include/linux/fs.h
>>>>> +++ b/include/linux/fs.h
>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
>>>>> 	fiemap_fill_cb	fi_cb;
>>>>> };
>>>>> 
>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
>>>>> +			      u64 phys, u64 len, u32 flags);
>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
>>>>> 			    u64 phys, u64 len, u32 flags);
>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
>>>>> --
>>>>> 2.17.2
>>>>> 
>>> 
>>> --
>>> Carlos
> 
> --
> Carlos


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-06 20:44           ` Darrick J. Wong
@ 2019-02-06 21:13             ` Andreas Dilger
  2019-02-07  9:52               ` Carlos Maiolino
  2019-02-07 11:59             ` Carlos Maiolino
  2019-02-07 12:36             ` Carlos Maiolino
  2 siblings, 1 reply; 53+ messages in thread
From: Andreas Dilger @ 2019-02-06 21:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Carlos Maiolino, linux-fsdevel, Christoph Hellwig, Eric Sandeen, david

[-- Attachment #1: Type: text/plain, Size: 11359 bytes --]

On Feb 6, 2019, at 1:44 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
>>>>> In any case, I think a better solution to the multi-device problem is to
>>>>> start returning device information via struct fiemap_extent, at least
>>>>> inside the kernel.  Use one of the reserved fields to declare a new
>>>>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
>>>>> device number, and then you can check that against inode->i_sb->s_bdev
>>>>> to avoid returning results for the non-primary device of a multi-device
>>>>> filesystem.
>>>> 
>>>> I agree we should address it here, but I don't think fiemap_extent is the right
>>>> place for it, it is linked to the UAPI, and changing it is usually not a good
>>>> idea.
>>> 
>>> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
>>> into some sort of dev_t/per-device cookie should be fine.  Userspace
>>> shouldn't be expecting any meaning in reserved areas.
>>> 
>>>> I think I got your idea anyway, but, what if, instead returning the bdev in
>>>> fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
>>>> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
>>>> with such information?
>>> 
>>> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
>> 
>> Ok, may I ask why not?
> 
> I think it's a bad idea to add a flag to FIEMAP to change its behavior
> to suit an older and even crappier legacy interface (i.e. FIBMAP).
> 
> FIBMAP is architecturally broken in that we can't /ever/ provide the
> context of "which device does this map to?"
> 
> FIEMAP is architecturally deficient as well, but its ioctl structure
> definition is flexible enough that we can report "which device does this
> map to".
> 
> I want to enhance FIEMAP to deal with multi-device filesystems
> correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> and *lilo.
> 
>> My apologies if I am wrong, but, per my understanding, there is
>> nothing today, which tells userspace which device belongs the extent
>> map reported by FIEMAP.
> 
> Right...
> 
>> If it belongs to the RT device in XFS, or whatever disk in a raid in
>> BTRFS, we simply do not provide such information.
> 
> Right...
> 
>> So, the goal is to provide a way to tell the filesystem if a FIEMAP or
>> a FIBMAP has been requested, so the current behavior of both ioctls
>> won't change.
> 
> ...but from my point of view, the FIEMAP behavior *ought* to change to
> be more expressive.  Once that's done, we can use the more expressive
> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> 
> The whole point of having fe_reserved* fields in struct fiemap_extent is
> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> start returning data in a reserved field.  New userspace programs that
> know about the flag can start reading information from the new field if
> they see the flag, and old userspace programs don't know about the flag
> and won't be any worse off.

Exactly correct.

>> Enabling filesystems to return device information into fiemap_extent
>> requires modification of all filesystems to provide such information,
>> which will not have any use other than matching the mounted device to
>> the device where the extent is.
> 
> Perhaps it would help for me to present a more concrete proposal:
> 
> --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> @@ -22,7 +22,19 @@ struct fiemap_extent {
> 	__u64 fe_length;   /* length in bytes for this extent */
> 	__u64 fe_reserved64[2];
> 	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> -	__u32 fe_reserved[3];
> +
> +	/*
> +	 * Underlying device that this extent is stored on.
> +	 *
> +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> +	 * set, this field is a 32-bit cookie that can be used to distinguish
> +	 * between backing devices but has no intrinsic meaning.  If neither
> +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> +	 * EXTENT_DEV flags may be set at any time.
> +	 */
> +	__u32 fe_device;
> +	__u32 fe_reserved[2];
> };
> 
> struct fiemap {
> @@ -66,5 +78,14 @@ struct fiemap {
> 						    * merged for efficiency. */
> #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> 						    * files. */
> +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> +						    * structure containing the
> +						    * major and minor numbers
> +						    * of a block device. */
> +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> +						    * cookie that can be used
> +						    * to distinguish physical
> +						    * devices but otherwise
> +						    * has no meaning. */
> 
> #endif /* _LINUX_FIEMAP_H */
> 
> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> encoding:
> 
>         fe_device = new_encode_dev(xfs_get_device_for_file());
> 
> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> and encode the replica number in fe_device.
> 
> Existing filesystems can be left unchanged, in which case neither
> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> meaningless, the same as they are today.  Reporting fe_device is entirely
> optional.

I like this better than my plain "FIEMAP_EXTENT_DEVICE" proposal, since it
allows userspace to distinguish between an actual dev_t a unique-but-
locally-meaninless identifier that is needed for network filesystems.

Cheers, Andreas

> Userspace programs will now be able to tell which device the file data
> lives on, which has been sort-of requested for years, if the filesystem
> chooses to start exporting that information.
> 
> Your FIBMAP-via-FIEMAP backend can do something like:
> 
>     /* FIBMAP only returns results for the same block device backing the fs. */
>     if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> 	return 0;
> 
>     /* Can't tell what is the backing device, bail out. */
>     if (fe->fe_flags & EXTENT_DEV_COOKIE)
> 	return 0;
> 
>     /*
>      * Either fe_device matches the backing device or the implementation
>      * doesn't tell us about the backing device, so assume it's ok.
>      */
>     <return FIBMAP results>
> 
> So that's how I'd solve a longstanding design problem of FIEMAP and then
> take advantage of that solution to remedy my objections to the proposed
> "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> behavior flag that userspace knows about but isn't allowed to pass in.
> 
>> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
>> than the device id in fiemap_extent. I don't see much advantage in
>> adding the device id instead of using the flag.
>> 
>> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
>> userspace, so, it would require a check to make sure it didn't come from
>> userspace if ioctl_fiemap() was used.
>> 
>> I think there are 2 other possibilities which can be used to fix this.
>> 
>> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
>> - If the device id is a must for you, maybe add the device id into
>>  fiemap_extent_info instead of fiemap_extent.
> 
> That won't work with btrfs, which can store file extents on multiple
> different physical devices.
> 
>>  So we don't mess with a UAPI exported data structure and still
>>  provides a way to the filesystems to provide which device the mapped
>>  extent is in.
>> 
>> What you think?
>> 
>> Cheers
>> 
>> 
>>> 
>>> --D
>>> 
>>>>> 
>>>>>> +
>>>>>> +	return error;
>>>>>> +}
>>>>>> +
>>>>>> /**
>>>>>>  *	bmap	- find a block number in a file
>>>>>>  *	@inode:  inode owning the block number being requested
>>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
>>>>>>  */
>>>>>> int bmap(struct inode *inode, sector_t *block)
>>>>>> {
>>>>>> -	if (!inode->i_mapping->a_ops->bmap)
>>>>>> +	if (inode->i_op->fiemap)
>>>>>> +		return bmap_fiemap(inode, block);
>>>>>> +	else if (inode->i_mapping->a_ops->bmap)
>>>>>> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
>>>>>> +						       *block);
>>>>>> +	else
>>>>>> 		return -EINVAL;
>>>>> 
>>>>> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
>>>>> suddenly it will support this legacy interface they've never supported
>>>>> before.  Are they on board with this?
>>>>> 
>>>>> --D
>>>>> 
>>>>>> 
>>>>>> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
>>>>>> 	return 0;
>>>>>> }
>>>>>> EXPORT_SYMBOL(bmap);
>>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
>>>>>> index 6086978fe01e..bfa59df332bf 100644
>>>>>> --- a/fs/ioctl.c
>>>>>> +++ b/fs/ioctl.c
>>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>>> 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>> }
>>>>>> 
>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>>> +			    u64 phys, u64 len, u32 flags)
>>>>>> +{
>>>>>> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
>>>>>> +
>>>>>> +	/* only count the extents */
>>>>>> +	if (fieinfo->fi_extents_max == 0) {
>>>>>> +		fieinfo->fi_extents_mapped++;
>>>>>> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>> +	}
>>>>>> +
>>>>>> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
>>>>>> +		return 1;
>>>>>> +
>>>>>> +	if (flags & SET_UNKNOWN_FLAGS)
>>>>>> +		flags |= FIEMAP_EXTENT_UNKNOWN;
>>>>>> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
>>>>>> +		flags |= FIEMAP_EXTENT_ENCODED;
>>>>>> +	if (flags & SET_NOT_ALIGNED_FLAGS)
>>>>>> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
>>>>>> +
>>>>>> +	extent->fe_logical = logical;
>>>>>> +	extent->fe_physical = phys;
>>>>>> +	extent->fe_length = len;
>>>>>> +	extent->fe_flags = flags;
>>>>>> +
>>>>>> +	fieinfo->fi_extents_mapped++;
>>>>>> +
>>>>>> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
>>>>>> +		return 1;
>>>>>> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>> +}
>>>>>> /**
>>>>>>  * fiemap_fill_next_extent - Fiemap helper function
>>>>>>  * @fieinfo:	Fiemap context passed into ->fiemap
>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>>> index 7a434979201c..28bb523d532a 100644
>>>>>> --- a/include/linux/fs.h
>>>>>> +++ b/include/linux/fs.h
>>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
>>>>>> 	fiemap_fill_cb	fi_cb;
>>>>>> };
>>>>>> 
>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
>>>>>> +			      u64 phys, u64 len, u32 flags);
>>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
>>>>>> 			    u64 phys, u64 len, u32 flags);
>>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
>>>>>> --
>>>>>> 2.17.2
>>>>>> 
>>>> 
>>>> --
>>>> Carlos
>> 
>> --
>> Carlos


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-06 21:13             ` Andreas Dilger
@ 2019-02-07  9:52               ` Carlos Maiolino
  2019-02-08  8:43                 ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-07  9:52 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Darrick J. Wong, linux-fsdevel, Christoph Hellwig, Eric Sandeen, david

> >> If it belongs to the RT device in XFS, or whatever disk in a raid in
> >> BTRFS, we simply do not provide such information.
> > 
> > Right...
> > 
> >> So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> >> a FIBMAP has been requested, so the current behavior of both ioctls
> >> won't change.
> > 
> > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > be more expressive.  Once that's done, we can use the more expressive
> > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > 
> > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > start returning data in a reserved field.  New userspace programs that
> > know about the flag can start reading information from the new field if
> > they see the flag, and old userspace programs don't know about the flag
> > and won't be any worse off.
> 

Btw, I am not saying I don't like the idea, I like it. What I was trying to do
was to avoid touching UAPI in this patchset. But... I'll try to implement your
idea here, send it to the list and raise my shields.

Thanks for the help Andreas/Darrick.

> Exactly correct.
> 
> >> Enabling filesystems to return device information into fiemap_extent
> >> requires modification of all filesystems to provide such information,
> >> which will not have any use other than matching the mounted device to
> >> the device where the extent is.
> > 
> > Perhaps it would help for me to present a more concrete proposal:
> > 
> > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > @@ -22,7 +22,19 @@ struct fiemap_extent {
> > 	__u64 fe_length;   /* length in bytes for this extent */
> > 	__u64 fe_reserved64[2];
> > 	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > -	__u32 fe_reserved[3];
> > +
> > +	/*
> > +	 * Underlying device that this extent is stored on.
> > +	 *
> > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > +	 * between backing devices but has no intrinsic meaning.  If neither
> > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > +	 * EXTENT_DEV flags may be set at any time.
> > +	 */
> > +	__u32 fe_device;
> > +	__u32 fe_reserved[2];
> > };
> > 
> > struct fiemap {
> > @@ -66,5 +78,14 @@ struct fiemap {
> > 						    * merged for efficiency. */
> > #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> > 						    * files. */
> > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > +						    * structure containing the
> > +						    * major and minor numbers
> > +						    * of a block device. */
> > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > +						    * cookie that can be used
> > +						    * to distinguish physical
> > +						    * devices but otherwise
> > +						    * has no meaning. */
> > 
> > #endif /* _LINUX_FIEMAP_H */
> > 
> > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > encoding:
> > 
> >         fe_device = new_encode_dev(xfs_get_device_for_file());
> > 
> > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > and encode the replica number in fe_device.
> > 
> > Existing filesystems can be left unchanged, in which case neither
> > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > meaningless, the same as they are today.  Reporting fe_device is entirely
> > optional.
> 
> I like this better than my plain "FIEMAP_EXTENT_DEVICE" proposal, since it
> allows userspace to distinguish between an actual dev_t a unique-but-
> locally-meaninless identifier that is needed for network filesystems.
> 
> Cheers, Andreas
> 
> > Userspace programs will now be able to tell which device the file data
> > lives on, which has been sort-of requested for years, if the filesystem
> > chooses to start exporting that information.
> > 
> > Your FIBMAP-via-FIEMAP backend can do something like:
> > 
> >     /* FIBMAP only returns results for the same block device backing the fs. */
> >     if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > 	return 0;
> > 
> >     /* Can't tell what is the backing device, bail out. */
> >     if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > 	return 0;
> > 
> >     /*
> >      * Either fe_device matches the backing device or the implementation
> >      * doesn't tell us about the backing device, so assume it's ok.
> >      */
> >     <return FIBMAP results>
> > 
> > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > take advantage of that solution to remedy my objections to the proposed
> > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > behavior flag that userspace knows about but isn't allowed to pass in.
> > 
> >> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> >> than the device id in fiemap_extent. I don't see much advantage in
> >> adding the device id instead of using the flag.
> >> 
> >> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> >> userspace, so, it would require a check to make sure it didn't come from
> >> userspace if ioctl_fiemap() was used.
> >> 
> >> I think there are 2 other possibilities which can be used to fix this.
> >> 
> >> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> >> - If the device id is a must for you, maybe add the device id into
> >>  fiemap_extent_info instead of fiemap_extent.
> > 
> > That won't work with btrfs, which can store file extents on multiple
> > different physical devices.
> > 
> >>  So we don't mess with a UAPI exported data structure and still
> >>  provides a way to the filesystems to provide which device the mapped
> >>  extent is in.
> >> 
> >> What you think?
> >> 
> >> Cheers
> >> 
> >> 
> >>> 
> >>> --D
> >>> 
> >>>>> 
> >>>>>> +
> >>>>>> +	return error;
> >>>>>> +}
> >>>>>> +
> >>>>>> /**
> >>>>>>  *	bmap	- find a block number in a file
> >>>>>>  *	@inode:  inode owning the block number being requested
> >>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> >>>>>>  */
> >>>>>> int bmap(struct inode *inode, sector_t *block)
> >>>>>> {
> >>>>>> -	if (!inode->i_mapping->a_ops->bmap)
> >>>>>> +	if (inode->i_op->fiemap)
> >>>>>> +		return bmap_fiemap(inode, block);
> >>>>>> +	else if (inode->i_mapping->a_ops->bmap)
> >>>>>> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> >>>>>> +						       *block);
> >>>>>> +	else
> >>>>>> 		return -EINVAL;
> >>>>> 
> >>>>> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> >>>>> suddenly it will support this legacy interface they've never supported
> >>>>> before.  Are they on board with this?
> >>>>> 
> >>>>> --D
> >>>>> 
> >>>>>> 
> >>>>>> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> >>>>>> 	return 0;
> >>>>>> }
> >>>>>> EXPORT_SYMBOL(bmap);
> >>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
> >>>>>> index 6086978fe01e..bfa59df332bf 100644
> >>>>>> --- a/fs/ioctl.c
> >>>>>> +++ b/fs/ioctl.c
> >>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >>>>>> 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>> }
> >>>>>> 
> >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >>>>>> +			    u64 phys, u64 len, u32 flags)
> >>>>>> +{
> >>>>>> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> >>>>>> +
> >>>>>> +	/* only count the extents */
> >>>>>> +	if (fieinfo->fi_extents_max == 0) {
> >>>>>> +		fieinfo->fi_extents_mapped++;
> >>>>>> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>> +	}
> >>>>>> +
> >>>>>> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> >>>>>> +		return 1;
> >>>>>> +
> >>>>>> +	if (flags & SET_UNKNOWN_FLAGS)
> >>>>>> +		flags |= FIEMAP_EXTENT_UNKNOWN;
> >>>>>> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> >>>>>> +		flags |= FIEMAP_EXTENT_ENCODED;
> >>>>>> +	if (flags & SET_NOT_ALIGNED_FLAGS)
> >>>>>> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> >>>>>> +
> >>>>>> +	extent->fe_logical = logical;
> >>>>>> +	extent->fe_physical = phys;
> >>>>>> +	extent->fe_length = len;
> >>>>>> +	extent->fe_flags = flags;
> >>>>>> +
> >>>>>> +	fieinfo->fi_extents_mapped++;
> >>>>>> +
> >>>>>> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> >>>>>> +		return 1;
> >>>>>> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>> +}
> >>>>>> /**
> >>>>>>  * fiemap_fill_next_extent - Fiemap helper function
> >>>>>>  * @fieinfo:	Fiemap context passed into ->fiemap
> >>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >>>>>> index 7a434979201c..28bb523d532a 100644
> >>>>>> --- a/include/linux/fs.h
> >>>>>> +++ b/include/linux/fs.h
> >>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> >>>>>> 	fiemap_fill_cb	fi_cb;
> >>>>>> };
> >>>>>> 
> >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> >>>>>> +			      u64 phys, u64 len, u32 flags);
> >>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> >>>>>> 			    u64 phys, u64 len, u32 flags);
> >>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> >>>>>> --
> >>>>>> 2.17.2
> >>>>>> 
> >>>> 
> >>>> --
> >>>> Carlos
> >> 
> >> --
> >> Carlos
> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 



-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-06 20:44           ` Darrick J. Wong
  2019-02-06 21:13             ` Andreas Dilger
@ 2019-02-07 11:59             ` Carlos Maiolino
  2019-02-07 17:02               ` Darrick J. Wong
  2019-02-07 12:36             ` Carlos Maiolino
  2 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-07 11:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > start returning device information via struct fiemap_extent, at least
> > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > filesystem.
> > > > 
> > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > idea.
> > > 
> > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > shouldn't be expecting any meaning in reserved areas.
> > > 
> > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > with such information?
> > > 
> > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > 
> > Ok, may I ask why not?
> 
> I think it's a bad idea to add a flag to FIEMAP to change its behavior
> to suit an older and even crappier legacy interface (i.e. FIBMAP).
> 
> FIBMAP is architecturally broken in that we can't /ever/ provide the
> context of "which device does this map to?"
> 
> FIEMAP is architecturally deficient as well, but its ioctl structure
> definition is flexible enough that we can report "which device does this
> map to".
> 
> I want to enhance FIEMAP to deal with multi-device filesystems
> correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> and *lilo.
> 
> > My apologies if I am wrong, but, per my understanding, there is
> > nothing today, which tells userspace which device belongs the extent
> > map reported by FIEMAP.
> 
> Right...
> 
> > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > BTRFS, we simply do not provide such information.
> 
> Right...
> 
> > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > a FIBMAP has been requested, so the current behavior of both ioctls
> > won't change.
> 
> ...but from my point of view, the FIEMAP behavior *ought* to change to
> be more expressive.  Once that's done, we can use the more expressive
> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> 
> The whole point of having fe_reserved* fields in struct fiemap_extent is
> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> start returning data in a reserved field.  New userspace programs that
> know about the flag can start reading information from the new field if
> they see the flag, and old userspace programs don't know about the flag
> and won't be any worse off.
> 
> > Enabling filesystems to return device information into fiemap_extent
> > requires modification of all filesystems to provide such information,
> > which will not have any use other than matching the mounted device to
> > the device where the extent is.
> 
> Perhaps it would help for me to present a more concrete proposal:
> 
> --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> @@ -22,7 +22,19 @@ struct fiemap_extent {
>  	__u64 fe_length;   /* length in bytes for this extent */
>  	__u64 fe_reserved64[2];
>  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> -	__u32 fe_reserved[3];
> +
> +	/*
> +	 * Underlying device that this extent is stored on.
> +	 *
> +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> +	 * set, this field is a 32-bit cookie that can be used to distinguish
> +	 * between backing devices but has no intrinsic meaning.  If neither
> +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> +	 * EXTENT_DEV flags may be set at any time.
> +	 */
> +	__u32 fe_device;
> +	__u32 fe_reserved[2];
>  };
>  
>  struct fiemap {
> @@ -66,5 +78,14 @@ struct fiemap {
>  						    * merged for efficiency. */
>  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
>  						    * files. */
> +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> +						    * structure containing the
> +						    * major and minor numbers
> +						    * of a block device. */
> +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> +						    * cookie that can be used
> +						    * to distinguish physical
> +						    * devices but otherwise
> +						    * has no meaning. */
>  
>  #endif /* _LINUX_FIEMAP_H */
> 
> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> 
> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> and encode the replica number in fe_device.
> 

All of this makes sense, but I'm struggling to understand what you mean by
replica number here, and why it justify a second flag.

> Existing filesystems can be left unchanged, in which case neither
> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> meaningless, the same as they are today.  Reporting fe_device is entirely
> optional.
> 
> Userspace programs will now be able to tell which device the file data
> lives on, which has been sort-of requested for years, if the filesystem
> chooses to start exporting that information.
> 
> Your FIBMAP-via-FIEMAP backend can do something like:
> 
> /* FIBMAP only returns results for the same block device backing the fs. */
> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> 	return 0;
> 
> /* Can't tell what is the backing device, bail out. */
> if (fe->fe_flags & EXTENT_DEV_COOKIE)
> 	return 0;
> 

Ok, the first conditional, is ok, the second one is not making sense to me.
Looks like you are basically using it to flag the filesystem can't tell
exactly which device the current extent is, let's say for example, distributed
filesystems, where the physical extent can actually be on a different machine.
But I can't say for sure, can you give me more details about what you are trying
to achieve here?



> /*
>  * Either fe_device matches the backing device or the implementation
>  * doesn't tell us about the backing device, so assume it's ok.
>  */
> <return FIBMAP results>
>

This actually looks to contradict what you have been complaining, about some
filesystems which doesn't support FIBMAP currently, will now suddenly start to
support. Assuming it's ok if the implementation doesn't tell us about the
backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the
backing device, assuming it's ok will just fall into your first complain.

Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE
you mentioned.

> So that's how I'd solve a longstanding design problem of FIEMAP and then
> take advantage of that solution to remedy my objections to the proposed
> "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> behavior flag that userspace knows about but isn't allowed to pass in.
>

> > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > than the device id in fiemap_extent. I don't see much advantage in
> > adding the device id instead of using the flag.
> > 
> > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > userspace, so, it would require a check to make sure it didn't come from
> > userspace if ioctl_fiemap() was used.
> > 
> > I think there are 2 other possibilities which can be used to fix this.
> > 
> > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > - If the device id is a must for you, maybe add the device id into
> >   fiemap_extent_info instead of fiemap_extent.
> 
> That won't work with btrfs, which can store file extents on multiple
> different physical devices.
> 
> >   So we don't mess with a UAPI exported data structure and still
> >   provides a way to the filesystems to provide which device the mapped
> >   extent is in.
> > 
> > What you think?
> > 
> > Cheers
> > 
> > 
> > > 
> > > --D
> > > 
> > > > > 
> > > > > > +
> > > > > > +	return error;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   *	bmap	- find a block number in a file
> > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > >   */
> > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > >  {
> > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > +	if (inode->i_op->fiemap)
> > > > > > +		return bmap_fiemap(inode, block);
> > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > +						       *block);
> > > > > > +	else
> > > > > >  		return -EINVAL;
> > > > > 
> > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > suddenly it will support this legacy interface they've never supported
> > > > > before.  Are they on board with this?
> > > > > 
> > > > > --D
> > > > > 
> > > > > >  
> > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > >  	return 0;
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > --- a/fs/ioctl.c
> > > > > > +++ b/fs/ioctl.c
> > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > >  }
> > > > > >  
> > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > +{
> > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > +
> > > > > > +	/* only count the extents */
> > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > +		return 1;
> > > > > > +
> > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > +
> > > > > > +	extent->fe_logical = logical;
> > > > > > +	extent->fe_physical = phys;
> > > > > > +	extent->fe_length = len;
> > > > > > +	extent->fe_flags = flags;
> > > > > > +
> > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > +
> > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > +		return 1;
> > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > +}
> > > > > >  /**
> > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > --- a/include/linux/fs.h
> > > > > > +++ b/include/linux/fs.h
> > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > >  };
> > > > > >  
> > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > -- 
> > > > > > 2.17.2
> > > > > > 
> > > > 
> > > > -- 
> > > > Carlos
> > 
> > -- 
> > Carlos

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-06 20:44           ` Darrick J. Wong
  2019-02-06 21:13             ` Andreas Dilger
  2019-02-07 11:59             ` Carlos Maiolino
@ 2019-02-07 12:36             ` Carlos Maiolino
  2019-02-07 18:16               ` Darrick J. Wong
  2 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-07 12:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, hch, adilger, sandeen, david

Apologies, I forgot to mention another thing..

On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > start returning device information via struct fiemap_extent, at least
> > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > filesystem.
> > > > 
> > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > idea.
> > > 
> > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > shouldn't be expecting any meaning in reserved areas.
> > > 
> > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > with such information?
> > > 
> > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > 
> > Ok, may I ask why not?
> 
> I think it's a bad idea to add a flag to FIEMAP to change its behavior
> to suit an older and even crappier legacy interface (i.e. FIBMAP).
> 
> FIBMAP is architecturally broken in that we can't /ever/ provide the
> context of "which device does this map to?"
> 
> FIEMAP is architecturally deficient as well, but its ioctl structure
> definition is flexible enough that we can report "which device does this
> map to".
> 
> I want to enhance FIEMAP to deal with multi-device filesystems
> correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> and *lilo.
> 
> > My apologies if I am wrong, but, per my understanding, there is
> > nothing today, which tells userspace which device belongs the extent
> > map reported by FIEMAP.
> 
> Right...
> 
> > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > BTRFS, we simply do not provide such information.
> 
> Right...
> 
> > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > a FIBMAP has been requested, so the current behavior of both ioctls
> > won't change.
> 
> ...but from my point of view, the FIEMAP behavior *ought* to change to
> be more expressive.  Once that's done, we can use the more expressive
> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> 
> The whole point of having fe_reserved* fields in struct fiemap_extent is
> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> start returning data in a reserved field.  New userspace programs that
> know about the flag can start reading information from the new field if
> they see the flag, and old userspace programs don't know about the flag
> and won't be any worse off.
> 
> > Enabling filesystems to return device information into fiemap_extent
> > requires modification of all filesystems to provide such information,
> > which will not have any use other than matching the mounted device to
> > the device where the extent is.
> 
> Perhaps it would help for me to present a more concrete proposal:
> 
> --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> @@ -22,7 +22,19 @@ struct fiemap_extent {
>  	__u64 fe_length;   /* length in bytes for this extent */
>  	__u64 fe_reserved64[2];
>  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> -	__u32 fe_reserved[3];
> +
> +	/*
> +	 * Underlying device that this extent is stored on.
> +	 *
> +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> +	 * set, this field is a 32-bit cookie that can be used to distinguish
> +	 * between backing devices but has no intrinsic meaning.  If neither
> +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> +	 * EXTENT_DEV flags may be set at any time.
> +	 */
> +	__u32 fe_device;
> +	__u32 fe_reserved[2];
>  };
>  
>  struct fiemap {
> @@ -66,5 +78,14 @@ struct fiemap {
>  						    * merged for efficiency. */
>  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
>  						    * files. */
> +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> +						    * structure containing the
> +						    * major and minor numbers
> +						    * of a block device. */
> +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> +						    * cookie that can be used
> +						    * to distinguish physical
> +						    * devices but otherwise
> +						    * has no meaning. */
>  
>  #endif /* _LINUX_FIEMAP_H */
> 
> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> encoding fe_device = new_encode_dev(xfs_get_device_for_file()).

Here, I believe you are forgetting that filesystems do not touch fiemap_extent
directly. We call fiemap_fell_next_extent() helper to fill each extent found by
fiemap. So, in either way, we'd need to modify fiemap_fill_next_extent() and the
callbacks being used to accommodate this new field or create a new helper to
modify the device which doesn't sound reasonable. So, either way, we will end up
needing to modify all filesystems.

So, although I really like the idea of improving the FIEMAP interface, I'm
starting to consider another patchset for it. I think it requires an interface
change big enough to fit in this patchset, which actually has a different
purpose. Or, maybe, address this at the end of this patchset, leaving different
interface changes in different patchsets, instead of making many changes all at
once, mixed together.

> 
> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> and encode the replica number in fe_device.
> 
> Existing filesystems can be left unchanged, in which case neither
> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> meaningless, the same as they are today.  Reporting fe_device is entirely
> optional.
> 
> Userspace programs will now be able to tell which device the file data
> lives on, which has been sort-of requested for years, if the filesystem
> chooses to start exporting that information.
> 
> Your FIBMAP-via-FIEMAP backend can do something like:
> 
> /* FIBMAP only returns results for the same block device backing the fs. */
> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> 	return 0;
> 
> /* Can't tell what is the backing device, bail out. */
> if (fe->fe_flags & EXTENT_DEV_COOKIE)
> 	return 0;
> 
> /*
>  * Either fe_device matches the backing device or the implementation
>  * doesn't tell us about the backing device, so assume it's ok.
>  */
> <return FIBMAP results>
> 
> So that's how I'd solve a longstanding design problem of FIEMAP and then
> take advantage of that solution to remedy my objections to the proposed
> "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> behavior flag that userspace knows about but isn't allowed to pass in.
> 
> > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > than the device id in fiemap_extent. I don't see much advantage in
> > adding the device id instead of using the flag.
> > 
> > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > userspace, so, it would require a check to make sure it didn't come from
> > userspace if ioctl_fiemap() was used.
> > 
> > I think there are 2 other possibilities which can be used to fix this.
> > 
> > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > - If the device id is a must for you, maybe add the device id into
> >   fiemap_extent_info instead of fiemap_extent.
> 
> That won't work with btrfs, which can store file extents on multiple
> different physical devices.
> 
> >   So we don't mess with a UAPI exported data structure and still
> >   provides a way to the filesystems to provide which device the mapped
> >   extent is in.
> > 
> > What you think?
> > 
> > Cheers
> > 
> > 
> > > 
> > > --D
> > > 
> > > > > 
> > > > > > +
> > > > > > +	return error;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   *	bmap	- find a block number in a file
> > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > >   */
> > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > >  {
> > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > +	if (inode->i_op->fiemap)
> > > > > > +		return bmap_fiemap(inode, block);
> > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > +						       *block);
> > > > > > +	else
> > > > > >  		return -EINVAL;
> > > > > 
> > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > suddenly it will support this legacy interface they've never supported
> > > > > before.  Are they on board with this?
> > > > > 
> > > > > --D
> > > > > 
> > > > > >  
> > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > >  	return 0;
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > --- a/fs/ioctl.c
> > > > > > +++ b/fs/ioctl.c
> > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > >  }
> > > > > >  
> > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > +{
> > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > +
> > > > > > +	/* only count the extents */
> > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > +		return 1;
> > > > > > +
> > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > +
> > > > > > +	extent->fe_logical = logical;
> > > > > > +	extent->fe_physical = phys;
> > > > > > +	extent->fe_length = len;
> > > > > > +	extent->fe_flags = flags;
> > > > > > +
> > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > +
> > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > +		return 1;
> > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > +}
> > > > > >  /**
> > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > --- a/include/linux/fs.h
> > > > > > +++ b/include/linux/fs.h
> > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > >  };
> > > > > >  
> > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > -- 
> > > > > > 2.17.2
> > > > > > 
> > > > 
> > > > -- 
> > > > Carlos
> > 
> > -- 
> > Carlos

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-07 11:59             ` Carlos Maiolino
@ 2019-02-07 17:02               ` Darrick J. Wong
  2019-02-07 21:25                 ` Andreas Dilger
  2019-02-08  9:03                 ` Carlos Maiolino
  0 siblings, 2 replies; 53+ messages in thread
From: Darrick J. Wong @ 2019-02-07 17:02 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote:
> On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > > start returning device information via struct fiemap_extent, at least
> > > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > > filesystem.
> > > > > 
> > > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > > idea.
> > > > 
> > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > > shouldn't be expecting any meaning in reserved areas.
> > > > 
> > > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > > with such information?
> > > > 
> > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > > 
> > > Ok, may I ask why not?
> > 
> > I think it's a bad idea to add a flag to FIEMAP to change its behavior
> > to suit an older and even crappier legacy interface (i.e. FIBMAP).
> > 
> > FIBMAP is architecturally broken in that we can't /ever/ provide the
> > context of "which device does this map to?"
> > 
> > FIEMAP is architecturally deficient as well, but its ioctl structure
> > definition is flexible enough that we can report "which device does this
> > map to".
> > 
> > I want to enhance FIEMAP to deal with multi-device filesystems
> > correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> > and *lilo.
> > 
> > > My apologies if I am wrong, but, per my understanding, there is
> > > nothing today, which tells userspace which device belongs the extent
> > > map reported by FIEMAP.
> > 
> > Right...
> > 
> > > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > > BTRFS, we simply do not provide such information.
> > 
> > Right...
> > 
> > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > > a FIBMAP has been requested, so the current behavior of both ioctls
> > > won't change.
> > 
> > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > be more expressive.  Once that's done, we can use the more expressive
> > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > 
> > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > start returning data in a reserved field.  New userspace programs that
> > know about the flag can start reading information from the new field if
> > they see the flag, and old userspace programs don't know about the flag
> > and won't be any worse off.
> > 
> > > Enabling filesystems to return device information into fiemap_extent
> > > requires modification of all filesystems to provide such information,
> > > which will not have any use other than matching the mounted device to
> > > the device where the extent is.
> > 
> > Perhaps it would help for me to present a more concrete proposal:
> > 
> > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > @@ -22,7 +22,19 @@ struct fiemap_extent {
> >  	__u64 fe_length;   /* length in bytes for this extent */
> >  	__u64 fe_reserved64[2];
> >  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > -	__u32 fe_reserved[3];
> > +
> > +	/*
> > +	 * Underlying device that this extent is stored on.
> > +	 *
> > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > +	 * between backing devices but has no intrinsic meaning.  If neither
> > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > +	 * EXTENT_DEV flags may be set at any time.
> > +	 */
> > +	__u32 fe_device;
> > +	__u32 fe_reserved[2];
> >  };
> >  
> >  struct fiemap {
> > @@ -66,5 +78,14 @@ struct fiemap {
> >  						    * merged for efficiency. */
> >  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> >  						    * files. */
> > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > +						    * structure containing the
> > +						    * major and minor numbers
> > +						    * of a block device. */
> > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > +						    * cookie that can be used
> > +						    * to distinguish physical
> > +						    * devices but otherwise
> > +						    * has no meaning. */
> >  
> >  #endif /* _LINUX_FIEMAP_H */
> > 
> > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> > 
> > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > and encode the replica number in fe_device.
> > 
> 
> All of this makes sense, but I'm struggling to understand what you mean by
> replica number here, and why it justify a second flag.

I left in the "device cookie" thing in the proposal to accomodate a
request from the Lustre folks to be able to report which replica is
storing a particular extent map.  Apparently the replica id is simply a
32-bit number that isn't inherently useful, hence the vagueness around
what "cookie" really means...

...oh, right, lustre fell out of drivers/staging/.  You could probably
leave it out then.

> > Existing filesystems can be left unchanged, in which case neither
> > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > meaningless, the same as they are today.  Reporting fe_device is entirely
> > optional.
> > 
> > Userspace programs will now be able to tell which device the file data
> > lives on, which has been sort-of requested for years, if the filesystem
> > chooses to start exporting that information.
> > 
> > Your FIBMAP-via-FIEMAP backend can do something like:
> > 
> > /* FIBMAP only returns results for the same block device backing the fs. */
> > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > 	return 0;
> > 
> > /* Can't tell what is the backing device, bail out. */
> > if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > 	return 0;
> > 
> 
> Ok, the first conditional, is ok, the second one is not making sense to me.
> Looks like you are basically using it to flag the filesystem can't tell
> exactly which device the current extent is, let's say for example, distributed
> filesystems, where the physical extent can actually be on a different machine.
> But I can't say for sure, can you give me more details about what you are trying
> to achieve here?

You've understood me correctly. :)

> > /*
> >  * Either fe_device matches the backing device or the implementation
> >  * doesn't tell us about the backing device, so assume it's ok.
> >  */
> > <return FIBMAP results>
> >
> 
> This actually looks to contradict what you have been complaining, about some
> filesystems which doesn't support FIBMAP currently, will now suddenly start to
> support. Assuming it's ok if the implementation doesn't tell us about the
> backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the
> backing device, assuming it's ok will just fall into your first complain.

Sorry, this thread has been going on so long that I forgot your goal for
this series. :/

Specifically, I had forgotten that you're removing the ->bmap pointer,
which means that filesystems don't have any particular way to signal
"Yes on FIEMAP, no on FIBMAP".  Somehow I had thought that you were
merely creating a generic_file_bmap() that would call FIEMAP and ripping
out all the adhoc bmap implementations.

Hmm, how many filesystems support FIEMAP and not FIBMAP?

btrfs, nilfs2, and overlayfs.  Also bad_inode.c...?

Hmm, how many filesystems support FIBMAP and not FIEMAP?

adfs, affs, befs, bfs, efs, exofs, fat, freevxfs, fuse(?), nfs, hfsplus,
isofs, jfs, minixfs, ntfs, qnx[46], reiserfs, sysv, udf, and ufs.

> Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE
> you mentioned.

I think you've understood it about as well as I can explain it.  Maybe
Andreas will have more to say about the lustre replica id, but OTOH it's
gone and so there's no user of it, so we could just drop it until lustre
comes back.

> > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > take advantage of that solution to remedy my objections to the proposed
> > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > behavior flag that userspace knows about but isn't allowed to pass in.
> >
> 
> > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > > than the device id in fiemap_extent. I don't see much advantage in
> > > adding the device id instead of using the flag.
> > > 
> > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > > userspace, so, it would require a check to make sure it didn't come from
> > > userspace if ioctl_fiemap() was used.
> > > 
> > > I think there are 2 other possibilities which can be used to fix this.
> > > 
> > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > > - If the device id is a must for you, maybe add the device id into
> > >   fiemap_extent_info instead of fiemap_extent.
> > 
> > That won't work with btrfs, which can store file extents on multiple
> > different physical devices.
> > 
> > >   So we don't mess with a UAPI exported data structure and still
> > >   provides a way to the filesystems to provide which device the mapped
> > >   extent is in.
> > > 
> > > What you think?
> > > 
> > > Cheers
> > > 
> > > 
> > > > 
> > > > --D
> > > > 
> > > > > > 
> > > > > > > +
> > > > > > > +	return error;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /**
> > > > > > >   *	bmap	- find a block number in a file
> > > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > > >   */
> > > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > > >  {
> > > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > > +	if (inode->i_op->fiemap)
> > > > > > > +		return bmap_fiemap(inode, block);
> > > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > > +						       *block);
> > > > > > > +	else
> > > > > > >  		return -EINVAL;
> > > > > > 
> > > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > > suddenly it will support this legacy interface they've never supported
> > > > > > before.  Are they on board with this?
> > > > > > 
> > > > > > --D
> > > > > > 
> > > > > > >  
> > > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > > >  	return 0;
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > > --- a/fs/ioctl.c
> > > > > > > +++ b/fs/ioctl.c
> > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > >  }
> > > > > > >  
> > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > > +{
> > > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > > +
> > > > > > > +	/* only count the extents */
> > > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > > +		return 1;
> > > > > > > +
> > > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > > +
> > > > > > > +	extent->fe_logical = logical;
> > > > > > > +	extent->fe_physical = phys;
> > > > > > > +	extent->fe_length = len;
> > > > > > > +	extent->fe_flags = flags;
> > > > > > > +
> > > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > > +
> > > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > > +		return 1;
> > > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > +}
> > > > > > >  /**
> > > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > > --- a/include/linux/fs.h
> > > > > > > +++ b/include/linux/fs.h
> > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > > >  };
> > > > > > >  
> > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > > -- 
> > > > > > > 2.17.2
> > > > > > > 
> > > > > 
> > > > > -- 
> > > > > Carlos
> > > 
> > > -- 
> > > Carlos
> 
> -- 
> Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-07 12:36             ` Carlos Maiolino
@ 2019-02-07 18:16               ` Darrick J. Wong
  2019-02-08  8:58                 ` Carlos Maiolino
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2019-02-07 18:16 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Thu, Feb 07, 2019 at 01:36:41PM +0100, Carlos Maiolino wrote:
> Apologies, I forgot to mention another thing..
> 
> On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > > start returning device information via struct fiemap_extent, at least
> > > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > > filesystem.
> > > > > 
> > > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > > idea.
> > > > 
> > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > > shouldn't be expecting any meaning in reserved areas.
> > > > 
> > > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > > with such information?
> > > > 
> > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > > 
> > > Ok, may I ask why not?
> > 
> > I think it's a bad idea to add a flag to FIEMAP to change its behavior
> > to suit an older and even crappier legacy interface (i.e. FIBMAP).
> > 
> > FIBMAP is architecturally broken in that we can't /ever/ provide the
> > context of "which device does this map to?"
> > 
> > FIEMAP is architecturally deficient as well, but its ioctl structure
> > definition is flexible enough that we can report "which device does this
> > map to".
> > 
> > I want to enhance FIEMAP to deal with multi-device filesystems
> > correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> > and *lilo.
> > 
> > > My apologies if I am wrong, but, per my understanding, there is
> > > nothing today, which tells userspace which device belongs the extent
> > > map reported by FIEMAP.
> > 
> > Right...
> > 
> > > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > > BTRFS, we simply do not provide such information.
> > 
> > Right...
> > 
> > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > > a FIBMAP has been requested, so the current behavior of both ioctls
> > > won't change.
> > 
> > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > be more expressive.  Once that's done, we can use the more expressive
> > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > 
> > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > start returning data in a reserved field.  New userspace programs that
> > know about the flag can start reading information from the new field if
> > they see the flag, and old userspace programs don't know about the flag
> > and won't be any worse off.
> > 
> > > Enabling filesystems to return device information into fiemap_extent
> > > requires modification of all filesystems to provide such information,
> > > which will not have any use other than matching the mounted device to
> > > the device where the extent is.
> > 
> > Perhaps it would help for me to present a more concrete proposal:
> > 
> > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > @@ -22,7 +22,19 @@ struct fiemap_extent {
> >  	__u64 fe_length;   /* length in bytes for this extent */
> >  	__u64 fe_reserved64[2];
> >  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > -	__u32 fe_reserved[3];
> > +
> > +	/*
> > +	 * Underlying device that this extent is stored on.
> > +	 *
> > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > +	 * between backing devices but has no intrinsic meaning.  If neither
> > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > +	 * EXTENT_DEV flags may be set at any time.
> > +	 */
> > +	__u32 fe_device;
> > +	__u32 fe_reserved[2];
> >  };
> >  
> >  struct fiemap {
> > @@ -66,5 +78,14 @@ struct fiemap {
> >  						    * merged for efficiency. */
> >  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> >  						    * files. */
> > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > +						    * structure containing the
> > +						    * major and minor numbers
> > +						    * of a block device. */
> > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > +						    * cookie that can be used
> > +						    * to distinguish physical
> > +						    * devices but otherwise
> > +						    * has no meaning. */
> >  
> >  #endif /* _LINUX_FIEMAP_H */
> > 
> > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> 
> Here, I believe you are forgetting that filesystems do not touch fiemap_extent
> directly. We call fiemap_fell_next_extent() helper to fill each extent found by
> fiemap. So, in either way, we'd need to modify fiemap_fill_next_extent() and the
> callbacks being used to accommodate this new field or create a new helper to
> modify the device which doesn't sound reasonable. So, either way, we will end up
> needing to modify all filesystems.

Yep.  Drat.  I guess you could add a bdev parameter to
fiemap_fill_next_extent, and we'd use that to encode fe_device.  If the
fs passes NULL then we just get it from the superblock or something.

> So, although I really like the idea of improving the FIEMAP interface, I'm
> starting to consider another patchset for it. I think it requires an interface
> change big enough to fit in this patchset, which actually has a different
> purpose. Or, maybe, address this at the end of this patchset, leaving different
> interface changes in different patchsets, instead of making many changes all at
> once, mixed together.

<nod> I think you're right, fiemap upgrades as one series and then
fibmap-via-fiemap as the second one.

--D

> > 
> > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > and encode the replica number in fe_device.
> > 
> > Existing filesystems can be left unchanged, in which case neither
> > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > meaningless, the same as they are today.  Reporting fe_device is entirely
> > optional.
> > 
> > Userspace programs will now be able to tell which device the file data
> > lives on, which has been sort-of requested for years, if the filesystem
> > chooses to start exporting that information.
> > 
> > Your FIBMAP-via-FIEMAP backend can do something like:
> > 
> > /* FIBMAP only returns results for the same block device backing the fs. */
> > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > 	return 0;
> > 
> > /* Can't tell what is the backing device, bail out. */
> > if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > 	return 0;
> > 
> > /*
> >  * Either fe_device matches the backing device or the implementation
> >  * doesn't tell us about the backing device, so assume it's ok.
> >  */
> > <return FIBMAP results>
> > 
> > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > take advantage of that solution to remedy my objections to the proposed
> > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > behavior flag that userspace knows about but isn't allowed to pass in.
> > 
> > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > > than the device id in fiemap_extent. I don't see much advantage in
> > > adding the device id instead of using the flag.
> > > 
> > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > > userspace, so, it would require a check to make sure it didn't come from
> > > userspace if ioctl_fiemap() was used.
> > > 
> > > I think there are 2 other possibilities which can be used to fix this.
> > > 
> > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > > - If the device id is a must for you, maybe add the device id into
> > >   fiemap_extent_info instead of fiemap_extent.
> > 
> > That won't work with btrfs, which can store file extents on multiple
> > different physical devices.
> > 
> > >   So we don't mess with a UAPI exported data structure and still
> > >   provides a way to the filesystems to provide which device the mapped
> > >   extent is in.
> > > 
> > > What you think?
> > > 
> > > Cheers
> > > 
> > > 
> > > > 
> > > > --D
> > > > 
> > > > > > 
> > > > > > > +
> > > > > > > +	return error;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /**
> > > > > > >   *	bmap	- find a block number in a file
> > > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > > >   */
> > > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > > >  {
> > > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > > +	if (inode->i_op->fiemap)
> > > > > > > +		return bmap_fiemap(inode, block);
> > > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > > +						       *block);
> > > > > > > +	else
> > > > > > >  		return -EINVAL;
> > > > > > 
> > > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > > suddenly it will support this legacy interface they've never supported
> > > > > > before.  Are they on board with this?
> > > > > > 
> > > > > > --D
> > > > > > 
> > > > > > >  
> > > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > > >  	return 0;
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > > --- a/fs/ioctl.c
> > > > > > > +++ b/fs/ioctl.c
> > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > >  }
> > > > > > >  
> > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > > +{
> > > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > > +
> > > > > > > +	/* only count the extents */
> > > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > > +		return 1;
> > > > > > > +
> > > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > > +
> > > > > > > +	extent->fe_logical = logical;
> > > > > > > +	extent->fe_physical = phys;
> > > > > > > +	extent->fe_length = len;
> > > > > > > +	extent->fe_flags = flags;
> > > > > > > +
> > > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > > +
> > > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > > +		return 1;
> > > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > +}
> > > > > > >  /**
> > > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > > --- a/include/linux/fs.h
> > > > > > > +++ b/include/linux/fs.h
> > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > > >  };
> > > > > > >  
> > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > > -- 
> > > > > > > 2.17.2
> > > > > > > 
> > > > > 
> > > > > -- 
> > > > > Carlos
> > > 
> > > -- 
> > > Carlos
> 
> -- 
> Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-07 17:02               ` Darrick J. Wong
@ 2019-02-07 21:25                 ` Andreas Dilger
  2019-02-08  8:46                   ` Christoph Hellwig
  2019-02-08  9:08                   ` Carlos Maiolino
  2019-02-08  9:03                 ` Carlos Maiolino
  1 sibling, 2 replies; 53+ messages in thread
From: Andreas Dilger @ 2019-02-07 21:25 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Carlos Maiolino, linux-fsdevel, Christoph Hellwig, Eric Sandeen, david

[-- Attachment #1: Type: text/plain, Size: 13822 bytes --]

On Feb 7, 2019, at 10:02 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote:
>> On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
>>> 
>>> ...but from my point of view, the FIEMAP behavior *ought* to change to
>>> be more expressive.  Once that's done, we can use the more expressive
>>> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
>>> 
>>> The whole point of having fe_reserved* fields in struct fiemap_extent is
>>> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
>>> start returning data in a reserved field.  New userspace programs that
>>> know about the flag can start reading information from the new field if
>>> they see the flag, and old userspace programs don't know about the flag
>>> and won't be any worse off.
>>> 
>>> Perhaps it would help for me to present a more concrete proposal:
>>> 
>>> --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
>>> +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
>>> @@ -22,7 +22,19 @@ struct fiemap_extent {
>>> 	__u64 fe_length;   /* length in bytes for this extent */
>>> 	__u64 fe_reserved64[2];
>>> 	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
>>> -	__u32 fe_reserved[3];
>>> +
>>> +	/*
>>> +	 * Underlying device that this extent is stored on.
>>> +	 *
>>> +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
>>> +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
>>> +	 * set, this field is a 32-bit cookie that can be used to distinguish
>>> +	 * between backing devices but has no intrinsic meaning.  If neither
>>> +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
>>> +	 * EXTENT_DEV flags may be set at any time.
>>> +	 */
>>> +	__u32 fe_device;
>>> +	__u32 fe_reserved[2];
>>> };
>>> 
>>> struct fiemap {
>>> @@ -66,5 +78,14 @@ struct fiemap {
>>> 						    * merged for efficiency. */
>>> #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
>>> 						    * files. */
>>> +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
>>> +						    * structure containing the
>>> +						    * major and minor numbers
>>> +						    * of a block device. */
>>> +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
>>> +						    * cookie that can be used
>>> +						    * to distinguish physical
>>> +						    * devices but otherwise
>>> +						    * has no meaning. */
>>> 
>>> #endif /* _LINUX_FIEMAP_H */
>>> 
>>> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
>>> encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
>>> 
>>> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
>>> and encode the replica number in fe_device.
>>> 
>> 
>> All of this makes sense, but I'm struggling to understand what you mean by
>> replica number here, and why it justify a second flag.
> 
> I left in the "device cookie" thing in the proposal to accomodate a
> request from the Lustre folks to be able to report which replica is
> storing a particular extent map.  Apparently the replica id is simply a
> 32-bit number that isn't inherently useful, hence the vagueness around
> what "cookie" really means...
> 
> ...oh, right, lustre fell out of drivers/staging/.  You could probably
> leave it out then.

Do we really need to be this way, about reserving a single flag for Lustre,
which will likely also be useful for other filesystems?  It's not like
Lustre is some closed-source binary module for which we need to make life
difficult, it is used by many thousands of the largest computers at labs
and universities and companies around the world.  We are working to clean
up the code outside the staging tree and resubmit it.  Not reserving a flag
just means we will continue to use random values in Lustre before it can
be merged, which will make life harder when we try to merge again.


In the case of Lustre, the proposed DEV_COOKIE would indicate fe_device is
the integer index number of the server on which each extent of the file is
located (Darrick's "replica number" term is not really correct).  The server
index is familiar to all Lustre users, so having filefrag print out the
device number of "0000" or "0009" is totally clear to them.

For pNFS or Ceph or other network filesystems (if they implement filefrag)
it could be an index or some other number (e.g. the IP address of the server
or low bits of the  UUID or whatever).  Reading back in the archives of the
original FIEMAP discussion, it seems BtrFS would prefer to use DEV_COOKIE
instead of DEV_T, because it uses internal RAID encoding and not plain block
devices, but I'm not familiar with the details there.


Alternately, or in addition to, a DEV_COOKIE flag which indicates that the
same fe_device field is "not a device", it would be possible to add:

    #define FIEMAP_NO_DIRECT      0x40000000

and/or:

    #define FIEMAP_EXTENT_NET     0x80000000   /* Data stored remotely.
                                                * Sets NO_DIRECT flag */

returned by the filesystem that indicates the extent blocks are not local
to the node, so FIBMAP should return an error (-EOPNOTSUP or -EREMOTE or
whatever) because the file can't be booted from.  In that case, we could
return FIEMAP_EXTENT_DEVICE to indicate the fe_device field is valid, and
return FIEMAP_EXTENT_NET to indicate the values in fe_device are not local
block devices, just filesystem-specific values to distinguish devices.

However, I'm open to both _DEV_COOKIE and _NET flag if that is preferred,
since I think the two are somewhat complementary.

>> This actually looks to contradict what you have been complaining, about some
>> filesystems which doesn't support FIBMAP currently, will now suddenly start to
>> support. Assuming it's ok if the implementation doesn't tell us about the
>> backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the
>> backing device, assuming it's ok will just fall into your first complain.
> 
> Sorry, this thread has been going on so long that I forgot your goal for
> this series. :/
> 
> Specifically, I had forgotten that you're removing the ->bmap pointer,
> which means that filesystems don't have any particular way to signal
> "Yes on FIEMAP, no on FIBMAP".  Somehow I had thought that you were
> merely creating a generic_file_bmap() that would call FIEMAP and ripping
> out all the adhoc bmap implementations.

Just a reminder here, you should set FIEMAP_FLAG_SYNC when mapping FIBMAP
to FIEMAP so that the data on that file is flushed to disk before returning,
since the block mapping may not be assigned yet or may be unstable, which
could lead to an unbootable system if used for LILO.

Cheers, Andreas

>>> Existing filesystems can be left unchanged, in which case neither
>>> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
>>> meaningless, the same as they are today.  Reporting fe_device is entirely
>>> optional.
>>> 
>>> Userspace programs will now be able to tell which device the file data
>>> lives on, which has been sort-of requested for years, if the filesystem
>>> chooses to start exporting that information.
>>> 
>>> Your FIBMAP-via-FIEMAP backend can do something like:
>>> 
>>> /* FIBMAP only returns results for the same block device backing the fs. */
>>> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
>>> 	return 0;
>>> 
>>> /* Can't tell what is the backing device, bail out. */
>>> if (fe->fe_flags & EXTENT_DEV_COOKIE)
>>> 	return 0;
>>> 
>> 
>> Ok, the first conditional, is ok, the second one is not making sense to me.
>> Looks like you are basically using it to flag the filesystem can't tell
>> exactly which device the current extent is, let's say for example, distributed
>> filesystems, where the physical extent can actually be on a different machine.
>> But I can't say for sure, can you give me more details about what you are trying
>> to achieve here?
> 
> You've understood me correctly. :)
> 
>>> /*
>>> * Either fe_device matches the backing device or the implementation
>>> * doesn't tell us about the backing device, so assume it's ok.
>>> */
>>> <return FIBMAP results>
>>> 
>> 
>> Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE
>> you mentioned.
> 
> I think you've understood it about as well as I can explain it.  Maybe
> Andreas will have more to say about the lustre replica id, but OTOH it's
> gone and so there's no user of it, so we could just drop it until lustre
> comes back.
> 

>>> So that's how I'd solve a longstanding design problem of FIEMAP and then
>>> take advantage of that solution to remedy my objections to the proposed
>>> "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
>>> behavior flag that userspace knows about but isn't allowed to pass in.
>>> 
>> 
>>>> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
>>>> than the device id in fiemap_extent. I don't see much advantage in
>>>> adding the device id instead of using the flag.
>>>> 
>>>> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
>>>> userspace, so, it would require a check to make sure it didn't come from
>>>> userspace if ioctl_fiemap() was used.
>>>> 
>>>> I think there are 2 other possibilities which can be used to fix this.
>>>> 
>>>> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
>>>> - If the device id is a must for you, maybe add the device id into
>>>>  fiemap_extent_info instead of fiemap_extent.
>>> 
>>> That won't work with btrfs, which can store file extents on multiple
>>> different physical devices.
>>> 
>>>>  So we don't mess with a UAPI exported data structure and still
>>>>  provides a way to the filesystems to provide which device the mapped
>>>>  extent is in.
>>>> 
>>>> What you think?
>>>> 
>>>> Cheers
>>>> 
>>>> 
>>>>> 
>>>>> --D
>>>>> 
>>>>>>> 
>>>>>>>> +
>>>>>>>> +	return error;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> /**
>>>>>>>>  *	bmap	- find a block number in a file
>>>>>>>>  *	@inode:  inode owning the block number being requested
>>>>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
>>>>>>>>  */
>>>>>>>> int bmap(struct inode *inode, sector_t *block)
>>>>>>>> {
>>>>>>>> -	if (!inode->i_mapping->a_ops->bmap)
>>>>>>>> +	if (inode->i_op->fiemap)
>>>>>>>> +		return bmap_fiemap(inode, block);
>>>>>>>> +	else if (inode->i_mapping->a_ops->bmap)
>>>>>>>> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
>>>>>>>> +						       *block);
>>>>>>>> +	else
>>>>>>>> 		return -EINVAL;
>>>>>>> 
>>>>>>> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
>>>>>>> suddenly it will support this legacy interface they've never supported
>>>>>>> before.  Are they on board with this?
>>>>>>> 
>>>>>>> --D
>>>>>>> 
>>>>>>>> 
>>>>>>>> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
>>>>>>>> 	return 0;
>>>>>>>> }
>>>>>>>> EXPORT_SYMBOL(bmap);
>>>>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
>>>>>>>> index 6086978fe01e..bfa59df332bf 100644
>>>>>>>> --- a/fs/ioctl.c
>>>>>>>> +++ b/fs/ioctl.c
>>>>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>>>>> 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>>>> }
>>>>>>>> 
>>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>>>>> +			    u64 phys, u64 len, u32 flags)
>>>>>>>> +{
>>>>>>>> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
>>>>>>>> +
>>>>>>>> +	/* only count the extents */
>>>>>>>> +	if (fieinfo->fi_extents_max == 0) {
>>>>>>>> +		fieinfo->fi_extents_mapped++;
>>>>>>>> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
>>>>>>>> +		return 1;
>>>>>>>> +
>>>>>>>> +	if (flags & SET_UNKNOWN_FLAGS)
>>>>>>>> +		flags |= FIEMAP_EXTENT_UNKNOWN;
>>>>>>>> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
>>>>>>>> +		flags |= FIEMAP_EXTENT_ENCODED;
>>>>>>>> +	if (flags & SET_NOT_ALIGNED_FLAGS)
>>>>>>>> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
>>>>>>>> +
>>>>>>>> +	extent->fe_logical = logical;
>>>>>>>> +	extent->fe_physical = phys;
>>>>>>>> +	extent->fe_length = len;
>>>>>>>> +	extent->fe_flags = flags;
>>>>>>>> +
>>>>>>>> +	fieinfo->fi_extents_mapped++;
>>>>>>>> +
>>>>>>>> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
>>>>>>>> +		return 1;
>>>>>>>> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>>>> +}
>>>>>>>> /**
>>>>>>>>  * fiemap_fill_next_extent - Fiemap helper function
>>>>>>>>  * @fieinfo:	Fiemap context passed into ->fiemap
>>>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>>>>> index 7a434979201c..28bb523d532a 100644
>>>>>>>> --- a/include/linux/fs.h
>>>>>>>> +++ b/include/linux/fs.h
>>>>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
>>>>>>>> 	fiemap_fill_cb	fi_cb;
>>>>>>>> };
>>>>>>>> 
>>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
>>>>>>>> +			      u64 phys, u64 len, u32 flags);
>>>>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
>>>>>>>> 			    u64 phys, u64 len, u32 flags);
>>>>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
>>>>>>>> --
>>>>>>>> 2.17.2
>>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Carlos
>>>> 
>>>> --
>>>> Carlos
>> 
>> --
>> Carlos


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-07  9:52               ` Carlos Maiolino
@ 2019-02-08  8:43                 ` Christoph Hellwig
  2019-02-11 12:57                   ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2019-02-08  8:43 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Andreas Dilger, Darrick J. Wong, linux-fsdevel,
	Christoph Hellwig, Eric Sandeen, david

On Thu, Feb 07, 2019 at 10:52:33AM +0100, Carlos Maiolino wrote:
> Btw, I am not saying I don't like the idea, I like it. What I was trying to do
> was to avoid touching UAPI in this patchset. But... I'll try to implement your
> idea here, send it to the list and raise my shields.

Agreed.  Please don't change the FIEMAP uapi.  If we need to check
for a request coming from bmap just defined an internal FIEMAP flag
as the last available flag in the flags word, and reject it when
it comes from userspace in fiemap.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-07 21:25                 ` Andreas Dilger
@ 2019-02-08  8:46                   ` Christoph Hellwig
  2019-02-08 10:36                     ` Carlos Maiolino
  2019-02-08  9:08                   ` Carlos Maiolino
  1 sibling, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2019-02-08  8:46 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Darrick J. Wong, Carlos Maiolino, linux-fsdevel,
	Christoph Hellwig, Eric Sandeen, david

On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote:
> Do we really need to be this way, about reserving a single flag for Lustre,
> which will likely also be useful for other filesystems?  It's not like
> Lustre is some closed-source binary module for which we need to make life
> difficult, it is used by many thousands of the largest computers at labs
> and universities and companies around the world.  We are working to clean
> up the code outside the staging tree and resubmit it.  Not reserving a flag
> just means we will continue to use random values in Lustre before it can
> be merged, which will make life harder when we try to merge again.

No, it is available in source, but otherwise just as bad.  And we generally
only define APIs for in-kernel usage.

If we can come up with a good API for in-kernel filesystems we can do
that, otherwise hell no.  And staging for that matter qualifies as out
of tree.

That being said I'm really worried about these FIEMAP extensions as
userspace has no business poking into details of the placement (vs
just the layout).

But all that belongs into a separate dicussion instead of dragging down
this series where it does not belong at all.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-07 18:16               ` Darrick J. Wong
@ 2019-02-08  8:58                 ` Carlos Maiolino
  0 siblings, 0 replies; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-08  8:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Thu, Feb 07, 2019 at 10:16:55AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 07, 2019 at 01:36:41PM +0100, Carlos Maiolino wrote:
> > Apologies, I forgot to mention another thing..
> > 
> > On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > > > start returning device information via struct fiemap_extent, at least
> > > > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > > > filesystem.
> > > > > > 
> > > > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > > > idea.
> > > > > 
> > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > > > shouldn't be expecting any meaning in reserved areas.
> > > > > 
> > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > > > with such information?
> > > > > 
> > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > > > 
> > > > Ok, may I ask why not?
> > > 
> > > I think it's a bad idea to add a flag to FIEMAP to change its behavior
> > > to suit an older and even crappier legacy interface (i.e. FIBMAP).
> > > 
> > > FIBMAP is architecturally broken in that we can't /ever/ provide the
> > > context of "which device does this map to?"
> > > 
> > > FIEMAP is architecturally deficient as well, but its ioctl structure
> > > definition is flexible enough that we can report "which device does this
> > > map to".
> > > 
> > > I want to enhance FIEMAP to deal with multi-device filesystems
> > > correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> > > and *lilo.
> > > 
> > > > My apologies if I am wrong, but, per my understanding, there is
> > > > nothing today, which tells userspace which device belongs the extent
> > > > map reported by FIEMAP.
> > > 
> > > Right...
> > > 
> > > > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > > > BTRFS, we simply do not provide such information.
> > > 
> > > Right...
> > > 
> > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > > > a FIBMAP has been requested, so the current behavior of both ioctls
> > > > won't change.
> > > 
> > > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > > be more expressive.  Once that's done, we can use the more expressive
> > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > > 
> > > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > > start returning data in a reserved field.  New userspace programs that
> > > know about the flag can start reading information from the new field if
> > > they see the flag, and old userspace programs don't know about the flag
> > > and won't be any worse off.
> > > 
> > > > Enabling filesystems to return device information into fiemap_extent
> > > > requires modification of all filesystems to provide such information,
> > > > which will not have any use other than matching the mounted device to
> > > > the device where the extent is.
> > > 
> > > Perhaps it would help for me to present a more concrete proposal:
> > > 
> > > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > > @@ -22,7 +22,19 @@ struct fiemap_extent {
> > >  	__u64 fe_length;   /* length in bytes for this extent */
> > >  	__u64 fe_reserved64[2];
> > >  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > > -	__u32 fe_reserved[3];
> > > +
> > > +	/*
> > > +	 * Underlying device that this extent is stored on.
> > > +	 *
> > > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > > +	 * between backing devices but has no intrinsic meaning.  If neither
> > > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > > +	 * EXTENT_DEV flags may be set at any time.
> > > +	 */
> > > +	__u32 fe_device;
> > > +	__u32 fe_reserved[2];
> > >  };
> > >  
> > >  struct fiemap {
> > > @@ -66,5 +78,14 @@ struct fiemap {
> > >  						    * merged for efficiency. */
> > >  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> > >  						    * files. */
> > > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > > +						    * structure containing the
> > > +						    * major and minor numbers
> > > +						    * of a block device. */
> > > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > > +						    * cookie that can be used
> > > +						    * to distinguish physical
> > > +						    * devices but otherwise
> > > +						    * has no meaning. */
> > >  
> > >  #endif /* _LINUX_FIEMAP_H */
> > > 
> > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > > encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> > 
> > Here, I believe you are forgetting that filesystems do not touch fiemap_extent
> > directly. We call fiemap_fell_next_extent() helper to fill each extent found by
> > fiemap. So, in either way, we'd need to modify fiemap_fill_next_extent() and the
> > callbacks being used to accommodate this new field or create a new helper to
> > modify the device which doesn't sound reasonable. So, either way, we will end up
> > needing to modify all filesystems.
> 
> Yep.  Drat.  I guess you could add a bdev parameter to
> fiemap_fill_next_extent, and we'd use that to encode fe_device.  If the
> fs passes NULL then we just get it from the superblock or something.
> 
> > So, although I really like the idea of improving the FIEMAP interface, I'm
> > starting to consider another patchset for it. I think it requires an interface
> > change big enough to fit in this patchset, which actually has a different
> > purpose. Or, maybe, address this at the end of this patchset, leaving different
> > interface changes in different patchsets, instead of making many changes all at
> > once, mixed together.
> 
> <nod> I think you're right, fiemap upgrades as one series and then
> fibmap-via-fiemap as the second one.
> 

Ok, fair enough, looks like we have an agreement :P I'll work on this direction
now, and set aside this patchset while we can improve FIEMAP to return the
device id, and then rebase this patchset on top of that.

Thanks for the review

> --D
> 
> > > 
> > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > > and encode the replica number in fe_device.
> > > 
> > > Existing filesystems can be left unchanged, in which case neither
> > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > > meaningless, the same as they are today.  Reporting fe_device is entirely
> > > optional.
> > > 
> > > Userspace programs will now be able to tell which device the file data
> > > lives on, which has been sort-of requested for years, if the filesystem
> > > chooses to start exporting that information.
> > > 
> > > Your FIBMAP-via-FIEMAP backend can do something like:
> > > 
> > > /* FIBMAP only returns results for the same block device backing the fs. */
> > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > > 	return 0;
> > > 
> > > /* Can't tell what is the backing device, bail out. */
> > > if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > > 	return 0;
> > > 
> > > /*
> > >  * Either fe_device matches the backing device or the implementation
> > >  * doesn't tell us about the backing device, so assume it's ok.
> > >  */
> > > <return FIBMAP results>
> > > 
> > > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > > take advantage of that solution to remedy my objections to the proposed
> > > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > > behavior flag that userspace knows about but isn't allowed to pass in.
> > > 
> > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > > > than the device id in fiemap_extent. I don't see much advantage in
> > > > adding the device id instead of using the flag.
> > > > 
> > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > > > userspace, so, it would require a check to make sure it didn't come from
> > > > userspace if ioctl_fiemap() was used.
> > > > 
> > > > I think there are 2 other possibilities which can be used to fix this.
> > > > 
> > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > > > - If the device id is a must for you, maybe add the device id into
> > > >   fiemap_extent_info instead of fiemap_extent.
> > > 
> > > That won't work with btrfs, which can store file extents on multiple
> > > different physical devices.
> > > 
> > > >   So we don't mess with a UAPI exported data structure and still
> > > >   provides a way to the filesystems to provide which device the mapped
> > > >   extent is in.
> > > > 
> > > > What you think?
> > > > 
> > > > Cheers
> > > > 
> > > > 
> > > > > 
> > > > > --D
> > > > > 
> > > > > > > 
> > > > > > > > +
> > > > > > > > +	return error;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  /**
> > > > > > > >   *	bmap	- find a block number in a file
> > > > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > > > >   */
> > > > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > > > >  {
> > > > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > > > +	if (inode->i_op->fiemap)
> > > > > > > > +		return bmap_fiemap(inode, block);
> > > > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > > > +						       *block);
> > > > > > > > +	else
> > > > > > > >  		return -EINVAL;
> > > > > > > 
> > > > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > > > suddenly it will support this legacy interface they've never supported
> > > > > > > before.  Are they on board with this?
> > > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > >  
> > > > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > > > >  	return 0;
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > > > --- a/fs/ioctl.c
> > > > > > > > +++ b/fs/ioctl.c
> > > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > > > +{
> > > > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > > > +
> > > > > > > > +	/* only count the extents */
> > > > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +
> > > > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > > > +
> > > > > > > > +	extent->fe_logical = logical;
> > > > > > > > +	extent->fe_physical = phys;
> > > > > > > > +	extent->fe_length = len;
> > > > > > > > +	extent->fe_flags = flags;
> > > > > > > > +
> > > > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +}
> > > > > > > >  /**
> > > > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > > > --- a/include/linux/fs.h
> > > > > > > > +++ b/include/linux/fs.h
> > > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > > > >  };
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > > > -- 
> > > > > > > > 2.17.2
> > > > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Carlos
> > > > 
> > > > -- 
> > > > Carlos
> > 
> > -- 
> > Carlos

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-07 17:02               ` Darrick J. Wong
  2019-02-07 21:25                 ` Andreas Dilger
@ 2019-02-08  9:03                 ` Carlos Maiolino
  1 sibling, 0 replies; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-08  9:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, hch, adilger, sandeen, david

On Thu, Feb 07, 2019 at 09:02:10AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote:
> > On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > > > start returning device information via struct fiemap_extent, at least
> > > > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > > > filesystem.
> > > > > > 
> > > > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > > > idea.
> > > > > 
> > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > > > shouldn't be expecting any meaning in reserved areas.
> > > > > 
> > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > > > with such information?
> > > > > 
> > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > > > 
> > > > Ok, may I ask why not?
> > > 
> > > I think it's a bad idea to add a flag to FIEMAP to change its behavior
> > > to suit an older and even crappier legacy interface (i.e. FIBMAP).
> > > 
> > > FIBMAP is architecturally broken in that we can't /ever/ provide the
> > > context of "which device does this map to?"
> > > 
> > > FIEMAP is architecturally deficient as well, but its ioctl structure
> > > definition is flexible enough that we can report "which device does this
> > > map to".
> > > 
> > > I want to enhance FIEMAP to deal with multi-device filesystems
> > > correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> > > and *lilo.
> > > 
> > > > My apologies if I am wrong, but, per my understanding, there is
> > > > nothing today, which tells userspace which device belongs the extent
> > > > map reported by FIEMAP.
> > > 
> > > Right...
> > > 
> > > > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > > > BTRFS, we simply do not provide such information.
> > > 
> > > Right...
> > > 
> > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > > > a FIBMAP has been requested, so the current behavior of both ioctls
> > > > won't change.
> > > 
> > > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > > be more expressive.  Once that's done, we can use the more expressive
> > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > > 
> > > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > > start returning data in a reserved field.  New userspace programs that
> > > know about the flag can start reading information from the new field if
> > > they see the flag, and old userspace programs don't know about the flag
> > > and won't be any worse off.
> > > 
> > > > Enabling filesystems to return device information into fiemap_extent
> > > > requires modification of all filesystems to provide such information,
> > > > which will not have any use other than matching the mounted device to
> > > > the device where the extent is.
> > > 
> > > Perhaps it would help for me to present a more concrete proposal:
> > > 
> > > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > > @@ -22,7 +22,19 @@ struct fiemap_extent {
> > >  	__u64 fe_length;   /* length in bytes for this extent */
> > >  	__u64 fe_reserved64[2];
> > >  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > > -	__u32 fe_reserved[3];
> > > +
> > > +	/*
> > > +	 * Underlying device that this extent is stored on.
> > > +	 *
> > > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > > +	 * between backing devices but has no intrinsic meaning.  If neither
> > > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > > +	 * EXTENT_DEV flags may be set at any time.
> > > +	 */
> > > +	__u32 fe_device;
> > > +	__u32 fe_reserved[2];
> > >  };
> > >  
> > >  struct fiemap {
> > > @@ -66,5 +78,14 @@ struct fiemap {
> > >  						    * merged for efficiency. */
> > >  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> > >  						    * files. */
> > > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > > +						    * structure containing the
> > > +						    * major and minor numbers
> > > +						    * of a block device. */
> > > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > > +						    * cookie that can be used
> > > +						    * to distinguish physical
> > > +						    * devices but otherwise
> > > +						    * has no meaning. */
> > >  
> > >  #endif /* _LINUX_FIEMAP_H */
> > > 
> > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > > encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> > > 
> > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > > and encode the replica number in fe_device.
> > > 
> > 
> > All of this makes sense, but I'm struggling to understand what you mean by
> > replica number here, and why it justify a second flag.
> 
> I left in the "device cookie" thing in the proposal to accomodate a
> request from the Lustre folks to be able to report which replica is
> storing a particular extent map.  Apparently the replica id is simply a
> 32-bit number that isn't inherently useful, hence the vagueness around
> what "cookie" really means...
> 
> ...oh, right, lustre fell out of drivers/staging/.  You could probably
> leave it out then.
> 
> > > Existing filesystems can be left unchanged, in which case neither
> > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > > meaningless, the same as they are today.  Reporting fe_device is entirely
> > > optional.
> > > 
> > > Userspace programs will now be able to tell which device the file data
> > > lives on, which has been sort-of requested for years, if the filesystem
> > > chooses to start exporting that information.
> > > 
> > > Your FIBMAP-via-FIEMAP backend can do something like:
> > > 
> > > /* FIBMAP only returns results for the same block device backing the fs. */
> > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > > 	return 0;
> > > 
> > > /* Can't tell what is the backing device, bail out. */
> > > if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > > 	return 0;
> > > 
> > 
> > Ok, the first conditional, is ok, the second one is not making sense to me.
> > Looks like you are basically using it to flag the filesystem can't tell
> > exactly which device the current extent is, let's say for example, distributed
> > filesystems, where the physical extent can actually be on a different machine.
> > But I can't say for sure, can you give me more details about what you are trying
> > to achieve here?
> 
> You've understood me correctly. :)
> 
> > > /*
> > >  * Either fe_device matches the backing device or the implementation
> > >  * doesn't tell us about the backing device, so assume it's ok.
> > >  */
> > > <return FIBMAP results>
> > >
> > 
> > This actually looks to contradict what you have been complaining, about some
> > filesystems which doesn't support FIBMAP currently, will now suddenly start to
> > support. Assuming it's ok if the implementation doesn't tell us about the
> > backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the
> > backing device, assuming it's ok will just fall into your first complain.
> 
> Sorry, this thread has been going on so long that I forgot your goal for
> this series. :/
> 
> Specifically, I had forgotten that you're removing the ->bmap pointer,
> which means that filesystems don't have any particular way to signal
> "Yes on FIEMAP, no on FIBMAP".  Somehow I had thought that you were
> merely creating a generic_file_bmap() that would call FIEMAP and ripping
> out all the adhoc bmap implementations.
> 
> Hmm, how many filesystems support FIEMAP and not FIBMAP?
> 
> btrfs, nilfs2, and overlayfs.  Also bad_inode.c...?
> 
> Hmm, how many filesystems support FIBMAP and not FIEMAP?
> 
> adfs, affs, befs, bfs, efs, exofs, fat, freevxfs, fuse(?), nfs, hfsplus,
> isofs, jfs, minixfs, ntfs, qnx[46], reiserfs, sysv, udf, and ufs.

Eh, that's why we should keep:

if (inode->i_op->fiemap)
	return bmap_fiemap(inode, block);
else if (..a_ops->bmap)
	->a_ops->bmap(...)
else
	return -EINVAL;


> 
> > Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE
> > you mentioned.
> 
> I think you've understood it about as well as I can explain it.  Maybe
> Andreas will have more to say about the lustre replica id, but OTOH it's
> gone and so there's no user of it, so we could just drop it until lustre
> comes back.
>

Ok, thanks for confirming.

> > > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > > take advantage of that solution to remedy my objections to the proposed
> > > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > > behavior flag that userspace knows about but isn't allowed to pass in.
> > >
> > 
> > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > > > than the device id in fiemap_extent. I don't see much advantage in
> > > > adding the device id instead of using the flag.
> > > > 
> > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > > > userspace, so, it would require a check to make sure it didn't come from
> > > > userspace if ioctl_fiemap() was used.
> > > > 
> > > > I think there are 2 other possibilities which can be used to fix this.
> > > > 
> > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > > > - If the device id is a must for you, maybe add the device id into
> > > >   fiemap_extent_info instead of fiemap_extent.
> > > 
> > > That won't work with btrfs, which can store file extents on multiple
> > > different physical devices.
> > > 
> > > >   So we don't mess with a UAPI exported data structure and still
> > > >   provides a way to the filesystems to provide which device the mapped
> > > >   extent is in.
> > > > 
> > > > What you think?
> > > > 
> > > > Cheers
> > > > 
> > > > 
> > > > > 
> > > > > --D
> > > > > 
> > > > > > > 
> > > > > > > > +
> > > > > > > > +	return error;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  /**
> > > > > > > >   *	bmap	- find a block number in a file
> > > > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > > > >   */
> > > > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > > > >  {
> > > > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > > > +	if (inode->i_op->fiemap)
> > > > > > > > +		return bmap_fiemap(inode, block);
> > > > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > > > +						       *block);
> > > > > > > > +	else
> > > > > > > >  		return -EINVAL;
> > > > > > > 
> > > > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > > > suddenly it will support this legacy interface they've never supported
> > > > > > > before.  Are they on board with this?
> > > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > >  
> > > > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > > > >  	return 0;
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > > > --- a/fs/ioctl.c
> > > > > > > > +++ b/fs/ioctl.c
> > > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > > > +{
> > > > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > > > +
> > > > > > > > +	/* only count the extents */
> > > > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +
> > > > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > > > +
> > > > > > > > +	extent->fe_logical = logical;
> > > > > > > > +	extent->fe_physical = phys;
> > > > > > > > +	extent->fe_length = len;
> > > > > > > > +	extent->fe_flags = flags;
> > > > > > > > +
> > > > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +}
> > > > > > > >  /**
> > > > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > > > --- a/include/linux/fs.h
> > > > > > > > +++ b/include/linux/fs.h
> > > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > > > >  };
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > > > -- 
> > > > > > > > 2.17.2
> > > > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Carlos
> > > > 
> > > > -- 
> > > > Carlos
> > 
> > -- 
> > Carlos

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-07 21:25                 ` Andreas Dilger
  2019-02-08  8:46                   ` Christoph Hellwig
@ 2019-02-08  9:08                   ` Carlos Maiolino
  1 sibling, 0 replies; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-08  9:08 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Darrick J. Wong, linux-fsdevel, Christoph Hellwig, Eric Sandeen, david

On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote:
> On Feb 7, 2019, at 10:02 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote:
> >> On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> >>> 
> >>> ...but from my point of view, the FIEMAP behavior *ought* to change to
> >>> be more expressive.  Once that's done, we can use the more expressive
> >>> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> >>> 
> >>> The whole point of having fe_reserved* fields in struct fiemap_extent is
> >>> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> >>> start returning data in a reserved field.  New userspace programs that
> >>> know about the flag can start reading information from the new field if
> >>> they see the flag, and old userspace programs don't know about the flag
> >>> and won't be any worse off.
> >>> 
> >>> Perhaps it would help for me to present a more concrete proposal:
> >>> 
> >>> --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> >>> +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> >>> @@ -22,7 +22,19 @@ struct fiemap_extent {
> >>> 	__u64 fe_length;   /* length in bytes for this extent */
> >>> 	__u64 fe_reserved64[2];
> >>> 	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> >>> -	__u32 fe_reserved[3];
> >>> +
> >>> +	/*
> >>> +	 * Underlying device that this extent is stored on.
> >>> +	 *
> >>> +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> >>> +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> >>> +	 * set, this field is a 32-bit cookie that can be used to distinguish
> >>> +	 * between backing devices but has no intrinsic meaning.  If neither
> >>> +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> >>> +	 * EXTENT_DEV flags may be set at any time.
> >>> +	 */
> >>> +	__u32 fe_device;
> >>> +	__u32 fe_reserved[2];
> >>> };
> >>> 
> >>> struct fiemap {
> >>> @@ -66,5 +78,14 @@ struct fiemap {
> >>> 						    * merged for efficiency. */
> >>> #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> >>> 						    * files. */
> >>> +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> >>> +						    * structure containing the
> >>> +						    * major and minor numbers
> >>> +						    * of a block device. */
> >>> +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> >>> +						    * cookie that can be used
> >>> +						    * to distinguish physical
> >>> +						    * devices but otherwise
> >>> +						    * has no meaning. */
> >>> 
> >>> #endif /* _LINUX_FIEMAP_H */
> >>> 
> >>> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> >>> encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> >>> 
> >>> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> >>> and encode the replica number in fe_device.
> >>> 
> >> 
> >> All of this makes sense, but I'm struggling to understand what you mean by
> >> replica number here, and why it justify a second flag.
> > 
> > I left in the "device cookie" thing in the proposal to accomodate a
> > request from the Lustre folks to be able to report which replica is
> > storing a particular extent map.  Apparently the replica id is simply a
> > 32-bit number that isn't inherently useful, hence the vagueness around
> > what "cookie" really means...
> > 
> > ...oh, right, lustre fell out of drivers/staging/.  You could probably
> > leave it out then.
> 
> Do we really need to be this way, about reserving a single flag for Lustre,
> which will likely also be useful for other filesystems?  It's not like
> Lustre is some closed-source binary module for which we need to make life
> difficult, it is used by many thousands of the largest computers at labs
> and universities and companies around the world.  We are working to clean
> up the code outside the staging tree and resubmit it.  Not reserving a flag
> just means we will continue to use random values in Lustre before it can
> be merged, which will make life harder when we try to merge again.
> 

Agreed, it's a flag that may benefit different filesystems, not only lustre.

> 
> In the case of Lustre, the proposed DEV_COOKIE would indicate fe_device is
> the integer index number of the server on which each extent of the file is
> located (Darrick's "replica number" term is not really correct).  The server
> index is familiar to all Lustre users, so having filefrag print out the
> device number of "0000" or "0009" is totally clear to them.
> 

Thanks for the info. /me doesn't know LustreFS.

> For pNFS or Ceph or other network filesystems (if they implement filefrag)
> it could be an index or some other number (e.g. the IP address of the server
> or low bits of the  UUID or whatever).  Reading back in the archives of the
> original FIEMAP discussion, it seems BtrFS would prefer to use DEV_COOKIE
> instead of DEV_T, because it uses internal RAID encoding and not plain block
> devices, but I'm not familiar with the details there.
> 
> 
> Alternately, or in addition to, a DEV_COOKIE flag which indicates that the
> same fe_device field is "not a device", it would be possible to add:
> 
>     #define FIEMAP_NO_DIRECT      0x40000000
> 
> and/or:
> 
>     #define FIEMAP_EXTENT_NET     0x80000000   /* Data stored remotely.
>                                                 * Sets NO_DIRECT flag */
> 
> returned by the filesystem that indicates the extent blocks are not local
> to the node, so FIBMAP should return an error (-EOPNOTSUP or -EREMOTE or
> whatever) because the file can't be booted from.  In that case, we could
> return FIEMAP_EXTENT_DEVICE to indicate the fe_device field is valid, and
> return FIEMAP_EXTENT_NET to indicate the values in fe_device are not local
> block devices, just filesystem-specific values to distinguish devices.
> 
> However, I'm open to both _DEV_COOKIE and _NET flag if that is preferred,
> since I think the two are somewhat complementary.
> 

I'd rather avoid going down this far in the rabbit hole. Once we have fe_device
field and the basic flags, would be relatively easy to propose new flags, but
now, new flags should be discussed on a patch proposal I believe. Discussing
which flags should/shouldn't be added here, will be pointless.

Let me work on the patchset to update FIEMAP, and then we can discuss such thing
there. I'll Cc you on the patches if you want to.

Cheers

> >> This actually looks to contradict what you have been complaining, about some
> >> filesystems which doesn't support FIBMAP currently, will now suddenly start to
> >> support. Assuming it's ok if the implementation doesn't tell us about the
> >> backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the
> >> backing device, assuming it's ok will just fall into your first complain.
> > 
> > Sorry, this thread has been going on so long that I forgot your goal for
> > this series. :/
> > 
> > Specifically, I had forgotten that you're removing the ->bmap pointer,
> > which means that filesystems don't have any particular way to signal
> > "Yes on FIEMAP, no on FIBMAP".  Somehow I had thought that you were
> > merely creating a generic_file_bmap() that would call FIEMAP and ripping
> > out all the adhoc bmap implementations.
> 
> Just a reminder here, you should set FIEMAP_FLAG_SYNC when mapping FIBMAP
> to FIEMAP so that the data on that file is flushed to disk before returning,
> since the block mapping may not be assigned yet or may be unstable, which
> could lead to an unbootable system if used for LILO.


> 
> Cheers, Andreas
> 
> >>> Existing filesystems can be left unchanged, in which case neither
> >>> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> >>> meaningless, the same as they are today.  Reporting fe_device is entirely
> >>> optional.
> >>> 
> >>> Userspace programs will now be able to tell which device the file data
> >>> lives on, which has been sort-of requested for years, if the filesystem
> >>> chooses to start exporting that information.
> >>> 
> >>> Your FIBMAP-via-FIEMAP backend can do something like:
> >>> 
> >>> /* FIBMAP only returns results for the same block device backing the fs. */
> >>> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> >>> 	return 0;
> >>> 
> >>> /* Can't tell what is the backing device, bail out. */
> >>> if (fe->fe_flags & EXTENT_DEV_COOKIE)
> >>> 	return 0;
> >>> 
> >> 
> >> Ok, the first conditional, is ok, the second one is not making sense to me.
> >> Looks like you are basically using it to flag the filesystem can't tell
> >> exactly which device the current extent is, let's say for example, distributed
> >> filesystems, where the physical extent can actually be on a different machine.
> >> But I can't say for sure, can you give me more details about what you are trying
> >> to achieve here?
> > 
> > You've understood me correctly. :)
> > 
> >>> /*
> >>> * Either fe_device matches the backing device or the implementation
> >>> * doesn't tell us about the backing device, so assume it's ok.
> >>> */
> >>> <return FIBMAP results>
> >>> 
> >> 
> >> Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE
> >> you mentioned.
> > 
> > I think you've understood it about as well as I can explain it.  Maybe
> > Andreas will have more to say about the lustre replica id, but OTOH it's
> > gone and so there's no user of it, so we could just drop it until lustre
> > comes back.
> > 
> 
> >>> So that's how I'd solve a longstanding design problem of FIEMAP and then
> >>> take advantage of that solution to remedy my objections to the proposed
> >>> "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> >>> behavior flag that userspace knows about but isn't allowed to pass in.
> >>> 
> >> 
> >>>> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> >>>> than the device id in fiemap_extent. I don't see much advantage in
> >>>> adding the device id instead of using the flag.
> >>>> 
> >>>> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> >>>> userspace, so, it would require a check to make sure it didn't come from
> >>>> userspace if ioctl_fiemap() was used.
> >>>> 
> >>>> I think there are 2 other possibilities which can be used to fix this.
> >>>> 
> >>>> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> >>>> - If the device id is a must for you, maybe add the device id into
> >>>>  fiemap_extent_info instead of fiemap_extent.
> >>> 
> >>> That won't work with btrfs, which can store file extents on multiple
> >>> different physical devices.
> >>> 
> >>>>  So we don't mess with a UAPI exported data structure and still
> >>>>  provides a way to the filesystems to provide which device the mapped
> >>>>  extent is in.
> >>>> 
> >>>> What you think?
> >>>> 
> >>>> Cheers
> >>>> 
> >>>> 
> >>>>> 
> >>>>> --D
> >>>>> 
> >>>>>>> 
> >>>>>>>> +
> >>>>>>>> +	return error;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> /**
> >>>>>>>>  *	bmap	- find a block number in a file
> >>>>>>>>  *	@inode:  inode owning the block number being requested
> >>>>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> >>>>>>>>  */
> >>>>>>>> int bmap(struct inode *inode, sector_t *block)
> >>>>>>>> {
> >>>>>>>> -	if (!inode->i_mapping->a_ops->bmap)
> >>>>>>>> +	if (inode->i_op->fiemap)
> >>>>>>>> +		return bmap_fiemap(inode, block);
> >>>>>>>> +	else if (inode->i_mapping->a_ops->bmap)
> >>>>>>>> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> >>>>>>>> +						       *block);
> >>>>>>>> +	else
> >>>>>>>> 		return -EINVAL;
> >>>>>>> 
> >>>>>>> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> >>>>>>> suddenly it will support this legacy interface they've never supported
> >>>>>>> before.  Are they on board with this?
> >>>>>>> 
> >>>>>>> --D
> >>>>>>> 
> >>>>>>>> 
> >>>>>>>> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> >>>>>>>> 	return 0;
> >>>>>>>> }
> >>>>>>>> EXPORT_SYMBOL(bmap);
> >>>>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
> >>>>>>>> index 6086978fe01e..bfa59df332bf 100644
> >>>>>>>> --- a/fs/ioctl.c
> >>>>>>>> +++ b/fs/ioctl.c
> >>>>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >>>>>>>> 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>>>> }
> >>>>>>>> 
> >>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >>>>>>>> +			    u64 phys, u64 len, u32 flags)
> >>>>>>>> +{
> >>>>>>>> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> >>>>>>>> +
> >>>>>>>> +	/* only count the extents */
> >>>>>>>> +	if (fieinfo->fi_extents_max == 0) {
> >>>>>>>> +		fieinfo->fi_extents_mapped++;
> >>>>>>>> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>>>> +	}
> >>>>>>>> +
> >>>>>>>> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> >>>>>>>> +		return 1;
> >>>>>>>> +
> >>>>>>>> +	if (flags & SET_UNKNOWN_FLAGS)
> >>>>>>>> +		flags |= FIEMAP_EXTENT_UNKNOWN;
> >>>>>>>> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> >>>>>>>> +		flags |= FIEMAP_EXTENT_ENCODED;
> >>>>>>>> +	if (flags & SET_NOT_ALIGNED_FLAGS)
> >>>>>>>> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> >>>>>>>> +
> >>>>>>>> +	extent->fe_logical = logical;
> >>>>>>>> +	extent->fe_physical = phys;
> >>>>>>>> +	extent->fe_length = len;
> >>>>>>>> +	extent->fe_flags = flags;
> >>>>>>>> +
> >>>>>>>> +	fieinfo->fi_extents_mapped++;
> >>>>>>>> +
> >>>>>>>> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> >>>>>>>> +		return 1;
> >>>>>>>> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>>>> +}
> >>>>>>>> /**
> >>>>>>>>  * fiemap_fill_next_extent - Fiemap helper function
> >>>>>>>>  * @fieinfo:	Fiemap context passed into ->fiemap
> >>>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >>>>>>>> index 7a434979201c..28bb523d532a 100644
> >>>>>>>> --- a/include/linux/fs.h
> >>>>>>>> +++ b/include/linux/fs.h
> >>>>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> >>>>>>>> 	fiemap_fill_cb	fi_cb;
> >>>>>>>> };
> >>>>>>>> 
> >>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> >>>>>>>> +			      u64 phys, u64 len, u32 flags);
> >>>>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> >>>>>>>> 			    u64 phys, u64 len, u32 flags);
> >>>>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> >>>>>>>> --
> >>>>>>>> 2.17.2
> >>>>>>>> 
> >>>>>> 
> >>>>>> --
> >>>>>> Carlos
> >>>> 
> >>>> --
> >>>> Carlos
> >> 
> >> --
> >> Carlos
> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 



-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-08  8:46                   ` Christoph Hellwig
@ 2019-02-08 10:36                     ` Carlos Maiolino
  2019-02-08 21:03                       ` Andreas Dilger
  0 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-08 10:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Dilger, Darrick J. Wong, linux-fsdevel, Eric Sandeen, david

On Fri, Feb 08, 2019 at 09:46:12AM +0100, Christoph Hellwig wrote:
> On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote:
> > Do we really need to be this way, about reserving a single flag for Lustre,
> > which will likely also be useful for other filesystems?  It's not like
> > Lustre is some closed-source binary module for which we need to make life
> > difficult, it is used by many thousands of the largest computers at labs
> > and universities and companies around the world.  We are working to clean
> > up the code outside the staging tree and resubmit it.  Not reserving a flag
> > just means we will continue to use random values in Lustre before it can
> > be merged, which will make life harder when we try to merge again.
> 
> No, it is available in source, but otherwise just as bad.  And we generally
> only define APIs for in-kernel usage.
> 
> If we can come up with a good API for in-kernel filesystems we can do
> that, otherwise hell no.  And staging for that matter qualifies as out
> of tree.
> 
> That being said I'm really worried about these FIEMAP extensions as
> userspace has no business poking into details of the placement (vs
> just the layout).
> 
I tend to say that identifying on which device an extent is is better than
simply saying 'it maps to physical blocks X-Z, but it's your problem to identify
which device X-Z belongs to'.

> But all that belongs into a separate dicussion instead of dragging down
> this series where it does not belong at all.

Agreed, but now I'm on a kind of dead-end :P

Darrick's concerns are valid, regarding letting currently unsupported
filesystems to suddenly allow FIBMAP calls, but on the other hand, his proposed
solution, which is also valid, requires a new discussion/patchset to discuss an
improvement of the FIEMAP infra-structure, and 'fix' the problem mentioned.
Using a flag to identify FIBMAP calls has been rejected. So, I'd accept
suggestions on how to move this patch forward, without requiring the
improvements suggested by Darrick, and, without using a flag to tag FIBMAP
calls, as suggested by me, I'm kind of running out of ideas by now :(

Cheers

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-08 10:36                     ` Carlos Maiolino
@ 2019-02-08 21:03                       ` Andreas Dilger
  0 siblings, 0 replies; 53+ messages in thread
From: Andreas Dilger @ 2019-02-08 21:03 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Eric Sandeen, david

[-- Attachment #1: Type: text/plain, Size: 3718 bytes --]

On Feb 8, 2019, at 3:36 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote:
> 
> On Fri, Feb 08, 2019 at 09:46:12AM +0100, Christoph Hellwig wrote:
>> On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote:
>>> Do we really need to be this way, about reserving a single flag for Lustre,
>>> which will likely also be useful for other filesystems?  It's not like
>>> Lustre is some closed-source binary module for which we need to make life
>>> difficult, it is used by many thousands of the largest computers at labs
>>> and universities and companies around the world.  We are working to clean
>>> up the code outside the staging tree and resubmit it.  Not reserving a flag
>>> just means we will continue to use random values in Lustre before it can
>>> be merged, which will make life harder when we try to merge again.
>> 
>> No, it is available in source, but otherwise just as bad.  And we generally
>> only define APIs for in-kernel usage.
>> 
>> If we can come up with a good API for in-kernel filesystems we can do
>> that, otherwise hell no.  And staging for that matter qualifies as out
>> of tree.
>> 
>> That being said I'm really worried about these FIEMAP extensions as
>> userspace has no business poking into details of the placement (vs
>> just the layout).
> 
> I tend to say that identifying on which device an extent is is better than
> simply saying 'it maps to physical blocks X-Z, but it's your problem to identify
> which device X-Z belongs to'.
> 
>> But all that belongs into a separate dicussion instead of dragging down
>> this series where it does not belong at all.
> 
> Agreed, but now I'm on a kind of dead-end :P
> 
> Darrick's concerns are valid, regarding letting currently unsupported
> filesystems to suddenly allow FIBMAP calls,

I don't think there is a huge danger of people suddenly moving to use LILO
on f2fs or Btrfs with new kernels.  In most cases, it _would_ just work,
but the FIBMAP->FIEMAP layer needs to check for FIEMAP_FLAG_NOT_ALIGNED,
FIEMAP_FLAG_ENCODED, and FIEMAP_FLAG_DEVICE flags that would make this
unsuitable for booting.

> but on the other hand, his proposed
> solution, which is also valid, requires a new discussion/patchset to discuss an
> improvement of the FIEMAP infra-structure, and 'fix' the problem mentioned.
> Using a flag to identify FIBMAP calls has been rejected. So, I'd accept
> suggestions on how to move this patch forward, without requiring the
> improvements suggested by Darrick, and, without using a flag to tag FIBMAP
> calls, as suggested by me, I'm kind of running out of ideas by now :(

I think Darrick was against a flag like "FIEMAP_FLAG_FIBMAP" because it could
be specified from userspace, and it is a bit ugly and has no other value than
preventing FIBMAP from working on filesystems that don't support it today.

As Christoph mentioned, such a flag could be OK as long as it is masked from
userspace in the top-level ioctl_fiemap() handler (though to be honest, there
is no benefit for a userspace app to set this flag, it would just increase the
chance the ioctl(FIEMAP) call will fail).

     #define FIEMAP_FLAG_FIBMAP 0x80000000

Filesystems that don't want FIBMAP to work at all should return -ENOTTY from
their ->fiemap() handler.

That said, there is *still* a need for fe_device checking in ioctl_fibmap(),
because for filesystems that allow both FIEMAP and FIBMAP (i.e. the most common
ones like ext4 and XFS) there may still be reasons for FIBMAP to fail for some
files if they are unsuitable (e.g. data stored on multiple devices). That isn't
something that the filesystems should be checking themselves.

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-08  8:43                 ` Christoph Hellwig
@ 2019-02-11 12:57                   ` Christoph Hellwig
  2019-02-11 16:21                     ` Carlos Maiolino
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2019-02-11 12:57 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Andreas Dilger, Darrick J. Wong, linux-fsdevel,
	Christoph Hellwig, Eric Sandeen, david

On Fri, Feb 08, 2019 at 09:43:52AM +0100, Christoph Hellwig wrote:
> Agreed.  Please don't change the FIEMAP uapi.  If we need to check
> for a request coming from bmap just defined an internal FIEMAP flag
> as the last available flag in the flags word, and reject it when
> it comes from userspace in fiemap.

Based on the new thread you started it seems like this go lost.  I think
we are 99% done with the bmap through fiemap series, and it is just
missing this internal flag to make progress.  Let's finish this off
before starting another big project in the area.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-11 12:57                   ` Christoph Hellwig
@ 2019-02-11 16:21                     ` Carlos Maiolino
  2019-02-11 16:48                       ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Carlos Maiolino @ 2019-02-11 16:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Dilger, Darrick J. Wong, linux-fsdevel, Eric Sandeen, david

On Mon, Feb 11, 2019 at 01:57:08PM +0100, Christoph Hellwig wrote:
> On Fri, Feb 08, 2019 at 09:43:52AM +0100, Christoph Hellwig wrote:
> > Agreed.  Please don't change the FIEMAP uapi.  If we need to check
> > for a request coming from bmap just defined an internal FIEMAP flag
> > as the last available flag in the flags word, and reject it when
> > it comes from userspace in fiemap.
> 
> Based on the new thread you started it seems like this go lost.  I think
> we are 99% done with the bmap through fiemap series, and it is just
> missing this internal flag to make progress.  Let's finish this off
> before starting another big project in the area.

I'm more than happy in see your reply, I'd love to finish this, but I got stuck
in how to make some filesystems deny a FIBMAP call once they do support FIEMAP
but not FIBMAP.

I have this patch almost ready to go anyway. Do you agree in keep this flag in
fi_flags field? Or maybe some other place, dunno, maybe a new fi_private field.


> 

-- 
Carlos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls
  2019-02-11 16:21                     ` Carlos Maiolino
@ 2019-02-11 16:48                       ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2019-02-11 16:48 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, Andreas Dilger, Darrick J. Wong,
	linux-fsdevel, Eric Sandeen, david

On Mon, Feb 11, 2019 at 05:21:40PM +0100, Carlos Maiolino wrote:
> I'm more than happy in see your reply, I'd love to finish this, but I got stuck
> in how to make some filesystems deny a FIBMAP call once they do support FIEMAP
> but not FIBMAP.
> 
> I have this patch almost ready to go anyway. Do you agree in keep this flag in
> fi_flags field? Or maybe some other place, dunno, maybe a new fi_private field.

Have it in fi_flags as a purely in-kernel flag, and then deny the
bmap calls through fiemap based on it.

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2019-02-11 16:48 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-05  9:17 [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Carlos Maiolino
2018-12-05  9:17 ` [PATCH 01/10] fs: Enable bmap() function to properly return errors Carlos Maiolino
2018-12-05  9:17 ` [PATCH 02/10] cachefiles: drop direct usage of ->bmap method Carlos Maiolino
2018-12-05  9:17 ` [PATCH 03/10] ecryptfs: drop direct calls to ->bmap Carlos Maiolino
2018-12-05  9:17 ` [PATCH 04/10 V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap Carlos Maiolino
2019-01-14 16:49   ` Christoph Hellwig
2019-02-04 11:34     ` Carlos Maiolino
2018-12-05  9:17 ` [PATCH 05/10] fs: Move start and length fiemap fields into fiemap_extent_info Carlos Maiolino
2019-01-14 16:50   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 06/10] iomap: Remove length and start fields from iomap_fiemap Carlos Maiolino
2019-01-14 16:51   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 07/10] fs: Use a void pointer to store fiemap_extent Carlos Maiolino
2019-01-14 16:53   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 08/10 V2] fiemap: Use a callback to fill fiemap extents Carlos Maiolino
2019-01-14 16:53   ` Christoph Hellwig
2018-12-05  9:17 ` [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Carlos Maiolino
2018-12-05 17:36   ` Darrick J. Wong
2018-12-07  9:09     ` Carlos Maiolino
2018-12-07 20:14       ` Andreas Dilger
2019-02-04 15:11     ` Carlos Maiolino
2019-02-04 18:27       ` Darrick J. Wong
2019-02-06 13:37         ` Carlos Maiolino
2019-02-06 20:44           ` Darrick J. Wong
2019-02-06 21:13             ` Andreas Dilger
2019-02-07  9:52               ` Carlos Maiolino
2019-02-08  8:43                 ` Christoph Hellwig
2019-02-11 12:57                   ` Christoph Hellwig
2019-02-11 16:21                     ` Carlos Maiolino
2019-02-11 16:48                       ` Christoph Hellwig
2019-02-07 11:59             ` Carlos Maiolino
2019-02-07 17:02               ` Darrick J. Wong
2019-02-07 21:25                 ` Andreas Dilger
2019-02-08  8:46                   ` Christoph Hellwig
2019-02-08 10:36                     ` Carlos Maiolino
2019-02-08 21:03                       ` Andreas Dilger
2019-02-08  9:08                   ` Carlos Maiolino
2019-02-08  9:03                 ` Carlos Maiolino
2019-02-07 12:36             ` Carlos Maiolino
2019-02-07 18:16               ` Darrick J. Wong
2019-02-08  8:58                 ` Carlos Maiolino
2019-02-06 21:04           ` Andreas Dilger
2019-01-14 16:56   ` Christoph Hellwig
2019-02-05  9:56     ` Carlos Maiolino
2019-02-05 18:25       ` Christoph Hellwig
2019-02-06  9:50         ` Carlos Maiolino
2018-12-05  9:17 ` [PATCH 10/10] xfs: Get rid of ->bmap Carlos Maiolino
2018-12-05 17:37   ` Darrick J. Wong
2018-12-06 13:06     ` Carlos Maiolino
2018-12-06 18:56 ` [PATCH 00/10 V2] New ->fiemap infrastructure and ->bmap removal Andreas Grünbacher
2018-12-07  9:34   ` Carlos Maiolino
2019-01-14 16:50     ` Christoph Hellwig
2019-01-14 17:56       ` Andreas Grünbacher
2019-01-14 17:58         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).