[PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks
@ 2020-11-04 11:07 fdmanana
  2020-11-04 11:07 ` [PATCH 1/4] btrfs: fix missing delalloc new bit for new delalloc ranges fdmanana
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: fdmanana @ 2020-11-04 11:07 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There are several cases where a stat(2) call can report an incorrect number
of used blocks. In some cases can even result in reporting 0 used blocks,
which is a specially bad value to report when a file is not empty, and we
had user bug reports in the past for such cases (the changelogs in the
patches point to one such report).

This patchset addresses all those cases. The third patch fixes a race in
defrag that while it does not result in a functional problem (data loss
or some corruption), it leads to unnecessary IO and space allocation,
and it's necessary for the 4th and final patch to work as it is.

A couple test cases for fstests will follow.

Filipe Manana (4):
  btrfs: fix missing delalloc new bit for new delalloc ranges
  btrfs: refactor btrfs_drop_extents() to make it easier to extend
  btrfs: fix race when defragging that leads to unnecessary IO
  btrfs: update the number of bytes used by an inode atomically

 fs/btrfs/btrfs_inode.h       |   3 +-
 fs/btrfs/ctree.h             |  71 ++++++++--
 fs/btrfs/extent-io-tree.h    |  16 ++-
 fs/btrfs/extent_io.c         |  10 +-
 fs/btrfs/file.c              | 246 +++++++++++++++--------------------
 fs/btrfs/inode.c             | 233 +++++++++++++++++++++++++++------
 fs/btrfs/ioctl.c             |  39 ++++++
 fs/btrfs/reflink.c           |   9 +-
 fs/btrfs/tests/inode-tests.c |  12 +-
 fs/btrfs/tree-log.c          |  32 +++--
 10 files changed, 458 insertions(+), 213 deletions(-)

-- 
2.28.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] btrfs: fix missing delalloc new bit for new delalloc ranges
  2020-11-04 11:07 [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks fdmanana
@ 2020-11-04 11:07 ` fdmanana
  2020-11-05 18:29   ` Josef Bacik
  2020-11-04 11:07 ` [PATCH 2/4] btrfs: refactor btrfs_drop_extents() to make it easier to extend fdmanana
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: fdmanana @ 2020-11-04 11:07 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When doing a buffered write, through one of the write family syscalls, we
look for ranges which currenly don't have allocated extents and set the
'delalloc new' bit on them, so that we can report a correct number of used
blocks to the stat(2) syscall until delalloc is flushed and ordered extents
complete.

However there are a few other places where we can do a buffered write
against a range that is mapped to a hole (no extent allocated) and where
we do not set the 'new delalloc' bit. Those places are:

- Doing a memory mapped write against a hole;

- Cloning an inline extent into a hole starting at file offset 0;

- Calling btrfs_cont_expand() when the i_size of the file is not aligned
  to the sector size and is located in a hole. For example when cloning
  to a destination offset beyond eof.

So after such cases, until the corresponding delalloc range is flushed and
the respective ordered extents complete, we can report an incorrect number
of blocks used through the stat(2) syscall.

In some cases we can end up reporting 0 used blocks to stat(2), which is a
particular bad value to report as it may mislead tools to think a file is
completely sparse when its i_size is not zero, making them skip reading
any data, an undesired consequence for tools such as archivers and other
backup tools, as reported a long time ago in the following thread (and
other past threads):

  https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html

Example reproducer:

  $ cat reproducer.sh
  #!/bin/bash

  MNT=/mnt/sdi
  DEV=/dev/sdi

  mkfs.btrfs -f $DEV > /dev/null
  # mkfs.xfs -f $DEV > /dev/null
  # mkfs.ext4 -F $DEV > /dev/null
  # mkfs.f2fs -f $DEV > /dev/null
  mount $DEV $MNT

  xfs_io -f -c "truncate 64K"   \
      -c "mmap -w 0 64K"        \
      -c "mwrite -S 0xab 0 64K" \
      -c "munmap"               \
      $MNT/foo

  blocks_used=$(stat -c %b $MNT/foo)
  echo "blocks used: $blocks_used"

  if [ $blocks_used -eq 0 ]; then
      echo "ERROR: blocks used is 0"
  fi

  umount $DEV

  $ ./reproducer.sh
  blocks used: 0
  ERROR: blocks used is 0

So move the logic that decides to set the 'delalloc bit' bit into the
function btrfs_set_extent_delalloc(), since that is what we use for all
those missing cases as well as for the cases that currently work well.

This change is also preparatory work for an upcoming patch that fixes
other problems related to tracking and reporting the number of bytes used
by an inode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c              | 56 -----------------------------------
 fs/btrfs/inode.c             | 57 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/tests/inode-tests.c | 12 +++++---
 3 files changed, 65 insertions(+), 60 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 1e87267b9e21..56c1fe3988bd 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -452,45 +452,6 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
 	}
 }
 
-static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
-					 const u64 start,
-					 const u64 len,
-					 struct extent_state **cached_state)
-{
-	u64 search_start = start;
-	const u64 end = start + len - 1;
-
-	while (search_start < end) {
-		const u64 search_len = end - search_start + 1;
-		struct extent_map *em;
-		u64 em_len;
-		int ret = 0;
-
-		em = btrfs_get_extent(inode, NULL, 0, search_start, search_len);
-		if (IS_ERR(em))
-			return PTR_ERR(em);
-
-		if (em->block_start != EXTENT_MAP_HOLE)
-			goto next;
-
-		em_len = em->len;
-		if (em->start < search_start)
-			em_len -= search_start - em->start;
-		if (em_len > search_len)
-			em_len = search_len;
-
-		ret = set_extent_bit(&inode->io_tree, search_start,
-				     search_start + em_len - 1,
-				     EXTENT_DELALLOC_NEW, cached_state, GFP_NOFS);
-next:
-		search_start = extent_map_end(em);
-		free_extent_map(em);
-		if (ret)
-			return ret;
-	}
-	return 0;
-}
-
 /*
  * after copy_from_user, pages need to be dirtied and we need to make
  * sure holes are created between the current EOF and the start of
@@ -533,23 +494,6 @@ int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
 			 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
 			 0, 0, cached);
 
-	if (!btrfs_is_free_space_inode(inode)) {
-		if (start_pos >= isize &&
-		    !(inode->flags & BTRFS_INODE_PREALLOC)) {
-			/*
-			 * There can't be any extents following eof in this case
-			 * so just set the delalloc new bit for the range
-			 * directly.
-			 */
-			extra_bits |= EXTENT_DELALLOC_NEW;
-		} else {
-			err = btrfs_find_new_delalloc_bytes(inode, start_pos,
-							    num_bytes, cached);
-			if (err)
-				return err;
-		}
-	}
-
 	err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
 					extra_bits, cached);
 	if (err)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c54e0ed0b938..2fb3d2ea75fd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2301,11 +2301,68 @@ static int add_pending_csums(struct btrfs_trans_handle *trans,
 	return 0;
 }
 
+static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
+					 const u64 start,
+					 const u64 len,
+					 struct extent_state **cached_state)
+{
+	u64 search_start = start;
+	const u64 end = start + len - 1;
+
+	while (search_start < end) {
+		const u64 search_len = end - search_start + 1;
+		struct extent_map *em;
+		u64 em_len;
+		int ret = 0;
+
+		em = btrfs_get_extent(inode, NULL, 0, search_start, search_len);
+		if (IS_ERR(em))
+			return PTR_ERR(em);
+
+		if (em->block_start != EXTENT_MAP_HOLE)
+			goto next;
+
+		em_len = em->len;
+		if (em->start < search_start)
+			em_len -= search_start - em->start;
+		if (em_len > search_len)
+			em_len = search_len;
+
+		ret = set_extent_bit(&inode->io_tree, search_start,
+				     search_start + em_len - 1,
+				     EXTENT_DELALLOC_NEW, cached_state, GFP_NOFS);
+next:
+		search_start = extent_map_end(em);
+		free_extent_map(em);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
 int btrfs_set_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 			      unsigned int extra_bits,
 			      struct extent_state **cached_state)
 {
 	WARN_ON(PAGE_ALIGNED(end));
+
+	if (start >= i_size_read(&inode->vfs_inode) &&
+	    !(inode->flags & BTRFS_INODE_PREALLOC)) {
+		/*
+		 * There can't be any extents following eof in this case so just
+		 * set the delalloc new bit for the range directly.
+		 */
+		extra_bits |= EXTENT_DELALLOC_NEW;
+	} else {
+		int ret;
+
+		ret = btrfs_find_new_delalloc_bytes(inode, start,
+						    end + 1 - start,
+						    cached_state);
+		if (ret)
+			return ret;
+	}
+
 	return set_extent_delalloc(&inode->io_tree, start, end, extra_bits,
 				   cached_state);
 }
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index e6719f7db386..04022069761d 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -983,7 +983,8 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	ret = clear_extent_bit(&BTRFS_I(inode)->io_tree,
 			       BTRFS_MAX_EXTENT_SIZE >> 1,
 			       (BTRFS_MAX_EXTENT_SIZE >> 1) + sectorsize - 1,
-			       EXTENT_DELALLOC | EXTENT_UPTODATE, 0, 0, NULL);
+			       EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
+			       EXTENT_UPTODATE, 0, 0, NULL);
 	if (ret) {
 		test_err("clear_extent_bit returned %d", ret);
 		goto out;
@@ -1050,7 +1051,8 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	ret = clear_extent_bit(&BTRFS_I(inode)->io_tree,
 			       BTRFS_MAX_EXTENT_SIZE + sectorsize,
 			       BTRFS_MAX_EXTENT_SIZE + 2 * sectorsize - 1,
-			       EXTENT_DELALLOC | EXTENT_UPTODATE, 0, 0, NULL);
+			       EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
+			       EXTENT_UPTODATE, 0, 0, NULL);
 	if (ret) {
 		test_err("clear_extent_bit returned %d", ret);
 		goto out;
@@ -1082,7 +1084,8 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 
 	/* Empty */
 	ret = clear_extent_bit(&BTRFS_I(inode)->io_tree, 0, (u64)-1,
-			       EXTENT_DELALLOC | EXTENT_UPTODATE, 0, 0, NULL);
+			       EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
+			       EXTENT_UPTODATE, 0, 0, NULL);
 	if (ret) {
 		test_err("clear_extent_bit returned %d", ret);
 		goto out;
@@ -1097,7 +1100,8 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 out:
 	if (ret)
 		clear_extent_bit(&BTRFS_I(inode)->io_tree, 0, (u64)-1,
-				 EXTENT_DELALLOC | EXTENT_UPTODATE, 0, 0, NULL);
+				 EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
+				 EXTENT_UPTODATE, 0, 0, NULL);
 	iput(inode);
 	btrfs_free_dummy_root(root);
 	btrfs_free_dummy_fs_info(fs_info);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] btrfs: refactor btrfs_drop_extents() to make it easier to extend
  2020-11-04 11:07 [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks fdmanana
  2020-11-04 11:07 ` [PATCH 1/4] btrfs: fix missing delalloc new bit for new delalloc ranges fdmanana
@ 2020-11-04 11:07 ` fdmanana
  2020-11-05 18:39   ` Josef Bacik
  2020-11-04 11:07 ` [PATCH 3/4] btrfs: fix race when defragging that leads to unnecessary IO fdmanana
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: fdmanana @ 2020-11-04 11:07 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There are many arguments for __btrfs_drop_extents() and its wrapper
btrfs_drop_extents(), which makes it hard to add more arguments to it and
requires changing every caller. I have added a couple myself back in 2014
commit 1acae57b161e ("Btrfs: faster file extent item replace operations")
and therefore know firsthand that it is a bit cumbersome to add additional
arguments to these functions.

Since I will need to add more arguments in a subsequent bug fix, this
change is preparatory work and adds a data structure that holds all the
arguments, for both input and output, that are passed to this function,
with some comments in the structure's definition mentioning what each
field is and how it relates to other fields.

Callers of this function need only to zero out the content of the
structure and setup only the fields they need. This also removes the
need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so
now we have a single function named btrfs_drop_extents() that takes a
pointer to this new data structure (struct btrfs_drop_extents_args).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h    |  63 +++++++++++++++---
 fs/btrfs/file.c     | 155 +++++++++++++++++++++++---------------------
 fs/btrfs/inode.c    |  41 +++++++-----
 fs/btrfs/reflink.c  |   7 +-
 fs/btrfs/tree-log.c |  28 +++++---
 5 files changed, 187 insertions(+), 107 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b46eecf882a1..c3b2e7f2d772 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1229,6 +1229,58 @@ struct btrfs_replace_extent_info {
 	int insertions;
 };
 
+/* Arguments for btrfs_drop_extents(). */
+struct btrfs_drop_extents_args {
+	/* Input parameters */
+
+	/*
+	 * If NULL, btrfs_drop_extents() will allocate and free its own path.
+	 * If 'replace_extent' is true, this must not be NULL. Also the path
+	 * is always released except if 'replace_extent' is true and
+	 * btrfs_drop_extents() sets 'extent_inserted' to true, in which case
+	 * the path is kept locked.
+	 */
+	struct btrfs_path *path;
+	/* Start offset of the range to drop extents from. */
+	u64 start;
+	/* End (exclusive, last byte + 1) of the range to drop extents from. */
+	u64 end;
+	/* If true drop all the extent maps in the range. */
+	bool drop_cache;
+	/*
+	 * If true it means we want to insert a new extent after dropping all
+	 * the extents in the range. If this is true, the 'extent_item_size'
+	 * parameter must be set as well and the 'extent_inserted' field will
+	 * be set to true by btrfs_drop_extents() if it could insert the new
+	 * extent.
+	 * Note: when this is set to true the path must not be NULL.
+	 */
+	bool replace_extent;
+	/*
+	 * Used if 'replace_extent' is true. Size of the file extent item to
+	 * insert after dropping all existing extents in the range.
+	 */
+	u32 extent_item_size;
+
+	/* Output parameters */
+
+	/*
+	 * Set to the minimum between the input parameter 'end' and the end
+	 * (exclusive, last byte + 1) of the last dropped extent. This is always
+	 * set even if btrfs_drop_extents() returns an error.
+	 */
+	u64 drop_end;
+	/*
+	 * Only set if 'replace_extent' is true. Set to true if we were able
+	 * to insert a replacement extent after dropping all extents in the
+	 * range, otherwise set to false by btrfs_drop_extents().
+	 * Also, if btrfs_drop_extents() has set this to true it means it
+	 * returned with the path locked, otherwise if it has set this to
+	 * false it has returned with the path released.
+	 */
+	bool extent_inserted;
+};
+
 struct btrfs_file_private {
 	void *filldir_buf;
 };
@@ -3116,16 +3168,9 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
 void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
 			     int skip_pinned);
 extern const struct file_operations btrfs_file_operations;
-int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
-			 struct btrfs_root *root, struct btrfs_inode *inode,
-			 struct btrfs_path *path, u64 start, u64 end,
-			 u64 *drop_end, int drop_cache,
-			 int replace_extent,
-			 u32 extent_item_size,
-			 int *key_inserted);
 int btrfs_drop_extents(struct btrfs_trans_handle *trans,
-		       struct btrfs_root *root, struct inode *inode, u64 start,
-		       u64 end, int drop_cache);
+		       struct btrfs_root *root, struct btrfs_inode *inode,
+		       struct btrfs_drop_extents_args *args);
 int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 			   const u64 start, const u64 end,
 			   struct btrfs_replace_extent_info *extent_info,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 56c1fe3988bd..68b96c6c9c4b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -678,13 +678,9 @@ void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
  * it is either truncated or split.  Anything entirely inside the range
  * is deleted from the tree.
  */
-int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
-			 struct btrfs_root *root, struct btrfs_inode *inode,
-			 struct btrfs_path *path, u64 start, u64 end,
-			 u64 *drop_end, int drop_cache,
-			 int replace_extent,
-			 u32 extent_item_size,
-			 int *key_inserted)
+int btrfs_drop_extents(struct btrfs_trans_handle *trans,
+		       struct btrfs_root *root, struct btrfs_inode *inode,
+		       struct btrfs_drop_extents_args *args)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct extent_buffer *leaf;
@@ -694,12 +690,12 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 	struct btrfs_key new_key;
 	struct inode *vfs_inode = &inode->vfs_inode;
 	u64 ino = btrfs_ino(inode);
-	u64 search_start = start;
+	u64 search_start = args->start;
 	u64 disk_bytenr = 0;
 	u64 num_bytes = 0;
 	u64 extent_offset = 0;
 	u64 extent_end = 0;
-	u64 last_end = start;
+	u64 last_end = args->start;
 	int del_nr = 0;
 	int del_slot = 0;
 	int extent_type;
@@ -709,11 +705,25 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 	int update_refs;
 	int found = 0;
 	int leafs_visited = 0;
+	struct btrfs_path *path = args->path;
+
+	args->extent_inserted = false;
+
+	/* Must always have a path if ->replace_extent is true. */
+	ASSERT(!(args->replace_extent && !args->path));
+
+	if (!path) {
+		path = btrfs_alloc_path();
+		if (!path) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
 
-	if (drop_cache)
-		btrfs_drop_extent_cache(inode, start, end - 1, 0);
+	if (args->drop_cache)
+		btrfs_drop_extent_cache(inode, args->start, args->end - 1, 0);
 
-	if (start >= inode->disk_i_size && !replace_extent)
+	if (args->start >= inode->disk_i_size && !args->replace_extent)
 		modify_tree = 0;
 
 	update_refs = (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) ||
@@ -724,7 +734,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 					       search_start, modify_tree);
 		if (ret < 0)
 			break;
-		if (ret > 0 && path->slots[0] > 0 && search_start == start) {
+		if (ret > 0 && path->slots[0] > 0 && search_start == args->start) {
 			leaf = path->nodes[0];
 			btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1);
 			if (key.objectid == ino &&
@@ -759,7 +769,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 			path->slots[0]++;
 			goto next_slot;
 		}
-		if (key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
+		if (key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= args->end)
 			break;
 
 		fi = btrfs_item_ptr(leaf, path->slots[0],
@@ -801,7 +811,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 		}
 
 		found = 1;
-		search_start = max(key.offset, start);
+		search_start = max(key.offset, args->start);
 		if (recow || !modify_tree) {
 			modify_tree = -1;
 			btrfs_release_path(path);
@@ -812,7 +822,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 		 *     | - range to drop - |
 		 *  | -------- extent -------- |
 		 */
-		if (start > key.offset && end < extent_end) {
+		if (args->start > key.offset && args->end < extent_end) {
 			BUG_ON(del_nr > 0);
 			if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
 				ret = -EOPNOTSUPP;
@@ -820,7 +830,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 			}
 
 			memcpy(&new_key, &key, sizeof(new_key));
-			new_key.offset = start;
+			new_key.offset = args->start;
 			ret = btrfs_duplicate_item(trans, root, path,
 						   &new_key);
 			if (ret == -EAGAIN) {
@@ -834,15 +844,15 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 			fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
 					    struct btrfs_file_extent_item);
 			btrfs_set_file_extent_num_bytes(leaf, fi,
-							start - key.offset);
+							args->start - key.offset);
 
 			fi = btrfs_item_ptr(leaf, path->slots[0],
 					    struct btrfs_file_extent_item);
 
-			extent_offset += start - key.offset;
+			extent_offset += args->start - key.offset;
 			btrfs_set_file_extent_offset(leaf, fi, extent_offset);
 			btrfs_set_file_extent_num_bytes(leaf, fi,
-							extent_end - start);
+							extent_end - args->start);
 			btrfs_mark_buffer_dirty(leaf);
 
 			if (update_refs && disk_bytenr > 0) {
@@ -852,11 +862,11 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 				btrfs_init_data_ref(&ref,
 						root->root_key.objectid,
 						new_key.objectid,
-						start - extent_offset);
+						args->start - extent_offset);
 				ret = btrfs_inc_extent_ref(trans, &ref);
 				BUG_ON(ret); /* -ENOMEM */
 			}
-			key.offset = start;
+			key.offset = args->start;
 		}
 		/*
 		 * From here on out we will have actually dropped something, so
@@ -868,23 +878,24 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 		 *  | ---- range to drop ----- |
 		 *      | -------- extent -------- |
 		 */
-		if (start <= key.offset && end < extent_end) {
+		if (args->start <= key.offset && args->end < extent_end) {
 			if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
 				ret = -EOPNOTSUPP;
 				break;
 			}
 
 			memcpy(&new_key, &key, sizeof(new_key));
-			new_key.offset = end;
+			new_key.offset = args->end;
 			btrfs_set_item_key_safe(fs_info, path, &new_key);
 
-			extent_offset += end - key.offset;
+			extent_offset += args->end - key.offset;
 			btrfs_set_file_extent_offset(leaf, fi, extent_offset);
 			btrfs_set_file_extent_num_bytes(leaf, fi,
-							extent_end - end);
+							extent_end - args->end);
 			btrfs_mark_buffer_dirty(leaf);
 			if (update_refs && disk_bytenr > 0)
-				inode_sub_bytes(vfs_inode, end - key.offset);
+				inode_sub_bytes(vfs_inode,
+						args->end - key.offset);
 			break;
 		}
 
@@ -893,7 +904,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 		 *       | ---- range to drop ----- |
 		 *  | -------- extent -------- |
 		 */
-		if (start > key.offset && end >= extent_end) {
+		if (args->start > key.offset && args->end >= extent_end) {
 			BUG_ON(del_nr > 0);
 			if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
 				ret = -EOPNOTSUPP;
@@ -901,11 +912,12 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 			}
 
 			btrfs_set_file_extent_num_bytes(leaf, fi,
-							start - key.offset);
+							args->start - key.offset);
 			btrfs_mark_buffer_dirty(leaf);
 			if (update_refs && disk_bytenr > 0)
-				inode_sub_bytes(vfs_inode, extent_end - start);
-			if (end == extent_end)
+				inode_sub_bytes(vfs_inode,
+						extent_end - args->start);
+			if (args->end == extent_end)
 				break;
 
 			path->slots[0]++;
@@ -916,7 +928,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 		 *  | ---- range to drop ----- |
 		 *    | ------ extent ------ |
 		 */
-		if (start <= key.offset && end >= extent_end) {
+		if (args->start <= key.offset && args->end >= extent_end) {
 delete_extent_item:
 			if (del_nr == 0) {
 				del_slot = path->slots[0];
@@ -946,7 +958,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 						extent_end - key.offset);
 			}
 
-			if (end == extent_end)
+			if (args->end == extent_end)
 				break;
 
 			if (path->slots[0] + 1 < btrfs_header_nritems(leaf)) {
@@ -976,7 +988,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 		 * Set path->slots[0] to first slot, so that after the delete
 		 * if items are move off from our leaf to its immediate left or
 		 * right neighbor leafs, we end up with a correct and adjusted
-		 * path->slots[0] for our insertion (if replace_extent != 0).
+		 * path->slots[0] for our insertion (if args->replace_extent).
 		 */
 		path->slots[0] = del_slot;
 		ret = btrfs_del_items(trans, root, path, del_slot, del_nr);
@@ -990,15 +1002,15 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 	 * which case it unlocked our path, so check path->locks[0] matches a
 	 * write lock.
 	 */
-	if (!ret && replace_extent && leafs_visited == 1 &&
+	if (!ret && args->replace_extent && leafs_visited == 1 &&
 	    (path->locks[0] == BTRFS_WRITE_LOCK_BLOCKING ||
 	     path->locks[0] == BTRFS_WRITE_LOCK) &&
 	    btrfs_leaf_free_space(leaf) >=
-	    sizeof(struct btrfs_item) + extent_item_size) {
+	    sizeof(struct btrfs_item) + args->extent_item_size) {
 
 		key.objectid = ino;
 		key.type = BTRFS_EXTENT_DATA_KEY;
-		key.offset = start;
+		key.offset = args->start;
 		if (!del_nr && path->slots[0] < btrfs_header_nritems(leaf)) {
 			struct btrfs_key slot_key;
 
@@ -1006,30 +1018,18 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 			if (btrfs_comp_cpu_keys(&key, &slot_key) > 0)
 				path->slots[0]++;
 		}
-		setup_items_for_insert(root, path, &key, &extent_item_size, 1);
-		*key_inserted = 1;
+		setup_items_for_insert(root, path, &key,
+				       &args->extent_item_size, 1);
+		args->extent_inserted = true;
 	}
 
-	if (!replace_extent || !(*key_inserted))
+	if (!args->path)
+		btrfs_free_path(path);
+	else if (!args->extent_inserted)
 		btrfs_release_path(path);
-	if (drop_end)
-		*drop_end = found ? min(end, last_end) : end;
-	return ret;
-}
-
-int btrfs_drop_extents(struct btrfs_trans_handle *trans,
-		       struct btrfs_root *root, struct inode *inode, u64 start,
-		       u64 end, int drop_cache)
-{
-	struct btrfs_path *path;
-	int ret;
+out:
+	args->drop_end = found ? min(args->end, last_end) : args->end;
 
-	path = btrfs_alloc_path();
-	if (!path)
-		return -ENOMEM;
-	ret = __btrfs_drop_extents(trans, root, BTRFS_I(inode), path, start,
-				   end, NULL, drop_cache, 0, 0, NULL);
-	btrfs_free_path(path);
 	return ret;
 }
 
@@ -2602,6 +2602,7 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 			   struct btrfs_replace_extent_info *extent_info,
 			   struct btrfs_trans_handle **trans_out)
 {
+	struct btrfs_drop_extents_args drop_args = { 0 };
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 min_size = btrfs_calc_insert_metadata_size(fs_info, 1);
 	u64 ino_size = round_up(inode->i_size, fs_info->sectorsize);
@@ -2610,7 +2611,6 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 	struct btrfs_block_rsv *rsv;
 	unsigned int rsv_count;
 	u64 cur_offset;
-	u64 drop_end;
 	u64 len = end - start;
 	int ret = 0;
 
@@ -2649,10 +2649,12 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 	trans->block_rsv = rsv;
 
 	cur_offset = start;
+	drop_args.path = path;
+	drop_args.end = end + 1;
+	drop_args.drop_cache = true;
 	while (cur_offset < end) {
-		ret = __btrfs_drop_extents(trans, root, BTRFS_I(inode), path,
-					   cur_offset, end + 1, &drop_end,
-					   1, 0, 0, NULL);
+		drop_args.start = cur_offset;
+		ret = btrfs_drop_extents(trans, root, BTRFS_I(inode), &drop_args);
 		if (ret != -ENOSPC) {
 			/*
 			 * When cloning we want to avoid transaction aborts when
@@ -2669,10 +2671,10 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 
 		trans->block_rsv = &fs_info->trans_block_rsv;
 
-		if (!extent_info && cur_offset < drop_end &&
+		if (!extent_info && cur_offset < drop_args.drop_end &&
 		    cur_offset < ino_size) {
 			ret = fill_holes(trans, BTRFS_I(inode), path,
-					cur_offset, drop_end);
+					 cur_offset, drop_args.drop_end);
 			if (ret) {
 				/*
 				 * If we failed then we didn't insert our hole
@@ -2683,7 +2685,7 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 				btrfs_abort_transaction(trans, ret);
 				break;
 			}
-		} else if (!extent_info && cur_offset < drop_end) {
+		} else if (!extent_info && cur_offset < drop_args.drop_end) {
 			/*
 			 * We are past the i_size here, but since we didn't
 			 * insert holes we need to clear the mapped area so we
@@ -2691,7 +2693,8 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 			 * file extent is inserted here.
 			 */
 			ret = btrfs_inode_clear_file_extent_range(BTRFS_I(inode),
-					cur_offset, drop_end - cur_offset);
+					cur_offset,
+					drop_args.drop_end - cur_offset);
 			if (ret) {
 				/*
 				 * We couldn't clear our area, so we could
@@ -2703,8 +2706,10 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 			}
 		}
 
-		if (extent_info && drop_end > extent_info->file_offset) {
-			u64 replace_len = drop_end - extent_info->file_offset;
+		if (extent_info &&
+		    drop_args.drop_end > extent_info->file_offset) {
+			u64 replace_len = drop_args.drop_end -
+				extent_info->file_offset;
 
 			ret = btrfs_insert_replace_extent(trans, inode, path,
 							extent_info, replace_len);
@@ -2717,7 +2722,7 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 			extent_info->file_offset += replace_len;
 		}
 
-		cur_offset = drop_end;
+		cur_offset = drop_args.drop_end;
 
 		ret = btrfs_update_inode(trans, root, inode);
 		if (ret)
@@ -2776,25 +2781,27 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 	 * will not record the existence of the hole region
 	 * [existing_hole_start, lockend].
 	 */
-	if (drop_end <= end)
-		drop_end = end + 1;
+	if (drop_args.drop_end <= end)
+		drop_args.drop_end = end + 1;
 	/*
 	 * Don't insert file hole extent item if it's for a range beyond eof
 	 * (because it's useless) or if it represents a 0 bytes range (when
 	 * cur_offset == drop_end).
 	 */
-	if (!extent_info && cur_offset < ino_size && cur_offset < drop_end) {
+	if (!extent_info && cur_offset < ino_size &&
+	    cur_offset < drop_args.drop_end) {
 		ret = fill_holes(trans, BTRFS_I(inode), path,
-				cur_offset, drop_end);
+				 cur_offset, drop_args.drop_end);
 		if (ret) {
 			/* Same comment as above. */
 			btrfs_abort_transaction(trans, ret);
 			goto out_trans;
 		}
-	} else if (!extent_info && cur_offset < drop_end) {
+	} else if (!extent_info && cur_offset < drop_args.drop_end) {
 		/* See the comment in the loop above for the reasoning here. */
 		ret = btrfs_inode_clear_file_extent_range(BTRFS_I(inode),
-					cur_offset, drop_end - cur_offset);
+					cur_offset,
+					drop_args.drop_end - cur_offset);
 		if (ret) {
 			btrfs_abort_transaction(trans, ret);
 			goto out_trans;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2fb3d2ea75fd..58f667269e7e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -202,7 +202,7 @@ static int btrfs_init_inode_security(struct btrfs_trans_handle *trans,
  * no overlapping inline items exist in the btree
  */
 static int insert_inline_extent(struct btrfs_trans_handle *trans,
-				struct btrfs_path *path, int extent_inserted,
+				struct btrfs_path *path, bool extent_inserted,
 				struct btrfs_root *root, struct inode *inode,
 				u64 start, size_t size, size_t compressed_size,
 				int compress_type,
@@ -317,6 +317,7 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 					  int compress_type,
 					  struct page **compressed_pages)
 {
+	struct btrfs_drop_extents_args drop_args = { 0 };
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_trans_handle *trans;
@@ -327,8 +328,6 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 	u64 data_len = inline_len;
 	int ret;
 	struct btrfs_path *path;
-	int extent_inserted = 0;
-	u32 extent_item_size;
 
 	if (compressed_size)
 		data_len = compressed_size;
@@ -354,16 +353,20 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 	}
 	trans->block_rsv = &inode->block_rsv;
 
+	drop_args.path = path;
+	drop_args.start = start;
+	drop_args.end = aligned_end;
+	drop_args.drop_cache = true;
+	drop_args.replace_extent = true;
+
 	if (compressed_size && compressed_pages)
-		extent_item_size = btrfs_file_extent_calc_inline_size(
+		drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(
 		   compressed_size);
 	else
-		extent_item_size = btrfs_file_extent_calc_inline_size(
+		drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(
 		    inline_len);
 
-	ret = __btrfs_drop_extents(trans, root, inode, path, start, aligned_end,
-				   NULL, 1, 1, extent_item_size,
-				   &extent_inserted);
+	ret = btrfs_drop_extents(trans, root, inode, &drop_args);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
 		goto out;
@@ -371,7 +374,7 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 
 	if (isize > actual_end)
 		inline_len = min_t(u64, isize, actual_end);
-	ret = insert_inline_extent(trans, path, extent_inserted,
+	ret = insert_inline_extent(trans, path, drop_args.extent_inserted,
 				   root, &inode->vfs_inode, start,
 				   inline_len, compressed_size,
 				   compress_type, compressed_pages);
@@ -2568,7 +2571,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	u64 disk_bytenr = btrfs_stack_file_extent_disk_bytenr(stack_fi);
 	u64 num_bytes = btrfs_stack_file_extent_num_bytes(stack_fi);
 	u64 ram_bytes = btrfs_stack_file_extent_ram_bytes(stack_fi);
-	int extent_inserted = 0;
+	struct btrfs_drop_extents_args drop_args = { 0 };
 	int ret;
 
 	path = btrfs_alloc_path();
@@ -2584,13 +2587,16 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	 * the caller is expected to unpin it and allow it to be merged
 	 * with the others.
 	 */
-	ret = __btrfs_drop_extents(trans, root, inode, path, file_pos,
-				   file_pos + num_bytes, NULL, 0,
-				   1, sizeof(*stack_fi), &extent_inserted);
+	drop_args.path = path;
+	drop_args.start = file_pos;
+	drop_args.end = file_pos + num_bytes;
+	drop_args.replace_extent = true;
+	drop_args.extent_item_size = sizeof(*stack_fi);
+	ret = btrfs_drop_extents(trans, root, inode, &drop_args);
 	if (ret)
 		goto out;
 
-	if (!extent_inserted) {
+	if (!drop_args.extent_inserted) {
 		ins.objectid = btrfs_ino(inode);
 		ins.offset = file_pos;
 		ins.type = BTRFS_EXTENT_DATA_KEY;
@@ -4752,6 +4758,7 @@ static int maybe_insert_hole(struct btrfs_root *root, struct inode *inode,
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_trans_handle *trans;
+	struct btrfs_drop_extents_args drop_args = { 0 };
 	int ret;
 
 	/*
@@ -4774,7 +4781,11 @@ static int maybe_insert_hole(struct btrfs_root *root, struct inode *inode,
 	if (IS_ERR(trans))
 		return PTR_ERR(trans);
 
-	ret = btrfs_drop_extents(trans, root, inode, offset, offset + len, 1);
+	drop_args.start = offset;
+	drop_args.end = offset + len;
+	drop_args.drop_cache = true;
+
+	ret = btrfs_drop_extents(trans, root, BTRFS_I(inode), &drop_args);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
 		btrfs_end_transaction(trans);
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 99aa87c08912..e65f8c9f8abb 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -163,6 +163,7 @@ static int clone_copy_inline_extent(struct inode *dst,
 	const u64 aligned_end = ALIGN(new_key->offset + datal,
 				      fs_info->sectorsize);
 	struct btrfs_trans_handle *trans = NULL;
+	struct btrfs_drop_extents_args drop_args = { 0 };
 	int ret;
 	struct btrfs_key key;
 
@@ -252,7 +253,11 @@ static int clone_copy_inline_extent(struct inode *dst,
 		trans = NULL;
 		goto out;
 	}
-	ret = btrfs_drop_extents(trans, root, dst, drop_start, aligned_end, 1);
+	drop_args.path = path;
+	drop_args.start = drop_start;
+	drop_args.end = aligned_end;
+	drop_args.drop_cache = true;
+	ret = btrfs_drop_extents(trans, root, BTRFS_I(dst), &drop_args);
 	if (ret)
 		goto out;
 	ret = btrfs_insert_empty_item(trans, root, path, new_key, size);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 135cb40295c1..3d142114aa15 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -576,6 +576,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
 				      struct extent_buffer *eb, int slot,
 				      struct btrfs_key *key)
 {
+	struct btrfs_drop_extents_args drop_args = { 0 };
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	int found_type;
 	u64 extent_end;
@@ -653,7 +654,10 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
 	btrfs_release_path(path);
 
 	/* drop any overlapping extents */
-	ret = btrfs_drop_extents(trans, root, inode, start, extent_end, 1);
+	drop_args.start = start;
+	drop_args.end = extent_end;
+	drop_args.drop_cache = true;
+	ret = btrfs_drop_extents(trans, root, BTRFS_I(inode), &drop_args);
 	if (ret)
 		goto out;
 
@@ -2576,6 +2580,7 @@ static int replay_one_buffer(struct btrfs_root *log, struct extent_buffer *eb,
 			 * those prealloc extents just after replaying them.
 			 */
 			if (S_ISREG(mode)) {
+				struct btrfs_drop_extents_args drop_args = { 0 };
 				struct inode *inode;
 				u64 from;
 
@@ -2586,8 +2591,12 @@ static int replay_one_buffer(struct btrfs_root *log, struct extent_buffer *eb,
 				}
 				from = ALIGN(i_size_read(inode),
 					     root->fs_info->sectorsize);
-				ret = btrfs_drop_extents(wc->trans, root, inode,
-							 from, (u64)-1, 1);
+				drop_args.start = from;
+				drop_args.end = (u64)-1;
+				drop_args.drop_cache = true;
+				ret = btrfs_drop_extents(wc->trans, root,
+							 BTRFS_I(inode),
+							 &drop_args);
 				if (!ret) {
 					/* Update the inode's nbytes. */
 					ret = btrfs_update_inode(wc->trans,
@@ -4186,6 +4195,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 			  struct btrfs_path *path,
 			  struct btrfs_log_ctx *ctx)
 {
+	struct btrfs_drop_extents_args drop_args = { 0 };
 	struct btrfs_root *log = root->log_root;
 	struct btrfs_file_extent_item *fi;
 	struct extent_buffer *leaf;
@@ -4194,19 +4204,21 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 	u64 extent_offset = em->start - em->orig_start;
 	u64 block_len;
 	int ret;
-	int extent_inserted = 0;
 
 	ret = log_extent_csums(trans, inode, log, em, ctx);
 	if (ret)
 		return ret;
 
-	ret = __btrfs_drop_extents(trans, log, inode, path, em->start,
-				   em->start + em->len, NULL, 0, 1,
-				   sizeof(*fi), &extent_inserted);
+	drop_args.path = path;
+	drop_args.start = em->start;
+	drop_args.end = em->start + em->len;
+	drop_args.replace_extent = true;
+	drop_args.extent_item_size = sizeof(*fi);
+	ret = btrfs_drop_extents(trans, log, inode, &drop_args);
 	if (ret)
 		return ret;
 
-	if (!extent_inserted) {
+	if (!drop_args.extent_inserted) {
 		key.objectid = btrfs_ino(inode);
 		key.type = BTRFS_EXTENT_DATA_KEY;
 		key.offset = em->start;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] btrfs: fix race when defragging that leads to unnecessary IO
  2020-11-04 11:07 [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks fdmanana
  2020-11-04 11:07 ` [PATCH 1/4] btrfs: fix missing delalloc new bit for new delalloc ranges fdmanana
  2020-11-04 11:07 ` [PATCH 2/4] btrfs: refactor btrfs_drop_extents() to make it easier to extend fdmanana
@ 2020-11-04 11:07 ` fdmanana
  2020-11-05 18:44   ` Josef Bacik
  2020-11-04 11:07 ` [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically fdmanana
  2020-11-10 21:46 ` [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks David Sterba
  4 siblings, 1 reply; 12+ messages in thread
From: fdmanana @ 2020-11-04 11:07 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When defragging we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).

So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.

I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ioctl.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 69a384145dc6..7c44aa277e93 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1275,6 +1275,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 	u64 page_end;
 	u64 page_cnt;
 	u64 start = (u64)start_index << PAGE_SHIFT;
+	u64 search_start;
 	int ret;
 	int i;
 	int i_done;
@@ -1371,6 +1372,40 @@ static int cluster_pages_for_defrag(struct inode *inode,
 
 	lock_extent_bits(&BTRFS_I(inode)->io_tree,
 			 page_start, page_end - 1, &cached_state);
+
+	/*
+	 * When defragging we skip ranges that have holes or inline extents,
+	 * (check should_defrag_range()), to avoid unnecessary IO and wasting
+	 * space. At btrfs_defrag_file(), we check if a range should be defragged
+	 * before locking the inode and then, if it should, we trigger a sync
+	 * page cache readahead - we lock the inode only after that to avoid
+	 * blocking for too long other tasks that possibly want to operate on
+	 * other file ranges. But before we were able to get the inode lock,
+	 * some other task may have punched a hole in the range, or we may have
+	 * now an inline extent, in which case we should not defrag. So check
+	 * for that here, where we have the inode and the range locked, and bail
+	 + out if that happened.
+	 */
+	search_start = page_start;
+	while (search_start < page_end) {
+		struct extent_map *em;
+
+		em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, search_start,
+				      page_end - search_start);
+		if (IS_ERR(em)) {
+			ret = PTR_ERR(em);
+			goto out_unlock_range;
+		}
+		if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
+			free_extent_map(em);
+			/* Ok, 0 means we did not defrag anything. */
+			ret = 0;
+			goto out_unlock_range;
+		}
+		search_start = extent_map_end(em);
+		free_extent_map(em);
+	}
+
 	clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start,
 			  page_end - 1, EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
 			  EXTENT_DEFRAG, 0, 0, &cached_state);
@@ -1401,6 +1436,10 @@ static int cluster_pages_for_defrag(struct inode *inode,
 	btrfs_delalloc_release_extents(BTRFS_I(inode), page_cnt << PAGE_SHIFT);
 	extent_changeset_free(data_reserved);
 	return i_done;
+
+out_unlock_range:
+	unlock_extent_cached(&BTRFS_I(inode)->io_tree,
+			     page_start, page_end - 1, &cached_state);
 out:
 	for (i = 0; i < i_done; i++) {
 		unlock_page(pages[i]);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically
  2020-11-04 11:07 [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks fdmanana
                   ` (2 preceding siblings ...)
  2020-11-04 11:07 ` [PATCH 3/4] btrfs: fix race when defragging that leads to unnecessary IO fdmanana
@ 2020-11-04 11:07 ` fdmanana
  2020-11-05 19:24   ` Josef Bacik
  2020-11-09 10:34   ` Nikolay Borisov
  2020-11-10 21:46 ` [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks David Sterba
  4 siblings, 2 replies; 12+ messages in thread
From: fdmanana @ 2020-11-04 11:07 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.

In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:

  https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html

The cases where this can happen are the following:

-> Case 1

If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).

This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.

If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).

Example reproducer:

  $ cat reproducer-1.sh
  #!/bin/bash

  MNT=/mnt/sdi
  DEV=/dev/sdi

  stat_loop()
  {
      trap "wait; exit" SIGTERM
      local filepath=$1
      local expected=$2
      local got

      while :; do
          got=$(stat -c %b $filepath)
          if [ $got -ne $expected ]; then
             echo -n "ERROR: unexpected used blocks"
             echo " (got: $got expected: $expected)"
          fi
      done
  }

  mkfs.btrfs -f $DEV > /dev/null
  # mkfs.xfs -f $DEV > /dev/null
  # mkfs.ext4 -F $DEV > /dev/null
  # mkfs.f2fs -f $DEV > /dev/null
  # mkfs.reiserfs -f $DEV > /dev/null
  mount $DEV $MNT

  xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
  expected=$(stat -c %b $MNT/foobar)

  # Create a process to keep calling stat(2) on the file and see if the
  # reported number of blocks used (disk space used) changes, it should
  # not because we are not increasing the file size nor punching holes.
  stat_loop $MNT/foobar $expected &
  loop_pid=$!

  for ((i = 0; i < 50000; i++)); do
      xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
  done

  kill $loop_pid &> /dev/null
  wait

  umount $DEV

  $ ./reproducer-1.sh
  ERROR: unexpected used blocks (got: 0 expected: 128)
  ERROR: unexpected used blocks (got: 0 expected: 128)
  (...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

-> Case 2

If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond eof, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.

This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.

Example reproducer:

  $ cat reproducer-2.sh
  #!/bin/bash

  MNT=/mnt/sdi
  DEV=/dev/sdi

  stat_loop()
  {
      trap "wait; exit" SIGTERM
      local filepath=$1
      local expected=$2
      local got

      while :; do
          got=$(stat -c %b $filepath)
          if [ $got -ne $expected ]; then
              echo -n "ERROR: unexpected used blocks"
              echo " (got: $got expected: $expected)"
          fi
      done
  }

  mkfs.btrfs -f $DEV > /dev/null
  # mkfs.xfs -f $DEV > /dev/null
  # mkfs.ext4 -F $DEV > /dev/null
  # mkfs.f2fs -f $DEV > /dev/null
  # mkfs.reiserfs -f $DEV > /dev/null
  mount $DEV $MNT

  touch $MNT/foobar
  write_size=$((64 * 1024))
  for ((i = 0; i < 16384; i++)); do
     offset=$(($i * $write_size))
     xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
     blocks_used=$(stat -c %b $MNT/foobar)

     # Fsync the file to trigger writeback and keep calling stat(2) on it
     # to see if the number of blocks used changes.
     stat_loop $MNT/foobar $blocks_used &
     loop_pid=$!
     xfs_io -c "fsync" $MNT/foobar

     kill $loop_pid &> /dev/null
     wait $loop_pid
  done

  umount $DEV

  $ ./reproducer-2.sh
  ERROR: unexpected used blocks (got: 265472 expected: 265344)
  ERROR: unexpected used blocks (got: 284032 expected: 283904)
  (...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

-> Case 3

Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.

The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.

Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.

Example reproducer:

  $ cat reproducer-3.sh
  #!/bin/bash

  MNT=/mnt/sdi
  DEV=/dev/sdi

  mkfs.btrfs -f $DEV > /dev/null
  # mkfs.xfs -f -m reflink=1 $DEV > /dev/null
  mount $DEV $MNT

  extent_size=$((64 * 1024))
  num_extents=16384
  file_size=$(($extent_size * $num_extents))

  # File foo has many small extents.
  xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
      > /dev/null
  # File bar has much less extents and has exactly the same data as foo.
  xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null

  expected=$(stat -c %b $MNT/foo)

  # Now deduplicate bar into foo. While the deduplication is in progres,
  # the number of used blocks/file size reported by stat should not change
  xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null  &
  dedupe_pid=$!
  while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
      used=$(stat -c %b $MNT/foo)
      if [ $used -ne $expected ]; then
          echo "Unexpected blocks used: $used (expected: $expected)"
      fi
  done

  umount $DEV

  $ ./reproducer-3.sh
  Unexpected blocks used: 2076800 (expected: 2097152)
  Unexpected blocks used: 2097024 (expected: 2097152)
  Unexpected blocks used: 2079872 (expected: 2097152)
  (...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

So fix this by:

1) Making btrfs_drop_extents() not decrement the vfs inode's number of
   bytes, and instead return the number of bytes;

2) Making any code that drops extents and adds new extents update the
   inode's number of bytes atomically, while holding the btrfs inode's
   spinlock, which is also used by the stat(2) callback to get the inode's
   number of bytes;

3) For ranges in the inode's iotree that are marked as 'delalloc new',
   corresponding to previously unallocated ranges, increment the inode's
   number of bytes when clearing the 'delalloc new' bit from the range,
   in the same critical section that decrements the inode's
   'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.

An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h    |   3 +-
 fs/btrfs/ctree.h          |   8 +++
 fs/btrfs/extent-io-tree.h |  16 ++++-
 fs/btrfs/extent_io.c      |  10 +--
 fs/btrfs/file.c           |  43 +++++++-----
 fs/btrfs/inode.c          | 135 ++++++++++++++++++++++++++++++--------
 fs/btrfs/reflink.c        |   2 +-
 fs/btrfs/tree-log.c       |   4 +-
 8 files changed, 171 insertions(+), 50 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 8494f62f8aa4..34f5fc318fc1 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -50,7 +50,8 @@ struct btrfs_inode {
 	/*
 	 * Lock for counters and all fields used to determine if the inode is in
 	 * the log or not (last_trans, last_sub_trans, last_log_commit,
-	 * logged_trans).
+	 * logged_trans), to access/update new_delalloc_bytes and to update the
+	 * vfs' inode number of bytes used.
 	 */
 	spinlock_t lock;
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c3b2e7f2d772..305cca7f3497 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1270,6 +1270,11 @@ struct btrfs_drop_extents_args {
 	 * set even if btrfs_drop_extents() returns an error.
 	 */
 	u64 drop_end;
+	/*
+	 * The number of allocated bytes found in the range. This can be smaller
+	 * than the range's length when there are holes in the range.
+	 */
+	u64 bytes_found;
 	/*
 	 * Only set if 'replace_extent' is true. Set to true if we were able
 	 * to insert a replacement extent after dropping all extents in the
@@ -3139,6 +3144,9 @@ extern const struct iomap_dio_ops btrfs_dio_ops;
 
 int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags);
 void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags);
+void btrfs_update_inode_bytes(struct btrfs_inode *inode,
+			      const u64 add_bytes,
+			      const u64 del_bytes);
 
 /* ioctl.c */
 long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index cab4273ff8d3..754f71423d42 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -21,10 +21,24 @@ struct io_failure_record;
 #define EXTENT_NORESERVE	(1U << 11)
 #define EXTENT_QGROUP_RESERVED	(1U << 12)
 #define EXTENT_CLEAR_DATA_RESV	(1U << 13)
+/*
+ * Must be cleared only during ordered extent completion or on error paths if we
+ * did not manage to submit bios and create the ordered extents for the range.
+ * Should not be cleared during page release and page invalidation (if there is
+ * an ordered extent in flight), that is left for the ordered extent completion.
+ */
 #define EXTENT_DELALLOC_NEW	(1U << 14)
+/*
+ * When an ordered extent successfully completes for a region marked as a new
+ * delalloc range, use this flag when clearing a new delalloc range to indicate
+ * that the vfs' inode number of bytes should be incremented and the inode's new
+ * delalloc bytes decremented, in an atomic way to prevent races with stat(2).
+ */
+#define EXTENT_ADD_INODE_BYTES  (1U << 15)
 #define EXTENT_DO_ACCOUNTING    (EXTENT_CLEAR_META_RESV | \
 				 EXTENT_CLEAR_DATA_RESV)
-#define EXTENT_CTLBITS		(EXTENT_DO_ACCOUNTING)
+#define EXTENT_CTLBITS		(EXTENT_DO_ACCOUNTING | \
+				 EXTENT_ADD_INODE_BYTES)
 
 /*
  * Redefined bits above which are used only in the device allocation tree,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f3515d3c1321..e837063d0c15 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4414,12 +4414,14 @@ static int try_release_extent_state(struct extent_io_tree *tree,
 		ret = 0;
 	} else {
 		/*
-		 * at this point we can safely clear everything except the
-		 * locked bit and the nodatasum bit
+		 * At this point we can safely clear everything except the
+		 * locked bit, the nodatasum bit and the delalloc new bit.
+		 * The delalloc new bit will be cleared by ordered extent
+		 * completion.
 		 */
 		ret = __clear_extent_bit(tree, start, end,
-				 ~(EXTENT_LOCKED | EXTENT_NODATASUM),
-				 0, 0, NULL, mask, NULL);
+			 ~(EXTENT_LOCKED | EXTENT_NODATASUM | EXTENT_DELALLOC_NEW),
+			 0, 0, NULL, mask, NULL);
 
 		/* if clear_extent_bit failed for enomem reasons,
 		 * we can't allow the release to continue.
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 68b96c6c9c4b..cacfddb3ff7b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -677,6 +677,12 @@ void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
  * If an extent intersects the range but is not entirely inside the range
  * it is either truncated or split.  Anything entirely inside the range
  * is deleted from the tree.
+ *
+ * Note: the vfs' inode number of bytes is not updated, it's up to the
+ * caller to deal with that. We set the field 'bytes_found' of the arguments
+ * structure with the number of allocated bytes found in the target range,
+ * so that the caller can update the inode's number of bytes in an atomic
+ * way when replacing extents in a range to avoid races with stat(2).
  */
 int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 		       struct btrfs_root *root, struct btrfs_inode *inode,
@@ -688,7 +694,6 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 	struct btrfs_ref ref = { 0 };
 	struct btrfs_key key;
 	struct btrfs_key new_key;
-	struct inode *vfs_inode = &inode->vfs_inode;
 	u64 ino = btrfs_ino(inode);
 	u64 search_start = args->start;
 	u64 disk_bytenr = 0;
@@ -707,6 +712,7 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 	int leafs_visited = 0;
 	struct btrfs_path *path = args->path;
 
+	args->bytes_found = 0;
 	args->extent_inserted = false;
 
 	/* Must always have a path if ->replace_extent is true. */
@@ -894,8 +900,7 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 							extent_end - args->end);
 			btrfs_mark_buffer_dirty(leaf);
 			if (update_refs && disk_bytenr > 0)
-				inode_sub_bytes(vfs_inode,
-						args->end - key.offset);
+				args->bytes_found += args->end - key.offset;
 			break;
 		}
 
@@ -915,8 +920,7 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 							args->start - key.offset);
 			btrfs_mark_buffer_dirty(leaf);
 			if (update_refs && disk_bytenr > 0)
-				inode_sub_bytes(vfs_inode,
-						extent_end - args->start);
+				args->bytes_found += extent_end - args->start;
 			if (args->end == extent_end)
 				break;
 
@@ -940,8 +944,7 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 
 			if (update_refs &&
 			    extent_type == BTRFS_FILE_EXTENT_INLINE) {
-				inode_sub_bytes(vfs_inode,
-						extent_end - key.offset);
+				args->bytes_found += extent_end - key.offset;
 				extent_end = ALIGN(extent_end,
 						   fs_info->sectorsize);
 			} else if (update_refs && disk_bytenr > 0) {
@@ -954,8 +957,7 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 						key.offset - extent_offset);
 				ret = btrfs_free_extent(trans, &ref);
 				BUG_ON(ret); /* -ENOMEM */
-				inode_sub_bytes(vfs_inode,
-						extent_end - key.offset);
+				args->bytes_found += extent_end - key.offset;
 			}
 
 			if (args->end == extent_end)
@@ -2512,7 +2514,8 @@ static int btrfs_insert_replace_extent(struct btrfs_trans_handle *trans,
 				     struct inode *inode,
 				     struct btrfs_path *path,
 				     struct btrfs_replace_extent_info *extent_info,
-				     const u64 replace_len)
+				     const u64 replace_len,
+				     const u64 bytes_to_drop)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -2527,8 +2530,10 @@ static int btrfs_insert_replace_extent(struct btrfs_trans_handle *trans,
 		return 0;
 
 	if (extent_info->disk_offset == 0 &&
-	    btrfs_fs_incompat(fs_info, NO_HOLES))
+	    btrfs_fs_incompat(fs_info, NO_HOLES)) {
+		btrfs_update_inode_bytes(BTRFS_I(inode), 0, bytes_to_drop);
 		return 0;
+	}
 
 	key.objectid = btrfs_ino(BTRFS_I(inode));
 	key.type = BTRFS_EXTENT_DATA_KEY;
@@ -2557,10 +2562,12 @@ static int btrfs_insert_replace_extent(struct btrfs_trans_handle *trans,
 		return ret;
 
 	/* If it's a hole, nothing more needs to be done. */
-	if (extent_info->disk_offset == 0)
+	if (extent_info->disk_offset == 0) {
+		btrfs_update_inode_bytes(BTRFS_I(inode), 0, bytes_to_drop);
 		return 0;
+	}
 
-	inode_add_bytes(inode, replace_len);
+	btrfs_update_inode_bytes(BTRFS_I(inode), replace_len, bytes_to_drop);
 
 	if (extent_info->is_new_extent && extent_info->insertions == 0) {
 		key.objectid = extent_info->disk_offset;
@@ -2655,6 +2662,10 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 	while (cur_offset < end) {
 		drop_args.start = cur_offset;
 		ret = btrfs_drop_extents(trans, root, BTRFS_I(inode), &drop_args);
+		/* If we are punching a hole decrement the inodes' byte count. */
+		if (!extent_info)
+			btrfs_update_inode_bytes(BTRFS_I(inode), 0,
+						 drop_args.bytes_found);
 		if (ret != -ENOSPC) {
 			/*
 			 * When cloning we want to avoid transaction aborts when
@@ -2712,7 +2723,8 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 				extent_info->file_offset;
 
 			ret = btrfs_insert_replace_extent(trans, inode, path,
-							extent_info, replace_len);
+							extent_info, replace_len,
+							drop_args.bytes_found);
 			if (ret) {
 				btrfs_abort_transaction(trans, ret);
 				break;
@@ -2810,7 +2822,8 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
 	}
 	if (extent_info) {
 		ret = btrfs_insert_replace_extent(trans, inode, path, extent_info,
-						extent_info->data_len);
+						extent_info->data_len,
+						drop_args.bytes_found);
 		if (ret) {
 			btrfs_abort_transaction(trans, ret);
 			goto out_trans;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 58f667269e7e..2a0f86d4297f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -223,8 +223,6 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	if (compressed_size && compressed_pages)
 		cur_size = compressed_size;
 
-	inode_add_bytes(inode, size);
-
 	if (!extent_inserted) {
 		struct btrfs_key key;
 		size_t datasize;
@@ -300,8 +298,6 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	 * could end up racing with unlink.
 	 */
 	BTRFS_I(inode)->disk_i_size = inode->i_size;
-	ret = btrfs_update_inode(trans, root, inode);
-
 fail:
 	return ret;
 }
@@ -386,6 +382,10 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 		goto out;
 	}
 
+	btrfs_update_inode_bytes(inode, inline_len, drop_args.bytes_found);
+	ret = btrfs_update_inode(trans, root, &inode->vfs_inode);
+	if (ret)
+		goto out;
 	set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags);
 	btrfs_drop_extent_cache(inode, start, aligned_end - 1, 0);
 out:
@@ -2145,6 +2145,8 @@ void btrfs_clear_delalloc_extent(struct inode *vfs_inode,
 		spin_lock(&inode->lock);
 		ASSERT(inode->new_delalloc_bytes >= len);
 		inode->new_delalloc_bytes -= len;
+		if (*bits & EXTENT_ADD_INODE_BYTES)
+			inode_add_bytes(&inode->vfs_inode, len);
 		spin_unlock(&inode->lock);
 	}
 }
@@ -2561,9 +2563,11 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 				       struct btrfs_inode *inode, u64 file_pos,
 				       struct btrfs_file_extent_item *stack_fi,
+				       const bool update_inode_bytes,
 				       u64 qgroup_reserved)
 {
 	struct btrfs_root *root = inode->root;
+	const u64 sectorsize = root->fs_info->sectorsize;
 	struct btrfs_path *path;
 	struct extent_buffer *leaf;
 	struct btrfs_key ins;
@@ -2616,7 +2620,24 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	btrfs_mark_buffer_dirty(leaf);
 	btrfs_release_path(path);
 
-	inode_add_bytes(&inode->vfs_inode, num_bytes);
+	/*
+	 * If we dropped an inline extent here, we know the range where it is
+	 * was not marked with the EXTENT_DELALLOC_NEW bit, so we update the
+	 * number of bytes only for that range contaning the inline extent.
+	 * The remaining of the range will be processed when clearning the
+	 * EXTENT_DELALLOC_BIT bit through the ordered extent completion.
+	 */
+	if (file_pos == 0 && !IS_ALIGNED(drop_args.bytes_found, sectorsize)) {
+		u64 inline_size = round_down(drop_args.bytes_found, sectorsize);
+
+		inline_size = drop_args.bytes_found - inline_size;
+		btrfs_update_inode_bytes(inode, sectorsize, inline_size);
+		drop_args.bytes_found -= inline_size;
+		num_bytes -= sectorsize;
+	}
+
+	if (update_inode_bytes)
+		btrfs_update_inode_bytes(inode, num_bytes, drop_args.bytes_found);
 
 	ins.objectid = disk_bytenr;
 	ins.offset = disk_num_bytes;
@@ -2654,6 +2675,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_file_extent_item stack_fi;
 	u64 logical_len;
+	bool update_inode_bytes;
 
 	memset(&stack_fi, 0, sizeof(stack_fi));
 	btrfs_set_stack_file_extent_type(&stack_fi, BTRFS_FILE_EXTENT_REG);
@@ -2669,9 +2691,18 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 	btrfs_set_stack_file_extent_compression(&stack_fi, oe->compress_type);
 	/* Encryption and other encoding is reserved and all 0 */
 
+	/*
+	 * For delalloc, when completing an ordered extent we update the inode's
+	 * bytes when clearing the range in the inode's io tree, so pass false
+	 * as the argument 'update_inode_bytes' to insert_reserved_file_extent(),
+	 * except if the ordered extent was truncated.
+	 */
+	update_inode_bytes = test_bit(BTRFS_ORDERED_DIRECT, &oe->flags) ||
+		test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags);
+
 	return insert_reserved_file_extent(trans, BTRFS_I(oe->inode),
 					   oe->file_offset, &stack_fi,
-					   oe->qgroup_rsv);
+					   update_inode_bytes, oe->qgroup_rsv);
 }
 
 /*
@@ -2693,10 +2724,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	u64 logical_len = ordered_extent->num_bytes;
 	bool freespace_inode;
 	bool truncated = false;
-	bool range_locked = false;
-	bool clear_new_delalloc_bytes = false;
 	bool clear_reserved_extent = true;
-	unsigned int clear_bits;
+	unsigned int clear_bits = EXTENT_DEFRAG;
 
 	start = ordered_extent->file_offset;
 	end = start + ordered_extent->num_bytes - 1;
@@ -2704,7 +2733,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) &&
 	    !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags) &&
 	    !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags))
-		clear_new_delalloc_bytes = true;
+		clear_bits |= EXTENT_DELALLOC_NEW;
 
 	freespace_inode = btrfs_is_free_space_inode(BTRFS_I(inode));
 
@@ -2743,7 +2772,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		goto out;
 	}
 
-	range_locked = true;
+	clear_bits |= EXTENT_LOCKED;
 	lock_extent_bits(io_tree, start, end, &cached_state);
 
 	if (freespace_inode)
@@ -2790,6 +2819,17 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		goto out;
 	}
 
+	/*
+	 * If this is a new delalloc range, clear its new delalloc flag to
+	 * update the inode's number of bytes. This needs to be done first
+	 * before updating the inode item.
+	 */
+	if ((clear_bits & EXTENT_DELALLOC_NEW) &&
+	    !test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags))
+		clear_extent_bit(&BTRFS_I(inode)->io_tree, start, end,
+				 EXTENT_DELALLOC_NEW | EXTENT_ADD_INODE_BYTES,
+				 0, 0, &cached_state);
+
 	btrfs_inode_safe_disk_i_size_write(inode, 0);
 	ret = btrfs_update_inode_fallback(trans, root, inode);
 	if (ret) { /* -ENOMEM or corruption */
@@ -2798,11 +2838,6 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	}
 	ret = 0;
 out:
-	clear_bits = EXTENT_DEFRAG;
-	if (range_locked)
-		clear_bits |= EXTENT_LOCKED;
-	if (clear_new_delalloc_bytes)
-		clear_bits |= EXTENT_DELALLOC_NEW;
 	clear_extent_bit(&BTRFS_I(inode)->io_tree, start, end, clear_bits,
 			 (clear_bits & EXTENT_LOCKED) ? 1 : 0, 0,
 			 &cached_state);
@@ -4794,10 +4829,13 @@ static int maybe_insert_hole(struct btrfs_root *root, struct inode *inode,
 
 	ret = btrfs_insert_file_extent(trans, root, btrfs_ino(BTRFS_I(inode)),
 			offset, 0, 0, len, 0, len, 0, 0, 0);
-	if (ret)
+	if (ret) {
 		btrfs_abort_transaction(trans, ret);
-	else
+	} else {
+		btrfs_update_inode_bytes(BTRFS_I(inode), 0,
+					 drop_args.bytes_found);
 		btrfs_update_inode(trans, root, inode);
+	}
 	btrfs_end_transaction(trans);
 	return ret;
 }
@@ -8121,6 +8159,8 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	u64 start;
 	u64 end;
 	int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
+	bool found_ordered = false;
+	bool completed_ordered = false;
 
 	/*
 	 * we have the page locked, so new writeback can't start,
@@ -8142,15 +8182,17 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	start = page_start;
 	ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
 	if (ordered) {
+		found_ordered = true;
 		end = min(page_end,
 			  ordered->file_offset + ordered->num_bytes - 1);
 		/*
-		 * IO on this page will never be started, so we need
-		 * to account for any ordered extents now
+		 * IO on this page will never be started, so we need to account
+		 * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
+		 * here, must leave that up for the ordered extent completion.
 		 */
 		if (!inode_evicting)
 			clear_extent_bit(tree, start, end,
-					 EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
+					 EXTENT_DELALLOC |
 					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 					 EXTENT_DEFRAG, 1, 0, &cached_state);
 		/*
@@ -8172,8 +8214,10 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 
 			if (btrfs_dec_test_ordered_pending(inode, &ordered,
 							   start,
-							   end - start + 1, 1))
+							   end - start + 1, 1)) {
 				btrfs_finish_ordered_io(ordered);
+				completed_ordered = true;
+			}
 		}
 		btrfs_put_ordered_extent(ordered);
 		if (!inode_evicting) {
@@ -8202,10 +8246,23 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 */
 	btrfs_qgroup_free_data(inode, NULL, page_start, PAGE_SIZE);
 	if (!inode_evicting) {
+		bool delete = true;
+
+		/*
+		 * If there's an ordered extent for this range and we have not
+		 * finished it ourselves, we must leave EXTENT_DELALLOC_NEW set
+		 * in the range for the ordered extent completion. We must also
+		 * not delete the range, otherwise we would lose that bit (and
+		 * any other bits set in the range). Make sure EXTENT_UPTODATE
+		 * is cleared if we don't delete, otherwise it can lead to
+		 * corruptions if the i_size is extented later.
+		 */
+		if (found_ordered && !completed_ordered)
+			delete = false;
 		clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED |
-				 EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
-				 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1, 1,
-				 &cached_state);
+				 EXTENT_DELALLOC | EXTENT_UPTODATE |
+				 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1,
+				 delete, &cached_state);
 
 		__btrfs_releasepage(page, GFP_NOFS);
 	}
@@ -8754,6 +8811,7 @@ static int btrfs_getattr(const struct path *path, struct kstat *stat,
 			 u32 request_mask, unsigned int flags)
 {
 	u64 delalloc_bytes;
+	u64 inode_bytes;
 	struct inode *inode = d_inode(path->dentry);
 	u32 blocksize = inode->i_sb->s_blocksize;
 	u32 bi_flags = BTRFS_I(inode)->flags;
@@ -8780,8 +8838,9 @@ static int btrfs_getattr(const struct path *path, struct kstat *stat,
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
+	inode_bytes = inode_get_bytes(inode);
 	spin_unlock(&BTRFS_I(inode)->lock);
-	stat->blocks = (ALIGN(inode_get_bytes(inode), blocksize) +
+	stat->blocks = (ALIGN(inode_bytes, blocksize) +
 			ALIGN(delalloc_bytes, blocksize)) >> 9;
 	return 0;
 }
@@ -9590,7 +9649,8 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
 
 	if (trans) {
 		ret = insert_reserved_file_extent(trans, BTRFS_I(inode),
-						  file_offset, &stack_fi, ret);
+						  file_offset, &stack_fi,
+						  true, ret);
 		if (ret)
 			return ERR_PTR(ret);
 		return trans;
@@ -10206,6 +10266,27 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 }
 #endif
 
+/*
+ * Update the number of bytes used in the VFS' inode. When we replace extents in
+ * a range (clone, dedupe, fallocate's zero range), we must update the number of
+ * bytes used by the inode in an atomic manner, so that concurrent stat(2) calls
+ * always get a correct value.
+ */
+void btrfs_update_inode_bytes(struct btrfs_inode *inode,
+			      const u64 add_bytes,
+			      const u64 del_bytes)
+{
+	if (add_bytes == del_bytes)
+		return;
+
+	spin_lock(&inode->lock);
+	if (del_bytes > 0)
+		inode_sub_bytes(&inode->vfs_inode, del_bytes);
+	if (add_bytes > 0)
+		inode_add_bytes(&inode->vfs_inode, add_bytes);
+	spin_unlock(&inode->lock);
+}
+
 static const struct inode_operations btrfs_dir_inode_operations = {
 	.getattr	= btrfs_getattr,
 	.lookup		= btrfs_lookup,
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index e65f8c9f8abb..866719f58e22 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -268,7 +268,7 @@ static int clone_copy_inline_extent(struct inode *dst,
 			    btrfs_item_ptr_offset(path->nodes[0],
 						  path->slots[0]),
 			    size);
-	inode_add_bytes(dst, datal);
+	btrfs_update_inode_bytes(BTRFS_I(dst), datal, drop_args.bytes_found);
 	set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &BTRFS_I(dst)->runtime_flags);
 	ret = btrfs_inode_set_file_extent_range(BTRFS_I(dst), 0, aligned_end);
 out:
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 3d142114aa15..fc799ed99304 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -832,8 +832,8 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
 	if (ret)
 		goto out;
 
-	inode_add_bytes(inode, nbytes);
 update_inode:
+	btrfs_update_inode_bytes(BTRFS_I(inode), nbytes, drop_args.bytes_found);
 	ret = btrfs_update_inode(trans, root, inode);
 out:
 	if (inode)
@@ -2598,6 +2598,8 @@ static int replay_one_buffer(struct btrfs_root *log, struct extent_buffer *eb,
 							 BTRFS_I(inode),
 							 &drop_args);
 				if (!ret) {
+					inode_sub_bytes(inode,
+							drop_args.bytes_found);
 					/* Update the inode's nbytes. */
 					ret = btrfs_update_inode(wc->trans,
 								 root, inode);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] btrfs: fix missing delalloc new bit for new delalloc ranges
  2020-11-04 11:07 ` [PATCH 1/4] btrfs: fix missing delalloc new bit for new delalloc ranges fdmanana
@ 2020-11-05 18:29   ` Josef Bacik
  0 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2020-11-05 18:29 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 11/4/20 6:07 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When doing a buffered write, through one of the write family syscalls, we
> look for ranges which currenly don't have allocated extents and set the
> 'delalloc new' bit on them, so that we can report a correct number of used
> blocks to the stat(2) syscall until delalloc is flushed and ordered extents
> complete.
> 
> However there are a few other places where we can do a buffered write
> against a range that is mapped to a hole (no extent allocated) and where
> we do not set the 'new delalloc' bit. Those places are:
> 
> - Doing a memory mapped write against a hole;
> 
> - Cloning an inline extent into a hole starting at file offset 0;
> 
> - Calling btrfs_cont_expand() when the i_size of the file is not aligned
>    to the sector size and is located in a hole. For example when cloning
>    to a destination offset beyond eof.
> 
> So after such cases, until the corresponding delalloc range is flushed and
> the respective ordered extents complete, we can report an incorrect number
> of blocks used through the stat(2) syscall.
> 
> In some cases we can end up reporting 0 used blocks to stat(2), which is a
> particular bad value to report as it may mislead tools to think a file is
> completely sparse when its i_size is not zero, making them skip reading
> any data, an undesired consequence for tools such as archivers and other
> backup tools, as reported a long time ago in the following thread (and
> other past threads):
> 
>    https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
> 
> Example reproducer:
> 
>    $ cat reproducer.sh
>    #!/bin/bash
> 
>    MNT=/mnt/sdi
>    DEV=/dev/sdi
> 
>    mkfs.btrfs -f $DEV > /dev/null
>    # mkfs.xfs -f $DEV > /dev/null
>    # mkfs.ext4 -F $DEV > /dev/null
>    # mkfs.f2fs -f $DEV > /dev/null
>    mount $DEV $MNT
> 
>    xfs_io -f -c "truncate 64K"   \
>        -c "mmap -w 0 64K"        \
>        -c "mwrite -S 0xab 0 64K" \
>        -c "munmap"               \
>        $MNT/foo
> 
>    blocks_used=$(stat -c %b $MNT/foo)
>    echo "blocks used: $blocks_used"
> 
>    if [ $blocks_used -eq 0 ]; then
>        echo "ERROR: blocks used is 0"
>    fi
> 
>    umount $DEV
> 
>    $ ./reproducer.sh
>    blocks used: 0
>    ERROR: blocks used is 0
> 
> So move the logic that decides to set the 'delalloc bit' bit into the
> function btrfs_set_extent_delalloc(), since that is what we use for all
> those missing cases as well as for the cases that currently work well.
> 
> This change is also preparatory work for an upcoming patch that fixes
> other problems related to tracking and reporting the number of bytes used
> by an inode.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] btrfs: refactor btrfs_drop_extents() to make it easier to extend
  2020-11-04 11:07 ` [PATCH 2/4] btrfs: refactor btrfs_drop_extents() to make it easier to extend fdmanana
@ 2020-11-05 18:39   ` Josef Bacik
  0 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2020-11-05 18:39 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 11/4/20 6:07 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> There are many arguments for __btrfs_drop_extents() and its wrapper
> btrfs_drop_extents(), which makes it hard to add more arguments to it and
> requires changing every caller. I have added a couple myself back in 2014
> commit 1acae57b161e ("Btrfs: faster file extent item replace operations")
> and therefore know firsthand that it is a bit cumbersome to add additional
> arguments to these functions.
> 
> Since I will need to add more arguments in a subsequent bug fix, this
> change is preparatory work and adds a data structure that holds all the
> arguments, for both input and output, that are passed to this function,
> with some comments in the structure's definition mentioning what each
> field is and how it relates to other fields.
> 
> Callers of this function need only to zero out the content of the
> structure and setup only the fields they need. This also removes the
> need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so
> now we have a single function named btrfs_drop_extents() that takes a
> pointer to this new data structure (struct btrfs_drop_extents_args).
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/4] btrfs: fix race when defragging that leads to unnecessary IO
  2020-11-04 11:07 ` [PATCH 3/4] btrfs: fix race when defragging that leads to unnecessary IO fdmanana
@ 2020-11-05 18:44   ` Josef Bacik
  0 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2020-11-05 18:44 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 11/4/20 6:07 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When defragging we skip ranges that have holes or inline extents, so that
> we don't do unnecessary IO and waste space. We do this check when calling
> should_defrag_range() at btrfs_defrag_file(). However we do it without
> holding the inode's lock. The reason we do it like this is to avoid
> blocking other tasks for too long, that possibly want to operate on other
> file ranges, since after the call to should_defrag_range() and before
> locking the inode, we trigger a synchronous page cache readahead. However
> before we were able to lock the inode, some other task might have punched
> a hole in our range, or we may now have an inline extent there, in which
> case we should not set the range for defrag anymore since that would cause
> unnecessary IO and make us waste space (i.e. allocating extents to contain
> zeros for a hole).
> 
> So after we locked the inode and the range in the iotree, check again if
> we have holes or an inline extent, and if we do, just skip the range.
> 
> I hit this while testing my next patch that fixes races when updating an
> inode's number of bytes (subject "btrfs: update the number of bytes used
> by an inode atomically"), and it depends on this change in order to work
> correctly. Alternatively I could rework that other patch to detect holes
> and flag their range with the 'new delalloc' bit, but this itself fixes
> an efficiency problem due a race that from a functional point of view is
> not harmful (it could be triggered with btrfs/062 from fstests).
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically
  2020-11-04 11:07 ` [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically fdmanana
@ 2020-11-05 19:24   ` Josef Bacik
  2020-11-09 10:34   ` Nikolay Borisov
  1 sibling, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2020-11-05 19:24 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 11/4/20 6:07 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> There are several occasions where we do not update the inode's number of
> used bytes atomically, resulting in a concurrent stat(2) syscall to report
> a value of used blocks that does not correspond to a valid value, that is,
> a value that does not match neither what we had before the operation nor
> what we get after the operation completes.
> 
> In extreme cases it can result in stat(2) reporting zero used blocks, which
> can cause problems for some userspace tools where they can consider a file
> with a non-zero size and zero used blocks as completely sparse and skip
> reading data, as reported/discussed a long time ago in some threads like
> the following:
> 
>    https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
> 
> The cases where this can happen are the following:
> 
> -> Case 1
> 
> If we do a write (buffered or direct IO) against a file region for which
> there is already an allocated extent (or multiple extents), then we have a
> short time window where we can report a number of used blocks to stat(2)
> that does not take into account the file region being overwritten. This
> short time window happens when completing the ordered extent(s).
> 
> This happens because when we drop the extents in the write range we
> decrement the inode's number of bytes and later on when we insert the new
> extent(s) we increment the number of bytes in the inode, resulting in a
> short time window where a stat(2) syscall can get an incorrect number of
> used blocks.
> 
> If we do writes that overwrite an entire file, then we have a short time
> window where we report 0 used blocks to stat(2).
> 
> Example reproducer:
> 
>    $ cat reproducer-1.sh
>    #!/bin/bash
> 
>    MNT=/mnt/sdi
>    DEV=/dev/sdi
> 
>    stat_loop()
>    {
>        trap "wait; exit" SIGTERM
>        local filepath=$1
>        local expected=$2
>        local got
> 
>        while :; do
>            got=$(stat -c %b $filepath)
>            if [ $got -ne $expected ]; then
>               echo -n "ERROR: unexpected used blocks"
>               echo " (got: $got expected: $expected)"
>            fi
>        done
>    }
> 
>    mkfs.btrfs -f $DEV > /dev/null
>    # mkfs.xfs -f $DEV > /dev/null
>    # mkfs.ext4 -F $DEV > /dev/null
>    # mkfs.f2fs -f $DEV > /dev/null
>    # mkfs.reiserfs -f $DEV > /dev/null
>    mount $DEV $MNT
> 
>    xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
>    expected=$(stat -c %b $MNT/foobar)
> 
>    # Create a process to keep calling stat(2) on the file and see if the
>    # reported number of blocks used (disk space used) changes, it should
>    # not because we are not increasing the file size nor punching holes.
>    stat_loop $MNT/foobar $expected &
>    loop_pid=$!
> 
>    for ((i = 0; i < 50000; i++)); do
>        xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
>    done
> 
>    kill $loop_pid &> /dev/null
>    wait
> 
>    umount $DEV
> 
>    $ ./reproducer-1.sh
>    ERROR: unexpected used blocks (got: 0 expected: 128)
>    ERROR: unexpected used blocks (got: 0 expected: 128)
>    (...)
> 
> Note that since this is a short time window where the race can happen, the
> reproducer may not be able to always trigger the bug in one run, or it may
> trigger it multiple times.
> 
> -> Case 2
> 
> If we do a buffered write against a file region that does not have any
> allocated extents, like a hole or beyond eof, then during ordered extent
> completion we have a short time window where a concurrent stat(2) syscall
> can report a number of used blocks that does not correspond to the value
> before or after the write operation, a value that is actually larger than
> the value after the write completes.
> 
> This happens because once we start a buffered write into an unallocated
> file range we increment the inode's 'new_delalloc_bytes', to make sure
> any stat(2) call gets a correct used blocks value before delalloc is
> flushed and completes. However at ordered extent completion, after we
> inserted the new extent, we increment the inode's number of bytes used
> with the size of the new extent, and only later, when clearing the range
> in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
> counter with the size of the extent. So this results in a short time
> window where a concurrent stat(2) syscall can report a number of used
> blocks that accounts for the new extent twice.
> 
> Example reproducer:
> 
>    $ cat reproducer-2.sh
>    #!/bin/bash
> 
>    MNT=/mnt/sdi
>    DEV=/dev/sdi
> 
>    stat_loop()
>    {
>        trap "wait; exit" SIGTERM
>        local filepath=$1
>        local expected=$2
>        local got
> 
>        while :; do
>            got=$(stat -c %b $filepath)
>            if [ $got -ne $expected ]; then
>                echo -n "ERROR: unexpected used blocks"
>                echo " (got: $got expected: $expected)"
>            fi
>        done
>    }
> 
>    mkfs.btrfs -f $DEV > /dev/null
>    # mkfs.xfs -f $DEV > /dev/null
>    # mkfs.ext4 -F $DEV > /dev/null
>    # mkfs.f2fs -f $DEV > /dev/null
>    # mkfs.reiserfs -f $DEV > /dev/null
>    mount $DEV $MNT
> 
>    touch $MNT/foobar
>    write_size=$((64 * 1024))
>    for ((i = 0; i < 16384; i++)); do
>       offset=$(($i * $write_size))
>       xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
>       blocks_used=$(stat -c %b $MNT/foobar)
> 
>       # Fsync the file to trigger writeback and keep calling stat(2) on it
>       # to see if the number of blocks used changes.
>       stat_loop $MNT/foobar $blocks_used &
>       loop_pid=$!
>       xfs_io -c "fsync" $MNT/foobar
> 
>       kill $loop_pid &> /dev/null
>       wait $loop_pid
>    done
> 
>    umount $DEV
> 
>    $ ./reproducer-2.sh
>    ERROR: unexpected used blocks (got: 265472 expected: 265344)
>    ERROR: unexpected used blocks (got: 284032 expected: 283904)
>    (...)
> 
> Note that since this is a short time window where the race can happen, the
> reproducer may not be able to always trigger the bug in one run, or it may
> trigger it multiple times.
> 
> -> Case 3
> 
> Another case where such problems happen is during other operations that
> replace extents in a file range with other extents. Those operations are
> extent cloning, deduplication and fallocate's zero range operation.
> 
> The cause of the problem is similar to the first case. When we drop the
> extents from a range, we decrement the inode's number of bytes, and later
> on, after inserting the new extents we increment it. Since this is not
> done atomically, a concurrent stat(2) call can see and return a number of
> used blocks that is smaller than it should be, does not match the number
> of used blocks before or after the clone/deduplication/zero operation.
> 
> Like for the first case, when doing a clone, deduplication or zero range
> operation against an entire file, we end up having a time window where we
> can report 0 used blocks to a stat(2) call.
> 
> Example reproducer:
> 
>    $ cat reproducer-3.sh
>    #!/bin/bash
> 
>    MNT=/mnt/sdi
>    DEV=/dev/sdi
> 
>    mkfs.btrfs -f $DEV > /dev/null
>    # mkfs.xfs -f -m reflink=1 $DEV > /dev/null
>    mount $DEV $MNT
> 
>    extent_size=$((64 * 1024))
>    num_extents=16384
>    file_size=$(($extent_size * $num_extents))
> 
>    # File foo has many small extents.
>    xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
>        > /dev/null
>    # File bar has much less extents and has exactly the same data as foo.
>    xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
> 
>    expected=$(stat -c %b $MNT/foo)
> 
>    # Now deduplicate bar into foo. While the deduplication is in progres,
>    # the number of used blocks/file size reported by stat should not change
>    xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null  &
>    dedupe_pid=$!
>    while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
>        used=$(stat -c %b $MNT/foo)
>        if [ $used -ne $expected ]; then
>            echo "Unexpected blocks used: $used (expected: $expected)"
>        fi
>    done
> 
>    umount $DEV
> 
>    $ ./reproducer-3.sh
>    Unexpected blocks used: 2076800 (expected: 2097152)
>    Unexpected blocks used: 2097024 (expected: 2097152)
>    Unexpected blocks used: 2079872 (expected: 2097152)
>    (...)
> 
> Note that since this is a short time window where the race can happen, the
> reproducer may not be able to always trigger the bug in one run, or it may
> trigger it multiple times.
> 
> So fix this by:
> 
> 1) Making btrfs_drop_extents() not decrement the vfs inode's number of
>     bytes, and instead return the number of bytes;
> 
> 2) Making any code that drops extents and adds new extents update the
>     inode's number of bytes atomically, while holding the btrfs inode's
>     spinlock, which is also used by the stat(2) callback to get the inode's
>     number of bytes;
> 
> 3) For ranges in the inode's iotree that are marked as 'delalloc new',
>     corresponding to previously unallocated ranges, increment the inode's
>     number of bytes when clearing the 'delalloc new' bit from the range,
>     in the same critical section that decrements the inode's
>     'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
> 
> An alternative would be to have btrfs_getattr() wait for any IO (ordered
> extents in progress) and locking the whole range (0 to (u64)-1) while it
> it computes the number of blocks used. But that would mean blocking
> stat(2), which is a very used syscall and expected to be fast, waiting
> for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically
  2020-11-04 11:07 ` [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically fdmanana
  2020-11-05 19:24   ` Josef Bacik
@ 2020-11-09 10:34   ` Nikolay Borisov
  2020-11-09 11:10     ` Filipe Manana
  1 sibling, 1 reply; 12+ messages in thread
From: Nikolay Borisov @ 2020-11-09 10:34 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 4.11.20 г. 13:07 ч., fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 

<snip>

> So fix this by:
> 
> 1) Making btrfs_drop_extents() not decrement the vfs inode's number of
>    bytes, and instead return the number of bytes;
> 
> 2) Making any code that drops extents and adds new extents update the
>    inode's number of bytes atomically, while holding the btrfs inode's
>    spinlock, which is also used by the stat(2) callback to get the inode's
>    number of bytes;

Since synchronization is going to be provided by the btrfs_inode's lock,
then how about switching to __inode_(sub|add)_bytes functions which do
not take the vfs_inode's lock? Or do we need both (btrfs' and vfs' inode
locks) ?

> 
> 3) For ranges in the inode's iotree that are marked as 'delalloc new',
>    corresponding to previously unallocated ranges, increment the inode's
>    number of bytes when clearing the 'delalloc new' bit from the range,
>    in the same critical section that decrements the inode's
>    'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
> 
> An alternative would be to have btrfs_getattr() wait for any IO (ordered
> extents in progress) and locking the whole range (0 to (u64)-1) while it
> it computes the number of blocks used. But that would mean blocking
> stat(2), which is a very used syscall and expected to be fast, waiting
> for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
<snip>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically
  2020-11-09 10:34   ` Nikolay Borisov
@ 2020-11-09 11:10     ` Filipe Manana
  0 siblings, 0 replies; 12+ messages in thread
From: Filipe Manana @ 2020-11-09 11:10 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Mon, Nov 9, 2020 at 10:34 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 4.11.20 г. 13:07 ч., fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
>
> <snip>
>
> > So fix this by:
> >
> > 1) Making btrfs_drop_extents() not decrement the vfs inode's number of
> >    bytes, and instead return the number of bytes;
> >
> > 2) Making any code that drops extents and adds new extents update the
> >    inode's number of bytes atomically, while holding the btrfs inode's
> >    spinlock, which is also used by the stat(2) callback to get the inode's
> >    number of bytes;
>
> Since synchronization is going to be provided by the btrfs_inode's lock,
> then how about switching to __inode_(sub|add)_bytes functions which do
> not take the vfs_inode's lock? Or do we need both (btrfs' and vfs' inode
> locks) ?

Because the synchronization is only meant for the races with stat(2),
and don't want to break existing users of inode_get_bytes(), specially
those outside btrfs.

Thanks.

>
> >
> > 3) For ranges in the inode's iotree that are marked as 'delalloc new',
> >    corresponding to previously unallocated ranges, increment the inode's
> >    number of bytes when clearing the 'delalloc new' bit from the range,
> >    in the same critical section that decrements the inode's
> >    'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
> >
> > An alternative would be to have btrfs_getattr() wait for any IO (ordered
> > extents in progress) and locking the whole range (0 to (u64)-1) while it
> > it computes the number of blocks used. But that would mean blocking
> > stat(2), which is a very used syscall and expected to be fast, waiting
> > for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> <snip>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks
  2020-11-04 11:07 [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks fdmanana
                   ` (3 preceding siblings ...)
  2020-11-04 11:07 ` [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically fdmanana
@ 2020-11-10 21:46 ` David Sterba
  4 siblings, 0 replies; 12+ messages in thread
From: David Sterba @ 2020-11-10 21:46 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Nov 04, 2020 at 11:07:30AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> There are several cases where a stat(2) call can report an incorrect number
> of used blocks. In some cases can even result in reporting 0 used blocks,
> which is a specially bad value to report when a file is not empty, and we
> had user bug reports in the past for such cases (the changelogs in the
> patches point to one such report).
> 
> This patchset addresses all those cases. The third patch fixes a race in
> defrag that while it does not result in a functional problem (data loss
> or some corruption), it leads to unnecessary IO and space allocation,
> and it's necessary for the 4th and final patch to work as it is.
> 
> A couple test cases for fstests will follow.
> 
> Filipe Manana (4):
>   btrfs: fix missing delalloc new bit for new delalloc ranges
>   btrfs: refactor btrfs_drop_extents() to make it easier to extend
>   btrfs: fix race when defragging that leads to unnecessary IO
>   btrfs: update the number of bytes used by an inode atomically

Thanks, now added to misc-next. As the fixes seem to be important I've
checked which stable kernels are relevant, so it's 4.19 and 5.4. The
code does not apply due to intermediate cleanups but it should be
easy to resolve.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-11-10 21:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-04 11:07 [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks fdmanana
2020-11-04 11:07 ` [PATCH 1/4] btrfs: fix missing delalloc new bit for new delalloc ranges fdmanana
2020-11-05 18:29   ` Josef Bacik
2020-11-04 11:07 ` [PATCH 2/4] btrfs: refactor btrfs_drop_extents() to make it easier to extend fdmanana
2020-11-05 18:39   ` Josef Bacik
2020-11-04 11:07 ` [PATCH 3/4] btrfs: fix race when defragging that leads to unnecessary IO fdmanana
2020-11-05 18:44   ` Josef Bacik
2020-11-04 11:07 ` [PATCH 4/4] btrfs: update the number of bytes used by an inode atomically fdmanana
2020-11-05 19:24   ` Josef Bacik
2020-11-09 10:34   ` Nikolay Borisov
2020-11-09 11:10     ` Filipe Manana
2020-11-10 21:46 ` [PATCH 0/4] btrfs: fix cases of stat(2) reporting incorrect number of used blocks David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.