linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] btrfs: qgroup: detect and fix leaked data reserved space
@ 2020-06-10  1:04 Qu Wenruo
  2020-06-10  1:04 ` [PATCH v3 1/5] btrfs: inode: refactor the parameters of insert_reserved_file_extent() Qu Wenruo
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-06-10  1:04 UTC (permalink / raw)
  To: linux-btrfs

ghere is an internal report complaining that qgroup is only half of the
limit, but they still get EDQUOT errors.

With some extra debugging patch added, it turns out that even fsstress
with 10 steps can sometimes cause qgroup reserved data space to leak.

The root cause is the chaotic lifespan of qgroup data rsv.
Here is the chart explaining the difference between the old and new
lifespan of qgroup data rsv:
  ||: Qgroup data rsv is reserved
   |: Qgroup data rsv is released but not freed
    : Qgroup data rsv is freed

	The old		The new

	   TT		   TT		Page get dirtied
 	   ||		   ||
           ||		   ||
  	   || ------------ || --------- btrfs_run_delalloc_range()
	   ||		    |		|- btrfs_add_ordered_extent()
	   ||		    |
	    | ------------  | --------- btrfs_finish_ordered_io()
	    |		    |
	      ------------    --------- btrfs_account_extents()

Since there is a large window between btrfs_add_ordered_extent() and
btrfs_finish_ordered_io(), during which page can be released and clear
the QGROUP_RESERVED bit.

In fact during dio, dio will try to flush the range, and then invalidate
the pages before submitting direct IO.

Due to the fact that filemap_write_and_wait_range() will only wait for
page writeback get cleared, not page dirty cleared, so it will release
pages before btrfs_finish_ordered_io() get executed, and clearing
QGROUP_RESERVED bit without triggering qgroup rsv release, leading to
qgroup data rsv leakage.

With the new timing, QGROUP_RESERVED bit is cleared before
filemap_write_and_wait_range() returns, and doing proper qgroup rsv
releasing, so there is no window to screw up qgroup rsv anymore.

Although to co-operate the timing change, quite some existing chaotic
btrfs_qgroup_release/free_data() calls must be modified/removed to
follow the more restrict calling protocol.

But overall, this make the qgroup data rsv lifespan more clear, so it
should be still worthy.

After all the big timing change and fixes, add an extra and hopefully
final safe net to catch qgroup data rsv leakage.
Now extent io tree based QGROUP_RESERVED bit should detect case like
missing btrfs_qgroup_release/free_data() call, while the unmount check
should detect unexpected QGROUP_RESERVED bit clearing.

The existing test case btrfs/022 can already catch the bug pretty
reliably.

Changelog:
v2:
- Change the lifespan of qgroup data rsv
  From the original whac-a-mole method to a more proper timing, to
  use ordered extents as the proper owner of qgroup data rsv.

- Add commit message for the final patch

- Adds extra refactor to make insert_reserved_file_extent() use less
  parameters

v3:
- Remove the redundant WARN() message

- Reduce the scope of struct btrfs_qgroup in btrfs_qgroup_has_leak()

Qu Wenruo (5):
  btrfs: inode: refactor the parameters of insert_reserved_file_extent()
  btrfs: inode: move the qgroup reserved data space release into the
    callers of insert_reserved_file_extent()
  btrfs: file: reserve qgroup space after the hole punch range locked
  btrfs: change the timing for qgroup reserved space for ordered extents
    to fix reserved space leak
  btrfs: qgroup: catch reserved space leakage at unmount time

 fs/btrfs/ctree.h        |   6 +-
 fs/btrfs/disk-io.c      |   5 ++
 fs/btrfs/file.c         |   8 +--
 fs/btrfs/inode.c        | 119 +++++++++++++++++++++++-----------------
 fs/btrfs/ordered-data.c |  22 +++++++-
 fs/btrfs/ordered-data.h |   3 +
 fs/btrfs/qgroup.c       |  43 +++++++++++++++
 fs/btrfs/qgroup.h       |   2 +-
 8 files changed, 151 insertions(+), 57 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/5] btrfs: inode: refactor the parameters of insert_reserved_file_extent()
  2020-06-10  1:04 [PATCH v3 0/5] btrfs: qgroup: detect and fix leaked data reserved space Qu Wenruo
@ 2020-06-10  1:04 ` Qu Wenruo
  2020-06-12 18:46   ` Josef Bacik
  2020-06-10  1:04 ` [PATCH v3 2/5] btrfs: inode: move the qgroup reserved data space release into the callers " Qu Wenruo
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-06-10  1:04 UTC (permalink / raw)
  To: linux-btrfs

Function insert_reserved_file_extent() takes a long list of parameters,
which are all for btrfs_file_extent_item, even including two reserved
members, encryption and other_encoding.

This makes the parameter list unnecessary long for a function which only
get called twice.

This patch will refactor the parameter list, by using
btrfs_file_extent_item as parameter directly to hugely reduce the number
of parameters.

Also, since there are only two callers, one in btrfs_finish_ordered_io()
which inserts file extent for ordered extent, and one
__btrfs_prealloc_file_range().

These two call sites have completely different context, where ordered
extent can be compressed, but will always be regular extent, while the
preallocated one is never going to be compressed and always has PREALLOC
type.

So use two small wrapper for these two different call sites to improve
readability.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h |  6 +++-
 fs/btrfs/inode.c | 94 ++++++++++++++++++++++++++++++++----------------
 2 files changed, 68 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 161533040978..23f7e9d67bdb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2231,7 +2231,8 @@ static inline unsigned int leaf_data_end(const struct extent_buffer *leaf)
 }
 
 /* struct btrfs_file_extent_item */
-BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
+BTRFS_SETGET_STACK_FUNCS(stack_file_extent_type, struct btrfs_file_extent_item,
+			 type, 8);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr,
 			 struct btrfs_file_extent_item, disk_bytenr, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_offset,
@@ -2240,6 +2241,8 @@ BTRFS_SETGET_STACK_FUNCS(stack_file_extent_generation,
 			 struct btrfs_file_extent_item, generation, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_num_bytes,
 			 struct btrfs_file_extent_item, num_bytes, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_file_extent_ram_bytes,
+			 struct btrfs_file_extent_item, ram_bytes, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_num_bytes,
 			 struct btrfs_file_extent_item, disk_num_bytes, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_compression,
@@ -2256,6 +2259,7 @@ static inline u32 btrfs_file_extent_calc_inline_size(u32 datasize)
 	return BTRFS_FILE_EXTENT_INLINE_DATA_START + datasize;
 }
 
+BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
 BTRFS_SETGET_FUNCS(file_extent_disk_bytenr, struct btrfs_file_extent_item,
 		   disk_bytenr, 64);
 BTRFS_SETGET_FUNCS(file_extent_generation, struct btrfs_file_extent_item,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1242d0aa108d..c076cbe9f492 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2449,16 +2449,16 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 
 static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 				       struct inode *inode, u64 file_pos,
-				       u64 disk_bytenr, u64 disk_num_bytes,
-				       u64 num_bytes, u64 ram_bytes,
-				       u8 compression, u8 encryption,
-				       u16 other_encoding, int extent_type)
+				       struct btrfs_file_extent_item *stack_fi)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct btrfs_file_extent_item *fi;
 	struct btrfs_path *path;
 	struct extent_buffer *leaf;
 	struct btrfs_key ins;
+	u64 disk_num_bytes = btrfs_stack_file_extent_disk_num_bytes(stack_fi);
+	u64 disk_bytenr = btrfs_stack_file_extent_disk_bytenr(stack_fi);
+	u64 num_bytes = btrfs_stack_file_extent_num_bytes(stack_fi);
+	u64 ram_bytes = btrfs_stack_file_extent_ram_bytes(stack_fi);
 	u64 qg_released;
 	int extent_inserted = 0;
 	int ret;
@@ -2478,7 +2478,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	 */
 	ret = __btrfs_drop_extents(trans, root, inode, path, file_pos,
 				   file_pos + num_bytes, NULL, 0,
-				   1, sizeof(*fi), &extent_inserted);
+				   1, sizeof(*stack_fi), &extent_inserted);
 	if (ret)
 		goto out;
 
@@ -2489,23 +2489,15 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 
 		path->leave_spinning = 1;
 		ret = btrfs_insert_empty_item(trans, root, path, &ins,
-					      sizeof(*fi));
+					      sizeof(*stack_fi));
 		if (ret)
 			goto out;
 	}
 	leaf = path->nodes[0];
-	fi = btrfs_item_ptr(leaf, path->slots[0],
-			    struct btrfs_file_extent_item);
-	btrfs_set_file_extent_generation(leaf, fi, trans->transid);
-	btrfs_set_file_extent_type(leaf, fi, extent_type);
-	btrfs_set_file_extent_disk_bytenr(leaf, fi, disk_bytenr);
-	btrfs_set_file_extent_disk_num_bytes(leaf, fi, disk_num_bytes);
-	btrfs_set_file_extent_offset(leaf, fi, 0);
-	btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes);
-	btrfs_set_file_extent_ram_bytes(leaf, fi, ram_bytes);
-	btrfs_set_file_extent_compression(leaf, fi, compression);
-	btrfs_set_file_extent_encryption(leaf, fi, encryption);
-	btrfs_set_file_extent_other_encoding(leaf, fi, other_encoding);
+	btrfs_set_stack_file_extent_generation(stack_fi, trans->transid);
+	write_extent_buffer(leaf, stack_fi,
+			btrfs_item_ptr_offset(leaf, path->slots[0]),
+			sizeof(struct btrfs_file_extent_item));
 
 	btrfs_mark_buffer_dirty(leaf);
 	btrfs_release_path(path);
@@ -2553,7 +2545,33 @@ static void btrfs_release_delalloc_bytes(struct btrfs_fs_info *fs_info,
 	btrfs_put_block_group(cache);
 }
 
-/* as ordered data IO finishes, this gets called so we can finish
+static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
+					     struct inode *inode,
+					     struct btrfs_ordered_extent *oe)
+{
+	struct btrfs_file_extent_item stack_fi;
+	u64 logical_len;
+
+	memset(&stack_fi, 0, sizeof(stack_fi));
+	btrfs_set_stack_file_extent_type(&stack_fi, BTRFS_FILE_EXTENT_REG);
+	btrfs_set_stack_file_extent_disk_bytenr(&stack_fi, oe->disk_bytenr);
+	btrfs_set_stack_file_extent_disk_num_bytes(&stack_fi,
+						   oe->disk_num_bytes);
+	if (test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags))
+		logical_len = oe->truncated_len;
+	else
+		logical_len = oe->num_bytes;
+	btrfs_set_stack_file_extent_num_bytes(&stack_fi, logical_len);
+	btrfs_set_stack_file_extent_ram_bytes(&stack_fi, logical_len);
+	btrfs_set_stack_file_extent_compression(&stack_fi, oe->compress_type);
+	/* Encryption and other encoding is reserved and all 0 */
+
+	return insert_reserved_file_extent(trans, inode, oe->file_offset,
+					   &stack_fi);
+}
+
+/*
+ * As ordered data IO finishes, this gets called so we can finish
  * an ordered extent if the range of bytes in the file it covers are
  * fully written.
  */
@@ -2655,12 +2673,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 						logical_len);
 	} else {
 		BUG_ON(root == fs_info->tree_root);
-		ret = insert_reserved_file_extent(trans, inode, start,
-						ordered_extent->disk_bytenr,
-						ordered_extent->disk_num_bytes,
-						logical_len, logical_len,
-						compress_type, 0, 0,
-						BTRFS_FILE_EXTENT_REG);
+		ret = insert_ordered_extent_file_extent(trans, inode,
+							ordered_extent);
 		if (!ret) {
 			clear_reserved_extent = false;
 			btrfs_release_delalloc_bytes(fs_info,
@@ -9487,6 +9501,27 @@ static int btrfs_symlink(struct inode *dir, struct dentry *dentry,
 	return err;
 }
 
+static int insert_prealloc_file_extent(struct btrfs_trans_handle *trans,
+				       struct inode *inode, struct btrfs_key *ins,
+				       u64 file_offset)
+{
+	struct btrfs_file_extent_item stack_fi;
+	u64 start = ins->objectid;
+	u64 len = ins->offset;
+
+	memset(&stack_fi, 0, sizeof(stack_fi));
+
+	btrfs_set_stack_file_extent_type(&stack_fi, BTRFS_FILE_EXTENT_PREALLOC);
+	btrfs_set_stack_file_extent_disk_bytenr(&stack_fi, start);
+	btrfs_set_stack_file_extent_disk_num_bytes(&stack_fi, len);
+	btrfs_set_stack_file_extent_num_bytes(&stack_fi, len);
+	btrfs_set_stack_file_extent_ram_bytes(&stack_fi, len);
+	btrfs_set_stack_file_extent_compression(&stack_fi, BTRFS_COMPRESS_NONE);
+	/* Encryption and other encoding is reserved and all 0 */
+
+	return insert_reserved_file_extent(trans, inode, file_offset,
+					   &stack_fi);
+}
 static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 				       u64 start, u64 num_bytes, u64 min_size,
 				       loff_t actual_len, u64 *alloc_hint,
@@ -9545,11 +9580,8 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 
 		last_alloc = ins.offset;
-		ret = insert_reserved_file_extent(trans, inode,
-						  cur_offset, ins.objectid,
-						  ins.offset, ins.offset,
-						  ins.offset, 0, 0, 0,
-						  BTRFS_FILE_EXTENT_PREALLOC);
+		ret = insert_prealloc_file_extent(trans, inode, &ins,
+						  cur_offset);
 		if (ret) {
 			btrfs_free_reserved_extent(fs_info, ins.objectid,
 						   ins.offset, 0);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/5] btrfs: inode: move the qgroup reserved data space release into the callers of insert_reserved_file_extent()
  2020-06-10  1:04 [PATCH v3 0/5] btrfs: qgroup: detect and fix leaked data reserved space Qu Wenruo
  2020-06-10  1:04 ` [PATCH v3 1/5] btrfs: inode: refactor the parameters of insert_reserved_file_extent() Qu Wenruo
@ 2020-06-10  1:04 ` Qu Wenruo
  2020-06-12 18:49   ` Josef Bacik
  2020-06-10  1:04 ` [PATCH v3 3/5] btrfs: file: reserve qgroup space after the hole punch range locked Qu Wenruo
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-06-10  1:04 UTC (permalink / raw)
  To: linux-btrfs

This is to prepare for the incoming timing change of qgroup reserved
data space and ordered extent.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c076cbe9f492..09e1724d620a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2449,7 +2449,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 
 static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 				       struct inode *inode, u64 file_pos,
-				       struct btrfs_file_extent_item *stack_fi)
+				       struct btrfs_file_extent_item *stack_fi,
+				       u64 qgroup_reserved)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_path *path;
@@ -2459,7 +2460,6 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	u64 disk_bytenr = btrfs_stack_file_extent_disk_bytenr(stack_fi);
 	u64 num_bytes = btrfs_stack_file_extent_num_bytes(stack_fi);
 	u64 ram_bytes = btrfs_stack_file_extent_ram_bytes(stack_fi);
-	u64 qg_released;
 	int extent_inserted = 0;
 	int ret;
 
@@ -2513,17 +2513,9 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	if (ret)
 		goto out;
 
-	/*
-	 * Release the reserved range from inode dirty range map, as it is
-	 * already moved into delayed_ref_head
-	 */
-	ret = btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
-	if (ret < 0)
-		goto out;
-	qg_released = ret;
 	ret = btrfs_alloc_reserved_file_extent(trans, root,
 					       btrfs_ino(BTRFS_I(inode)),
-					       file_pos, qg_released, &ins);
+					       file_pos, qgroup_reserved, &ins);
 out:
 	btrfs_free_path(path);
 
@@ -2551,6 +2543,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_file_extent_item stack_fi;
 	u64 logical_len;
+	int ret;
 
 	memset(&stack_fi, 0, sizeof(stack_fi));
 	btrfs_set_stack_file_extent_type(&stack_fi, BTRFS_FILE_EXTENT_REG);
@@ -2566,8 +2559,11 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 	btrfs_set_stack_file_extent_compression(&stack_fi, oe->compress_type);
 	/* Encryption and other encoding is reserved and all 0 */
 
+	ret = btrfs_qgroup_release_data(inode, oe->file_offset, logical_len);
+	if (ret < 0)
+		return ret;
 	return insert_reserved_file_extent(trans, inode, oe->file_offset,
-					   &stack_fi);
+					   &stack_fi, ret);
 }
 
 /*
@@ -9508,6 +9504,7 @@ static int insert_prealloc_file_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_file_extent_item stack_fi;
 	u64 start = ins->objectid;
 	u64 len = ins->offset;
+	int ret;
 
 	memset(&stack_fi, 0, sizeof(stack_fi));
 
@@ -9519,8 +9516,11 @@ static int insert_prealloc_file_extent(struct btrfs_trans_handle *trans,
 	btrfs_set_stack_file_extent_compression(&stack_fi, BTRFS_COMPRESS_NONE);
 	/* Encryption and other encoding is reserved and all 0 */
 
+	ret = btrfs_qgroup_release_data(inode, file_offset, len);
+	if (ret < 0)
+		return ret;
 	return insert_reserved_file_extent(trans, inode, file_offset,
-					   &stack_fi);
+					   &stack_fi, ret);
 }
 static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 				       u64 start, u64 num_bytes, u64 min_size,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 3/5] btrfs: file: reserve qgroup space after the hole punch range locked
  2020-06-10  1:04 [PATCH v3 0/5] btrfs: qgroup: detect and fix leaked data reserved space Qu Wenruo
  2020-06-10  1:04 ` [PATCH v3 1/5] btrfs: inode: refactor the parameters of insert_reserved_file_extent() Qu Wenruo
  2020-06-10  1:04 ` [PATCH v3 2/5] btrfs: inode: move the qgroup reserved data space release into the callers " Qu Wenruo
@ 2020-06-10  1:04 ` Qu Wenruo
  2020-06-12 18:49   ` Josef Bacik
  2020-06-10  1:04 ` [PATCH v3 4/5] btrfs: change the timing for qgroup reserved space for ordered extents to fix reserved space leak Qu Wenruo
  2020-06-10  1:04 ` [PATCH v3 5/5] btrfs: qgroup: catch reserved space leakage at unmount time Qu Wenruo
  4 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-06-10  1:04 UTC (permalink / raw)
  To: linux-btrfs

The incoming qgroup reserved space timing will move the data reserve to
ordered extent completely.

However in btrfs_punch_hole_lock_range() will call
btrfs_invalidate_page(), which will clear QGROUP_RESERVED bit for the
range.

In current stage it's OK, but if we're making ordered extents to handle
the reserved space, then btrfs_punch_hole_lock_range() can clear the
QGROUP_RESERVED bit before we submit ordered extent, leading to qgroup
reserved space leakage.

So here change the timing to make reserve data space after
btrfs_punch_hole_lock_range().
The new timing is fine for either current code or the new code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fde125616687..3f4b8f7666a9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3188,14 +3188,14 @@ static int btrfs_zero_range(struct inode *inode,
 		if (ret < 0)
 			goto out;
 		space_reserved = true;
-		ret = btrfs_qgroup_reserve_data(inode, &data_reserved,
-						alloc_start, bytes_to_reserve);
-		if (ret)
-			goto out;
 		ret = btrfs_punch_hole_lock_range(inode, lockstart, lockend,
 						  &cached_state);
 		if (ret)
 			goto out;
+		ret = btrfs_qgroup_reserve_data(inode, &data_reserved,
+						alloc_start, bytes_to_reserve);
+		if (ret)
+			goto out;
 		ret = btrfs_prealloc_file_range(inode, mode, alloc_start,
 						alloc_end - alloc_start,
 						i_blocksize(inode),
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 4/5] btrfs: change the timing for qgroup reserved space for ordered extents to fix reserved space leak
  2020-06-10  1:04 [PATCH v3 0/5] btrfs: qgroup: detect and fix leaked data reserved space Qu Wenruo
                   ` (2 preceding siblings ...)
  2020-06-10  1:04 ` [PATCH v3 3/5] btrfs: file: reserve qgroup space after the hole punch range locked Qu Wenruo
@ 2020-06-10  1:04 ` Qu Wenruo
  2020-06-12 18:50   ` Josef Bacik
  2020-06-16 15:17   ` David Sterba
  2020-06-10  1:04 ` [PATCH v3 5/5] btrfs: qgroup: catch reserved space leakage at unmount time Qu Wenruo
  4 siblings, 2 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-06-10  1:04 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

[BUG]
The following simple workload from fsstress can lead to qgroup reserved
data space leakage:
  0/0: creat f0 x:0 0 0
  0/0: creat add id=0,parent=-1
  0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
  0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
  0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0

This would cause btrfs qgroup to leak 20480 bytes for data reserved
space.
If btrfs qgroup limit is enabled, such leakage can lead to unexpected
early EDQUOT and unusable space.

[CAUSE]
When doing direct IO, kernel will try to writeback existing buffered
page cache, then invalidate them:
  iomap_dio_rw()
  |- filemap_write_and_wait_range();
  |- invalidate_inode_pages2_range();

However for btrfs, the bi_end_io hook doesn't finish all its heavy work
right after bio ends.
In fact, it delays its work further:
  submit_extent_page(end_io_func=end_bio_extent_writepage);
  end_bio_extent_writepage()
  |- btrfs_writepage_endio_finish_ordered()
     |- btrfs_init_work(finish_ordered_fn);

  <<< Work queue execution >>>
  finish_ordered_fn()
  |- btrfs_finish_ordered_io();
     |- Clear qgroup bits

This means, when filemap_write_and_wait_range() returns,
btrfs_finish_ordered_io() is not ensured to be executed, thus the
qgroup bits for related range is not cleared.

Now into how the leakage happens, this will only focus on the
overlapping part of buffered and direct IO part.

1. After buffered write
   The inode had the following range with QGROUP_RESERVED bit:
   	596		616K
	|///////////////|
   Qgroup reserved data space: 20K

2. Writeback part for range [596K, 616K)
   Write back finished, but btrfs_finish_ordered_io() not get called
   yet.
   So we still have:
   	596K		616K
	|///////////////|
   Qgroup reserved data space: 20K

3. Pages for range [596K, 616K) get released
   This will clear all qgroup bits, but don't update the reserved data
   space.
   So we have:
   	596K		616K
	|		|
   Qgroup reserved data space: 20K
   That number doesn't match with the qgroup bit range anymore.

4. Dio prepare space for range [596K, 700K)
   Qgroup reserved data space for that range, we got:
   	596K		616K			700K
	|///////////////|///////////////////////|
   Qgroup reserved data space: 20K + 104K = 124K

5. btrfs_finish_ordered_range() get executed for range [596K, 616K)
   Qgroup free reserved space for that range, we got:
   	596K		616K			700K
	|		|///////////////////////|
   We need to free that range of reserved space.
   Qgroup reserved data space: 124K - 20K = 104K

6. btrfs_finish_ordered_range() get executed for range [596K, 700K)
   However qgroup bit for range [596K, 616K) is already cleared in
   previous step, so we only free 84K for qgroup reserved space.
   	596K		616K			700K
	|		|			|
   We need to free that range of reserved space.
   Qgroup reserved data space: 104K - 84K = 20K

   Now there is no way to release that 20K unless disabling qgroup or
   unmount the fs.

[FIX]
This patch will change the timing when btrfs_qgroup_release/free_data()
get called.
Here uses buffered CoW write as an example.

	The new timing			|	The old timing
----------------------------------------+---------------------------------------
 btrfs_buffered_write()			| btrfs_buffered_write()
 |- btrfs_qgroup_reserve_data() 	| |- btrfs_qgroup_reserve_data()
					|
 btrfs_run_delalloc_range()		| btrfs_run_delalloc_range()
 |- btrfs_add_ordered_extent()  	|
    |- btrfs_qgroup_release_data()	|
       The reserved is passed into	|
       btrfs_ordered_extent structure	|
					|
 btrfs_finish_ordered_io()		| btrfs_finish_ordered_io()
 |- The reserved space is passed to 	| |- btrfs_qgroup_release_data()
    btrfs_qgroup_record			|    The resereved space is passed
					|    to btrfs_qgroup_recrod
					|
 btrfs_qgroup_account_extents()		| btrfs_qgroup_account_extents()
 |- btrfs_qgroup_free_refroot()		| |- btrfs_qgroup_free_refroot()

The point of such change is to ensure, when ordered extents are
submitted, the qgroup reserved space is already release, to keep the
timing aligned with file_write_and_wait_range().

So that qgroup data reserved space is all bound to btrfs_ordered_extent
and solve the timing mismatch.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Fixes: f695fdcef83a ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c        | 15 +--------------
 fs/btrfs/ordered-data.c | 22 +++++++++++++++++++++-
 fs/btrfs/ordered-data.h |  3 +++
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 09e1724d620a..094926cc4982 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2543,7 +2543,6 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_file_extent_item stack_fi;
 	u64 logical_len;
-	int ret;
 
 	memset(&stack_fi, 0, sizeof(stack_fi));
 	btrfs_set_stack_file_extent_type(&stack_fi, BTRFS_FILE_EXTENT_REG);
@@ -2559,11 +2558,8 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 	btrfs_set_stack_file_extent_compression(&stack_fi, oe->compress_type);
 	/* Encryption and other encoding is reserved and all 0 */
 
-	ret = btrfs_qgroup_release_data(inode, oe->file_offset, logical_len);
-	if (ret < 0)
-		return ret;
 	return insert_reserved_file_extent(trans, inode, oe->file_offset,
-					   &stack_fi, ret);
+					   &stack_fi, oe->qgroup_rsv);
 }
 
 /*
@@ -2618,13 +2614,6 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	if (test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags)) {
 		BUG_ON(!list_empty(&ordered_extent->list)); /* Logic error */
 
-		/*
-		 * For mwrite(mmap + memset to write) case, we still reserve
-		 * space for NOCOW range.
-		 * As NOCOW won't cause a new delayed ref, just free the space
-		 */
-		btrfs_qgroup_free_data(inode, NULL, start,
-				       ordered_extent->num_bytes);
 		btrfs_inode_safe_disk_i_size_write(inode, 0);
 		if (freespace_inode)
 			trans = btrfs_join_transaction_spacecache(root);
@@ -2661,8 +2650,6 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		compress_type = ordered_extent->compress_type;
 	if (test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) {
 		BUG_ON(compress_type);
-		btrfs_qgroup_free_data(inode, NULL, start,
-				       ordered_extent->num_bytes);
 		ret = btrfs_mark_extent_written(trans, BTRFS_I(inode),
 						ordered_extent->file_offset,
 						ordered_extent->file_offset +
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index e13b3d28c063..c8bd7a4e67bb 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -15,6 +15,7 @@
 #include "disk-io.h"
 #include "compression.h"
 #include "delalloc-space.h"
+#include "qgroup.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -152,7 +153,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
 	return ret;
 }
 
-/* allocate and add a new ordered_extent into the per-inode tree.
+/*
+ * Allocate and add a new ordered_extent into the per-inode tree.
  *
  * The tree is given a single reference on the ordered extent that was
  * inserted.
@@ -167,7 +169,24 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 	struct btrfs_ordered_inode_tree *tree;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry;
+	int ret;
 
+	if (type == BTRFS_ORDERED_NOCOW || type == BTRFS_ORDERED_PREALLOC) {
+		/* For nocow write, we can release the qgroup rsv right now */
+		ret = btrfs_qgroup_free_data(inode, NULL, file_offset,
+					     num_bytes);
+		if (ret < 0)
+			return ret;
+		ret = 0;
+	} else {
+		/*
+		 * The ordered extent has reserved qgroup space, release now
+		 * and pass the reserved number for qgroup_record to free.
+		 */
+		ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes);
+		if (ret < 0)
+			return ret;
+	}
 	tree = &BTRFS_I(inode)->ordered_tree;
 	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
 	if (!entry)
@@ -181,6 +200,7 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 	entry->inode = igrab(inode);
 	entry->compress_type = compress_type;
 	entry->truncated_len = (u64)-1;
+	entry->qgroup_rsv = ret;
 	if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
 		set_bit(type, &entry->flags);
 
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index c01c9698250b..4a506c5598f8 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -92,6 +92,9 @@ struct btrfs_ordered_extent {
 	/* compression algorithm */
 	int compress_type;
 
+	/* Qgroup reserved space */
+	int qgroup_rsv;
+
 	/* reference count */
 	refcount_t refs;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 5/5] btrfs: qgroup: catch reserved space leakage at unmount time
  2020-06-10  1:04 [PATCH v3 0/5] btrfs: qgroup: detect and fix leaked data reserved space Qu Wenruo
                   ` (3 preceding siblings ...)
  2020-06-10  1:04 ` [PATCH v3 4/5] btrfs: change the timing for qgroup reserved space for ordered extents to fix reserved space leak Qu Wenruo
@ 2020-06-10  1:04 ` Qu Wenruo
  2020-06-12 18:51   ` Josef Bacik
  4 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-06-10  1:04 UTC (permalink / raw)
  To: linux-btrfs

Before this patch, btrfs qgroup completely relies on per-inode extent io
tree to detect reserved data space leakage.

However previous bug has already shown how release page before
btrfs_finish_ordered_io() could lead to leakage, and since it's
QGROUP_RESERVED bit cleared without triggering qgroup rsv, it can't be
detected by per-inode extent io tree.

So this patch adds another (and hopefully the final) safe net to catch
qgroup data reserved space leakage.

At least the new safe net catches all the leakage during development, so
it should be pretty useful in the real world.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c |  5 +++++
 fs/btrfs/qgroup.c  | 43 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.h  |  2 +-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f8ec2d8606fd..aaecaa4c64f5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4058,6 +4058,11 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 	ASSERT(list_empty(&fs_info->delayed_iputs));
 	set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags);
 
+	if (btrfs_qgroup_has_leak(fs_info)) {
+		WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
+		btrfs_err(fs_info, "qgroup reserved space leaked\n");
+	}
+
 	btrfs_free_qgroup_config(fs_info);
 	ASSERT(list_empty(&fs_info->delalloc_roots));
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 5bd4089ad0e1..f899b2167fa8 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -505,6 +505,49 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
 	return ret < 0 ? ret : 0;
 }
 
+static u64 btrfs_qgroup_subvolid(u64 qgroupid)
+{
+	return (qgroupid & ((1ULL << BTRFS_QGROUP_LEVEL_SHIFT) - 1));
+}
+/*
+ * Get called for close_ctree() when quota is still enabled.
+ * This verifies we don't leak some reserved space.
+ *
+ * Return false if no reserved space is left.
+ * Return true if some reserved space is leaked.
+ */
+bool btrfs_qgroup_has_leak(struct btrfs_fs_info *fs_info)
+{
+	struct rb_node *node;
+	bool ret = false;
+
+	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+		return ret;
+	/*
+	 * Since we're unmounting, there is no race and no need to grab
+	 * qgroup lock.
+	 * And here we don't go post order to provide a more user friendly
+	 * sorted result.
+	 */
+	for (node = rb_first(&fs_info->qgroup_tree); node; node = rb_next(node)) {
+		struct btrfs_qgroup *qgroup;
+		int i;
+
+		qgroup = rb_entry(node, struct btrfs_qgroup, node);
+		for (i = 0; i < BTRFS_QGROUP_RSV_LAST; i++) {
+			if (qgroup->rsv.values[i]) {
+				ret = true;
+				btrfs_warn(fs_info,
+		"qgroup %llu/%llu has unreleased space, type=%d rsv=%llu",
+				   btrfs_qgroup_level(qgroup->qgroupid),
+				   btrfs_qgroup_subvolid(qgroup->qgroupid),
+				   i, qgroup->rsv.values[i]);
+			}
+		}
+	}
+	return ret;
+}
+
 /*
  * This is called from close_ctree() or open_ctree() or btrfs_quota_disable(),
  * first two are in single-threaded paths.And for the third one, we have set
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 1bc654459469..e3e9f9df8320 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -415,5 +415,5 @@ int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
 int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
 		struct btrfs_root *root, struct extent_buffer *eb);
 void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans);
-
+bool btrfs_qgroup_has_leak(struct btrfs_fs_info *fs_info);
 #endif
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] btrfs: inode: refactor the parameters of insert_reserved_file_extent()
  2020-06-10  1:04 ` [PATCH v3 1/5] btrfs: inode: refactor the parameters of insert_reserved_file_extent() Qu Wenruo
@ 2020-06-12 18:46   ` Josef Bacik
  0 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2020-06-12 18:46 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 6/9/20 9:04 PM, Qu Wenruo wrote:
> Function insert_reserved_file_extent() takes a long list of parameters,
> which are all for btrfs_file_extent_item, even including two reserved
> members, encryption and other_encoding.
> 
> This makes the parameter list unnecessary long for a function which only
> get called twice.
> 
> This patch will refactor the parameter list, by using
> btrfs_file_extent_item as parameter directly to hugely reduce the number
> of parameters.
> 
> Also, since there are only two callers, one in btrfs_finish_ordered_io()
> which inserts file extent for ordered extent, and one
> __btrfs_prealloc_file_range().
> 
> These two call sites have completely different context, where ordered
> extent can be compressed, but will always be regular extent, while the
> preallocated one is never going to be compressed and always has PREALLOC
> type.
> 
> So use two small wrapper for these two different call sites to improve
> readability.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

I like this,

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/5] btrfs: inode: move the qgroup reserved data space release into the callers of insert_reserved_file_extent()
  2020-06-10  1:04 ` [PATCH v3 2/5] btrfs: inode: move the qgroup reserved data space release into the callers " Qu Wenruo
@ 2020-06-12 18:49   ` Josef Bacik
  0 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2020-06-12 18:49 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 6/9/20 9:04 PM, Qu Wenruo wrote:
> This is to prepare for the incoming timing change of qgroup reserved
> data space and ordered extent.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/5] btrfs: file: reserve qgroup space after the hole punch range locked
  2020-06-10  1:04 ` [PATCH v3 3/5] btrfs: file: reserve qgroup space after the hole punch range locked Qu Wenruo
@ 2020-06-12 18:49   ` Josef Bacik
  0 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2020-06-12 18:49 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 6/9/20 9:04 PM, Qu Wenruo wrote:
> The incoming qgroup reserved space timing will move the data reserve to
> ordered extent completely.
> 
> However in btrfs_punch_hole_lock_range() will call
> btrfs_invalidate_page(), which will clear QGROUP_RESERVED bit for the
> range.
> 
> In current stage it's OK, but if we're making ordered extents to handle
> the reserved space, then btrfs_punch_hole_lock_range() can clear the
> QGROUP_RESERVED bit before we submit ordered extent, leading to qgroup
> reserved space leakage.
> 
> So here change the timing to make reserve data space after
> btrfs_punch_hole_lock_range().
> The new timing is fine for either current code or the new code.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 4/5] btrfs: change the timing for qgroup reserved space for ordered extents to fix reserved space leak
  2020-06-10  1:04 ` [PATCH v3 4/5] btrfs: change the timing for qgroup reserved space for ordered extents to fix reserved space leak Qu Wenruo
@ 2020-06-12 18:50   ` Josef Bacik
  2020-06-16 15:17   ` David Sterba
  1 sibling, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2020-06-12 18:50 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 6/9/20 9:04 PM, Qu Wenruo wrote:
> [BUG]
> The following simple workload from fsstress can lead to qgroup reserved
> data space leakage:
>    0/0: creat f0 x:0 0 0
>    0/0: creat add id=0,parent=-1
>    0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
>    0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
>    0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
> 
> This would cause btrfs qgroup to leak 20480 bytes for data reserved
> space.
> If btrfs qgroup limit is enabled, such leakage can lead to unexpected
> early EDQUOT and unusable space.
> 
> [CAUSE]
> When doing direct IO, kernel will try to writeback existing buffered
> page cache, then invalidate them:
>    iomap_dio_rw()
>    |- filemap_write_and_wait_range();
>    |- invalidate_inode_pages2_range();
> 
> However for btrfs, the bi_end_io hook doesn't finish all its heavy work
> right after bio ends.
> In fact, it delays its work further:
>    submit_extent_page(end_io_func=end_bio_extent_writepage);
>    end_bio_extent_writepage()
>    |- btrfs_writepage_endio_finish_ordered()
>       |- btrfs_init_work(finish_ordered_fn);
> 
>    <<< Work queue execution >>>
>    finish_ordered_fn()
>    |- btrfs_finish_ordered_io();
>       |- Clear qgroup bits
> 
> This means, when filemap_write_and_wait_range() returns,
> btrfs_finish_ordered_io() is not ensured to be executed, thus the
> qgroup bits for related range is not cleared.
> 
> Now into how the leakage happens, this will only focus on the
> overlapping part of buffered and direct IO part.
> 
> 1. After buffered write
>     The inode had the following range with QGROUP_RESERVED bit:
>     	596		616K
> 	|///////////////|
>     Qgroup reserved data space: 20K
> 
> 2. Writeback part for range [596K, 616K)
>     Write back finished, but btrfs_finish_ordered_io() not get called
>     yet.
>     So we still have:
>     	596K		616K
> 	|///////////////|
>     Qgroup reserved data space: 20K
> 
> 3. Pages for range [596K, 616K) get released
>     This will clear all qgroup bits, but don't update the reserved data
>     space.
>     So we have:
>     	596K		616K
> 	|		|
>     Qgroup reserved data space: 20K
>     That number doesn't match with the qgroup bit range anymore.
> 
> 4. Dio prepare space for range [596K, 700K)
>     Qgroup reserved data space for that range, we got:
>     	596K		616K			700K
> 	|///////////////|///////////////////////|
>     Qgroup reserved data space: 20K + 104K = 124K
> 
> 5. btrfs_finish_ordered_range() get executed for range [596K, 616K)
>     Qgroup free reserved space for that range, we got:
>     	596K		616K			700K
> 	|		|///////////////////////|
>     We need to free that range of reserved space.
>     Qgroup reserved data space: 124K - 20K = 104K
> 
> 6. btrfs_finish_ordered_range() get executed for range [596K, 700K)
>     However qgroup bit for range [596K, 616K) is already cleared in
>     previous step, so we only free 84K for qgroup reserved space.
>     	596K		616K			700K
> 	|		|			|
>     We need to free that range of reserved space.
>     Qgroup reserved data space: 104K - 84K = 20K
> 
>     Now there is no way to release that 20K unless disabling qgroup or
>     unmount the fs.
> 
> [FIX]
> This patch will change the timing when btrfs_qgroup_release/free_data()
> get called.
> Here uses buffered CoW write as an example.
> 
> 	The new timing			|	The old timing
> ----------------------------------------+---------------------------------------
>   btrfs_buffered_write()			| btrfs_buffered_write()
>   |- btrfs_qgroup_reserve_data() 	| |- btrfs_qgroup_reserve_data()
> 					|
>   btrfs_run_delalloc_range()		| btrfs_run_delalloc_range()
>   |- btrfs_add_ordered_extent()  	|
>      |- btrfs_qgroup_release_data()	|
>         The reserved is passed into	|
>         btrfs_ordered_extent structure	|
> 					|
>   btrfs_finish_ordered_io()		| btrfs_finish_ordered_io()
>   |- The reserved space is passed to 	| |- btrfs_qgroup_release_data()
>      btrfs_qgroup_record			|    The resereved space is passed
> 					|    to btrfs_qgroup_recrod
> 					|
>   btrfs_qgroup_account_extents()		| btrfs_qgroup_account_extents()
>   |- btrfs_qgroup_free_refroot()		| |- btrfs_qgroup_free_refroot()
> 
> The point of such change is to ensure, when ordered extents are
> submitted, the qgroup reserved space is already release, to keep the
> timing aligned with file_write_and_wait_range().
> 
> So that qgroup data reserved space is all bound to btrfs_ordered_extent
> and solve the timing mismatch.
> 
> Suggested-by: Josef Bacik <josef@toxicpanda.com>
> Fixes: f695fdcef83a ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space")
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 5/5] btrfs: qgroup: catch reserved space leakage at unmount time
  2020-06-10  1:04 ` [PATCH v3 5/5] btrfs: qgroup: catch reserved space leakage at unmount time Qu Wenruo
@ 2020-06-12 18:51   ` Josef Bacik
  0 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2020-06-12 18:51 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 6/9/20 9:04 PM, Qu Wenruo wrote:
> Before this patch, btrfs qgroup completely relies on per-inode extent io
> tree to detect reserved data space leakage.
> 
> However previous bug has already shown how release page before
> btrfs_finish_ordered_io() could lead to leakage, and since it's
> QGROUP_RESERVED bit cleared without triggering qgroup rsv, it can't be
> detected by per-inode extent io tree.
> 
> So this patch adds another (and hopefully the final) safe net to catch
> qgroup data reserved space leakage.
> 
> At least the new safe net catches all the leakage during development, so
> it should be pretty useful in the real world.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 4/5] btrfs: change the timing for qgroup reserved space for ordered extents to fix reserved space leak
  2020-06-10  1:04 ` [PATCH v3 4/5] btrfs: change the timing for qgroup reserved space for ordered extents to fix reserved space leak Qu Wenruo
  2020-06-12 18:50   ` Josef Bacik
@ 2020-06-16 15:17   ` David Sterba
  2020-06-16 23:55     ` Qu Wenruo
  1 sibling, 1 reply; 13+ messages in thread
From: David Sterba @ 2020-06-16 15:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Josef Bacik

On Wed, Jun 10, 2020 at 09:04:43AM +0800, Qu Wenruo wrote:
> [BUG]
> The following simple workload from fsstress can lead to qgroup reserved
> data space leakage:
>   0/0: creat f0 x:0 0 0
>   0/0: creat add id=0,parent=-1
>   0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
>   0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
>   0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
> 
> This would cause btrfs qgroup to leak 20480 bytes for data reserved
> space.
> If btrfs qgroup limit is enabled, such leakage can lead to unexpected
> early EDQUOT and unusable space.
> 
> [CAUSE]
> When doing direct IO, kernel will try to writeback existing buffered
> page cache, then invalidate them:
>   iomap_dio_rw()
>   |- filemap_write_and_wait_range();
>   |- invalidate_inode_pages2_range();

As the dio-iomap got reverted, can you please update the changelog and
review if the changes are still valid? The whole patchset is in
misc-next so I'll update the changelog in place if needed, or replace
the whole patchset. Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 4/5] btrfs: change the timing for qgroup reserved space for ordered extents to fix reserved space leak
  2020-06-16 15:17   ` David Sterba
@ 2020-06-16 23:55     ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-06-16 23:55 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Josef Bacik


[-- Attachment #1.1: Type: text/plain, Size: 1828 bytes --]



On 2020/6/16 下午11:17, David Sterba wrote:
> On Wed, Jun 10, 2020 at 09:04:43AM +0800, Qu Wenruo wrote:
>> [BUG]
>> The following simple workload from fsstress can lead to qgroup reserved
>> data space leakage:
>>   0/0: creat f0 x:0 0 0
>>   0/0: creat add id=0,parent=-1
>>   0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
>>   0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
>>   0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
>>
>> This would cause btrfs qgroup to leak 20480 bytes for data reserved
>> space.
>> If btrfs qgroup limit is enabled, such leakage can lead to unexpected
>> early EDQUOT and unusable space.
>>
>> [CAUSE]
>> When doing direct IO, kernel will try to writeback existing buffered
>> page cache, then invalidate them:
>>   iomap_dio_rw()
>>   |- filemap_write_and_wait_range();
>>   |- invalidate_inode_pages2_range();
> 
> As the dio-iomap got reverted, can you please update the changelog and
> review if the changes are still valid? The whole patchset is in
> misc-next so I'll update the changelog in place if needed, or replace
> the whole patchset. Thanks.
> 
After reviewing the reverted code, the change is still valid here.
As the filemap_write_and_wait_range() and
invalidate_inode_pages2_range() are still in generic_file_direct_write()
call.

And without the timing change patches, the safenet can still detect the
leakage, and my existing seeds reproduce the same problem.
So we still need the series.

For the changelog update, I'll send out the v4 patches, but the
changelog modification is pretty small.
I guess only the first and this patch needs some small modification.
(the first for some words change, while for this, only the function name
needs to be modified)

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-06-16 23:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-10  1:04 [PATCH v3 0/5] btrfs: qgroup: detect and fix leaked data reserved space Qu Wenruo
2020-06-10  1:04 ` [PATCH v3 1/5] btrfs: inode: refactor the parameters of insert_reserved_file_extent() Qu Wenruo
2020-06-12 18:46   ` Josef Bacik
2020-06-10  1:04 ` [PATCH v3 2/5] btrfs: inode: move the qgroup reserved data space release into the callers " Qu Wenruo
2020-06-12 18:49   ` Josef Bacik
2020-06-10  1:04 ` [PATCH v3 3/5] btrfs: file: reserve qgroup space after the hole punch range locked Qu Wenruo
2020-06-12 18:49   ` Josef Bacik
2020-06-10  1:04 ` [PATCH v3 4/5] btrfs: change the timing for qgroup reserved space for ordered extents to fix reserved space leak Qu Wenruo
2020-06-12 18:50   ` Josef Bacik
2020-06-16 15:17   ` David Sterba
2020-06-16 23:55     ` Qu Wenruo
2020-06-10  1:04 ` [PATCH v3 5/5] btrfs: qgroup: catch reserved space leakage at unmount time Qu Wenruo
2020-06-12 18:51   ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).