[PATCH v2 0/2] btrfs: trim: Fix a bug certain range may not be trimmed properly

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] btrfs: trim: Fix a bug certain range may not be trimmed properly
@ 2019-10-23 12:56 Qu Wenruo
  2019-10-23 12:56 ` [PATCH v2 1/2] btrfs: volumes: Use more straightforward way to calculate map length Qu Wenruo
  2019-10-23 12:56 ` [PATCH v2 2/2] btrfs: extent-tree: Ensure we trim ranges across block group boundary Qu Wenruo
  0 siblings, 2 replies; 6+ messages in thread
From: Qu Wenruo @ 2019-10-23 12:56 UTC (permalink / raw)
  To: linux-btrfs

There is a bug report about discard mount option not trimming some range
properly, and causing unexpected space usage for thin device.

It turns out to be that, if there are pinned extents across block group
boundary, we will only trim to the end of current block group, skipping
the remaining.

This patchset will fix it by ensuring btrfs_discard_extent() will
iterate the full range before exiting.

The first patch is just a tiny readability improvement found during the
fix.
The second patch is the main body of the fix.

Meanwhile I'm still looking into how to craft such test case for btrfs,
so the test case may be late for several days.

Changelog:
v2:
- Fold the __btrfs_map_block_for_discard() @length change into the main
  patch
  Since the @length parameter change itself doesn't make much sense,
  folding it into the fix looks more reasonable.

- Split the tiny readability improvement patch into its own patch

Qu Wenruo (2):
  btrfs: volumes: Use more straightforward way to calculate map length
  btrfs: extent-tree: Ensure we trim ranges across block group boundary

 fs/btrfs/extent-tree.c | 40 ++++++++++++++++++++++++++++++----------
 fs/btrfs/volumes.c     |  8 +++++---
 2 files changed, 35 insertions(+), 13 deletions(-)

-- 
2.23.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] btrfs: volumes: Use more straightforward way to calculate map length
  2019-10-23 12:56 [PATCH v2 0/2] btrfs: trim: Fix a bug certain range may not be trimmed properly Qu Wenruo
@ 2019-10-23 12:56 ` Qu Wenruo
  2019-10-23 13:11   ` Nikolay Borisov
  2019-10-23 12:56 ` [PATCH v2 2/2] btrfs: extent-tree: Ensure we trim ranges across block group boundary Qu Wenruo
  1 sibling, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2019-10-23 12:56 UTC (permalink / raw)
  To: linux-btrfs

The old code goes:

 	offset = logical - em->start;
	length = min_t(u64, em->len - offset, length);

Where @length calculate is dependent on offset, it can take reader
several more seconds to find it's just the same code as:

 	offset = logical - em->start;
	length = min_t(u64, em->start + em->len - logical, length);

Use above code to make the length calculate independent from other
variable, thus slightly increase the readability.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index cdd7af424033..a6db11e821a5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5616,7 +5616,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
 	}
 
 	offset = logical - em->start;
-	length = min_t(u64, em->len - offset, length);
+	length = min_t(u64, em->start + em->len - logical, length);
 
 	stripe_len = map->stripe_len;
 	/*
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] btrfs: extent-tree: Ensure we trim ranges across block group boundary
  2019-10-23 12:56 [PATCH v2 0/2] btrfs: trim: Fix a bug certain range may not be trimmed properly Qu Wenruo
  2019-10-23 12:56 ` [PATCH v2 1/2] btrfs: volumes: Use more straightforward way to calculate map length Qu Wenruo
@ 2019-10-23 12:56 ` Qu Wenruo
  2019-10-23 13:37   ` Nikolay Borisov
  1 sibling, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2019-10-23 12:56 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When deleting large files (which cross block group boundary) with discard
mount option, we find some btrfs_discard_extent() calls only trimmed part
of its space, not the whole range:

  btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%

type:		bbio->map_type, in above case, it's SINGLE DATA.
start:		Logical address of this trim
len:		Logical length of this trim
trimmed:	Physically trimmed bytes
ratio:		trimmed / len

Thus leading some unused space not discarded.

[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:

btrfs_commit_transaction()
|- write_all_supers()
|- btrfs_finish_extent_commit()
   |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
   |- btrfs_discard_extent()

However pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.

When a large file get deleted and it has adjacent file extents across
block group boundary, we will get a large merged range.

Then when we pass the large range into btrfs_discard_extent(),
btrfs_discard_extent() will just trim the first part, without trimming
the remaining part.

Furthermore, this bug is not that reliably observed, as if the whole
block group is empty, there will be another trim for that block group.

So the most obvious way to find this missing trim needs to delete large
extents at block group boundary without empting involved block groups.

[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
  btrfs_map_block() uses its @length paramter to notify the caller how
  many bytes are mapped in current call.
  With __btrfs_map_block_for_discard() also modifing the @length,
  btrfs_discard_extent() now understands if it needs to do next trim.

- Call btrfs_map_block() in a loop until we hit the range end
  Since we now know how many bytes are mapped each time, we can iterate
  through each block group boundary and issue correct trim for each
  range.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-tree.c | 40 ++++++++++++++++++++++++++++++----------
 fs/btrfs/volumes.c     |  6 ++++--
 2 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 49cb26fa7c63..45df45fa775b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1306,8 +1306,10 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 			 u64 num_bytes, u64 *actual_bytes)
 {
-	int ret;
+	int ret = 0;
 	u64 discarded_bytes = 0;
+	u64 end = bytenr + num_bytes;
+	u64 cur = bytenr;
 	struct btrfs_bio *bbio = NULL;
 
 
@@ -1316,15 +1318,22 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 	 * associated to its stripes that don't go away while we are discarding.
 	 */
 	btrfs_bio_counter_inc_blocked(fs_info);
-	/* Tell the block device(s) that the sectors can be discarded */
-	ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, bytenr, &num_bytes,
-			      &bbio, 0);
-	/* Error condition is -ENOMEM */
-	if (!ret) {
-		struct btrfs_bio_stripe *stripe = bbio->stripes;
+	while (cur < end) {
+		struct btrfs_bio_stripe *stripe;
 		int i;
 
+		/* Tell the block device(s) that the sectors can be discarded */
+		ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, cur,
+				      &num_bytes, &bbio, 0);
+		/*
+		 * Error can be -ENOMEM, -ENOENT (no such chunk mapping) or
+		 * -EOPNOTSUPP. For any such error, @num_bytes is not updated,
+		 * thus we can't continue anyway.
+		 */
+		if (ret < 0)
+			goto out;
 
+		stripe = bbio->stripes;
 		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
 			u64 bytes;
 			struct request_queue *req_q;
@@ -1341,10 +1350,19 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 						  stripe->physical,
 						  stripe->length,
 						  &bytes);
-			if (!ret)
+			if (!ret) {
 				discarded_bytes += bytes;
-			else if (ret != -EOPNOTSUPP)
-				break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */
+			} else if (ret != -EOPNOTSUPP) {
+				/*
+				 * Logic errors or -ENOMEM, or -EIO but I don't
+				 * know how that could happen JDM
+				 *
+				 * Ans since there are two loops, explicitly
+				 * goto out to avoid confusion.
+				 */
+				btrfs_put_bbio(bbio);
+				goto out;
+			}
 
 			/*
 			 * Just in case we get back EOPNOTSUPP for some reason,
@@ -1354,7 +1372,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 			ret = 0;
 		}
 		btrfs_put_bbio(bbio);
+		cur += num_bytes;
 	}
+out:
 	btrfs_bio_counter_dec(fs_info);
 
 	if (actual_bytes)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a6db11e821a5..f66bd0d03f44 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5578,12 +5578,13 @@ void btrfs_put_bbio(struct btrfs_bio *bbio)
  * replace.
  */
 static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
-					 u64 logical, u64 length,
+					 u64 logical, u64 *length_ret,
 					 struct btrfs_bio **bbio_ret)
 {
 	struct extent_map *em;
 	struct map_lookup *map;
 	struct btrfs_bio *bbio;
+	u64 length = *length_ret;
 	u64 offset;
 	u64 stripe_nr;
 	u64 stripe_nr_end;
@@ -5617,6 +5618,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
 
 	offset = logical - em->start;
 	length = min_t(u64, em->start + em->len - logical, length);
+	*length_ret = length;
 
 	stripe_len = map->stripe_len;
 	/*
@@ -6031,7 +6033,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 
 	if (op == BTRFS_MAP_DISCARD)
 		return __btrfs_map_block_for_discard(fs_info, logical,
-						     *length, bbio_ret);
+						     length, bbio_ret);
 
 	ret = btrfs_get_io_geometry(fs_info, op, logical, *length, &geom);
 	if (ret < 0)
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] btrfs: volumes: Use more straightforward way to calculate map length
  2019-10-23 12:56 ` [PATCH v2 1/2] btrfs: volumes: Use more straightforward way to calculate map length Qu Wenruo
@ 2019-10-23 13:11   ` Nikolay Borisov
  0 siblings, 0 replies; 6+ messages in thread
From: Nikolay Borisov @ 2019-10-23 13:11 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 23.10.19 г. 15:56 ч., Qu Wenruo wrote:
> The old code goes:
> 
>  	offset = logical - em->start;
> 	length = min_t(u64, em->len - offset, length);
> 
> Where @length calculate is dependent on offset, it can take reader
> several more seconds to find it's just the same code as:
> 
>  	offset = logical - em->start;
> 	length = min_t(u64, em->start + em->len - logical, length);
> 
> Use above code to make the length calculate independent from other
> variable, thus slightly increase the readability.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

I agree, this calculates whether we need to trim the rest of the BG or
not.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>


> ---
>  fs/btrfs/volumes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index cdd7af424033..a6db11e821a5 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5616,7 +5616,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
>  	}
>  
>  	offset = logical - em->start;
> -	length = min_t(u64, em->len - offset, length);
> +	length = min_t(u64, em->start + em->len - logical, length);
>  
>  	stripe_len = map->stripe_len;
>  	/*
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] btrfs: extent-tree: Ensure we trim ranges across block group boundary
  2019-10-23 12:56 ` [PATCH v2 2/2] btrfs: extent-tree: Ensure we trim ranges across block group boundary Qu Wenruo
@ 2019-10-23 13:37   ` Nikolay Borisov
  2019-10-23 13:48     ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: Nikolay Borisov @ 2019-10-23 13:37 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 23.10.19 г. 15:56 ч., Qu Wenruo wrote:
> [BUG]
> When deleting large files (which cross block group boundary) with discard
> mount option, we find some btrfs_discard_extent() calls only trimmed part
> of its space, not the whole range:
> 
>   btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
> 
> type:		bbio->map_type, in above case, it's SINGLE DATA.
> start:		Logical address of this trim
> len:		Logical length of this trim
> trimmed:	Physically trimmed bytes
> ratio:		trimmed / len
> 
> Thus leading some unused space not discarded.
> 
> [CAUSE]
> When discard mount option is specified, after a transaction is fully
> committed (super block written to disk), we begin to cleanup pinned
> extents in the following call chain:
> 
> btrfs_commit_transaction()
> |- write_all_supers()
> |- btrfs_finish_extent_commit()
>    |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
>    |- btrfs_discard_extent()
> 
> However pinned extents are recorded in an extent_io_tree, which can
> merge adjacent extent states.
> 
> When a large file get deleted and it has adjacent file extents across
> block group boundary, we will get a large merged range.
> 
> Then when we pass the large range into btrfs_discard_extent(),
> btrfs_discard_extent() will just trim the first part, without trimming
> the remaining part.
> 
> Furthermore, this bug is not that reliably observed, as if the whole
> block group is empty, there will be another trim for that block group.
> 
> So the most obvious way to find this missing trim needs to delete large
> extents at block group boundary without empting involved block groups.
> 
> [FIX]
> - Allow __btrfs_map_block_for_discard() to modify @length parameter
>   btrfs_map_block() uses its @length paramter to notify the caller how
>   many bytes are mapped in current call.
>   With __btrfs_map_block_for_discard() also modifing the @length,
>   btrfs_discard_extent() now understands if it needs to do next trim.
> 
> - Call btrfs_map_block() in a loop until we hit the range end
>   Since we now know how many bytes are mapped each time, we can iterate
>   through each block group boundary and issue correct trim for each
>   range.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent-tree.c | 40 ++++++++++++++++++++++++++++++----------
>  fs/btrfs/volumes.c     |  6 ++++--
>  2 files changed, 34 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 49cb26fa7c63..45df45fa775b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1306,8 +1306,10 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
>  int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
>  			 u64 num_bytes, u64 *actual_bytes)
>  {
> -	int ret;
> +	int ret = 0;
>  	u64 discarded_bytes = 0;
> +	u64 end = bytenr + num_bytes;
> +	u64 cur = bytenr;
>  	struct btrfs_bio *bbio = NULL;
>  
>  
> @@ -1316,15 +1318,22 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
>  	 * associated to its stripes that don't go away while we are discarding.
>  	 */
>  	btrfs_bio_counter_inc_blocked(fs_info);
> -	/* Tell the block device(s) that the sectors can be discarded */
> -	ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, bytenr, &num_bytes,
> -			      &bbio, 0);
> -	/* Error condition is -ENOMEM */
> -	if (!ret) {
> -		struct btrfs_bio_stripe *stripe = bbio->stripes;
> +	while (cur < end) {
> +		struct btrfs_bio_stripe *stripe;
>  		int i;
>  
> +		/* Tell the block device(s) that the sectors can be discarded */
> +		ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, cur,
> +				      &num_bytes, &bbio, 0);
> +		/*
> +		 * Error can be -ENOMEM, -ENOENT (no such chunk mapping) or
> +		 * -EOPNOTSUPP. For any such error, @num_bytes is not updated,
> +		 * thus we can't continue anyway.
> +		 */
> +		if (ret < 0)
> +			goto out;
>  
> +		stripe = bbio->stripes;
>  		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
>  			u64 bytes;
>  			struct request_queue *req_q;
> @@ -1341,10 +1350,19 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
>  						  stripe->physical,
>  						  stripe->length,
>  						  &bytes);
> -			if (!ret)
> +			if (!ret) {
>  				discarded_bytes += bytes;
> -			else if (ret != -EOPNOTSUPP)
> -				break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */
> +			} else if (ret != -EOPNOTSUPP) {
> +				/*
> +				 * Logic errors or -ENOMEM, or -EIO but I don't
> +				 * know how that could happen JDM
> +				 *
> +				 * Ans since there are two loops, explicitly
> +				 * goto out to avoid confusion.
> +				 */
> +				btrfs_put_bbio(bbio);
> +				goto out;
> +			}
>  
>  			/*
>  			 * Just in case we get back EOPNOTSUPP for some reason,
> @@ -1354,7 +1372,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
>  			ret = 0;
>  		}
>  		btrfs_put_bbio(bbio);
> +		cur += num_bytes;

Shouldn't  you keep the number of bytes requested to discard and deduct
from the actual amount being discarded.

Say you want to discard 1G but for whatever reason this has to be done
in, say, 1 chunk of 750mb and a second chunk of 250 mb. On the first go
you will discard 750mb and add them to cur, numbytes at this point are
750mb then you call btrfs_map_block with num_bytes being 750mb whereas
they should have been 250mb. And if the code manages to map the 2nd
750mb it could potentially trim more than requested and corrupt data.

>  	}
> +out:
>  	btrfs_bio_counter_dec(fs_info);
>  
>  	if (actual_bytes)
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a6db11e821a5..f66bd0d03f44 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5578,12 +5578,13 @@ void btrfs_put_bbio(struct btrfs_bio *bbio)
>   * replace.
>   */
>  static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
> -					 u64 logical, u64 length,
> +					 u64 logical, u64 *length_ret,
>  					 struct btrfs_bio **bbio_ret)
>  {
>  	struct extent_map *em;
>  	struct map_lookup *map;
>  	struct btrfs_bio *bbio;
> +	u64 length = *length_ret;
>  	u64 offset;
>  	u64 stripe_nr;
>  	u64 stripe_nr_end;
> @@ -5617,6 +5618,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
>  
>  	offset = logical - em->start;
>  	length = min_t(u64, em->start + em->len - logical, length);
> +	*length_ret = length;
>  
>  	stripe_len = map->stripe_len;
>  	/*
> @@ -6031,7 +6033,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
>  
>  	if (op == BTRFS_MAP_DISCARD)
>  		return __btrfs_map_block_for_discard(fs_info, logical,
> -						     *length, bbio_ret);
> +						     length, bbio_ret);
>  
>  	ret = btrfs_get_io_geometry(fs_info, op, logical, *length, &geom);
>  	if (ret < 0)
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] btrfs: extent-tree: Ensure we trim ranges across block group boundary
  2019-10-23 13:37   ` Nikolay Borisov
@ 2019-10-23 13:48     ` Qu Wenruo
  0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2019-10-23 13:48 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2019/10/23 下午9:37, Nikolay Borisov wrote:
>
>
> On 23.10.19 г. 15:56 ч., Qu Wenruo wrote:
>> [BUG]
>> When deleting large files (which cross block group boundary) with discard
>> mount option, we find some btrfs_discard_extent() calls only trimmed part
>> of its space, not the whole range:
>>
>>   btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
>>
>> type:		bbio->map_type, in above case, it's SINGLE DATA.
>> start:		Logical address of this trim
>> len:		Logical length of this trim
>> trimmed:	Physically trimmed bytes
>> ratio:		trimmed / len
>>
>> Thus leading some unused space not discarded.
>>
>> [CAUSE]
>> When discard mount option is specified, after a transaction is fully
>> committed (super block written to disk), we begin to cleanup pinned
>> extents in the following call chain:
>>
>> btrfs_commit_transaction()
>> |- write_all_supers()
>> |- btrfs_finish_extent_commit()
>>    |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
>>    |- btrfs_discard_extent()
>>
>> However pinned extents are recorded in an extent_io_tree, which can
>> merge adjacent extent states.
>>
>> When a large file get deleted and it has adjacent file extents across
>> block group boundary, we will get a large merged range.
>>
>> Then when we pass the large range into btrfs_discard_extent(),
>> btrfs_discard_extent() will just trim the first part, without trimming
>> the remaining part.
>>
>> Furthermore, this bug is not that reliably observed, as if the whole
>> block group is empty, there will be another trim for that block group.
>>
>> So the most obvious way to find this missing trim needs to delete large
>> extents at block group boundary without empting involved block groups.
>>
>> [FIX]
>> - Allow __btrfs_map_block_for_discard() to modify @length parameter
>>   btrfs_map_block() uses its @length paramter to notify the caller how
>>   many bytes are mapped in current call.
>>   With __btrfs_map_block_for_discard() also modifing the @length,
>>   btrfs_discard_extent() now understands if it needs to do next trim.
>>
>> - Call btrfs_map_block() in a loop until we hit the range end
>>   Since we now know how many bytes are mapped each time, we can iterate
>>   through each block group boundary and issue correct trim for each
>>   range.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/extent-tree.c | 40 ++++++++++++++++++++++++++++++----------
>>  fs/btrfs/volumes.c     |  6 ++++--
>>  2 files changed, 34 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 49cb26fa7c63..45df45fa775b 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -1306,8 +1306,10 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
>>  int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
>>  			 u64 num_bytes, u64 *actual_bytes)
>>  {
>> -	int ret;
>> +	int ret = 0;
>>  	u64 discarded_bytes = 0;
>> +	u64 end = bytenr + num_bytes;
>> +	u64 cur = bytenr;
>>  	struct btrfs_bio *bbio = NULL;
>>
>>
>> @@ -1316,15 +1318,22 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
>>  	 * associated to its stripes that don't go away while we are discarding.
>>  	 */
>>  	btrfs_bio_counter_inc_blocked(fs_info);
>> -	/* Tell the block device(s) that the sectors can be discarded */
>> -	ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, bytenr, &num_bytes,
>> -			      &bbio, 0);
>> -	/* Error condition is -ENOMEM */
>> -	if (!ret) {
>> -		struct btrfs_bio_stripe *stripe = bbio->stripes;
>> +	while (cur < end) {
>> +		struct btrfs_bio_stripe *stripe;
>>  		int i;
>>
>> +		/* Tell the block device(s) that the sectors can be discarded */
>> +		ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, cur,
>> +				      &num_bytes, &bbio, 0);
>> +		/*
>> +		 * Error can be -ENOMEM, -ENOENT (no such chunk mapping) or
>> +		 * -EOPNOTSUPP. For any such error, @num_bytes is not updated,
>> +		 * thus we can't continue anyway.
>> +		 */
>> +		if (ret < 0)
>> +			goto out;
>>
>> +		stripe = bbio->stripes;
>>  		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
>>  			u64 bytes;
>>  			struct request_queue *req_q;
>> @@ -1341,10 +1350,19 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
>>  						  stripe->physical,
>>  						  stripe->length,
>>  						  &bytes);
>> -			if (!ret)
>> +			if (!ret) {
>>  				discarded_bytes += bytes;
>> -			else if (ret != -EOPNOTSUPP)
>> -				break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */
>> +			} else if (ret != -EOPNOTSUPP) {
>> +				/*
>> +				 * Logic errors or -ENOMEM, or -EIO but I don't
>> +				 * know how that could happen JDM
>> +				 *
>> +				 * Ans since there are two loops, explicitly
>> +				 * goto out to avoid confusion.
>> +				 */
>> +				btrfs_put_bbio(bbio);
>> +				goto out;
>> +			}
>>
>>  			/*
>>  			 * Just in case we get back EOPNOTSUPP for some reason,
>> @@ -1354,7 +1372,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
>>  			ret = 0;
>>  		}
>>  		btrfs_put_bbio(bbio);
>> +		cur += num_bytes;
>
> Shouldn't  you keep the number of bytes requested to discard and deduct
> from the actual amount being discarded.
>
> Say you want to discard 1G but for whatever reason this has to be done
> in, say, 1 chunk of 750mb and a second chunk of 250 mb. On the first go
> you will discard 750mb and add them to cur, numbytes at this point are
> 750mb then you call btrfs_map_block with num_bytes being 750mb whereas
> they should have been 250mb. And if the code manages to map the 2nd
> 750mb it could potentially trim more than requested and corrupt data.

Oh! Great find!

That's indeed a problem! So every time before we call btrfs_map_block()
we should adjust @num_bytes to @end - @cur.

Thanks for catching this big bug!
Qu
>
>>  	}
>> +out:
>>  	btrfs_bio_counter_dec(fs_info);
>>
>>  	if (actual_bytes)
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index a6db11e821a5..f66bd0d03f44 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5578,12 +5578,13 @@ void btrfs_put_bbio(struct btrfs_bio *bbio)
>>   * replace.
>>   */
>>  static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
>> -					 u64 logical, u64 length,
>> +					 u64 logical, u64 *length_ret,
>>  					 struct btrfs_bio **bbio_ret)
>>  {
>>  	struct extent_map *em;
>>  	struct map_lookup *map;
>>  	struct btrfs_bio *bbio;
>> +	u64 length = *length_ret;
>>  	u64 offset;
>>  	u64 stripe_nr;
>>  	u64 stripe_nr_end;
>> @@ -5617,6 +5618,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
>>
>>  	offset = logical - em->start;
>>  	length = min_t(u64, em->start + em->len - logical, length);
>> +	*length_ret = length;
>>
>>  	stripe_len = map->stripe_len;
>>  	/*
>> @@ -6031,7 +6033,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
>>
>>  	if (op == BTRFS_MAP_DISCARD)
>>  		return __btrfs_map_block_for_discard(fs_info, logical,
>> -						     *length, bbio_ret);
>> +						     length, bbio_ret);
>>
>>  	ret = btrfs_get_io_geometry(fs_info, op, logical, *length, &geom);
>>  	if (ret < 0)
>>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-10-23 13:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-23 12:56 [PATCH v2 0/2] btrfs: trim: Fix a bug certain range may not be trimmed properly Qu Wenruo
2019-10-23 12:56 ` [PATCH v2 1/2] btrfs: volumes: Use more straightforward way to calculate map length Qu Wenruo
2019-10-23 13:11   ` Nikolay Borisov
2019-10-23 12:56 ` [PATCH v2 2/2] btrfs: extent-tree: Ensure we trim ranges across block group boundary Qu Wenruo
2019-10-23 13:37   ` Nikolay Borisov
2019-10-23 13:48     ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).