linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard()
@ 2020-05-30 13:52 Coly Li
  2020-05-30 13:52 ` Coly Li
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Coly Li @ 2020-05-30 13:52 UTC (permalink / raw)
  To: linux-block
  Cc: linux-bcache, Coly Li, Acshai Manoj, Bart Van Assche,
	Christoph Hellwig, Enzo Matsumiya, Hannes Reinecke, Jens Axboe,
	Ming Lei, Xiao Ni

This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.

Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.

Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
  with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
  /dev/sdb, fill data into the first 50GB by,
        # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
        # blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
        # sg_get_lba_status /dev/sdb -m 1048 --lba=0
  descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
  descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
  descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.

The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.

 25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
 27                 struct bio **biop)
 28 {
 29         struct request_queue *q = bdev_get_queue(bdev);
   [snipped]
 56
 57         while (nr_sects) {
 58                 sector_t req_sects = min_t(sector_t, nr_sects,
 59                                 bio_allowed_max_sectors(q));
 60
 61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 62
 63                 bio = blk_next_bio(bio, 0, gfp_mask);
 64                 bio->bi_iter.bi_sector = sector;
 65                 bio_set_dev(bio, bdev);
 66                 bio_set_op_attrs(bio, op, 0);
 67
 68                 bio->bi_iter.bi_size = req_sects << 9;
 69                 sector += req_sects;
 70                 nr_sects -= req_sects;
   [snipped]
 79         }
 80
 81         *biop = bio;
 82         return 0;
 83 }
 84 EXPORT_SYMBOL(__blkdev_issue_discard);

At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.

This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.

But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked, if it is not aligned to discard granularity,
the first split location should make sure following bio has bi_sector
aligned to discard granularity. Then there won't be still-mapped
fragment in the middle of the discard range.

The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

We an see there is no 2048 sectors segment anymore, everything is clean.

Reported-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
---
Changelog:
v2, the improved version with inspire from review comments by Bart,
    Ming and Xiao.
v1, the initial version.

 block/blk-lib.c | 25 +++++++++++++++++++++++--
 block/blk.h     | 14 ++++++++++++++
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..7bffdee63a20 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -55,8 +55,29 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		return -EINVAL;
 
 	while (nr_sects) {
-		sector_t req_sects = min_t(sector_t, nr_sects,
-				bio_allowed_max_sectors(q));
+		sector_t granularity_aligned_lba;
+		sector_t req_sects;
+
+		granularity_aligned_lba = round_up(sector,
+				q->limits.discard_granularity >> SECTOR_SHIFT);
+
+		/*
+		 * Check whether the discard bio starts at a discard_granularity
+		 * aligned LBA,
+		 * - If no: set (granularity_aligned_lba - sector) to bi_size of
+		 *   the first split bio, then the second bio will start at a
+		 *   discard_granularity aligned LBA.
+		 * - If yes: use bio_aligned_discard_max_sectors() as the max
+		 *   possible bi_size of the first split bio. Then when this bio
+		 *   is split in device drive, the split ones are very probably
+		 *   to be aligned to discard_granularity of the device's queue.
+		 */
+		if (granularity_aligned_lba == sector)
+			req_sects = min_t(sector_t, nr_sects,
+					  bio_aligned_discard_max_sectors(q));
+		else
+			req_sects = min_t(sector_t, nr_sects,
+					  granularity_aligned_lba - sector);
 
 		WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 
diff --git a/block/blk.h b/block/blk.h
index 0a94ec68af32..589007ac564e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -292,6 +292,20 @@ static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
 	return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
 }
 
+/*
+ * The max bio size which is aligned to q->limits.discard_granularity. This
+ * is a hint to split large discard bio in generic block layer, then if device
+ * driver needs to split the discard bio into smaller ones, their bi_size can
+ * be very probably and easily aligned to discard_granularity of the device's
+ * queue.
+ */
+static inline unsigned int bio_aligned_discard_max_sectors(
+					struct request_queue *q)
+{
+	return round_down(UINT_MAX, q->limits.discard_granularity) >>
+			SECTOR_SHIFT;
+}
+
 /*
  * Internal io_context interface
  */
-- 
2.25.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard()
  2020-05-30 13:52 [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard() Coly Li
@ 2020-05-30 13:52 ` Coly Li
  2020-05-30 16:45 ` Hannes Reinecke
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Coly Li @ 2020-05-30 13:52 UTC (permalink / raw)
  To: linux-block
  Cc: linux-bcache, Coly Li, Acshai Manoj, Bart Van Assche,
	Christoph Hellwig, Enzo Matsumiya, Hannes Reinecke, Jens Axboe,
	Ming Lei, Xiao Ni

This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.

Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.

Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
  with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
  /dev/sdb, fill data into the first 50GB by,
        # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
        # blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
        # sg_get_lba_status /dev/sdb -m 1048 --lba=0
  descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
  descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
  descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.

The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.

 25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
 27                 struct bio **biop)
 28 {
 29         struct request_queue *q = bdev_get_queue(bdev);
   [snipped]
 56
 57         while (nr_sects) {
 58                 sector_t req_sects = min_t(sector_t, nr_sects,
 59                                 bio_allowed_max_sectors(q));
 60
 61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 62
 63                 bio = blk_next_bio(bio, 0, gfp_mask);
 64                 bio->bi_iter.bi_sector = sector;
 65                 bio_set_dev(bio, bdev);
 66                 bio_set_op_attrs(bio, op, 0);
 67
 68                 bio->bi_iter.bi_size = req_sects << 9;
 69                 sector += req_sects;
 70                 nr_sects -= req_sects;
   [snipped]
 79         }
 80
 81         *biop = bio;
 82         return 0;
 83 }
 84 EXPORT_SYMBOL(__blkdev_issue_discard);

At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.

This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.

But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked, if it is not aligned to discard granularity,
the first split location should make sure following bio has bi_sector
aligned to discard granularity. Then there won't be still-mapped
fragment in the middle of the discard range.

The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

We an see there is no 2048 sectors segment anymore, everything is clean.

Reported-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
---
Changelog:
v2, the improved version with inspire from review comments by Bart,
    Ming and Xiao.
v1, the initial version.

 block/blk-lib.c | 25 +++++++++++++++++++++++--
 block/blk.h     | 14 ++++++++++++++
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..7bffdee63a20 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -55,8 +55,29 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		return -EINVAL;
 
 	while (nr_sects) {
-		sector_t req_sects = min_t(sector_t, nr_sects,
-				bio_allowed_max_sectors(q));
+		sector_t granularity_aligned_lba;
+		sector_t req_sects;
+
+		granularity_aligned_lba = round_up(sector,
+				q->limits.discard_granularity >> SECTOR_SHIFT);
+
+		/*
+		 * Check whether the discard bio starts at a discard_granularity
+		 * aligned LBA,
+		 * - If no: set (granularity_aligned_lba - sector) to bi_size of
+		 *   the first split bio, then the second bio will start at a
+		 *   discard_granularity aligned LBA.
+		 * - If yes: use bio_aligned_discard_max_sectors() as the max
+		 *   possible bi_size of the first split bio. Then when this bio
+		 *   is split in device drive, the split ones are very probably
+		 *   to be aligned to discard_granularity of the device's queue.
+		 */
+		if (granularity_aligned_lba == sector)
+			req_sects = min_t(sector_t, nr_sects,
+					  bio_aligned_discard_max_sectors(q));
+		else
+			req_sects = min_t(sector_t, nr_sects,
+					  granularity_aligned_lba - sector);
 
 		WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 
diff --git a/block/blk.h b/block/blk.h
index 0a94ec68af32..589007ac564e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -292,6 +292,20 @@ static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
 	return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
 }
 
+/*
+ * The max bio size which is aligned to q->limits.discard_granularity. This
+ * is a hint to split large discard bio in generic block layer, then if device
+ * driver needs to split the discard bio into smaller ones, their bi_size can
+ * be very probably and easily aligned to discard_granularity of the device's
+ * queue.
+ */
+static inline unsigned int bio_aligned_discard_max_sectors(
+					struct request_queue *q)
+{
+	return round_down(UINT_MAX, q->limits.discard_granularity) >>
+			SECTOR_SHIFT;
+}
+
 /*
  * Internal io_context interface
  */
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard()
  2020-05-30 13:52 [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard() Coly Li
  2020-05-30 13:52 ` Coly Li
@ 2020-05-30 16:45 ` Hannes Reinecke
  2020-06-01  5:55 ` Xiao Ni
  2020-06-01  7:15 ` Ming Lei
  3 siblings, 0 replies; 7+ messages in thread
From: Hannes Reinecke @ 2020-05-30 16:45 UTC (permalink / raw)
  To: Coly Li, linux-block
  Cc: linux-bcache, Acshai Manoj, Bart Van Assche, Christoph Hellwig,
	Enzo Matsumiya, Hannes Reinecke, Jens Axboe, Ming Lei, Xiao Ni

On 5/30/20 3:52 PM, Coly Li wrote:
> This patch improves discard bio split for address and size alignment in
> __blkdev_issue_discard(). The aligned discard bio may help underlying
> device controller to perform better discard and internal garbage
> collection, and avoid unnecessary internal fragment.
> 
> Current discard bio split algorithm in __blkdev_issue_discard() may have
> non-discarded fregment on device even the discard bio LBA and size are
> both aligned to device's discard granularity size.
> 
> Here is the example steps on how to reproduce the above problem.
> - On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
>    with thin mode and give it to a Linux virtual machine.
> - Inside the Linux virtual machine, if the 50GB virtual disk shows up as
>    /dev/sdb, fill data into the first 50GB by,
>          # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
> - Discard the 50GB range from offset 0 on /dev/sdb,
>          # blkdiscard /dev/sdb -o 0 -l 53687091200
> - Observe the underlying mapping status of the device
>          # sg_get_lba_status /dev/sdb -m 1048 --lba=0
>    descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
>    descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
>    descriptor LBA: 0x0000000006600000  blocks: 0  deallocated
> 
> Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
> are perfect aligned to the discard granularity, from the above list
> these are many 1MB (2048 sectors) internal fragments exist unexpectedly.
> 
> The problem is in __blkdev_issue_discard(), an improper algorithm causes
> an improper bio size which is not aligned.
> 
>   25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>   26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
>   27                 struct bio **biop)
>   28 {
>   29         struct request_queue *q = bdev_get_queue(bdev);
>     [snipped]
>   56
>   57         while (nr_sects) {
>   58                 sector_t req_sects = min_t(sector_t, nr_sects,
>   59                                 bio_allowed_max_sectors(q));
>   60
>   61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
>   62
>   63                 bio = blk_next_bio(bio, 0, gfp_mask);
>   64                 bio->bi_iter.bi_sector = sector;
>   65                 bio_set_dev(bio, bdev);
>   66                 bio_set_op_attrs(bio, op, 0);
>   67
>   68                 bio->bi_iter.bi_size = req_sects << 9;
>   69                 sector += req_sects;
>   70                 nr_sects -= req_sects;
>     [snipped]
>   79         }
>   80
>   81         *biop = bio;
>   82         return 0;
>   83 }
>   84 EXPORT_SYMBOL(__blkdev_issue_discard);
> 
> At line 58-59, to discard a 50GB range, req_sects is set as return value
> of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
> case, the discard granularity is 2048 sectors, although the start LBA
> and discard length are aligned to discard granularity, req_sects never
> has chance to be aligned to discard granularity. This is why there are
> some still-mapped 2048 sectors fragment in every 4 or 8 GB range.
> 
> If req_sects at line 58 is set to a value aligned to discard_granularity
> and close to UNIT_MAX, then all consequent split bios inside device
> driver are (almostly) aligned to discard_granularity of the device
> queue. The 2048 sectors still-mapped fragment will disappear.
> 
> This patch introduces bio_aligned_discard_max_sectors() to return the
> the value which is aligned to q->limits.discard_granularity and closest
> to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
> this new routine to decide a more proper split bio length.
> 
> But we still need to handle the situation when discard start LBA is not
> aligned to q->limits.discard_granularity, otherwise even the length is
> aligned, current code may still leave 2048 fragment around every 4GB
> range. Therefore, to calculate req_sects, firstly the start LBA of
> discard range is checked, if it is not aligned to discard granularity,
> the first split location should make sure following bio has bi_sector
> aligned to discard granularity. Then there won't be still-mapped
> fragment in the middle of the discard range.
> 
> The above is how this patch improves discard bio alignment in
> __blkdev_issue_discard(). Now with this patch, after discard with same
> command line mentiond previously, sg_get_lba_status returns,
> descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
> descriptor LBA: 0x0000000006600000  blocks: 0  deallocated
> 
> We an see there is no 2048 sectors segment anymore, everything is clean.
> 
> Reported-by: Acshai Manoj <acshai.manoj@microfocus.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Enzo Matsumiya <ematsumiya@suse.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Xiao Ni <xni@redhat.com>
> ---
> Changelog:
> v2, the improved version with inspire from review comments by Bart,
>      Ming and Xiao.
> v1, the initial version.
> 
>   block/blk-lib.c | 25 +++++++++++++++++++++++--
>   block/blk.h     | 14 ++++++++++++++
>   2 files changed, 37 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard()
  2020-05-30 13:52 [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard() Coly Li
  2020-05-30 13:52 ` Coly Li
  2020-05-30 16:45 ` Hannes Reinecke
@ 2020-06-01  5:55 ` Xiao Ni
  2020-06-01  5:55   ` Xiao Ni
  2020-06-01  7:15 ` Ming Lei
  3 siblings, 1 reply; 7+ messages in thread
From: Xiao Ni @ 2020-06-01  5:55 UTC (permalink / raw)
  To: Coly Li, linux-block
  Cc: linux-bcache, Acshai Manoj, Bart Van Assche, Christoph Hellwig,
	Enzo Matsumiya, Hannes Reinecke, Jens Axboe, Ming Lei



On 05/30/2020 09:52 PM, Coly Li wrote:
> This patch improves discard bio split for address and size alignment in
> __blkdev_issue_discard(). The aligned discard bio may help underlying
> device controller to perform better discard and internal garbage
> collection, and avoid unnecessary internal fragment.
>
> Current discard bio split algorithm in __blkdev_issue_discard() may have
> non-discarded fregment on device even the discard bio LBA and size are
> both aligned to device's discard granularity size.
>
> Here is the example steps on how to reproduce the above problem.
> - On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
>    with thin mode and give it to a Linux virtual machine.
> - Inside the Linux virtual machine, if the 50GB virtual disk shows up as
>    /dev/sdb, fill data into the first 50GB by,
>          # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
> - Discard the 50GB range from offset 0 on /dev/sdb,
>          # blkdiscard /dev/sdb -o 0 -l 53687091200
> - Observe the underlying mapping status of the device
>          # sg_get_lba_status /dev/sdb -m 1048 --lba=0
>    descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
>    descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
>    descriptor LBA: 0x0000000006600000  blocks: 0  deallocated
>
> Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
> are perfect aligned to the discard granularity, from the above list
> these are many 1MB (2048 sectors) internal fragments exist unexpectedly.
>
> The problem is in __blkdev_issue_discard(), an improper algorithm causes
> an improper bio size which is not aligned.
>
>   25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>   26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
>   27                 struct bio **biop)
>   28 {
>   29         struct request_queue *q = bdev_get_queue(bdev);
>     [snipped]
>   56
>   57         while (nr_sects) {
>   58                 sector_t req_sects = min_t(sector_t, nr_sects,
>   59                                 bio_allowed_max_sectors(q));
>   60
>   61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
>   62
>   63                 bio = blk_next_bio(bio, 0, gfp_mask);
>   64                 bio->bi_iter.bi_sector = sector;
>   65                 bio_set_dev(bio, bdev);
>   66                 bio_set_op_attrs(bio, op, 0);
>   67
>   68                 bio->bi_iter.bi_size = req_sects << 9;
>   69                 sector += req_sects;
>   70                 nr_sects -= req_sects;
>     [snipped]
>   79         }
>   80
>   81         *biop = bio;
>   82         return 0;
>   83 }
>   84 EXPORT_SYMBOL(__blkdev_issue_discard);
>
> At line 58-59, to discard a 50GB range, req_sects is set as return value
> of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
> case, the discard granularity is 2048 sectors, although the start LBA
> and discard length are aligned to discard granularity, req_sects never
> has chance to be aligned to discard granularity. This is why there are
> some still-mapped 2048 sectors fragment in every 4 or 8 GB range.
>
> If req_sects at line 58 is set to a value aligned to discard_granularity
> and close to UNIT_MAX, then all consequent split bios inside device
> driver are (almostly) aligned to discard_granularity of the device
> queue. The 2048 sectors still-mapped fragment will disappear.
>
> This patch introduces bio_aligned_discard_max_sectors() to return the
> the value which is aligned to q->limits.discard_granularity and closest
> to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
> this new routine to decide a more proper split bio length.
>
> But we still need to handle the situation when discard start LBA is not
> aligned to q->limits.discard_granularity, otherwise even the length is
> aligned, current code may still leave 2048 fragment around every 4GB
> range. Therefore, to calculate req_sects, firstly the start LBA of
> discard range is checked, if it is not aligned to discard granularity,
> the first split location should make sure following bio has bi_sector
> aligned to discard granularity. Then there won't be still-mapped
> fragment in the middle of the discard range.
>
> The above is how this patch improves discard bio alignment in
> __blkdev_issue_discard(). Now with this patch, after discard with same
> command line mentiond previously, sg_get_lba_status returns,
> descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
> descriptor LBA: 0x0000000006600000  blocks: 0  deallocated
>
> We an see there is no 2048 sectors segment anymore, everything is clean.
>
> Reported-by: Acshai Manoj <acshai.manoj@microfocus.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Enzo Matsumiya <ematsumiya@suse.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Xiao Ni <xni@redhat.com>
> ---
> Changelog:
> v2, the improved version with inspire from review comments by Bart,
>      Ming and Xiao.
> v1, the initial version.
>
>   block/blk-lib.c | 25 +++++++++++++++++++++++--
>   block/blk.h     | 14 ++++++++++++++
>   2 files changed, 37 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 5f2c429d4378..7bffdee63a20 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -55,8 +55,29 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>   		return -EINVAL;
>   
>   	while (nr_sects) {
> -		sector_t req_sects = min_t(sector_t, nr_sects,
> -				bio_allowed_max_sectors(q));
> +		sector_t granularity_aligned_lba;
> +		sector_t req_sects;
> +
> +		granularity_aligned_lba = round_up(sector,
> +				q->limits.discard_granularity >> SECTOR_SHIFT);
> +
> +		/*
> +		 * Check whether the discard bio starts at a discard_granularity
> +		 * aligned LBA,
> +		 * - If no: set (granularity_aligned_lba - sector) to bi_size of
> +		 *   the first split bio, then the second bio will start at a
> +		 *   discard_granularity aligned LBA.
> +		 * - If yes: use bio_aligned_discard_max_sectors() as the max
> +		 *   possible bi_size of the first split bio. Then when this bio
> +		 *   is split in device drive, the split ones are very probably
> +		 *   to be aligned to discard_granularity of the device's queue.
> +		 */
> +		if (granularity_aligned_lba == sector)
> +			req_sects = min_t(sector_t, nr_sects,
> +					  bio_aligned_discard_max_sectors(q));
> +		else
> +			req_sects = min_t(sector_t, nr_sects,
> +					  granularity_aligned_lba - sector);
>   
>   		WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
>   
> diff --git a/block/blk.h b/block/blk.h
> index 0a94ec68af32..589007ac564e 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -292,6 +292,20 @@ static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
>   	return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
>   }
>   
> +/*
> + * The max bio size which is aligned to q->limits.discard_granularity. This
> + * is a hint to split large discard bio in generic block layer, then if device
> + * driver needs to split the discard bio into smaller ones, their bi_size can
> + * be very probably and easily aligned to discard_granularity of the device's
> + * queue.
> + */
> +static inline unsigned int bio_aligned_discard_max_sectors(
> +					struct request_queue *q)
> +{
> +	return round_down(UINT_MAX, q->limits.discard_granularity) >>
> +			SECTOR_SHIFT;
> +}
> +
>   /*
>    * Internal io_context interface
>    */
Reviewed-by: Xiao Ni <xni@redhat.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard()
  2020-06-01  5:55 ` Xiao Ni
@ 2020-06-01  5:55   ` Xiao Ni
  0 siblings, 0 replies; 7+ messages in thread
From: Xiao Ni @ 2020-06-01  5:55 UTC (permalink / raw)
  To: Coly Li, linux-block
  Cc: linux-bcache, Acshai Manoj, Bart Van Assche, Christoph Hellwig,
	Enzo Matsumiya, Hannes Reinecke, Jens Axboe, Ming Lei



On 05/30/2020 09:52 PM, Coly Li wrote:
> This patch improves discard bio split for address and size alignment in
> __blkdev_issue_discard(). The aligned discard bio may help underlying
> device controller to perform better discard and internal garbage
> collection, and avoid unnecessary internal fragment.
>
> Current discard bio split algorithm in __blkdev_issue_discard() may have
> non-discarded fregment on device even the discard bio LBA and size are
> both aligned to device's discard granularity size.
>
> Here is the example steps on how to reproduce the above problem.
> - On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
>    with thin mode and give it to a Linux virtual machine.
> - Inside the Linux virtual machine, if the 50GB virtual disk shows up as
>    /dev/sdb, fill data into the first 50GB by,
>          # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
> - Discard the 50GB range from offset 0 on /dev/sdb,
>          # blkdiscard /dev/sdb -o 0 -l 53687091200
> - Observe the underlying mapping status of the device
>          # sg_get_lba_status /dev/sdb -m 1048 --lba=0
>    descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
>    descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
>    descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
>    descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
>    descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
>    descriptor LBA: 0x0000000006600000  blocks: 0  deallocated
>
> Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
> are perfect aligned to the discard granularity, from the above list
> these are many 1MB (2048 sectors) internal fragments exist unexpectedly.
>
> The problem is in __blkdev_issue_discard(), an improper algorithm causes
> an improper bio size which is not aligned.
>
>   25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>   26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
>   27                 struct bio **biop)
>   28 {
>   29         struct request_queue *q = bdev_get_queue(bdev);
>     [snipped]
>   56
>   57         while (nr_sects) {
>   58                 sector_t req_sects = min_t(sector_t, nr_sects,
>   59                                 bio_allowed_max_sectors(q));
>   60
>   61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
>   62
>   63                 bio = blk_next_bio(bio, 0, gfp_mask);
>   64                 bio->bi_iter.bi_sector = sector;
>   65                 bio_set_dev(bio, bdev);
>   66                 bio_set_op_attrs(bio, op, 0);
>   67
>   68                 bio->bi_iter.bi_size = req_sects << 9;
>   69                 sector += req_sects;
>   70                 nr_sects -= req_sects;
>     [snipped]
>   79         }
>   80
>   81         *biop = bio;
>   82         return 0;
>   83 }
>   84 EXPORT_SYMBOL(__blkdev_issue_discard);
>
> At line 58-59, to discard a 50GB range, req_sects is set as return value
> of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
> case, the discard granularity is 2048 sectors, although the start LBA
> and discard length are aligned to discard granularity, req_sects never
> has chance to be aligned to discard granularity. This is why there are
> some still-mapped 2048 sectors fragment in every 4 or 8 GB range.
>
> If req_sects at line 58 is set to a value aligned to discard_granularity
> and close to UNIT_MAX, then all consequent split bios inside device
> driver are (almostly) aligned to discard_granularity of the device
> queue. The 2048 sectors still-mapped fragment will disappear.
>
> This patch introduces bio_aligned_discard_max_sectors() to return the
> the value which is aligned to q->limits.discard_granularity and closest
> to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
> this new routine to decide a more proper split bio length.
>
> But we still need to handle the situation when discard start LBA is not
> aligned to q->limits.discard_granularity, otherwise even the length is
> aligned, current code may still leave 2048 fragment around every 4GB
> range. Therefore, to calculate req_sects, firstly the start LBA of
> discard range is checked, if it is not aligned to discard granularity,
> the first split location should make sure following bio has bi_sector
> aligned to discard granularity. Then there won't be still-mapped
> fragment in the middle of the discard range.
>
> The above is how this patch improves discard bio alignment in
> __blkdev_issue_discard(). Now with this patch, after discard with same
> command line mentiond previously, sg_get_lba_status returns,
> descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
> descriptor LBA: 0x0000000006600000  blocks: 0  deallocated
>
> We an see there is no 2048 sectors segment anymore, everything is clean.
>
> Reported-by: Acshai Manoj <acshai.manoj@microfocus.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Enzo Matsumiya <ematsumiya@suse.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Xiao Ni <xni@redhat.com>
> ---
> Changelog:
> v2, the improved version with inspire from review comments by Bart,
>      Ming and Xiao.
> v1, the initial version.
>
>   block/blk-lib.c | 25 +++++++++++++++++++++++--
>   block/blk.h     | 14 ++++++++++++++
>   2 files changed, 37 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 5f2c429d4378..7bffdee63a20 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -55,8 +55,29 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>   		return -EINVAL;
>   
>   	while (nr_sects) {
> -		sector_t req_sects = min_t(sector_t, nr_sects,
> -				bio_allowed_max_sectors(q));
> +		sector_t granularity_aligned_lba;
> +		sector_t req_sects;
> +
> +		granularity_aligned_lba = round_up(sector,
> +				q->limits.discard_granularity >> SECTOR_SHIFT);
> +
> +		/*
> +		 * Check whether the discard bio starts at a discard_granularity
> +		 * aligned LBA,
> +		 * - If no: set (granularity_aligned_lba - sector) to bi_size of
> +		 *   the first split bio, then the second bio will start at a
> +		 *   discard_granularity aligned LBA.
> +		 * - If yes: use bio_aligned_discard_max_sectors() as the max
> +		 *   possible bi_size of the first split bio. Then when this bio
> +		 *   is split in device drive, the split ones are very probably
> +		 *   to be aligned to discard_granularity of the device's queue.
> +		 */
> +		if (granularity_aligned_lba == sector)
> +			req_sects = min_t(sector_t, nr_sects,
> +					  bio_aligned_discard_max_sectors(q));
> +		else
> +			req_sects = min_t(sector_t, nr_sects,
> +					  granularity_aligned_lba - sector);
>   
>   		WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
>   
> diff --git a/block/blk.h b/block/blk.h
> index 0a94ec68af32..589007ac564e 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -292,6 +292,20 @@ static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
>   	return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
>   }
>   
> +/*
> + * The max bio size which is aligned to q->limits.discard_granularity. This
> + * is a hint to split large discard bio in generic block layer, then if device
> + * driver needs to split the discard bio into smaller ones, their bi_size can
> + * be very probably and easily aligned to discard_granularity of the device's
> + * queue.
> + */
> +static inline unsigned int bio_aligned_discard_max_sectors(
> +					struct request_queue *q)
> +{
> +	return round_down(UINT_MAX, q->limits.discard_granularity) >>
> +			SECTOR_SHIFT;
> +}
> +
>   /*
>    * Internal io_context interface
>    */
Reviewed-by: Xiao Ni <xni@redhat.com>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard()
  2020-05-30 13:52 [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard() Coly Li
                   ` (2 preceding siblings ...)
  2020-06-01  5:55 ` Xiao Ni
@ 2020-06-01  7:15 ` Ming Lei
  3 siblings, 0 replies; 7+ messages in thread
From: Ming Lei @ 2020-06-01  7:15 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-block, linux-bcache, Acshai Manoj, Bart Van Assche,
	Christoph Hellwig, Enzo Matsumiya, Hannes Reinecke, Jens Axboe,
	Xiao Ni

On Sat, May 30, 2020 at 09:52:31PM +0800, Coly Li wrote:
> This patch improves discard bio split for address and size alignment in
> __blkdev_issue_discard(). The aligned discard bio may help underlying
> device controller to perform better discard and internal garbage
> collection, and avoid unnecessary internal fragment.
> 
> Current discard bio split algorithm in __blkdev_issue_discard() may have
> non-discarded fregment on device even the discard bio LBA and size are
> both aligned to device's discard granularity size.
> 
> Here is the example steps on how to reproduce the above problem.
> - On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
>   with thin mode and give it to a Linux virtual machine.
> - Inside the Linux virtual machine, if the 50GB virtual disk shows up as
>   /dev/sdb, fill data into the first 50GB by,
>         # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
> - Discard the 50GB range from offset 0 on /dev/sdb,
>         # blkdiscard /dev/sdb -o 0 -l 53687091200
> - Observe the underlying mapping status of the device
>         # sg_get_lba_status /dev/sdb -m 1048 --lba=0
>   descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
>   descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
>   descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
>   descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
>   descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
>   descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
>   descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
>   descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
>   descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
>   descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
>   descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
>   descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
>   descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
>   descriptor LBA: 0x0000000006600000  blocks: 0  deallocated
> 
> Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
> are perfect aligned to the discard granularity, from the above list
> these are many 1MB (2048 sectors) internal fragments exist unexpectedly.
> 
> The problem is in __blkdev_issue_discard(), an improper algorithm causes
> an improper bio size which is not aligned.
> 
>  25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
>  27                 struct bio **biop)
>  28 {
>  29         struct request_queue *q = bdev_get_queue(bdev);
>    [snipped]
>  56
>  57         while (nr_sects) {
>  58                 sector_t req_sects = min_t(sector_t, nr_sects,
>  59                                 bio_allowed_max_sectors(q));
>  60
>  61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
>  62
>  63                 bio = blk_next_bio(bio, 0, gfp_mask);
>  64                 bio->bi_iter.bi_sector = sector;
>  65                 bio_set_dev(bio, bdev);
>  66                 bio_set_op_attrs(bio, op, 0);
>  67
>  68                 bio->bi_iter.bi_size = req_sects << 9;
>  69                 sector += req_sects;
>  70                 nr_sects -= req_sects;
>    [snipped]
>  79         }
>  80
>  81         *biop = bio;
>  82         return 0;
>  83 }
>  84 EXPORT_SYMBOL(__blkdev_issue_discard);
> 
> At line 58-59, to discard a 50GB range, req_sects is set as return value
> of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
> case, the discard granularity is 2048 sectors, although the start LBA
> and discard length are aligned to discard granularity, req_sects never
> has chance to be aligned to discard granularity. This is why there are
> some still-mapped 2048 sectors fragment in every 4 or 8 GB range.
> 
> If req_sects at line 58 is set to a value aligned to discard_granularity
> and close to UNIT_MAX, then all consequent split bios inside device
> driver are (almostly) aligned to discard_granularity of the device
> queue. The 2048 sectors still-mapped fragment will disappear.
> 
> This patch introduces bio_aligned_discard_max_sectors() to return the
> the value which is aligned to q->limits.discard_granularity and closest
> to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
> this new routine to decide a more proper split bio length.
> 
> But we still need to handle the situation when discard start LBA is not
> aligned to q->limits.discard_granularity, otherwise even the length is
> aligned, current code may still leave 2048 fragment around every 4GB
> range. Therefore, to calculate req_sects, firstly the start LBA of
> discard range is checked, if it is not aligned to discard granularity,
> the first split location should make sure following bio has bi_sector
> aligned to discard granularity. Then there won't be still-mapped
> fragment in the middle of the discard range.
> 
> The above is how this patch improves discard bio alignment in
> __blkdev_issue_discard(). Now with this patch, after discard with same
> command line mentiond previously, sg_get_lba_status returns,
> descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
> descriptor LBA: 0x0000000006600000  blocks: 0  deallocated
> 
> We an see there is no 2048 sectors segment anymore, everything is clean.
> 
> Reported-by: Acshai Manoj <acshai.manoj@microfocus.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Enzo Matsumiya <ematsumiya@suse.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Xiao Ni <xni@redhat.com>
> ---
> Changelog:
> v2, the improved version with inspire from review comments by Bart,
>     Ming and Xiao.
> v1, the initial version.
> 
>  block/blk-lib.c | 25 +++++++++++++++++++++++--
>  block/blk.h     | 14 ++++++++++++++
>  2 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 5f2c429d4378..7bffdee63a20 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -55,8 +55,29 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		return -EINVAL;
>  
>  	while (nr_sects) {
> -		sector_t req_sects = min_t(sector_t, nr_sects,
> -				bio_allowed_max_sectors(q));
> +		sector_t granularity_aligned_lba;
> +		sector_t req_sects;
> +
> +		granularity_aligned_lba = round_up(sector,
> +				q->limits.discard_granularity >> SECTOR_SHIFT);
> +
> +		/*
> +		 * Check whether the discard bio starts at a discard_granularity
> +		 * aligned LBA,
> +		 * - If no: set (granularity_aligned_lba - sector) to bi_size of
> +		 *   the first split bio, then the second bio will start at a
> +		 *   discard_granularity aligned LBA.
> +		 * - If yes: use bio_aligned_discard_max_sectors() as the max
> +		 *   possible bi_size of the first split bio. Then when this bio
> +		 *   is split in device drive, the split ones are very probably
> +		 *   to be aligned to discard_granularity of the device's queue.
> +		 */
> +		if (granularity_aligned_lba == sector)
> +			req_sects = min_t(sector_t, nr_sects,
> +					  bio_aligned_discard_max_sectors(q));
> +		else
> +			req_sects = min_t(sector_t, nr_sects,
> +					  granularity_aligned_lba - sector);
>  
>  		WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
>  
> diff --git a/block/blk.h b/block/blk.h
> index 0a94ec68af32..589007ac564e 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -292,6 +292,20 @@ static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
>  	return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
>  }
>  
> +/*
> + * The max bio size which is aligned to q->limits.discard_granularity. This
> + * is a hint to split large discard bio in generic block layer, then if device
> + * driver needs to split the discard bio into smaller ones, their bi_size can
> + * be very probably and easily aligned to discard_granularity of the device's
> + * queue.
> + */
> +static inline unsigned int bio_aligned_discard_max_sectors(
> +					struct request_queue *q)
> +{
> +	return round_down(UINT_MAX, q->limits.discard_granularity) >>
> +			SECTOR_SHIFT;
> +}
> +

Looks fine, and the root cause is that max bio size is UINT_MAX,
otherwise blk_bio_discard_split can do perfect splitting:

Reviewed-by: Ming Lei <ming.lei@redhat.com>

BTW, in the future, maybe we can consider to make both bio->bi_iter.bi_size 
and rq->__data_len to store sector count instead of bytes. The whole
storage stack should operate on 512bytes boundary.


thanks,
Ming

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard()
@ 2020-07-16 17:43 Coly Li
  0 siblings, 0 replies; 7+ messages in thread
From: Coly Li @ 2020-07-16 17:43 UTC (permalink / raw)
  To: linux-block
  Cc: linux-kernel, linux-bcache, Coly Li, Acshai Manoj,
	Hannes Reinecke, Ming Lei, Xiao Ni, Bart Van Assche,
	Christoph Hellwig, Enzo Matsumiya, Jens Axboe,
	Martin K . Petersen

This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.

Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.

Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
  with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
  /dev/sdb, fill data into the first 50GB by,
        # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
        # blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
        # sg_get_lba_status /dev/sdb -m 1048 --lba=0
  descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
  descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
  descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.

The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.

 25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
 27                 struct bio **biop)
 28 {
 29         struct request_queue *q = bdev_get_queue(bdev);
   [snipped]
 56
 57         while (nr_sects) {
 58                 sector_t req_sects = min_t(sector_t, nr_sects,
 59                                 bio_allowed_max_sectors(q));
 60
 61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 62
 63                 bio = blk_next_bio(bio, 0, gfp_mask);
 64                 bio->bi_iter.bi_sector = sector;
 65                 bio_set_dev(bio, bdev);
 66                 bio_set_op_attrs(bio, op, 0);
 67
 68                 bio->bi_iter.bi_size = req_sects << 9;
 69                 sector += req_sects;
 70                 nr_sects -= req_sects;
   [snipped]
 79         }
 80
 81         *biop = bio;
 82         return 0;
 83 }
 84 EXPORT_SYMBOL(__blkdev_issue_discard);

At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.

This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.

But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked, if it is not aligned to discard granularity,
the first split location should make sure following bio has bi_sector
aligned to discard granularity. Then there won't be still-mapped
fragment in the middle of the discard range.

The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

We an see there is no 2048 sectors segment anymore, everything is clean.

Reported-and-tested-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
---
Changelog:
v3: handle partition offset suggested by Martin.
v2: fix the overflow pointed by Ming Lei.
v1: initial version.
 block/blk-lib.c | 31 ++++++++++++++++++++++++++++---
 block/blk.h     | 14 ++++++++++++++
 2 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..019e09bb9c0e 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -29,7 +29,7 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	struct request_queue *q = bdev_get_queue(bdev);
 	struct bio *bio = *biop;
 	unsigned int op;
-	sector_t bs_mask;
+	sector_t bs_mask, part_offset = 0;
 
 	if (!q)
 		return -ENXIO;
@@ -54,9 +54,34 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	if (!nr_sects)
 		return -EINVAL;
 
+	/* In case the discard request is in a partition */
+	if (bdev->bd_partno)
+		part_offset = bdev->bd_part->start_sect;
+
 	while (nr_sects) {
-		sector_t req_sects = min_t(sector_t, nr_sects,
-				bio_allowed_max_sectors(q));
+		sector_t granularity_aligned_lba, req_sects;
+		sector_t sector_mapped = sector + part_offset;
+
+		granularity_aligned_lba = round_up(sector_mapped,
+				q->limits.discard_granularity >> SECTOR_SHIFT);
+
+		/*
+		 * Check whether the discard bio starts at a discard_granularity
+		 * aligned LBA,
+		 * - If no: set (granularity_aligned_lba - sector_mapped) to
+		 *   bi_size of the first split bio, then the second bio will
+		 *   start at a discard_granularity aligned LBA on the device.
+		 * - If yes: use bio_aligned_discard_max_sectors() as the max
+		 *   possible bi_size of the first split bio. Then when this bio
+		 *   is split in device drive, the split ones are very probably
+		 *   to be aligned to discard_granularity of the device's queue.
+		 */
+		if (granularity_aligned_lba == sector_mapped)
+			req_sects = min_t(sector_t, nr_sects,
+					  bio_aligned_discard_max_sectors(q));
+		else
+			req_sects = min_t(sector_t, nr_sects,
+					  granularity_aligned_lba - sector_mapped);
 
 		WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 
diff --git a/block/blk.h b/block/blk.h
index b5d1f0fc6547..a80738581f84 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -281,6 +281,20 @@ static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
 	return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
 }
 
+/*
+ * The max bio size which is aligned to q->limits.discard_granularity. This
+ * is a hint to split large discard bio in generic block layer, then if device
+ * driver needs to split the discard bio into smaller ones, their bi_size can
+ * be very probably and easily aligned to discard_granularity of the device's
+ * queue.
+ */
+static inline unsigned int bio_aligned_discard_max_sectors(
+					struct request_queue *q)
+{
+	return round_down(UINT_MAX, q->limits.discard_granularity) >>
+			SECTOR_SHIFT;
+}
+
 /*
  * Internal io_context interface
  */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-07-16 17:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-30 13:52 [PATCH v3] block: improve discard bio alignment in __blkdev_issue_discard() Coly Li
2020-05-30 13:52 ` Coly Li
2020-05-30 16:45 ` Hannes Reinecke
2020-06-01  5:55 ` Xiao Ni
2020-06-01  5:55   ` Xiao Ni
2020-06-01  7:15 ` Ming Lei
2020-07-16 17:43 Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).