[PATCH 1/3] block: try one write zeroes request before going further

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/3] block: try one write zeroes request before going further
@ 2020-12-06  5:53 Tom Yan
  2020-12-06  5:53 ` [PATCH 2/3] block: make __blkdev_issue_zero_pages() less confusing Tom Yan
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Tom Yan @ 2020-12-06  5:53 UTC (permalink / raw)
  To: linux-block; +Cc: linux-scsi, Tom Yan

At least the SCSI disk driver is "benevolent" when it try to decide
whether the device actually supports write zeroes, i.e. unless the
device explicity report otherwise, it assumes it does at first.

Therefore before we pile up bios that would fail at the end, we try
the command/request once, as not doing so could trigger quite a
disaster in at least certain case. For example, the host controller
can be messed up entirely when one does `blkdiscard -z` a UAS drive.

Signed-off-by: Tom Yan <tom.ty89@gmail.com>
---
 block/blk-lib.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e90614fd8d6a..c1e9388a8fb8 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 	struct bio *bio = *biop;
 	unsigned int max_write_zeroes_sectors;
 	struct request_queue *q = bdev_get_queue(bdev);
+	int i = 0;
 
 	if (!q)
 		return -ENXIO;
@@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		return -EOPNOTSUPP;
 
 	while (nr_sects) {
-		bio = blk_next_bio(bio, 0, gfp_mask);
+		if (i != 1) {
+			bio = blk_next_bio(bio, 0, gfp_mask);
+		} else {
+			submit_bio_wait(bio);
+			bio_put(bio);
+
+			if (bdev_write_zeroes_sectors(bdev) == 0)
+				return -EOPNOTSUPP;
+			else
+				bio = bio_alloc(gfp_mask, 0);
+		}
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
 		bio->bi_opf = REQ_OP_WRITE_ZEROES;
@@ -280,6 +291,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 			nr_sects = 0;
 		}
 		cond_resched();
+		i++;
 	}
 
 	*biop = bio;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] block: make __blkdev_issue_zero_pages() less confusing
  2020-12-06  5:53 [PATCH 1/3] block: try one write zeroes request before going further Tom Yan
@ 2020-12-06  5:53 ` Tom Yan
  2020-12-06 11:29   ` Hannes Reinecke
  2020-12-07 13:34   ` Christoph Hellwig
  2020-12-06  5:53 ` [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages() Tom Yan
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Tom Yan @ 2020-12-06  5:53 UTC (permalink / raw)
  To: linux-block; +Cc: linux-scsi, Tom Yan

Instead of using the same check for the two layers of loops, count
bio pages in the inner loop instead.

Signed-off-by: Tom Yan <tom.ty89@gmail.com>
---
 block/blk-lib.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index c1e9388a8fb8..354dcab760c7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -318,7 +318,7 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
 	struct request_queue *q = bdev_get_queue(bdev);
 	struct bio *bio = *biop;
 	int bi_size = 0;
-	unsigned int sz;
+	unsigned int sz, bio_nr_pages;
 
 	if (!q)
 		return -ENXIO;
@@ -327,19 +327,18 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
 		return -EPERM;
 
 	while (nr_sects != 0) {
-		bio = blk_next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
-				   gfp_mask);
+		bio_nr_pages = __blkdev_sectors_to_bio_pages(nr_sects);
+		bio = blk_next_bio(bio, bio_nr_pages, gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
 		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
 
-		while (nr_sects != 0) {
+		while (bio_nr_pages != 0) {
 			sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
 			bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0);
 			nr_sects -= bi_size >> 9;
 			sector += bi_size >> 9;
-			if (bi_size < sz)
-				break;
+			bio_nr_pages--;
 		}
 		cond_resched();
 	}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages()
  2020-12-06  5:53 [PATCH 1/3] block: try one write zeroes request before going further Tom Yan
  2020-12-06  5:53 ` [PATCH 2/3] block: make __blkdev_issue_zero_pages() less confusing Tom Yan
@ 2020-12-06  5:53 ` Tom Yan
  2020-12-06 11:31   ` Hannes Reinecke
  2020-12-07 13:35   ` Christoph Hellwig
  2020-12-06 11:25 ` [PATCH 1/3] block: try one write zeroes request before going further Hannes Reinecke
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Tom Yan @ 2020-12-06  5:53 UTC (permalink / raw)
  To: linux-block; +Cc: linux-scsi, Tom Yan

Mimicking blkdev_issue_flush(). Seems like a right thing to do, as
they are a bunch of REQ_OP_WRITE.

Signed-off-by: Tom Yan <tom.ty89@gmail.com>
---
 block/blk-lib.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 354dcab760c7..5579fdea893d 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -422,6 +422,8 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	} else if (!(flags & BLKDEV_ZERO_NOFALLBACK)) {
 		ret = __blkdev_issue_zero_pages(bdev, sector, nr_sects,
 						gfp_mask, &bio);
+		if (bio)
+			bio->bi_opf |= REQ_PREFLUSH;
 	} else {
 		/* No zeroing offload support */
 		ret = -EOPNOTSUPP;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] block: try one write zeroes request before going further
  2020-12-06  5:53 [PATCH 1/3] block: try one write zeroes request before going further Tom Yan
  2020-12-06  5:53 ` [PATCH 2/3] block: make __blkdev_issue_zero_pages() less confusing Tom Yan
  2020-12-06  5:53 ` [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages() Tom Yan
@ 2020-12-06 11:25 ` Hannes Reinecke
  2020-12-06 13:25   ` Tom Yan
  2020-12-07 13:36 ` Christoph Hellwig
  2020-12-08 22:46 ` Ewan D. Milne
  4 siblings, 1 reply; 23+ messages in thread
From: Hannes Reinecke @ 2020-12-06 11:25 UTC (permalink / raw)
  To: Tom Yan, linux-block; +Cc: linux-scsi

On 12/6/20 6:53 AM, Tom Yan wrote:
> At least the SCSI disk driver is "benevolent" when it try to decide
> whether the device actually supports write zeroes, i.e. unless the
> device explicity report otherwise, it assumes it does at first.
> 
> Therefore before we pile up bios that would fail at the end, we try
> the command/request once, as not doing so could trigger quite a
> disaster in at least certain case. For example, the host controller
> can be messed up entirely when one does `blkdiscard -z` a UAS drive.
> 
> Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> ---
>   block/blk-lib.c | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e90614fd8d6a..c1e9388a8fb8 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>   	struct bio *bio = *biop;
>   	unsigned int max_write_zeroes_sectors;
>   	struct request_queue *q = bdev_get_queue(bdev);
> +	int i = 0;
>   
>   	if (!q)
>   		return -ENXIO;
> @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>   		return -EOPNOTSUPP;
>   
>   	while (nr_sects) {
> -		bio = blk_next_bio(bio, 0, gfp_mask);
> +		if (i != 1) {
> +			bio = blk_next_bio(bio, 0, gfp_mask);
> +		} else {
> +			submit_bio_wait(bio);
> +			bio_put(bio);
> +
> +			if (bdev_write_zeroes_sectors(bdev) == 0)
> +				return -EOPNOTSUPP;
> +			else
> +				bio = bio_alloc(gfp_mask, 0);
> +		}
>   		bio->bi_iter.bi_sector = sector;
>   		bio_set_dev(bio, bdev);
>   		bio->bi_opf = REQ_OP_WRITE_ZEROES;
> @@ -280,6 +291,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>   			nr_sects = 0;
>   		}
>   		cond_resched();
> +		i++;
>   	}
>   
>   	*biop = bio;
> 
We do want to keep the chain of bios intact such that end_io processing 
will recurse back to the original end_io callback.
As such we need to call bio_chain on the first bio, submit that 
(possibly with submit_bio_wait()), and then decide whether we can / 
should continue.
With your patch we'll lose the information that indeed other bios might 
be linked to the original one.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] block: make __blkdev_issue_zero_pages() less confusing
  2020-12-06  5:53 ` [PATCH 2/3] block: make __blkdev_issue_zero_pages() less confusing Tom Yan
@ 2020-12-06 11:29   ` Hannes Reinecke
  2020-12-06 13:28     ` Tom Yan
  2020-12-07 13:34   ` Christoph Hellwig
  1 sibling, 1 reply; 23+ messages in thread
From: Hannes Reinecke @ 2020-12-06 11:29 UTC (permalink / raw)
  To: Tom Yan, linux-block; +Cc: linux-scsi

On 12/6/20 6:53 AM, Tom Yan wrote:
> Instead of using the same check for the two layers of loops, count
> bio pages in the inner loop instead.
> 
> Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> ---
>   block/blk-lib.c | 11 +++++------
>   1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index c1e9388a8fb8..354dcab760c7 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -318,7 +318,7 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
>   	struct request_queue *q = bdev_get_queue(bdev);
>   	struct bio *bio = *biop;
>   	int bi_size = 0;
> -	unsigned int sz;
> +	unsigned int sz, bio_nr_pages;
>   
>   	if (!q)
>   		return -ENXIO;
> @@ -327,19 +327,18 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
>   		return -EPERM;
>   
>   	while (nr_sects != 0) {
> -		bio = blk_next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
> -				   gfp_mask);
> +		bio_nr_pages = __blkdev_sectors_to_bio_pages(nr_sects);
> +		bio = blk_next_bio(bio, bio_nr_pages, gfp_mask);
>   		bio->bi_iter.bi_sector = sector;
>   		bio_set_dev(bio, bdev);
>   		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>   
> -		while (nr_sects != 0) {
> +		while (bio_nr_pages != 0) {
>   			sz = min((sector_t) PAGE_SIZE, nr_sects << 9);

nr_sects will need to be modified, too, if we iterate over bio_nr_pages 
instead of nr_sects.

>   			bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0);
>   			nr_sects -= bi_size >> 9;
>   			sector += bi_size >> 9;
> -			if (bi_size < sz)
> -				break;
> +			bio_nr_pages--;
>   		}
>   		cond_resched();
>   	}
> 

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages()
  2020-12-06  5:53 ` [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages() Tom Yan
@ 2020-12-06 11:31   ` Hannes Reinecke
  2020-12-06 13:32     ` Tom Yan
  2020-12-07 13:35   ` Christoph Hellwig
  1 sibling, 1 reply; 23+ messages in thread
From: Hannes Reinecke @ 2020-12-06 11:31 UTC (permalink / raw)
  To: Tom Yan, linux-block; +Cc: linux-scsi

On 12/6/20 6:53 AM, Tom Yan wrote:
> Mimicking blkdev_issue_flush(). Seems like a right thing to do, as
> they are a bunch of REQ_OP_WRITE.
> 
> Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> ---
>   block/blk-lib.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 354dcab760c7..5579fdea893d 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -422,6 +422,8 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>   	} else if (!(flags & BLKDEV_ZERO_NOFALLBACK)) {
>   		ret = __blkdev_issue_zero_pages(bdev, sector, nr_sects,
>   						gfp_mask, &bio);
> +		if (bio)
> +			bio->bi_opf |= REQ_PREFLUSH;
>   	} else {
>   		/* No zeroing offload support */
>   		ret = -EOPNOTSUPP;
> 
PREFLUSH is for the 'flush' machinery (cf blk-flush.c). Which is okay 
for blkdev_issue_flush(), but certainly not for zeroout.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] block: try one write zeroes request before going further
  2020-12-06 11:25 ` [PATCH 1/3] block: try one write zeroes request before going further Hannes Reinecke
@ 2020-12-06 13:25   ` Tom Yan
  2020-12-06 13:56     ` Hannes Reinecke
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Yan @ 2020-12-06 13:25 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, linux-scsi

I think you misunderstood it. The goal of this patch is to split the
current situation into two chains (or one unchained bio + a series of
chained bio). The first one is an attempt/trial which makes sure that
the latter large bio chain can actually be handled (as per the
"command capability" of the device).

P.S. I think I missed the fact that it requires my blk_next_bio()
patch to work properly. (It still seems like a typo bug to me.)

On Sun, 6 Dec 2020 at 19:25, Hannes Reinecke <hare@suse.de> wrote:
>
> On 12/6/20 6:53 AM, Tom Yan wrote:
> > At least the SCSI disk driver is "benevolent" when it try to decide
> > whether the device actually supports write zeroes, i.e. unless the
> > device explicity report otherwise, it assumes it does at first.
> >
> > Therefore before we pile up bios that would fail at the end, we try
> > the command/request once, as not doing so could trigger quite a
> > disaster in at least certain case. For example, the host controller
> > can be messed up entirely when one does `blkdiscard -z` a UAS drive.
> >
> > Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> > ---
> >   block/blk-lib.c | 14 +++++++++++++-
> >   1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index e90614fd8d6a..c1e9388a8fb8 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >       struct bio *bio = *biop;
> >       unsigned int max_write_zeroes_sectors;
> >       struct request_queue *q = bdev_get_queue(bdev);
> > +     int i = 0;
> >
> >       if (!q)
> >               return -ENXIO;
> > @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >               return -EOPNOTSUPP;
> >
> >       while (nr_sects) {
> > -             bio = blk_next_bio(bio, 0, gfp_mask);
> > +             if (i != 1) {
> > +                     bio = blk_next_bio(bio, 0, gfp_mask);
> > +             } else {
> > +                     submit_bio_wait(bio);
> > +                     bio_put(bio);
> > +
> > +                     if (bdev_write_zeroes_sectors(bdev) == 0)
> > +                             return -EOPNOTSUPP;
> > +                     else
> > +                             bio = bio_alloc(gfp_mask, 0);
> > +             }
> >               bio->bi_iter.bi_sector = sector;
> >               bio_set_dev(bio, bdev);
> >               bio->bi_opf = REQ_OP_WRITE_ZEROES;
> > @@ -280,6 +291,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >                       nr_sects = 0;
> >               }
> >               cond_resched();
> > +             i++;
> >       }
> >
> >       *biop = bio;
> >
> We do want to keep the chain of bios intact such that end_io processing
> will recurse back to the original end_io callback.
> As such we need to call bio_chain on the first bio, submit that
> (possibly with submit_bio_wait()), and then decide whether we can /
> should continue.
> With your patch we'll lose the information that indeed other bios might
> be linked to the original one.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] block: make __blkdev_issue_zero_pages() less confusing
  2020-12-06 11:29   ` Hannes Reinecke
@ 2020-12-06 13:28     ` Tom Yan
  0 siblings, 0 replies; 23+ messages in thread
From: Tom Yan @ 2020-12-06 13:28 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, linux-scsi

On Sun, 6 Dec 2020 at 19:29, Hannes Reinecke <hare@suse.de> wrote:
>
> On 12/6/20 6:53 AM, Tom Yan wrote:
> > Instead of using the same check for the two layers of loops, count
> > bio pages in the inner loop instead.
> >
> > Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> > ---
> >   block/blk-lib.c | 11 +++++------
> >   1 file changed, 5 insertions(+), 6 deletions(-)
> >
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index c1e9388a8fb8..354dcab760c7 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -318,7 +318,7 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
> >       struct request_queue *q = bdev_get_queue(bdev);
> >       struct bio *bio = *biop;
> >       int bi_size = 0;
> > -     unsigned int sz;
> > +     unsigned int sz, bio_nr_pages;
> >
> >       if (!q)
> >               return -ENXIO;
> > @@ -327,19 +327,18 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
> >               return -EPERM;
> >
> >       while (nr_sects != 0) {
> > -             bio = blk_next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
> > -                                gfp_mask);
> > +             bio_nr_pages = __blkdev_sectors_to_bio_pages(nr_sects);
> > +             bio = blk_next_bio(bio, bio_nr_pages, gfp_mask);
> >               bio->bi_iter.bi_sector = sector;
> >               bio_set_dev(bio, bdev);
> >               bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> >
> > -             while (nr_sects != 0) {
> > +             while (bio_nr_pages != 0) {
> >                       sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
>
> nr_sects will need to be modified, too, if we iterate over bio_nr_pages
> instead of nr_sects.
>
> >                       bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0);
> >                       nr_sects -= bi_size >> 9;
Not sure what modification you are suggesting. We are still deducting
from it the "added" sectors.
> >                       sector += bi_size >> 9;
> > -                     if (bi_size < sz)
> > -                             break;
> > +                     bio_nr_pages--;
> >               }
> >               cond_resched();
> >       }
> >
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages()
  2020-12-06 11:31   ` Hannes Reinecke
@ 2020-12-06 13:32     ` Tom Yan
  2020-12-06 14:05       ` Hannes Reinecke
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Yan @ 2020-12-06 13:32 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, linux-scsi

Why? Did you miss that it is in the condition where
__blkdev_issue_zero_pages() is called (i.e. it's not WRITE SAME but
WRITE). From what I gathered REQ_PREFLUSH triggers a write back cache
(that is on the device; not sure about dirty pages) flush, wouldn't it
be a right thing to do after we performed a series of WRITE (which is
more or less purposed to get a drive wiped clean).

On Sun, 6 Dec 2020 at 19:31, Hannes Reinecke <hare@suse.de> wrote:
>
> On 12/6/20 6:53 AM, Tom Yan wrote:
> > Mimicking blkdev_issue_flush(). Seems like a right thing to do, as
> > they are a bunch of REQ_OP_WRITE.
> >
> > Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> > ---
> >   block/blk-lib.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index 354dcab760c7..5579fdea893d 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -422,6 +422,8 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> >       } else if (!(flags & BLKDEV_ZERO_NOFALLBACK)) {
> >               ret = __blkdev_issue_zero_pages(bdev, sector, nr_sects,
> >                                               gfp_mask, &bio);
> > +             if (bio)
> > +                     bio->bi_opf |= REQ_PREFLUSH;
> >       } else {
> >               /* No zeroing offload support */
> >               ret = -EOPNOTSUPP;
> >
> PREFLUSH is for the 'flush' machinery (cf blk-flush.c). Which is okay
> for blkdev_issue_flush(), but certainly not for zeroout.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] block: try one write zeroes request before going further
  2020-12-06 13:25   ` Tom Yan
@ 2020-12-06 13:56     ` Hannes Reinecke
  2020-12-06 14:07       ` Tom Yan
  0 siblings, 1 reply; 23+ messages in thread
From: Hannes Reinecke @ 2020-12-06 13:56 UTC (permalink / raw)
  To: Tom Yan; +Cc: linux-block, linux-scsi

On 12/6/20 2:25 PM, Tom Yan wrote:
> I think you misunderstood it. The goal of this patch is to split the
> current situation into two chains (or one unchained bio + a series of
> chained bio). The first one is an attempt/trial which makes sure that
> the latter large bio chain can actually be handled (as per the
> "command capability" of the device).
> 
Oh, I think I do get what you're trying to do. And, in fact, I don't 
argue with what you're trying to achieve.

What I would like to see, though, is keep the current bio_chain logic 
intact (irrespective of your previous patch, which should actually be 
part of this series), and just lift the first check out of the loop:

@@ -262,9 +262,14 @@ static int __blkdev_issue_write_zeroes(struct 
block_device *bdev,

         if (max_write_zeroes_sectors == 0)
                 return -EOPNOTSUPP;
-
+       new = bio_alloc(gfp_mask, 0);
+       bio_chain(bio, new);
+       if (submit_bio_wait(bio) == BLK_STS_NOTSUPP) {
+               bio_put(new);
+               return -ENOPNOTSUPP;
+       }
+       bio = new;
         while (nr_sects) {
-               bio = blk_next_bio(bio, 0, gfp_mask);
                 bio->bi_iter.bi_sector = sector;
                 bio_set_dev(bio, bdev);
                 bio->bi_opf = REQ_OP_WRITE_ZEROES;
@@ -279,6 +284,7 @@ static int __blkdev_issue_write_zeroes(struct 
block_device *bdev,
                         bio->bi_iter.bi_size = nr_sects << 9;
                         nr_sects = 0;
                 }
+               bio = blk_next_bio(bio, 0, gfp_mask);
                 cond_resched();
         }

(The error checking from submit_bio_wait() could be improved :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages()
  2020-12-06 13:32     ` Tom Yan
@ 2020-12-06 14:05       ` Hannes Reinecke
  2020-12-06 14:14         ` Tom Yan
  0 siblings, 1 reply; 23+ messages in thread
From: Hannes Reinecke @ 2020-12-06 14:05 UTC (permalink / raw)
  To: Tom Yan; +Cc: linux-block, linux-scsi

On 12/6/20 2:32 PM, Tom Yan wrote:
> Why? Did you miss that it is in the condition where
> __blkdev_issue_zero_pages() is called (i.e. it's not WRITE SAME but
> WRITE). From what I gathered REQ_PREFLUSH triggers a write back cache
> (that is on the device; not sure about dirty pages) flush, wouldn't it
> be a right thing to do after we performed a series of WRITE (which is
> more or less purposed to get a drive wiped clean).
> 

But what makes 'zero_pages' special as compared to, say, WRITE_SAME?
One could use WRITE SAME with '0' content, arriving at pretty much the 
same content than usine zeroout without unmapping. And neither of them 
worries about cache flushing.
Nor should they, IMO.

These are 'native' block layer calls, providing abstract accesses to 
hardware functionality. If an application wants to use them, it would be 
the task of the application to insert a 'flush' if it deems neccessary.
(There _is_ blkdev_issue_flush(), after all).

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] block: try one write zeroes request before going further
  2020-12-06 13:56     ` Hannes Reinecke
@ 2020-12-06 14:07       ` Tom Yan
  2020-12-06 14:28         ` Tom Yan
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Yan @ 2020-12-06 14:07 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-block, linux-scsi, Christoph Hellwig, tom.leiming, axboe

Yes it does have "dependency" to the blk_next_bio() patch. I just
somehow missed that.

The problem is, I don't think I'm trying to change the logic of
bio_chain(), or even that of blk_next_bio(). It really just looks like
a careless mistake, that the arguments were typed in the wrong order.

Adding those who signed off the original commit (block: remove struct
bio_batch / 9082e87b) here too to the CC list.


On Sun, 6 Dec 2020 at 21:56, Hannes Reinecke <hare@suse.de> wrote:
>
> On 12/6/20 2:25 PM, Tom Yan wrote:
> > I think you misunderstood it. The goal of this patch is to split the
> > current situation into two chains (or one unchained bio + a series of
> > chained bio). The first one is an attempt/trial which makes sure that
> > the latter large bio chain can actually be handled (as per the
> > "command capability" of the device).
> >
> Oh, I think I do get what you're trying to do. And, in fact, I don't
> argue with what you're trying to achieve.
>
> What I would like to see, though, is keep the current bio_chain logic
> intact (irrespective of your previous patch, which should actually be
> part of this series), and just lift the first check out of the loop:
>
> @@ -262,9 +262,14 @@ static int __blkdev_issue_write_zeroes(struct
> block_device *bdev,
>
>          if (max_write_zeroes_sectors == 0)
>                  return -EOPNOTSUPP;
> -
> +       new = bio_alloc(gfp_mask, 0);
> +       bio_chain(bio, new);
> +       if (submit_bio_wait(bio) == BLK_STS_NOTSUPP) {
> +               bio_put(new);
> +               return -ENOPNOTSUPP;
> +       }
> +       bio = new;
>          while (nr_sects) {
> -               bio = blk_next_bio(bio, 0, gfp_mask);
>                  bio->bi_iter.bi_sector = sector;
>                  bio_set_dev(bio, bdev);
>                  bio->bi_opf = REQ_OP_WRITE_ZEROES;
> @@ -279,6 +284,7 @@ static int __blkdev_issue_write_zeroes(struct
> block_device *bdev,
>                          bio->bi_iter.bi_size = nr_sects << 9;
>                          nr_sects = 0;
>                  }
> +               bio = blk_next_bio(bio, 0, gfp_mask);
>                  cond_resched();
>          }
>
> (The error checking from submit_bio_wait() could be improved :-)
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages()
  2020-12-06 14:05       ` Hannes Reinecke
@ 2020-12-06 14:14         ` Tom Yan
  2020-12-06 16:05           ` Hannes Reinecke
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Yan @ 2020-12-06 14:14 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, linux-scsi

On Sun, 6 Dec 2020 at 22:05, Hannes Reinecke <hare@suse.de> wrote:
>
> On 12/6/20 2:32 PM, Tom Yan wrote:
> > Why? Did you miss that it is in the condition where
> > __blkdev_issue_zero_pages() is called (i.e. it's not WRITE SAME but
> > WRITE). From what I gathered REQ_PREFLUSH triggers a write back cache
> > (that is on the device; not sure about dirty pages) flush, wouldn't it
> > be a right thing to do after we performed a series of WRITE (which is
> > more or less purposed to get a drive wiped clean).
> >
>
> But what makes 'zero_pages' special as compared to, say, WRITE_SAME?
> One could use WRITE SAME with '0' content, arriving at pretty much the
> same content than usine zeroout without unmapping. And neither of them
> worries about cache flushing.
> Nor should they, IMO.

Because we are writing actual pages (just that they are zero and
"shared memory" in the system) to the device, instead of triggering a
special command (with a specific parameter)?

>
> These are 'native' block layer calls, providing abstract accesses to
> hardware functionality. If an application wants to use them, it would be
> the task of the application to insert a 'flush' if it deems neccessary.
> (There _is_ blkdev_issue_flush(), after all).

Well my argument would be the call has the purpose of "wiping" so it
should try to "atomically" guarantee that the wiping is synced. It's
like a complement to REQ_SYNC in the final submit_bio_wait().

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] block: try one write zeroes request before going further
  2020-12-06 14:07       ` Tom Yan
@ 2020-12-06 14:28         ` Tom Yan
  0 siblings, 0 replies; 23+ messages in thread
From: Tom Yan @ 2020-12-06 14:28 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-block, linux-scsi, Christoph Hellwig, tom.leiming, axboe

Btw, while this series relies on the blk_next_bio() patch to work, it
was not the reason that I sent the latter. It was just because the way
it calls bio_chain() doesn't look right to any of the functions that
make use of it (or in other words, the apparent logic of itself).
That's actually why I didn't have it in the same series.

On Sun, 6 Dec 2020 at 22:07, Tom Yan <tom.ty89@gmail.com> wrote:
>
> Yes it does have "dependency" to the blk_next_bio() patch. I just
> somehow missed that.
>
> The problem is, I don't think I'm trying to change the logic of
> bio_chain(), or even that of blk_next_bio(). It really just looks like
> a careless mistake, that the arguments were typed in the wrong order.
>
> Adding those who signed off the original commit (block: remove struct
> bio_batch / 9082e87b) here too to the CC list.
>
>
> On Sun, 6 Dec 2020 at 21:56, Hannes Reinecke <hare@suse.de> wrote:
> >
> > On 12/6/20 2:25 PM, Tom Yan wrote:
> > > I think you misunderstood it. The goal of this patch is to split the
> > > current situation into two chains (or one unchained bio + a series of
> > > chained bio). The first one is an attempt/trial which makes sure that
> > > the latter large bio chain can actually be handled (as per the
> > > "command capability" of the device).
> > >
> > Oh, I think I do get what you're trying to do. And, in fact, I don't
> > argue with what you're trying to achieve.
> >
> > What I would like to see, though, is keep the current bio_chain logic
> > intact (irrespective of your previous patch, which should actually be
> > part of this series), and just lift the first check out of the loop:
> >
> > @@ -262,9 +262,14 @@ static int __blkdev_issue_write_zeroes(struct
> > block_device *bdev,
> >
> >          if (max_write_zeroes_sectors == 0)
> >                  return -EOPNOTSUPP;
> > -
> > +       new = bio_alloc(gfp_mask, 0);
> > +       bio_chain(bio, new);
> > +       if (submit_bio_wait(bio) == BLK_STS_NOTSUPP) {
> > +               bio_put(new);
> > +               return -ENOPNOTSUPP;
> > +       }
> > +       bio = new;
> >          while (nr_sects) {
> > -               bio = blk_next_bio(bio, 0, gfp_mask);
> >                  bio->bi_iter.bi_sector = sector;
> >                  bio_set_dev(bio, bdev);
> >                  bio->bi_opf = REQ_OP_WRITE_ZEROES;
> > @@ -279,6 +284,7 @@ static int __blkdev_issue_write_zeroes(struct
> > block_device *bdev,
> >                          bio->bi_iter.bi_size = nr_sects << 9;
> >                          nr_sects = 0;
> >                  }
> > +               bio = blk_next_bio(bio, 0, gfp_mask);
> >                  cond_resched();
> >          }
> >
> > (The error checking from submit_bio_wait() could be improved :-)
> >
> > Cheers,
> >
> > Hannes
> > --
> > Dr. Hannes Reinecke                Kernel Storage Architect
> > hare@suse.de                              +49 911 74053 688
> > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> > HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages()
  2020-12-06 14:14         ` Tom Yan
@ 2020-12-06 16:05           ` Hannes Reinecke
  2020-12-08 12:51             ` Tom Yan
  0 siblings, 1 reply; 23+ messages in thread
From: Hannes Reinecke @ 2020-12-06 16:05 UTC (permalink / raw)
  To: Tom Yan; +Cc: linux-block, linux-scsi

On 12/6/20 3:14 PM, Tom Yan wrote:
> On Sun, 6 Dec 2020 at 22:05, Hannes Reinecke <hare@suse.de> wrote:
>>
>> On 12/6/20 2:32 PM, Tom Yan wrote:
>>> Why? Did you miss that it is in the condition where
>>> __blkdev_issue_zero_pages() is called (i.e. it's not WRITE SAME but
>>> WRITE). From what I gathered REQ_PREFLUSH triggers a write back cache
>>> (that is on the device; not sure about dirty pages) flush, wouldn't it
>>> be a right thing to do after we performed a series of WRITE (which is
>>> more or less purposed to get a drive wiped clean).
>>>
>>
>> But what makes 'zero_pages' special as compared to, say, WRITE_SAME?
>> One could use WRITE SAME with '0' content, arriving at pretty much the
>> same content than usine zeroout without unmapping. And neither of them
>> worries about cache flushing.
>> Nor should they, IMO.
> 
> Because we are writing actual pages (just that they are zero and
> "shared memory" in the system) to the device, instead of triggering a
> special command (with a specific parameter)?
> 

But these pages are ephemeral, and never visible to the user.

>>
>> These are 'native' block layer calls, providing abstract accesses to
>> hardware functionality. If an application wants to use them, it would be
>> the task of the application to insert a 'flush' if it deems neccessary.
>> (There _is_ blkdev_issue_flush(), after all).
> 
> Well my argument would be the call has the purpose of "wiping" so it
> should try to "atomically" guarantee that the wiping is synced. It's
> like a complement to REQ_SYNC in the final submit_bio_wait().
> 
That's an assumption.

It would be valid if blkdev_issue_zeroout() would only allow to wipe the 
entire disk. As it stands, it doesn't, and so we shouldn't presume what 
users might want to do with it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] block: make __blkdev_issue_zero_pages() less confusing
  2020-12-06  5:53 ` [PATCH 2/3] block: make __blkdev_issue_zero_pages() less confusing Tom Yan
  2020-12-06 11:29   ` Hannes Reinecke
@ 2020-12-07 13:34   ` Christoph Hellwig
  2020-12-08 12:54     ` Tom Yan
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2020-12-07 13:34 UTC (permalink / raw)
  To: Tom Yan; +Cc: linux-block, linux-scsi

On Sun, Dec 06, 2020 at 01:53:31PM +0800, Tom Yan wrote:
> Instead of using the same check for the two layers of loops, count
> bio pages in the inner loop instead.

I don't really see any major benefit of one version over the other.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages()
  2020-12-06  5:53 ` [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages() Tom Yan
  2020-12-06 11:31   ` Hannes Reinecke
@ 2020-12-07 13:35   ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2020-12-07 13:35 UTC (permalink / raw)
  To: Tom Yan; +Cc: linux-block, linux-scsi

On Sun, Dec 06, 2020 at 01:53:32PM +0800, Tom Yan wrote:
> Mimicking blkdev_issue_flush(). Seems like a right thing to do, as
> they are a bunch of REQ_OP_WRITE.

Huh?  Zeroing out is in no way a data integrity operation that needs a
reflush.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] block: try one write zeroes request before going further
  2020-12-06  5:53 [PATCH 1/3] block: try one write zeroes request before going further Tom Yan
                   ` (2 preceding siblings ...)
  2020-12-06 11:25 ` [PATCH 1/3] block: try one write zeroes request before going further Hannes Reinecke
@ 2020-12-07 13:36 ` Christoph Hellwig
  2020-12-08 12:48   ` Tom Yan
  2020-12-08 22:46 ` Ewan D. Milne
  4 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2020-12-07 13:36 UTC (permalink / raw)
  To: Tom Yan; +Cc: linux-block, linux-scsi

On Sun, Dec 06, 2020 at 01:53:30PM +0800, Tom Yan wrote:
> At least the SCSI disk driver is "benevolent" when it try to decide
> whether the device actually supports write zeroes, i.e. unless the
> device explicity report otherwise, it assumes it does at first.
> 
> Therefore before we pile up bios that would fail at the end, we try
> the command/request once, as not doing so could trigger quite a
> disaster in at least certain case. For example, the host controller
> can be messed up entirely when one does `blkdiscard -z` a UAS drive.
> 
> Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> ---
>  block/blk-lib.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e90614fd8d6a..c1e9388a8fb8 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>  	struct bio *bio = *biop;
>  	unsigned int max_write_zeroes_sectors;
>  	struct request_queue *q = bdev_get_queue(bdev);
> +	int i = 0;
>  
>  	if (!q)
>  		return -ENXIO;
> @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>  		return -EOPNOTSUPP;
>  
>  	while (nr_sects) {
> -		bio = blk_next_bio(bio, 0, gfp_mask);
> +		if (i != 1) {
> +			bio = blk_next_bio(bio, 0, gfp_mask);
> +		} else {
> +			submit_bio_wait(bio);
> +			bio_put(bio);
> +
> +			if (bdev_write_zeroes_sectors(bdev) == 0)
> +				return -EOPNOTSUPP;
> +			else

This means you now massively slow down say nvme operations by adding
a wait.  If at all we need a maybe supports write zeroes flag and
only do that if the driver hasn't decided yet if write zeroes is
actually supported.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] block: try one write zeroes request before going further
  2020-12-07 13:36 ` Christoph Hellwig
@ 2020-12-08 12:48   ` Tom Yan
  2020-12-09 17:51     ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Yan @ 2020-12-08 12:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, linux-scsi

You mean like submit_bio_wait() is a blocking operation?

On Mon, 7 Dec 2020 at 21:36, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Sun, Dec 06, 2020 at 01:53:30PM +0800, Tom Yan wrote:
> > At least the SCSI disk driver is "benevolent" when it try to decide
> > whether the device actually supports write zeroes, i.e. unless the
> > device explicity report otherwise, it assumes it does at first.
> >
> > Therefore before we pile up bios that would fail at the end, we try
> > the command/request once, as not doing so could trigger quite a
> > disaster in at least certain case. For example, the host controller
> > can be messed up entirely when one does `blkdiscard -z` a UAS drive.
> >
> > Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> > ---
> >  block/blk-lib.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index e90614fd8d6a..c1e9388a8fb8 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >       struct bio *bio = *biop;
> >       unsigned int max_write_zeroes_sectors;
> >       struct request_queue *q = bdev_get_queue(bdev);
> > +     int i = 0;
> >
> >       if (!q)
> >               return -ENXIO;
> > @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >               return -EOPNOTSUPP;
> >
> >       while (nr_sects) {
> > -             bio = blk_next_bio(bio, 0, gfp_mask);
> > +             if (i != 1) {
> > +                     bio = blk_next_bio(bio, 0, gfp_mask);
> > +             } else {
> > +                     submit_bio_wait(bio);
> > +                     bio_put(bio);
> > +
> > +                     if (bdev_write_zeroes_sectors(bdev) == 0)
> > +                             return -EOPNOTSUPP;
> > +                     else
>
> This means you now massively slow down say nvme operations by adding
> a wait.  If at all we need a maybe supports write zeroes flag and
> only do that if the driver hasn't decided yet if write zeroes is
> actually supported.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages()
  2020-12-06 16:05           ` Hannes Reinecke
@ 2020-12-08 12:51             ` Tom Yan
  0 siblings, 0 replies; 23+ messages in thread
From: Tom Yan @ 2020-12-08 12:51 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, linux-scsi

On Mon, 7 Dec 2020 at 00:05, Hannes Reinecke <hare@suse.de> wrote:
>
> On 12/6/20 3:14 PM, Tom Yan wrote:
> > On Sun, 6 Dec 2020 at 22:05, Hannes Reinecke <hare@suse.de> wrote:
> >>
> >> On 12/6/20 2:32 PM, Tom Yan wrote:
> >>> Why? Did you miss that it is in the condition where
> >>> __blkdev_issue_zero_pages() is called (i.e. it's not WRITE SAME but
> >>> WRITE). From what I gathered REQ_PREFLUSH triggers a write back cache
> >>> (that is on the device; not sure about dirty pages) flush, wouldn't it
> >>> be a right thing to do after we performed a series of WRITE (which is
> >>> more or less purposed to get a drive wiped clean).
> >>>
> >>
> >> But what makes 'zero_pages' special as compared to, say, WRITE_SAME?
> >> One could use WRITE SAME with '0' content, arriving at pretty much the
> >> same content than usine zeroout without unmapping. And neither of them
> >> worries about cache flushing.
> >> Nor should they, IMO.
> >
> > Because we are writing actual pages (just that they are zero and
> > "shared memory" in the system) to the device, instead of triggering a
> > special command (with a specific parameter)?
> >
>
> But these pages are ephemeral, and never visible to the user.

What do you mean by the "user"? What I meant was, since it's no
different than "normal" write operation, the zero pages will go to the
volatile write cache of the device.

>
> >>
> >> These are 'native' block layer calls, providing abstract accesses to
> >> hardware functionality. If an application wants to use them, it would be
> >> the task of the application to insert a 'flush' if it deems neccessary.
> >> (There _is_ blkdev_issue_flush(), after all).
> >
> > Well my argument would be the call has the purpose of "wiping" so it
> > should try to "atomically" guarantee that the wiping is synced. It's
> > like a complement to REQ_SYNC in the final submit_bio_wait().
> >
> That's an assumption.
>
> It would be valid if blkdev_issue_zeroout() would only allow to wipe the
> entire disk. As it stands, it doesn't, and so we shouldn't presume what
> users might want to do with it.

Whether it's an entire disk doesn't matter. It still stands when it's
only a certain range of blocks.

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] block: make __blkdev_issue_zero_pages() less confusing
  2020-12-07 13:34   ` Christoph Hellwig
@ 2020-12-08 12:54     ` Tom Yan
  0 siblings, 0 replies; 23+ messages in thread
From: Tom Yan @ 2020-12-08 12:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, linux-scsi

Sure. It's only for code clarity, so definitely no "major benefit".

On Mon, 7 Dec 2020 at 21:34, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Sun, Dec 06, 2020 at 01:53:31PM +0800, Tom Yan wrote:
> > Instead of using the same check for the two layers of loops, count
> > bio pages in the inner loop instead.
>
> I don't really see any major benefit of one version over the other.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] block: try one write zeroes request before going further
  2020-12-06  5:53 [PATCH 1/3] block: try one write zeroes request before going further Tom Yan
                   ` (3 preceding siblings ...)
  2020-12-07 13:36 ` Christoph Hellwig
@ 2020-12-08 22:46 ` Ewan D. Milne
  4 siblings, 0 replies; 23+ messages in thread
From: Ewan D. Milne @ 2020-12-08 22:46 UTC (permalink / raw)
  To: Tom Yan, linux-block; +Cc: linux-scsi

On Sun, 2020-12-06 at 13:53 +0800, Tom Yan wrote:
> At least the SCSI disk driver is "benevolent" when it try to decide
> whether the device actually supports write zeroes, i.e. unless the
> device explicity report otherwise, it assumes it does at first.
> 
> Therefore before we pile up bios that would fail at the end, we try
> the command/request once, as not doing so could trigger quite a
> disaster in at least certain case. For example, the host controller
> can be messed up entirely when one does `blkdiscard -z` a UAS drive.
> 

It's not as simple as that.  There are some SCSI devices that support
WRITE ZEROES, but do not return the MAXIMUM WRITE SAME LENGTH in the
block device limits VPD page.  So, some commands might work, and others
might not.  In particular, a commonly-used hypervisor does this.

The sd driver disables the use of write same if certain errors are
returned (INVALID COMMAND w/ INVALID COMMAND OPCODE or INVALID FIELD IN
CDB), but if you do a blkdiscard -z of an entire drive a whole lot of
bios/requests are already queued by the time you get that.

Higher level code checks to see if write zeroes is supported, and
won't queue the requests once it is turned off, but that doesn't
happen until a command fails.  We also check in command setup, see
my earlier patch which deals with spurious blk_update_request errors
if the disablement of write same gets detected there...  I explicitly
did not try to fix that by "testing" with one bio for the reason
Christoph mentions.

-Ewan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] block: try one write zeroes request before going further
  2020-12-08 12:48   ` Tom Yan
@ 2020-12-09 17:51     ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2020-12-09 17:51 UTC (permalink / raw)
  To: Tom Yan; +Cc: Christoph Hellwig, linux-block, linux-scsi

On Tue, Dec 08, 2020 at 08:48:15PM +0800, Tom Yan wrote:
> You mean like submit_bio_wait() is a blocking operation?

Yes, it blocks to wait for the I/O to complete.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-12-09 17:52 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-06  5:53 [PATCH 1/3] block: try one write zeroes request before going further Tom Yan
2020-12-06  5:53 ` [PATCH 2/3] block: make __blkdev_issue_zero_pages() less confusing Tom Yan
2020-12-06 11:29   ` Hannes Reinecke
2020-12-06 13:28     ` Tom Yan
2020-12-07 13:34   ` Christoph Hellwig
2020-12-08 12:54     ` Tom Yan
2020-12-06  5:53 ` [PATCH 3/3] block: set REQ_PREFLUSH to the final bio from __blkdev_issue_zero_pages() Tom Yan
2020-12-06 11:31   ` Hannes Reinecke
2020-12-06 13:32     ` Tom Yan
2020-12-06 14:05       ` Hannes Reinecke
2020-12-06 14:14         ` Tom Yan
2020-12-06 16:05           ` Hannes Reinecke
2020-12-08 12:51             ` Tom Yan
2020-12-07 13:35   ` Christoph Hellwig
2020-12-06 11:25 ` [PATCH 1/3] block: try one write zeroes request before going further Hannes Reinecke
2020-12-06 13:25   ` Tom Yan
2020-12-06 13:56     ` Hannes Reinecke
2020-12-06 14:07       ` Tom Yan
2020-12-06 14:28         ` Tom Yan
2020-12-07 13:36 ` Christoph Hellwig
2020-12-08 12:48   ` Tom Yan
2020-12-09 17:51     ` Christoph Hellwig
2020-12-08 22:46 ` Ewan D. Milne

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).