All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] block: Disable write plugging for zoned block devices
@ 2019-07-09  9:02 Damien Le Moal
  2019-07-09 11:15 ` Johannes Thumshirn
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Damien Le Moal @ 2019-07-09  9:02 UTC (permalink / raw)
  To: linux-block, Jens Axboe; +Cc: Christoph Hellwig, Matias Bjorling

Simultaneously writing to a sequential zone of a zoned block device
from multiple contexts requires mutual exclusion for BIO issuing to
ensure that writes happen sequentially. However, even for a well
behaved user correctly implementing such synchronization, BIO plugging
may interfere and result in BIOs from the different contextx to be
reordered if plugging is done outside of the mutual exclusion section,
e.g. the plug was started by a function higher in the call chain than
the function issuing BIOs.

      Context A                           Context B

   | blk_start_plug()
   | ...
   | seq_write_zone()
     | mutex_lock(zone)
     | submit_bio(bio-0)
     | submit_bio(bio-1)
     | mutex_unlock(zone)
     | return
   | ------------------------------> | seq_write_zone()
  				       | mutex_lock(zone)
				       | submit_bio(bio-2)
				       | mutex_unlock(zone)
   | <------------------------------ |
   | blk_finish_plug()

In the above example, despite the mutex synchronization resulting in the
correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
issued after BIO 2 when the plug is released with blk_finish_plug().

To fix this problem, introduce the internal helper function
blk_mq_plug() to access the current context plug, return the current
plug only if the target device is not a zoned block device or if the
BIO to be plugged not a write operation. Otherwise, ignore the plug and
return NULL, resulting is all writes to zoned block device to never be
plugged.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 block/blk-core.c |  2 +-
 block/blk-mq.c   |  2 +-
 block/blk-mq.h   | 12 ++++++++++++
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8340f69670d8..3957ea6811c3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -645,7 +645,7 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	struct request *rq;
 	struct list_head *plug_list;
 
-	plug = current->plug;
+	plug = blk_mq_plug(q, bio);
 	if (!plug)
 		return false;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ce0f5f4ede70..90be5bb6fa1b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1969,7 +1969,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 	cookie = request_to_qc_t(data.hctx, rq);
 
-	plug = current->plug;
+	plug = blk_mq_plug(q, bio);
 	if (unlikely(is_flush_fua)) {
 		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 633a5a77ee8b..d9b1e94b82a4 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -238,4 +238,16 @@ static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 		qmap->mq_map[cpu] = 0;
 }
 
+static inline struct blk_plug *blk_mq_plug(struct request_queue *q,
+					   struct bio *bio)
+{
+	struct blk_plug *plug = current->plug;
+
+	if (!blk_queue_is_zoned(q) || !op_is_write(bio_op(bio)))
+		return plug;
+
+	/* Zoned block device write case: do not plug the BIO */
+	return NULL;
+}
+
 #endif
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-09  9:02 [PATCH] block: Disable write plugging for zoned block devices Damien Le Moal
@ 2019-07-09 11:15 ` Johannes Thumshirn
  2019-07-09 13:51 ` Bart Van Assche
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Johannes Thumshirn @ 2019-07-09 11:15 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Matias Bjorling

Not sure if I like the new helper or I'd prefer another 'else' in 
blk_mq_make_request().

But all variants I can come up with are ugly and disgusting, so let's got the
route you proposed.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-09  9:02 [PATCH] block: Disable write plugging for zoned block devices Damien Le Moal
  2019-07-09 11:15 ` Johannes Thumshirn
@ 2019-07-09 13:51 ` Bart Van Assche
  2019-07-09 14:59   ` Damien Le Moal
  2019-07-09 14:29 ` Ming Lei
  2019-07-10  2:32 ` Bart Van Assche
  3 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2019-07-09 13:51 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Matias Bjorling

On 7/9/19 2:02 AM, Damien Le Moal wrote:
> Simultaneously writing to a sequential zone of a zoned block device
> from multiple contexts requires mutual exclusion for BIO issuing to
> ensure that writes happen sequentially. However, even for a well
> behaved user correctly implementing such synchronization, BIO plugging
> may interfere and result in BIOs from the different contextx to be
> reordered if plugging is done outside of the mutual exclusion section,
> e.g. the plug was started by a function higher in the call chain than
> the function issuing BIOs.
> 
>        Context A                           Context B
> 
>     | blk_start_plug()
>     | ...
>     | seq_write_zone()
>       | mutex_lock(zone)
>       | submit_bio(bio-0)
>       | submit_bio(bio-1)
>       | mutex_unlock(zone)
>       | return
>     | ------------------------------> | seq_write_zone()
>    				       | mutex_lock(zone)
> 				       | submit_bio(bio-2)
> 				       | mutex_unlock(zone)
>     | <------------------------------ |
>     | blk_finish_plug()
> 
> In the above example, despite the mutex synchronization resulting in the
> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
> issued after BIO 2 when the plug is released with blk_finish_plug().
> 
> To fix this problem, introduce the internal helper function
> blk_mq_plug() to access the current context plug, return the current
> plug only if the target device is not a zoned block device or if the
> BIO to be plugged not a write operation. Otherwise, ignore the plug and
> return NULL, resulting is all writes to zoned block device to never be
> plugged.

Are there classes of zoned devices for which the plug list is useful? If 
so, have you considered any other approaches, e.g. one plug list per 
request queue instead of one plug list per task in case of zoned devices?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-09  9:02 [PATCH] block: Disable write plugging for zoned block devices Damien Le Moal
  2019-07-09 11:15 ` Johannes Thumshirn
  2019-07-09 13:51 ` Bart Van Assche
@ 2019-07-09 14:29 ` Ming Lei
  2019-07-09 14:47   ` Damien Le Moal
  2019-07-10  2:55   ` Jens Axboe
  2019-07-10  2:32 ` Bart Van Assche
  3 siblings, 2 replies; 13+ messages in thread
From: Ming Lei @ 2019-07-09 14:29 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Matias Bjorling

On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
> Simultaneously writing to a sequential zone of a zoned block device
> from multiple contexts requires mutual exclusion for BIO issuing to
> ensure that writes happen sequentially. However, even for a well
> behaved user correctly implementing such synchronization, BIO plugging
> may interfere and result in BIOs from the different contextx to be
> reordered if plugging is done outside of the mutual exclusion section,
> e.g. the plug was started by a function higher in the call chain than
> the function issuing BIOs.
> 
>       Context A                           Context B
> 
>    | blk_start_plug()
>    | ...
>    | seq_write_zone()
>      | mutex_lock(zone)
>      | submit_bio(bio-0)
>      | submit_bio(bio-1)
>      | mutex_unlock(zone)
>      | return
>    | ------------------------------> | seq_write_zone()
>   				       | mutex_lock(zone)
> 				       | submit_bio(bio-2)
> 				       | mutex_unlock(zone)
>    | <------------------------------ |
>    | blk_finish_plug()
> 
> In the above example, despite the mutex synchronization resulting in the
> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
> issued after BIO 2 when the plug is released with blk_finish_plug().

I am wondering how you guarantee that context B is always run after
context A.

> 
> To fix this problem, introduce the internal helper function
> blk_mq_plug() to access the current context plug, return the current
> plug only if the target device is not a zoned block device or if the
> BIO to be plugged not a write operation. Otherwise, ignore the plug and
> return NULL, resulting is all writes to zoned block device to never be
> plugged.

Another candidate approach is to run the following code before
releasing 'zone' lock:

	if (current->plug)
		blk_finish_plug(context->plug)

Then we can fix zone specific issue in zone code only, and avoid generic
blk-core change for zone issue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-09 14:29 ` Ming Lei
@ 2019-07-09 14:47   ` Damien Le Moal
  2019-07-10  3:10     ` Ming Lei
  2019-07-10  2:55   ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Damien Le Moal @ 2019-07-09 14:47 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Jens Axboe, Christoph Hellwig, Matias Bjorling

Hi Ming,

On 2019/07/09 23:29, Ming Lei wrote:
> On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
>> Simultaneously writing to a sequential zone of a zoned block device
>> from multiple contexts requires mutual exclusion for BIO issuing to
>> ensure that writes happen sequentially. However, even for a well
>> behaved user correctly implementing such synchronization, BIO plugging
>> may interfere and result in BIOs from the different contextx to be
>> reordered if plugging is done outside of the mutual exclusion section,
>> e.g. the plug was started by a function higher in the call chain than
>> the function issuing BIOs.
>>
>>       Context A                           Context B
>>
>>    | blk_start_plug()
>>    | ...
>>    | seq_write_zone()
>>      | mutex_lock(zone)
>>      | submit_bio(bio-0)
>>      | submit_bio(bio-1)
>>      | mutex_unlock(zone)
>>      | return
>>    | ------------------------------> | seq_write_zone()
>>   				       | mutex_lock(zone)
>> 				       | submit_bio(bio-2)
>> 				       | mutex_unlock(zone)
>>    | <------------------------------ |
>>    | blk_finish_plug()
>>
>> In the above example, despite the mutex synchronization resulting in the
>> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
>> issued after BIO 2 when the plug is released with blk_finish_plug().
> 
> I am wondering how you guarantee that context B is always run after
> context A.

My example was a little too oversimplified. Think of a file system allocating
blocks sequentially and issuing page I/Os for the allocated blocks sequentially.
The typical sequence is:

mutex_lock(zone)
alloc_block_extent(zone)
for all blocks in the extent
	submit_bio()
mutex_unlock(zone)

This way, it does not matter which context gets the lock first, all write BIOs
for the zone remain sequential. The problem with plugs as explained above is
that if the plug start/finish is not within the zone lock, reordering can happen
for the 2 sequences of BIOs issued by the 2 contexts.

We hit this problem with btrfs writepages (page writeback) where plugging is
done before the above sequence execution, in the caller function of the page
writeback processing, resulting in unaligned write errors.

>> To fix this problem, introduce the internal helper function
>> blk_mq_plug() to access the current context plug, return the current
>> plug only if the target device is not a zoned block device or if the
>> BIO to be plugged not a write operation. Otherwise, ignore the plug and
>> return NULL, resulting is all writes to zoned block device to never be
>> plugged.
> 
> Another candidate approach is to run the following code before
> releasing 'zone' lock:
> 
> 	if (current->plug)
> 		blk_finish_plug(context->plug)
> 
> Then we can fix zone specific issue in zone code only, and avoid generic
> blk-core change for zone issue.

Yes indeed, that would work too. But this patch is precisely to avoid having to
add such code and simplify implementing support for zoned block device in
existing code. Furthermore, plugging for writes to sequential zones has no real
value because mq-deadline will dispatch at most one write per zone. So writes
for a single zone tend to accumulate in the scheduler queue, and that creates
plenty of opportunities for merging small sequential writes (e.g. file system
page BIOs).

If you think this patch is really not appropriate, we can still address the
problem case by case in the support we add for zoned devices. But again, a
generic solution makes things simpler I think.

Best regards.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-09 13:51 ` Bart Van Assche
@ 2019-07-09 14:59   ` Damien Le Moal
  0 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2019-07-09 14:59 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Matias Bjorling

On 2019/07/09 22:51, Bart Van Assche wrote:
> On 7/9/19 2:02 AM, Damien Le Moal wrote:
>> Simultaneously writing to a sequential zone of a zoned block device
>> from multiple contexts requires mutual exclusion for BIO issuing to
>> ensure that writes happen sequentially. However, even for a well
>> behaved user correctly implementing such synchronization, BIO plugging
>> may interfere and result in BIOs from the different contextx to be
>> reordered if plugging is done outside of the mutual exclusion section,
>> e.g. the plug was started by a function higher in the call chain than
>> the function issuing BIOs.
>>
>>        Context A                           Context B
>>
>>     | blk_start_plug()
>>     | ...
>>     | seq_write_zone()
>>       | mutex_lock(zone)
>>       | submit_bio(bio-0)
>>       | submit_bio(bio-1)
>>       | mutex_unlock(zone)
>>       | return
>>     | ------------------------------> | seq_write_zone()
>>    				       | mutex_lock(zone)
>> 				       | submit_bio(bio-2)
>> 				       | mutex_unlock(zone)
>>     | <------------------------------ |
>>     | blk_finish_plug()
>>
>> In the above example, despite the mutex synchronization resulting in the
>> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
>> issued after BIO 2 when the plug is released with blk_finish_plug().
>>
>> To fix this problem, introduce the internal helper function
>> blk_mq_plug() to access the current context plug, return the current
>> plug only if the target device is not a zoned block device or if the
>> BIO to be plugged not a write operation. Otherwise, ignore the plug and
>> return NULL, resulting is all writes to zoned block device to never be
>> plugged.
> 
> Are there classes of zoned devices for which the plug list is useful? If 
> so, have you considered any other approaches, e.g. one plug list per 
> request queue instead of one plug list per task in case of zoned devices?

Plugging for writes to zoned block devices is not really useful at all. The
reason is that for any user of the disk executing requests at a queue depth
larger than 1, to preserve write ordering, mq-deadline must be used. With this
scheduler, zone write locking will prevent dispatching more than one write
request per zone at any time, resulting in the accumulation of sequential writes
for a zone in the scheduler queue. This creates plenty of opportunities for
merging small (i.e. single page) write BIOs with preceding pending requests,
which is exactly the intent of plugging in the first place.

A per request queue plug list would work, but it would require a single lock,
going against blk-mq design principle. Such method would also result in a lot
more changes for no real gain at all (for the reason explained above).
Performance-wise, simply disabling per context plugging for writes only has no
measurable impact and is far simpler I think.

Best regards.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-09  9:02 [PATCH] block: Disable write plugging for zoned block devices Damien Le Moal
                   ` (2 preceding siblings ...)
  2019-07-09 14:29 ` Ming Lei
@ 2019-07-10  2:32 ` Bart Van Assche
  2019-07-10  2:36   ` Damien Le Moal
  3 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2019-07-10  2:32 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Matias Bjorling

On 7/9/19 2:02 AM, Damien Le Moal wrote:
> +static inline struct blk_plug *blk_mq_plug(struct request_queue *q,
> +					   struct bio *bio)
> +{
> +	struct blk_plug *plug = current->plug;
> +
> +	if (!blk_queue_is_zoned(q) || !op_is_write(bio_op(bio)))
> +		return plug;
> +
> +	/* Zoned block device write case: do not plug the BIO */
> +	return NULL;
> +}

Can the 'plug' variable be left out from this function and can 'return 
plug' be changed into 'return current->plug'? Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-10  2:32 ` Bart Van Assche
@ 2019-07-10  2:36   ` Damien Le Moal
  0 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2019-07-10  2:36 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Matias Bjorling

On 7/10/19 11:32 AM, Bart Van Assche wrote:
> On 7/9/19 2:02 AM, Damien Le Moal wrote:
>> +static inline struct blk_plug *blk_mq_plug(struct request_queue *q,
>> +					   struct bio *bio)
>> +{
>> +	struct blk_plug *plug = current->plug;
>> +
>> +	if (!blk_queue_is_zoned(q) || !op_is_write(bio_op(bio)))
>> +		return plug;
>> +
>> +	/* Zoned block device write case: do not plug the BIO */
>> +	return NULL;
>> +}
> 
> Can the 'plug' variable be left out from this function and can 'return 
> plug' be changed into 'return current->plug'? Anyway:

Sure, that would be cleaner. Will Send a V2.

> 
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Thanks. Can I add this to the V2 or would you prefer to see the revised
patch first ?

Best regards.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-09 14:29 ` Ming Lei
  2019-07-09 14:47   ` Damien Le Moal
@ 2019-07-10  2:55   ` Jens Axboe
  2019-07-10  2:59     ` Damien Le Moal
  1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2019-07-10  2:55 UTC (permalink / raw)
  To: Ming Lei, Damien Le Moal; +Cc: linux-block, Christoph Hellwig, Matias Bjorling

On 7/9/19 8:29 AM, Ming Lei wrote:
> On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
>> Simultaneously writing to a sequential zone of a zoned block device
>> from multiple contexts requires mutual exclusion for BIO issuing to
>> ensure that writes happen sequentially. However, even for a well
>> behaved user correctly implementing such synchronization, BIO plugging
>> may interfere and result in BIOs from the different contextx to be
>> reordered if plugging is done outside of the mutual exclusion section,
>> e.g. the plug was started by a function higher in the call chain than
>> the function issuing BIOs.
>>
>>        Context A                           Context B
>>
>>     | blk_start_plug()
>>     | ...
>>     | seq_write_zone()
>>       | mutex_lock(zone)
>>       | submit_bio(bio-0)
>>       | submit_bio(bio-1)
>>       | mutex_unlock(zone)
>>       | return
>>     | ------------------------------> | seq_write_zone()
>>    				       | mutex_lock(zone)
>> 				       | submit_bio(bio-2)
>> 				       | mutex_unlock(zone)
>>     | <------------------------------ |
>>     | blk_finish_plug()
>>
>> In the above example, despite the mutex synchronization resulting in the
>> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
>> issued after BIO 2 when the plug is released with blk_finish_plug().
> 
> I am wondering how you guarantee that context B is always run after
> context A.
> 
>>
>> To fix this problem, introduce the internal helper function
>> blk_mq_plug() to access the current context plug, return the current
>> plug only if the target device is not a zoned block device or if the
>> BIO to be plugged not a write operation. Otherwise, ignore the plug and
>> return NULL, resulting is all writes to zoned block device to never be
>> plugged.
> 
> Another candidate approach is to run the following code before
> releasing 'zone' lock:
> 
> 	if (current->plug)
> 		blk_finish_plug(context->plug)
> 
> Then we can fix zone specific issue in zone code only, and avoid generic
> blk-core change for zone issue.

I prefer that to the existing solution as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-10  2:55   ` Jens Axboe
@ 2019-07-10  2:59     ` Damien Le Moal
  2019-07-10 13:59       ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Damien Le Moal @ 2019-07-10  2:59 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei; +Cc: linux-block, Christoph Hellwig, Matias Bjorling

On 7/10/19 11:55 AM, Jens Axboe wrote:
> On 7/9/19 8:29 AM, Ming Lei wrote:
>> On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
>>> Simultaneously writing to a sequential zone of a zoned block device
>>> from multiple contexts requires mutual exclusion for BIO issuing to
>>> ensure that writes happen sequentially. However, even for a well
>>> behaved user correctly implementing such synchronization, BIO plugging
>>> may interfere and result in BIOs from the different contextx to be
>>> reordered if plugging is done outside of the mutual exclusion section,
>>> e.g. the plug was started by a function higher in the call chain than
>>> the function issuing BIOs.
>>>
>>>        Context A                           Context B
>>>
>>>     | blk_start_plug()
>>>     | ...
>>>     | seq_write_zone()
>>>       | mutex_lock(zone)
>>>       | submit_bio(bio-0)
>>>       | submit_bio(bio-1)
>>>       | mutex_unlock(zone)
>>>       | return
>>>     | ------------------------------> | seq_write_zone()
>>>    				       | mutex_lock(zone)
>>> 				       | submit_bio(bio-2)
>>> 				       | mutex_unlock(zone)
>>>     | <------------------------------ |
>>>     | blk_finish_plug()
>>>
>>> In the above example, despite the mutex synchronization resulting in the
>>> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
>>> issued after BIO 2 when the plug is released with blk_finish_plug().
>>
>> I am wondering how you guarantee that context B is always run after
>> context A.
>>
>>>
>>> To fix this problem, introduce the internal helper function
>>> blk_mq_plug() to access the current context plug, return the current
>>> plug only if the target device is not a zoned block device or if the
>>> BIO to be plugged not a write operation. Otherwise, ignore the plug and
>>> return NULL, resulting is all writes to zoned block device to never be
>>> plugged.
>>
>> Another candidate approach is to run the following code before
>> releasing 'zone' lock:
>>
>> 	if (current->plug)
>> 		blk_finish_plug(context->plug)
>>
>> Then we can fix zone specific issue in zone code only, and avoid generic
>> blk-core change for zone issue.
> 
> I prefer that to the existing solution as well.

My apologies, you lost me: do you mean that you prefer Ming's suggestion
and force FS or dm users to manually unplug in the case of zoned block
devices ?

Thanks.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-09 14:47   ` Damien Le Moal
@ 2019-07-10  3:10     ` Ming Lei
  2019-07-10  4:14       ` Damien Le Moal
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2019-07-10  3:10 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Matias Bjorling

On Tue, Jul 09, 2019 at 02:47:12PM +0000, Damien Le Moal wrote:
> Hi Ming,
> 
> On 2019/07/09 23:29, Ming Lei wrote:
> > On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
> >> Simultaneously writing to a sequential zone of a zoned block device
> >> from multiple contexts requires mutual exclusion for BIO issuing to
> >> ensure that writes happen sequentially. However, even for a well
> >> behaved user correctly implementing such synchronization, BIO plugging
> >> may interfere and result in BIOs from the different contextx to be
> >> reordered if plugging is done outside of the mutual exclusion section,
> >> e.g. the plug was started by a function higher in the call chain than
> >> the function issuing BIOs.
> >>
> >>       Context A                           Context B
> >>
> >>    | blk_start_plug()
> >>    | ...
> >>    | seq_write_zone()
> >>      | mutex_lock(zone)
> >>      | submit_bio(bio-0)
> >>      | submit_bio(bio-1)
> >>      | mutex_unlock(zone)
> >>      | return
> >>    | ------------------------------> | seq_write_zone()
> >>   				       | mutex_lock(zone)
> >> 				       | submit_bio(bio-2)
> >> 				       | mutex_unlock(zone)
> >>    | <------------------------------ |
> >>    | blk_finish_plug()
> >>
> >> In the above example, despite the mutex synchronization resulting in the
> >> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
> >> issued after BIO 2 when the plug is released with blk_finish_plug().
> > 
> > I am wondering how you guarantee that context B is always run after
> > context A.
> 
> My example was a little too oversimplified. Think of a file system allocating
> blocks sequentially and issuing page I/Os for the allocated blocks sequentially.
> The typical sequence is:
> 
> mutex_lock(zone)
> alloc_block_extent(zone)
> for all blocks in the extent
> 	submit_bio()
> mutex_unlock(zone)
> 
> This way, it does not matter which context gets the lock first, all write BIOs
> for the zone remain sequential. The problem with plugs as explained above is

But wrt. the example in the commit log, it does matter which context gets the lock
first, and it implies that context A has to run seq_write_zone() first,
because you mentioned bio-2 has to be issued after bio-0 and bio-1.

If there is 3rd context which is holding the lock, then either context A or
context B can win in getting the lock first. So looks the zone lock itself
isn't enough for maintaining the IO order. But that may not be related
with this patch.

Also seems there is issue with REQ_NOWAIT for zone support, for example,
context A may see out-of-request and return earlier, however context B
may get request and move on.

> that if the plug start/finish is not within the zone lock, reordering can happen
> for the 2 sequences of BIOs issued by the 2 contexts.
> 
> We hit this problem with btrfs writepages (page writeback) where plugging is
> done before the above sequence execution, in the caller function of the page
> writeback processing, resulting in unaligned write errors.
> 
> >> To fix this problem, introduce the internal helper function
> >> blk_mq_plug() to access the current context plug, return the current
> >> plug only if the target device is not a zoned block device or if the
> >> BIO to be plugged not a write operation. Otherwise, ignore the plug and
> >> return NULL, resulting is all writes to zoned block device to never be
> >> plugged.
> > 
> > Another candidate approach is to run the following code before
> > releasing 'zone' lock:
> > 
> > 	if (current->plug)
> > 		blk_finish_plug(context->plug)
> > 
> > Then we can fix zone specific issue in zone code only, and avoid generic
> > blk-core change for zone issue.
> 
> Yes indeed, that would work too. But this patch is precisely to avoid having to
> add such code and simplify implementing support for zoned block device in
> existing code. Furthermore, plugging for writes to sequential zones has no real
> value because mq-deadline will dispatch at most one write per zone. So writes
> for a single zone tend to accumulate in the scheduler queue, and that creates
> plenty of opportunities for merging small sequential writes (e.g. file system
> page BIOs).
> 
> If you think this patch is really not appropriate, we can still address the
> problem case by case in the support we add for zoned devices. But again, a
> generic solution makes things simpler I think.

OK, then I am fine with this simple generic approach.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-10  3:10     ` Ming Lei
@ 2019-07-10  4:14       ` Damien Le Moal
  0 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2019-07-10  4:14 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Jens Axboe, Christoph Hellwig, Matias Bjorling

On 7/10/19 12:10 PM, Ming Lei wrote:
> On Tue, Jul 09, 2019 at 02:47:12PM +0000, Damien Le Moal wrote:
>> Hi Ming,
>>
>> On 2019/07/09 23:29, Ming Lei wrote:
>>> On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
>>>> Simultaneously writing to a sequential zone of a zoned block device
>>>> from multiple contexts requires mutual exclusion for BIO issuing to
>>>> ensure that writes happen sequentially. However, even for a well
>>>> behaved user correctly implementing such synchronization, BIO plugging
>>>> may interfere and result in BIOs from the different contextx to be
>>>> reordered if plugging is done outside of the mutual exclusion section,
>>>> e.g. the plug was started by a function higher in the call chain than
>>>> the function issuing BIOs.
>>>>
>>>>       Context A                           Context B
>>>>
>>>>    | blk_start_plug()
>>>>    | ...
>>>>    | seq_write_zone()
>>>>      | mutex_lock(zone)
>>>>      | submit_bio(bio-0)
>>>>      | submit_bio(bio-1)
>>>>      | mutex_unlock(zone)
>>>>      | return
>>>>    | ------------------------------> | seq_write_zone()
>>>>   				       | mutex_lock(zone)
>>>> 				       | submit_bio(bio-2)
>>>> 				       | mutex_unlock(zone)
>>>>    | <------------------------------ |
>>>>    | blk_finish_plug()
>>>>
>>>> In the above example, despite the mutex synchronization resulting in the
>>>> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
>>>> issued after BIO 2 when the plug is released with blk_finish_plug().
>>>
>>> I am wondering how you guarantee that context B is always run after
>>> context A.
>>
>> My example was a little too oversimplified. Think of a file system allocating
>> blocks sequentially and issuing page I/Os for the allocated blocks sequentially.
>> The typical sequence is:
>>
>> mutex_lock(zone)
>> alloc_block_extent(zone)
>> for all blocks in the extent
>> 	submit_bio()
>> mutex_unlock(zone)
>>
>> This way, it does not matter which context gets the lock first, all write BIOs
>> for the zone remain sequential. The problem with plugs as explained above is
> 
> But wrt. the example in the commit log, it does matter which context gets the lock
> first, and it implies that context A has to run seq_write_zone() first,
> because you mentioned bio-2 has to be issued after bio-0 and bio-1.
> 
> If there is 3rd context which is holding the lock, then either context A or
> context B can win in getting the lock first. So looks the zone lock itself
> isn't enough for maintaining the IO order. But that may not be related
> with this patch.

For a raw block device driver, the zone lock is enough to maintain
sequential write sequence. This is not visible in my example, because it
is too simplistic. My apologies for the confusion.

The reason is that the target sector of any zone write BIO must always
be set to the end sector of the last issued write BIO for the zone. A
more detailed and correct typical sequence for writing to a zone for a
raw block device driver (e.g. a dm target) is:

seq_write_zone() {

	mutex_lock(zone)

	/* bio-0 */
	bio = bio_alloc()
	bio->bi_iter.bi_sector = zone->wp
	zone->wp += bio_sectors(bio)
	submit_bio(bio)

	/* bio-1 */
	bio = bio_alloc()
	bio->bi_iter.bi_sector = zone->wp
	zone->wp += bio_sectors(bio)
	submit_bio(bio)

	...

	mutex_unlock(zone)
}

Doing so, multiple contexts serialized with the zone mutex can keep
writing sequentially, no matter the number of BIOs they issue and no
matter the order in which they grab the zone lock. Note that here, the
zone write pointer is a "soft" write pointer, not the actual device
managed write pointer, because this latter WP is updated only on
completion of the write commands, so visible to the host only on
completion of the write BIOs. The "soft" write pointer is thus always
equal to or in advance of the device hard WP. The soft WP must be
re-synced to the hard WP in case of failed writes.

For a file system, the zone hard WP is used as a starting point for
block allocation. BIO issuing can then simply use the allocated extent
sector directly instead of the zone soft write pointer. The block
allocation code will manage the zone soft WP and do the resync with the
device hard WP in case of write error.

> Also seems there is issue with REQ_NOWAIT for zone support, for example,
> context A may see out-of-request and return earlier, however context B
> may get request and move on.

Yes, but context B will move on from the last successfully written
sector so sequential writes can still go on. It is the responsibility of
the user code to deal with failed writes and how to recover from them.

If REQ_NOWAIT is used for a BIO and causes submit_bio() to fail (
BLK_QC_T_NONE returned) in one context, that context may retry until it
succeeds and increment the soft WP or bail out without incrementing the
zone soft WP. In both cases, other contexts may simply resume trying to
write from the still valid soft WP. Any number of methods exist for
dealing with this. All the responsibility of the user (fs or dm) because
sequential write issuing must in the first place be guaranteed by the
users. In this regard, the generic block layer is fine.

>> that if the plug start/finish is not within the zone lock, reordering can happen
>> for the 2 sequences of BIOs issued by the 2 contexts.
>>
>> We hit this problem with btrfs writepages (page writeback) where plugging is
>> done before the above sequence execution, in the caller function of the page
>> writeback processing, resulting in unaligned write errors.
>>
>>>> To fix this problem, introduce the internal helper function
>>>> blk_mq_plug() to access the current context plug, return the current
>>>> plug only if the target device is not a zoned block device or if the
>>>> BIO to be plugged not a write operation. Otherwise, ignore the plug and
>>>> return NULL, resulting is all writes to zoned block device to never be
>>>> plugged.
>>>
>>> Another candidate approach is to run the following code before
>>> releasing 'zone' lock:
>>>
>>> 	if (current->plug)
>>> 		blk_finish_plug(context->plug)
>>>
>>> Then we can fix zone specific issue in zone code only, and avoid generic
>>> blk-core change for zone issue.
>>
>> Yes indeed, that would work too. But this patch is precisely to avoid having to
>> add such code and simplify implementing support for zoned block device in
>> existing code. Furthermore, plugging for writes to sequential zones has no real
>> value because mq-deadline will dispatch at most one write per zone. So writes
>> for a single zone tend to accumulate in the scheduler queue, and that creates
>> plenty of opportunities for merging small sequential writes (e.g. file system
>> page BIOs).
>>
>> If you think this patch is really not appropriate, we can still address the
>> problem case by case in the support we add for zoned devices. But again, a
>> generic solution makes things simpler I think.
> 
> OK, then I am fine with this simple generic approach.

Thanks.

Best regards.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] block: Disable write plugging for zoned block devices
  2019-07-10  2:59     ` Damien Le Moal
@ 2019-07-10 13:59       ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2019-07-10 13:59 UTC (permalink / raw)
  To: Damien Le Moal, Ming Lei; +Cc: linux-block, Christoph Hellwig, Matias Bjorling

On 7/9/19 8:59 PM, Damien Le Moal wrote:
> On 7/10/19 11:55 AM, Jens Axboe wrote:
>> On 7/9/19 8:29 AM, Ming Lei wrote:
>>> On Tue, Jul 09, 2019 at 06:02:19PM +0900, Damien Le Moal wrote:
>>>> Simultaneously writing to a sequential zone of a zoned block device
>>>> from multiple contexts requires mutual exclusion for BIO issuing to
>>>> ensure that writes happen sequentially. However, even for a well
>>>> behaved user correctly implementing such synchronization, BIO plugging
>>>> may interfere and result in BIOs from the different contextx to be
>>>> reordered if plugging is done outside of the mutual exclusion section,
>>>> e.g. the plug was started by a function higher in the call chain than
>>>> the function issuing BIOs.
>>>>
>>>>         Context A                           Context B
>>>>
>>>>      | blk_start_plug()
>>>>      | ...
>>>>      | seq_write_zone()
>>>>        | mutex_lock(zone)
>>>>        | submit_bio(bio-0)
>>>>        | submit_bio(bio-1)
>>>>        | mutex_unlock(zone)
>>>>        | return
>>>>      | ------------------------------> | seq_write_zone()
>>>>     				       | mutex_lock(zone)
>>>> 				       | submit_bio(bio-2)
>>>> 				       | mutex_unlock(zone)
>>>>      | <------------------------------ |
>>>>      | blk_finish_plug()
>>>>
>>>> In the above example, despite the mutex synchronization resulting in the
>>>> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
>>>> issued after BIO 2 when the plug is released with blk_finish_plug().
>>>
>>> I am wondering how you guarantee that context B is always run after
>>> context A.
>>>
>>>>
>>>> To fix this problem, introduce the internal helper function
>>>> blk_mq_plug() to access the current context plug, return the current
>>>> plug only if the target device is not a zoned block device or if the
>>>> BIO to be plugged not a write operation. Otherwise, ignore the plug and
>>>> return NULL, resulting is all writes to zoned block device to never be
>>>> plugged.
>>>
>>> Another candidate approach is to run the following code before
>>> releasing 'zone' lock:
>>>
>>> 	if (current->plug)
>>> 		blk_finish_plug(context->plug)
>>>
>>> Then we can fix zone specific issue in zone code only, and avoid generic
>>> blk-core change for zone issue.
>>
>> I prefer that to the existing solution as well.
> 
> My apologies, you lost me: do you mean that you prefer Ming's suggestion
> and force FS or dm users to manually unplug in the case of zoned block
> devices ?

I take that back, I thought we could do it manually in the zoned code
while dealing with the locking, but I don't think that is feasible.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-07-10 13:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-09  9:02 [PATCH] block: Disable write plugging for zoned block devices Damien Le Moal
2019-07-09 11:15 ` Johannes Thumshirn
2019-07-09 13:51 ` Bart Van Assche
2019-07-09 14:59   ` Damien Le Moal
2019-07-09 14:29 ` Ming Lei
2019-07-09 14:47   ` Damien Le Moal
2019-07-10  3:10     ` Ming Lei
2019-07-10  4:14       ` Damien Le Moal
2019-07-10  2:55   ` Jens Axboe
2019-07-10  2:59     ` Damien Le Moal
2019-07-10 13:59       ` Jens Axboe
2019-07-10  2:32 ` Bart Van Assche
2019-07-10  2:36   ` Damien Le Moal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.