All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
@ 2020-12-04  9:44 Hao Xu
  2020-12-04 11:44 ` Pavel Begunkov
  2020-12-07  2:21 ` Dave Chinner
  0 siblings, 2 replies; 15+ messages in thread
From: Hao Xu @ 2020-12-04  9:44 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig
  Cc: Darrick J. Wong, linux-fsdevel, Jeffle Xu, Konstantin Khlebnikov,
	Jens Axboe, io-uring, Joseph Qi

Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
IOCB_NOWAIT is set.

Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---

Hi all,
I tested fio io_uring direct read for a file on ext4 filesystem on a
nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
means REQ_NOWAIT is not set in bio->bi_opf. This makes nowait IO a
normal IO. Since I'm new to iomap and block layer, I sincerely ask
yours opinions in case I misunderstand the code which is very likely
to happen.:)
The example I use: io_uring direct randread, the first try is with
IOCB_NOWAIT but not IOCB_HIPRI, the IOCB_NOWAIT is ignored in block
layer which I think is not the designed behaviour.

I found that Konstantin found this issue before in May
2020 (https://www.spinics.net/lists/linux-block/msg53275.html), here add
his signature, add Jeffle's as well since he gave me some help.

v1->v2:
* add same logic in __blkdev_direct_IO_simple()
v2->v3:
* add same logic in do_blockdev_direct_IO()

 fs/block_dev.c       | 7 +++++++
 fs/direct-io.c       | 6 ++++--
 fs/iomap/direct-io.c | 3 +++
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9e84b1928b94..ca6f365c2f14 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -263,6 +263,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
 		bio.bi_opf = dio_bio_write_op(iocb);
 		task_io_account_write(ret);
 	}
+
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		bio.bi_opf |= REQ_NOWAIT;
+
 	if (iocb->ki_flags & IOCB_HIPRI)
 		bio_set_polled(&bio, iocb);
 
@@ -417,6 +421,9 @@ static void blkdev_bio_end_io(struct bio *bio)
 			task_io_account_write(bio->bi_iter.bi_size);
 		}
 
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			bio->bi_opf |= REQ_NOWAIT;
+
 		dio->size += bio->bi_iter.bi_size;
 		pos += bio->bi_iter.bi_size;
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index d53fa92a1ab6..b221ed351c1c 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1206,11 +1206,13 @@ static inline int drop_refcount(struct dio *dio)
 	if (iov_iter_rw(iter) == WRITE) {
 		dio->op = REQ_OP_WRITE;
 		dio->op_flags = REQ_SYNC | REQ_IDLE;
-		if (iocb->ki_flags & IOCB_NOWAIT)
-			dio->op_flags |= REQ_NOWAIT;
 	} else {
 		dio->op = REQ_OP_READ;
 	}
+
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		dio->op_flags |= REQ_NOWAIT;
+
 	if (iocb->ki_flags & IOCB_HIPRI)
 		dio->op_flags |= REQ_HIPRI;
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 933f234d5bec..2e897688ed6d 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -64,6 +64,9 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
 {
 	atomic_inc(&dio->ref);
 
+	if (dio->iocb->ki_flags & IOCB_NOWAIT)
+		bio->bi_opf |= REQ_NOWAIT;
+
 	if (dio->iocb->ki_flags & IOCB_HIPRI)
 		bio_set_polled(bio, dio->iocb);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-04  9:44 [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO Hao Xu
@ 2020-12-04 11:44 ` Pavel Begunkov
  2020-12-07  2:21 ` Dave Chinner
  1 sibling, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2020-12-04 11:44 UTC (permalink / raw)
  To: Hao Xu, Alexander Viro, Christoph Hellwig
  Cc: Darrick J. Wong, linux-fsdevel, Jeffle Xu, Konstantin Khlebnikov,
	Jens Axboe, io-uring, Joseph Qi, linux-block

On 04/12/2020 09:44, Hao Xu wrote:
> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
> IOCB_NOWAIT is set.

I believe Jens took my patch fixing that for blkdev_direct_IO*()
(but not iomap) a while ago.

BTW, even though get_maintainer.pl doesn't think so, AFAIK
fs/block_dev.c is managed by linux-block@vger.kernel.org. Please CC it
next time.

> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> ---
> 
> Hi all,
> I tested fio io_uring direct read for a file on ext4 filesystem on a
> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
> means REQ_NOWAIT is not set in bio->bi_opf. This makes nowait IO a
> normal IO. Since I'm new to iomap and block layer, I sincerely ask
> yours opinions in case I misunderstand the code which is very likely
> to happen.:)
> The example I use: io_uring direct randread, the first try is with
> IOCB_NOWAIT but not IOCB_HIPRI, the IOCB_NOWAIT is ignored in block
> layer which I think is not the designed behaviour.
> 
> I found that Konstantin found this issue before in May
> 2020 (https://www.spinics.net/lists/linux-block/msg53275.html), here add
> his signature, add Jeffle's as well since he gave me some help.
> 
> v1->v2:
> * add same logic in __blkdev_direct_IO_simple()
> v2->v3:
> * add same logic in do_blockdev_direct_IO()
> 
>  fs/block_dev.c       | 7 +++++++
>  fs/direct-io.c       | 6 ++++--
>  fs/iomap/direct-io.c | 3 +++
>  3 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 9e84b1928b94..ca6f365c2f14 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -263,6 +263,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>  		bio.bi_opf = dio_bio_write_op(iocb);
>  		task_io_account_write(ret);
>  	}
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT)
> +		bio.bi_opf |= REQ_NOWAIT;
> +
>  	if (iocb->ki_flags & IOCB_HIPRI)
>  		bio_set_polled(&bio, iocb);
>  
> @@ -417,6 +421,9 @@ static void blkdev_bio_end_io(struct bio *bio)
>  			task_io_account_write(bio->bi_iter.bi_size);
>  		}
>  
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			bio->bi_opf |= REQ_NOWAIT;
> +
>  		dio->size += bio->bi_iter.bi_size;
>  		pos += bio->bi_iter.bi_size;
>  
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index d53fa92a1ab6..b221ed351c1c 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1206,11 +1206,13 @@ static inline int drop_refcount(struct dio *dio)
>  	if (iov_iter_rw(iter) == WRITE) {
>  		dio->op = REQ_OP_WRITE;
>  		dio->op_flags = REQ_SYNC | REQ_IDLE;
> -		if (iocb->ki_flags & IOCB_NOWAIT)
> -			dio->op_flags |= REQ_NOWAIT;
>  	} else {
>  		dio->op = REQ_OP_READ;
>  	}
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT)
> +		dio->op_flags |= REQ_NOWAIT;
> +
>  	if (iocb->ki_flags & IOCB_HIPRI)
>  		dio->op_flags |= REQ_HIPRI;
>  
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 933f234d5bec..2e897688ed6d 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -64,6 +64,9 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
>  {
>  	atomic_inc(&dio->ref);
>  
> +	if (dio->iocb->ki_flags & IOCB_NOWAIT)
> +		bio->bi_opf |= REQ_NOWAIT;
> +
>  	if (dio->iocb->ki_flags & IOCB_HIPRI)
>  		bio_set_polled(bio, dio->iocb);
>  
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-04  9:44 [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO Hao Xu
  2020-12-04 11:44 ` Pavel Begunkov
@ 2020-12-07  2:21 ` Dave Chinner
  2020-12-07 23:40   ` Jens Axboe
  2020-12-08  5:46   ` JeffleXu
  1 sibling, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2020-12-07  2:21 UTC (permalink / raw)
  To: Hao Xu
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Jeffle Xu, Konstantin Khlebnikov, Jens Axboe,
	io-uring, Joseph Qi

On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
> IOCB_NOWAIT is set.
> 
> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> ---
> 
> Hi all,
> I tested fio io_uring direct read for a file on ext4 filesystem on a
> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
> means REQ_NOWAIT is not set in bio->bi_opf.

What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
filesystem behaviour, not the block device.

REQ_NOWAIT can result in partial IO failures because the error is
only reported to the iomap layer via IO completions. Hence we can
split a DIO into multiple bios and have random bios in that IO fail
with EAGAIN because REQ_NOWAIT is set. This error will
get reported to the submitter via completion, and it will override
any of the partial IOs that actually completed.

Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
NOWAIT DIO across extent boundaries") we'll get silent partial
writes occurring because the second submitted bio in an IO can
trigger EAGAIN errors with partial IO completion having already
occurred.

Further, we don't allow partial IO completion for DIO on XFS at all.
DIO must be completely submitted and completed or return an error
without having issued any IO at all.  Hence using REQ_NOWAIT for
DIO bios is incorrect and not desirable.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-07  2:21 ` Dave Chinner
@ 2020-12-07 23:40   ` Jens Axboe
  2020-12-09 21:15     ` Dave Chinner
  2020-12-08  5:46   ` JeffleXu
  1 sibling, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2020-12-07 23:40 UTC (permalink / raw)
  To: Dave Chinner, Hao Xu
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Jeffle Xu, Konstantin Khlebnikov, io-uring,
	Joseph Qi

On 12/6/20 7:21 PM, Dave Chinner wrote:
> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
>> IOCB_NOWAIT is set.
>>
>> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>> ---
>>
>> Hi all,
>> I tested fio io_uring direct read for a file on ext4 filesystem on a
>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
>> means REQ_NOWAIT is not set in bio->bi_opf.
> 
> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
> filesystem behaviour, not the block device.
> 
> REQ_NOWAIT can result in partial IO failures because the error is
> only reported to the iomap layer via IO completions. Hence we can
> split a DIO into multiple bios and have random bios in that IO fail
> with EAGAIN because REQ_NOWAIT is set. This error will
> get reported to the submitter via completion, and it will override
> any of the partial IOs that actually completed.
> 
> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
> NOWAIT DIO across extent boundaries") we'll get silent partial
> writes occurring because the second submitted bio in an IO can
> trigger EAGAIN errors with partial IO completion having already
> occurred.
> 
> Further, we don't allow partial IO completion for DIO on XFS at all.
> DIO must be completely submitted and completed or return an error
> without having issued any IO at all.  Hence using REQ_NOWAIT for
> DIO bios is incorrect and not desirable.

What you say makes total sense for a user using RWF_NOWAIT, but it
doesn't make a lot of sense for io_uring where we really want
IOCB_NOWAIT to be what it suggests it is - don't wait for other IO to
complete, if avoidable. One of the things that really suck with
aio/libai is the "yeah it's probably async, but lol, might not be"
aspect of it.

For io_uring, if we do get -EAGAIN, we'll retry without NOWAIT set. So
the concern about fractured/short writes doesn't bubble up to the
application. Hence we really want an IOCB_NOWAIT_REALLY on that side,
instead of the poor mans IOCB_MAYBE_NOWAIT semantics.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-07  2:21 ` Dave Chinner
  2020-12-07 23:40   ` Jens Axboe
@ 2020-12-08  5:46   ` JeffleXu
  2020-12-09 21:23     ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: JeffleXu @ 2020-12-08  5:46 UTC (permalink / raw)
  To: Dave Chinner, Hao Xu
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring,
	Joseph Qi



On 12/7/20 10:21 AM, Dave Chinner wrote:
> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
>> IOCB_NOWAIT is set.
>>
>> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>> ---
>>
>> Hi all,
>> I tested fio io_uring direct read for a file on ext4 filesystem on a
>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
>> means REQ_NOWAIT is not set in bio->bi_opf.
> 
> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
> filesystem behaviour, not the block device.
> 
> REQ_NOWAIT can result in partial IO failures because the error is
> only reported to the iomap layer via IO completions. Hence we can
> split a DIO into multiple bios and have random bios in that IO fail
> with EAGAIN because REQ_NOWAIT is set. This error will
> get reported to the submitter via completion, and it will override
> any of the partial IOs that actually completed.
> 
> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
> NOWAIT DIO across extent boundaries") we'll get silent partial
> writes occurring because the second submitted bio in an IO can
> trigger EAGAIN errors with partial IO completion having already
> occurred.
> 
> Further, we don't allow partial IO completion for DIO on XFS at all.
> DIO must be completely submitted and completed or return an error
> without having issued any IO at all.  Hence using REQ_NOWAIT for
> DIO bios is incorrect and not desirable.

Not familiar with xfs though, just in curiosity, how do you achive 'no
partial completion'? I mean you could avoid partial -EAGAIN by not
setting REQ_NOWAIT, but you could get other partial errors such as
-ENOMEM or something, as long as one DIO could be split to multiple bios.


-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-07 23:40   ` Jens Axboe
@ 2020-12-09 21:15     ` Dave Chinner
  2020-12-10  2:33       ` JeffleXu
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2020-12-09 21:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Jeffle Xu, Konstantin Khlebnikov, io-uring,
	Joseph Qi

On Mon, Dec 07, 2020 at 04:40:38PM -0700, Jens Axboe wrote:
> On 12/6/20 7:21 PM, Dave Chinner wrote:
> > On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
> >> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
> >> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
> >> IOCB_NOWAIT is set.
> >>
> >> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> >> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> >> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> >> ---
> >>
> >> Hi all,
> >> I tested fio io_uring direct read for a file on ext4 filesystem on a
> >> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
> >> means REQ_NOWAIT is not set in bio->bi_opf.
> > 
> > What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
> > filesystem behaviour, not the block device.
> > 
> > REQ_NOWAIT can result in partial IO failures because the error is
> > only reported to the iomap layer via IO completions. Hence we can
> > split a DIO into multiple bios and have random bios in that IO fail
> > with EAGAIN because REQ_NOWAIT is set. This error will
> > get reported to the submitter via completion, and it will override
> > any of the partial IOs that actually completed.
> > 
> > Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
> > reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
> > NOWAIT DIO across extent boundaries") we'll get silent partial
> > writes occurring because the second submitted bio in an IO can
> > trigger EAGAIN errors with partial IO completion having already
> > occurred.
> > 
> > Further, we don't allow partial IO completion for DIO on XFS at all.
> > DIO must be completely submitted and completed or return an error
> > without having issued any IO at all.  Hence using REQ_NOWAIT for
> > DIO bios is incorrect and not desirable.
> 
> What you say makes total sense for a user using RWF_NOWAIT, but it
> doesn't make a lot of sense for io_uring where we really want
> IOCB_NOWAIT to be what it suggests it is - don't wait for other IO to
> complete, if avoidable. One of the things that really suck with
> aio/libai is the "yeah it's probably async, but lol, might not be"
> aspect of it.

Sure, but we have no way of telling what semantics the IO issuer
actually requires from above. And because IOCB_NOWAIT behaviour is
directly exposed to userspace by RWF_NOWAIT, that's the behaviour we
have to implement.

> For io_uring, if we do get -EAGAIN, we'll retry without NOWAIT set. So
> the concern about fractured/short writes doesn't bubble up to the
> application. Hence we really want an IOCB_NOWAIT_REALLY on that side,
> instead of the poor mans IOCB_MAYBE_NOWAIT semantics.

Yup, perhaps what we really want is a true IOCB_NONBLOCK flag as an
internal kernel implementation. i.e. don't block anywhere in the
stack, and the caller must handle retrying/completing the entire IO
regardless of where the -EAGAIN comes from during the IO, including
from a partial completion that the caller is waiting for...

i.e. rather than hacking around this specific instance of "it blocks
and we don't want it to", define the semantics and behaviour of a
fully non-blocking IO through all layers from the VFS down to the
hardware and let's implement that. Then we can stop playing
whack-a-mole with all the "but it blocks when I do this, doctor!"
issues that we seem to keep having.... :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-08  5:46   ` JeffleXu
@ 2020-12-09 21:23     ` Dave Chinner
  2020-12-10  1:55       ` JeffleXu
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2020-12-09 21:23 UTC (permalink / raw)
  To: JeffleXu
  Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring,
	Joseph Qi

On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote:
> 
> 
> On 12/7/20 10:21 AM, Dave Chinner wrote:
> > On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
> >> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
> >> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
> >> IOCB_NOWAIT is set.
> >>
> >> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> >> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> >> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> >> ---
> >>
> >> Hi all,
> >> I tested fio io_uring direct read for a file on ext4 filesystem on a
> >> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
> >> means REQ_NOWAIT is not set in bio->bi_opf.
> > 
> > What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
> > filesystem behaviour, not the block device.
> > 
> > REQ_NOWAIT can result in partial IO failures because the error is
> > only reported to the iomap layer via IO completions. Hence we can
> > split a DIO into multiple bios and have random bios in that IO fail
> > with EAGAIN because REQ_NOWAIT is set. This error will
> > get reported to the submitter via completion, and it will override
> > any of the partial IOs that actually completed.
> > 
> > Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
> > reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
> > NOWAIT DIO across extent boundaries") we'll get silent partial
> > writes occurring because the second submitted bio in an IO can
> > trigger EAGAIN errors with partial IO completion having already
> > occurred.
> > 
> > Further, we don't allow partial IO completion for DIO on XFS at all.
> > DIO must be completely submitted and completed or return an error
> > without having issued any IO at all.  Hence using REQ_NOWAIT for
> > DIO bios is incorrect and not desirable.
> 
> Not familiar with xfs though, just in curiosity, how do you achive 'no
> partial completion'? I mean you could avoid partial -EAGAIN by not
> setting REQ_NOWAIT, but you could get other partial errors such as
> -ENOMEM or something, as long as one DIO could be split to multiple bios.

If any part of a DIO fails, we fail the entire IO. When we split a
DIO into multiple bios and one reports an error, we don't know track
where in the IO it actually failed, we just fail the entire IO.

e.g. how do you report correct partial completion to userspace when
a DIO gets split into 3 pieces and the middle one fails? There are
two ranges that actually completed, but we can only report one of
them....

And, really, we still need to report that an IO failed to userspace,
because mission critical apps care more about the fact that an IO
failure occurred than silently swallowing the IO error with a
(potentially incorrect) partial IO completion notification.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-09 21:23     ` Dave Chinner
@ 2020-12-10  1:55       ` JeffleXu
  2020-12-10  5:18         ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: JeffleXu @ 2020-12-10  1:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring,
	Joseph Qi

Sorry I'm still a little confused.


On 12/10/20 5:23 AM, Dave Chinner wrote:
> On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote:
>>
>>
>> On 12/7/20 10:21 AM, Dave Chinner wrote:
>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
>>>> IOCB_NOWAIT is set.
>>>>
>>>> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>> ---
>>>>
>>>> Hi all,
>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a
>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
>>>> means REQ_NOWAIT is not set in bio->bi_opf.
>>>
>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
>>> filesystem behaviour, not the block device.
>>>
>>> REQ_NOWAIT can result in partial IO failures because the error is
>>> only reported to the iomap layer via IO completions. Hence we can
>>> split a DIO into multiple bios and have random bios in that IO fail
>>> with EAGAIN because REQ_NOWAIT is set. This error will
>>> get reported to the submitter via completion, and it will override
>>> any of the partial IOs that actually completed.
>>>
>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
>>> NOWAIT DIO across extent boundaries") we'll get silent partial
>>> writes occurring because the second submitted bio in an IO can
>>> trigger EAGAIN errors with partial IO completion having already
>>> occurred.
>>>

>>> Further, we don't allow partial IO completion for DIO on XFS at all.
>>> DIO must be completely submitted and completed or return an error
>>> without having issued any IO at all.  Hence using REQ_NOWAIT for
>>> DIO bios is incorrect and not desirable.


The current block layer implementation causes that, as long as one split
bio fails, then the whole DIO fails, in which case several split bios
maybe have succeeded and the content has been written to the disk. This
is obviously what you called "partial IO completion".

I'm just concerned on how do you achieve that "DIO must return an error
without having issued any IO at all". Do you have some method of
reverting the content has already been written into the disk when a
partial error happened?

> 
> If any part of a DIO fails, we fail the entire IO. When we split a
> DIO into multiple bios and one reports an error, we don't know track
> where in the IO it actually failed, we just fail the entire IO.
> 
> e.g. how do you report correct partial completion to userspace when
> a DIO gets split into 3 pieces and the middle one fails? There are
> two ranges that actually completed, but we can only report one of
> them....
> 
> And, really, we still need to report that an IO failed to userspace,
> because mission critical apps care more about the fact that an IO
> failure occurred than silently swallowing the IO error with a
> (potentially incorrect) partial IO completion notification.



-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-09 21:15     ` Dave Chinner
@ 2020-12-10  2:33       ` JeffleXu
  0 siblings, 0 replies; 15+ messages in thread
From: JeffleXu @ 2020-12-10  2:33 UTC (permalink / raw)
  To: Dave Chinner, Jens Axboe
  Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Konstantin Khlebnikov, io-uring, Joseph Qi



On 12/10/20 5:15 AM, Dave Chinner wrote:
> On Mon, Dec 07, 2020 at 04:40:38PM -0700, Jens Axboe wrote:
>> On 12/6/20 7:21 PM, Dave Chinner wrote:
>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
>>>> IOCB_NOWAIT is set.
>>>>
>>>> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>> ---
>>>>
>>>> Hi all,
>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a
>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
>>>> means REQ_NOWAIT is not set in bio->bi_opf.
>>>
>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
>>> filesystem behaviour, not the block device.
>>>
>>> REQ_NOWAIT can result in partial IO failures because the error is
>>> only reported to the iomap layer via IO completions. Hence we can
>>> split a DIO into multiple bios and have random bios in that IO fail
>>> with EAGAIN because REQ_NOWAIT is set. This error will
>>> get reported to the submitter via completion, and it will override
>>> any of the partial IOs that actually completed.
>>>
>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
>>> NOWAIT DIO across extent boundaries") we'll get silent partial
>>> writes occurring because the second submitted bio in an IO can
>>> trigger EAGAIN errors with partial IO completion having already
>>> occurred.
>>>
>>> Further, we don't allow partial IO completion for DIO on XFS at all.
>>> DIO must be completely submitted and completed or return an error
>>> without having issued any IO at all.  Hence using REQ_NOWAIT for
>>> DIO bios is incorrect and not desirable.
>>
>> What you say makes total sense for a user using RWF_NOWAIT, but it
>> doesn't make a lot of sense for io_uring where we really want
>> IOCB_NOWAIT to be what it suggests it is - don't wait for other IO to
>> complete, if avoidable. One of the things that really suck with
>> aio/libai is the "yeah it's probably async, but lol, might not be"
>> aspect of it.
> 
> Sure, but we have no way of telling what semantics the IO issuer
> actually requires from above. And because IOCB_NOWAIT behaviour is
> directly exposed to userspace by RWF_NOWAIT, that's the behaviour we
> have to implement.
> 
>> For io_uring, if we do get -EAGAIN, we'll retry without NOWAIT set. So
>> the concern about fractured/short writes doesn't bubble up to the
>> application. Hence we really want an IOCB_NOWAIT_REALLY on that side,
>> instead of the poor mans IOCB_MAYBE_NOWAIT semantics.
> 
> Yup, perhaps what we really want is a true IOCB_NONBLOCK flag as an
> internal kernel implementation. i.e. don't block anywhere in the
> stack, and the caller must handle retrying/completing the entire IO
> regardless of where the -EAGAIN comes from during the IO, including
> from a partial completion that the caller is waiting for...
> 
> i.e. rather than hacking around this specific instance of "it blocks
> and we don't want it to", define the semantics and behaviour of a
> fully non-blocking IO through all layers from the VFS down to the
> hardware and let's implement that. Then we can stop playing
> whack-a-mole with all the "but it blocks when I do this, doctor!"
> issues that we seem to keep having.... :)
> 

From my opinion, the semantics of NOWAIT is clear, just like Jens said,
"don't wait, if**avoidable**". Also it should be the responsibility of
the callers who set this flag to resubmit the failed IO once a -EAGAIN
returned. e.g. if the user app sets RWF_NOWAIT, then it's the
responsibility of the user app to resubmit the failed IO once -EAGAIN
returned. If io_uring layer newly set IOCB_NONBLOCK flag, then it's the
responsibility of io_uring to resubmit.

The limitation here is that once a big DIO/bio got split, then the
atomicity of whole DIO/bio can not be guaranteed. This seems a
limitation of direct IO routine and block layer, and the work of above
layers e.g. filesystem, is needed to guarantee the atomicity.

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-10  1:55       ` JeffleXu
@ 2020-12-10  5:18         ` Dave Chinner
  2020-12-11  2:50           ` JeffleXu
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2020-12-10  5:18 UTC (permalink / raw)
  To: JeffleXu
  Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring,
	Joseph Qi

On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote:
> Sorry I'm still a little confused.
> 
> 
> On 12/10/20 5:23 AM, Dave Chinner wrote:
> > On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote:
> >>
> >>
> >> On 12/7/20 10:21 AM, Dave Chinner wrote:
> >>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
> >>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
> >>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
> >>>> IOCB_NOWAIT is set.
> >>>>
> >>>> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> >>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> >>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> >>>> ---
> >>>>
> >>>> Hi all,
> >>>> I tested fio io_uring direct read for a file on ext4 filesystem on a
> >>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
> >>>> means REQ_NOWAIT is not set in bio->bi_opf.
> >>>
> >>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
> >>> filesystem behaviour, not the block device.
> >>>
> >>> REQ_NOWAIT can result in partial IO failures because the error is
> >>> only reported to the iomap layer via IO completions. Hence we can
> >>> split a DIO into multiple bios and have random bios in that IO fail
> >>> with EAGAIN because REQ_NOWAIT is set. This error will
> >>> get reported to the submitter via completion, and it will override
> >>> any of the partial IOs that actually completed.
> >>>
> >>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
> >>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
> >>> NOWAIT DIO across extent boundaries") we'll get silent partial
> >>> writes occurring because the second submitted bio in an IO can
> >>> trigger EAGAIN errors with partial IO completion having already
> >>> occurred.
> >>>
> 
> >>> Further, we don't allow partial IO completion for DIO on XFS at all.
> >>> DIO must be completely submitted and completed or return an error
> >>> without having issued any IO at all.  Hence using REQ_NOWAIT for
> >>> DIO bios is incorrect and not desirable.
> 
> 
> The current block layer implementation causes that, as long as one split
> bio fails, then the whole DIO fails, in which case several split bios
> maybe have succeeded and the content has been written to the disk. This
> is obviously what you called "partial IO completion".
> 
> I'm just concerned on how do you achieve that "DIO must return an error
> without having issued any IO at all". Do you have some method of
> reverting the content has already been written into the disk when a
> partial error happened?

I think you've misunderstood what I was saying. I did not say
"DIO must return an error without having issued any IO at all".
There are two parts to my statement, and you just smashed part of
the first statement into part of the second statement and came up
something I didn't actually say.

The first statement is:

	1. "DIO must be fully submitted and completed ...."

That is, if we need to break an IO up into multiple parts, the
entire IO must be submitted and completed as a whole. We do not
allow partial submission or completion of the IO at all because we
cannot accurately report what parts of a multi-bio DIO that failed
through the completion interface. IOWs, if any of the IOs after the
first one fail submission, then we must complete all the IOs that
have already been submitted before we can report the failure that
occurred during the IO.

The second statement is:

	2. "... or return an error without having issued any IO at
	   all."

IO submission errors are only reported by the DIO layer through IO
completion, in which case #1 is applied. #2 only applies to errors
that occur before IO submission is started, and these errors are
directly reported to the caller. IOCB_NOWAIT is a check done before
we start submission, hence can return -EAGAIN directly to the
caller.

IOWs, if an error is returned to the caller, we have either not
submitted any IO at all, or we have fully submitted and completed
the IO and there was some error reported by the IO completions.
There is no scope for "partial IO" here - it either completed fully
or we got an error.

This is necessary for correct AIO semantics. We aren't waiting for
completions to be able to report submission errors to submitters.
Hence for async IO, the only way for an error in the DIO layer to be
reported to the submitter is if the error occurs before the first IO
is submitted.(*)

RWF_NOWAIT was explicitly intended to enable applications using
AIO+DIO to avoid long latencies that occur as a result of blocking
on filesystem locks and resources. Blocking in the request queue is
minimal latency compared to waiting for (tens of) thousands of IOs
to complete ((tens of) seconds!) so the filesystem iomap path can run a
transaction to allocate disk spacei for the DIO.

IOWS, IOCB_NOWAIT was pretty` much intended to only be seen at the
filesystem layers to avoid the extremely high latencies that
filesystem layers might cause.  Blocking for a few milliseconds or
even tens of milliseconds in the request queue is not a problem
IOCB_NOWAIT was ever intended to solve.

Really, if io_uring needs DIO to avoid blocking in the request
queues below the filesystem, it should be providing that guidance
directly. IOCB_NOWAIT is -part- of the semantics being asked for,
but it does not provide them all and we can't change them to provide
exactly what io_uring wants because IOCB_NOWAIT == RWF_NOWAIT
semantics.

Cheers,

Dave.

(*) Yes, yes, I know that if you have a really fast storage the IO
might complete before submission has finished, but that's just the
final completion is done by the submitter and so #1 is actually
being followed in this case. i.e. IO is fully submitted and
completed.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-10  5:18         ` Dave Chinner
@ 2020-12-11  2:50           ` JeffleXu
  2020-12-14  2:56             ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: JeffleXu @ 2020-12-11  2:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring,
	Joseph Qi

Excellent explanation! Thanks a lot.

Still some questions below.

On 12/10/20 1:18 PM, Dave Chinner wrote:
> On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote:
>> Sorry I'm still a little confused.
>>
>>
>> On 12/10/20 5:23 AM, Dave Chinner wrote:
>>> On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote:
>>>>
>>>>
>>>> On 12/7/20 10:21 AM, Dave Chinner wrote:
>>>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
>>>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
>>>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
>>>>>> IOCB_NOWAIT is set.
>>>>>>
>>>>>> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>>>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>>>> ---
>>>>>>
>>>>>> Hi all,
>>>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a
>>>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
>>>>>> means REQ_NOWAIT is not set in bio->bi_opf.
>>>>>
>>>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
>>>>> filesystem behaviour, not the block device.
>>>>>
>>>>> REQ_NOWAIT can result in partial IO failures because the error is
>>>>> only reported to the iomap layer via IO completions. Hence we can
>>>>> split a DIO into multiple bios and have random bios in that IO fail
>>>>> with EAGAIN because REQ_NOWAIT is set. This error will
>>>>> get reported to the submitter via completion, and it will override
>>>>> any of the partial IOs that actually completed.
>>>>>
>>>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
>>>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
>>>>> NOWAIT DIO across extent boundaries") we'll get silent partial
>>>>> writes occurring because the second submitted bio in an IO can
>>>>> trigger EAGAIN errors with partial IO completion having already
>>>>> occurred.
>>>>>
>>
>>>>> Further, we don't allow partial IO completion for DIO on XFS at all.
>>>>> DIO must be completely submitted and completed or return an error
>>>>> without having issued any IO at all.  Hence using REQ_NOWAIT for
>>>>> DIO bios is incorrect and not desirable.
>>
>>
>> The current block layer implementation causes that, as long as one split
>> bio fails, then the whole DIO fails, in which case several split bios
>> maybe have succeeded and the content has been written to the disk. This
>> is obviously what you called "partial IO completion".
>>
>> I'm just concerned on how do you achieve that "DIO must return an error
>> without having issued any IO at all". Do you have some method of
>> reverting the content has already been written into the disk when a
>> partial error happened?
> 
> I think you've misunderstood what I was saying. I did not say
> "DIO must return an error without having issued any IO at all".
> There are two parts to my statement, and you just smashed part of
> the first statement into part of the second statement and came up
> something I didn't actually say.
> 
> The first statement is:
> 
> 	1. "DIO must be fully submitted and completed ...."
> 
> That is, if we need to break an IO up into multiple parts, the
> entire IO must be submitted and completed as a whole. We do not
> allow partial submission or completion of the IO at all because we
> cannot accurately report what parts of a multi-bio DIO that failed
> through the completion interface. IOWs, if any of the IOs after the
> first one fail submission, then we must complete all the IOs that
> have already been submitted before we can report the failure that
> occurred during the IO.
> 

1. Actually I'm quite not clear on what you called "partial submission
or completion". Even when REQ_NOWAIT is set for all split bios of one
DIO, then all these split bios are **submitted** to block layer through
submit_bio(). Even when one split bio after the first one failed
**inside** submit_bio() because of REQ_NOWAIT, submit_bio() only returns
BLK_QC_T_NONE, and the DIO layer (such as __blkdev_direct_IO()) will
still call submit_bio() for the remaining split bios. And then only when
all split bios complete, will the whole kiocb complete.

So if you define "submission" as submitting to hardware disk (such as
nvme device driver), then it is indeed **partial submission** when
REQ_NOWAIT set. But if the definition of "submission" is actually
"submitting to block layer by calling submit_bio()", then all split bios
of one DIO are indeed submitted to block layer, even when one split bio
gets BLK_STS_AGAIN because of REQ_NOWIAT.


2. One DIO could be split into multiple bios in DIO layer. Similarly one
big bio could be split into multiple bios in block layer. In the
situation when all split bios have already been submitted to block
layer, since the IO completion interface could return only one error
code, the whole DIO could fail as long as one split bio fails, while
several other split bios could have already succeeded and the
corresponding disk sectors have already been overwritten. I'm not sure
if this is what you called "silent partial writes", and if this is the
core reason for commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across
extent boundaries") you mentioned in previous mail.

If this is the case, then it seems that the IO completion interface
should be blamed for this issue. Besides REQ_NOWIAT, there may be more
situations that will cause "silent partial writes".

As long as one split bios could fail (besides EAGAIN, maybe EOPNOTSUPP,
if possible), while other split bios may still succeeds, then the error
of one split bio will still override the completion status of the whole
DIO. In this case "silent partial writes" is still possible? In my
understanding, passing REQ_NOWAIT flags from IOCB_NOWAIT in DIO layer
only makes the problem more likely to happen, but it's not the only
source of this problem?



> The second statement is:
> 
> 	2. "... or return an error without having issued any IO at
> 	   all."
> 
> IO submission errors are only reported by the DIO layer through IO
> completion, in which case #1 is applied. #2 only applies to errors
> that occur before IO submission is started, and these errors are
> directly reported to the caller. IOCB_NOWAIT is a check done before
> we start submission, hence can return -EAGAIN directly to the
> caller.
> 
> IOWs, if an error is returned to the caller, we have either not
> submitted any IO at all, or we have fully submitted and completed
> the IO and there was some error reported by the IO completions.
> There is no scope for "partial IO" here - it either completed fully
> or we got an error.
> 
> This is necessary for correct AIO semantics. We aren't waiting for
> completions to be able to report submission errors to submitters.
> Hence for async IO, the only way for an error in the DIO layer to be
> reported to the submitter is if the error occurs before the first IO
> is submitted.(*)
> 
> RWF_NOWAIT was explicitly intended to enable applications using
> AIO+DIO to avoid long latencies that occur as a result of blocking
> on filesystem locks and resources. Blocking in the request queue is
> minimal latency compared to waiting for (tens of) thousands of IOs
> to complete ((tens of) seconds!) so the filesystem iomap path can run a
> transaction to allocate disk spacei for the DIO.
> 
> IOWS, IOCB_NOWAIT was pretty` much intended to only be seen at the
> filesystem layers to avoid the extremely high latencies that
> filesystem layers might cause.  Blocking for a few milliseconds or
> even tens of milliseconds in the request queue is not a problem
> IOCB_NOWAIT was ever intended to solve.

Don't know the initial intention and history of IOCB_NOWAIT. Learned a
lot. Thanks.

> 
> Really, if io_uring needs DIO to avoid blocking in the request
> queues below the filesystem, it should be providing that guidance
> directly. IOCB_NOWAIT is -part- of the semantics being asked for,
> but it does not provide them all and we can't change them to provide
> exactly what io_uring wants because IOCB_NOWAIT == RWF_NOWAIT
> semantics.
> 
> Cheers,
> 
> Dave.
> 
> (*) Yes, yes, I know that if you have a really fast storage the IO
> might complete before submission has finished, but that's just the
> final completion is done by the submitter and so #1 is actually
> being followed in this case. i.e. IO is fully submitted and
> completed.
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-11  2:50           ` JeffleXu
@ 2020-12-14  2:56             ` Dave Chinner
  2020-12-15  9:43               ` JeffleXu
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2020-12-14  2:56 UTC (permalink / raw)
  To: JeffleXu
  Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring,
	Joseph Qi

On Fri, Dec 11, 2020 at 10:50:44AM +0800, JeffleXu wrote:
> Excellent explanation! Thanks a lot.
> 
> Still some questions below.
> 
> On 12/10/20 1:18 PM, Dave Chinner wrote:
> > On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote:
> >> Sorry I'm still a little confused.
> >>
> >>
> >> On 12/10/20 5:23 AM, Dave Chinner wrote:
> >>> On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote:
> >>>>
> >>>>
> >>>> On 12/7/20 10:21 AM, Dave Chinner wrote:
> >>>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
> >>>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
> >>>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
> >>>>>> IOCB_NOWAIT is set.
> >>>>>>
> >>>>>> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> >>>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> >>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> >>>>>> ---
> >>>>>>
> >>>>>> Hi all,
> >>>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a
> >>>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
> >>>>>> means REQ_NOWAIT is not set in bio->bi_opf.
> >>>>>
> >>>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
> >>>>> filesystem behaviour, not the block device.
> >>>>>
> >>>>> REQ_NOWAIT can result in partial IO failures because the error is
> >>>>> only reported to the iomap layer via IO completions. Hence we can
> >>>>> split a DIO into multiple bios and have random bios in that IO fail
> >>>>> with EAGAIN because REQ_NOWAIT is set. This error will
> >>>>> get reported to the submitter via completion, and it will override
> >>>>> any of the partial IOs that actually completed.
> >>>>>
> >>>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
> >>>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
> >>>>> NOWAIT DIO across extent boundaries") we'll get silent partial
> >>>>> writes occurring because the second submitted bio in an IO can
> >>>>> trigger EAGAIN errors with partial IO completion having already
> >>>>> occurred.
> >>>>>
> >>
> >>>>> Further, we don't allow partial IO completion for DIO on XFS at all.
> >>>>> DIO must be completely submitted and completed or return an error
> >>>>> without having issued any IO at all.  Hence using REQ_NOWAIT for
> >>>>> DIO bios is incorrect and not desirable.
> >>
> >>
> >> The current block layer implementation causes that, as long as one split
> >> bio fails, then the whole DIO fails, in which case several split bios
> >> maybe have succeeded and the content has been written to the disk. This
> >> is obviously what you called "partial IO completion".
> >>
> >> I'm just concerned on how do you achieve that "DIO must return an error
> >> without having issued any IO at all". Do you have some method of
> >> reverting the content has already been written into the disk when a
> >> partial error happened?
> > 
> > I think you've misunderstood what I was saying. I did not say
> > "DIO must return an error without having issued any IO at all".
> > There are two parts to my statement, and you just smashed part of
> > the first statement into part of the second statement and came up
> > something I didn't actually say.
> > 
> > The first statement is:
> > 
> > 	1. "DIO must be fully submitted and completed ...."
> > 
> > That is, if we need to break an IO up into multiple parts, the
> > entire IO must be submitted and completed as a whole. We do not
> > allow partial submission or completion of the IO at all because we
> > cannot accurately report what parts of a multi-bio DIO that failed
> > through the completion interface. IOWs, if any of the IOs after the
> > first one fail submission, then we must complete all the IOs that
> > have already been submitted before we can report the failure that
> > occurred during the IO.
> > 
> 
> 1. Actually I'm quite not clear on what you called "partial submission
> or completion". Even when REQ_NOWAIT is set for all split bios of one
> DIO, then all these split bios are **submitted** to block layer through
> submit_bio(). Even when one split bio after the first one failed
> **inside** submit_bio() because of REQ_NOWAIT, submit_bio() only returns
> BLK_QC_T_NONE, and the DIO layer (such as __blkdev_direct_IO()) will
> still call submit_bio() for the remaining split bios. And then only when
> all split bios complete, will the whole kiocb complete.

Yes, so what you have there is complete submission and completion.
i.e. what I described as #1.

> So if you define "submission" as submitting to hardware disk (such as
> nvme device driver), then it is indeed **partial submission** when
> REQ_NOWAIT set. But if the definition of "submission" is actually
> "submitting to block layer by calling submit_bio()", then all split bios
> of one DIO are indeed submitted to block layer, even when one split bio
> gets BLK_STS_AGAIN because of REQ_NOWIAT.

The latter is the definition we using in iomap, because at this
layer we are the ones submitting the bios and defining the bounds of
the IO. What happens in the lower layers does not concern us here.

Partial submission can occur at the iomap layer if IOCB_NOWAIT is
set - that was what 883a790a8440 fixed. i.e. we do

	loop for all mapped extents {
		loop for mapped range {
			submit_bio(subrange)
		}
	}

So if we have an extent range like so:

	start					end
	+-----------+---------+------+-----------+
	  ext 1        ext 2    ext 3    ext 4

We actually have to loop above 4 times to map the 4 extents that
span the user IO range. That means we need to submit at least 4
independent bios to run this IO.

So if we submit all the bios in a first mapped range (ext 1), then
go back to the filesystem to get the map for the next range to
submit and it returns -EAGAIN, we break out of the outer loop
without having completely submitted bios for the range spanning ext
2, 3 or 4. That's partial IO submission as the *iomap layer* defines
it.

And the "silent" part of the "partial write" it came from the fact
this propagated -EAGAIN to the user despite the fact that some IO
was actually successfully committed and completed. IOws, we may have
corrupted data on disk by telling the user nothing was done with
-EAGAIN rather than failing the partial IO with -EIO.

This is the problem commit 883a790a8440 fixed.

> 2. One DIO could be split into multiple bios in DIO layer. Similarly one
> big bio could be split into multiple bios in block layer. In the
> situation when all split bios have already been submitted to block
> layer, since the IO completion interface could return only one error
> code, the whole DIO could fail as long as one split bio fails, while
> several other split bios could have already succeeded and the
> corresponding disk sectors have already been overwritten.

Yup. That can happen. That's a data corruption for which prevention
is the responsibility of the layer doing the splitting and
recombining.  The iomap layer knows nothing of this - this situation
needs to result in the entire bio that was split being marked with
an EIO error because we do not know what parts of the bio made it to
disk or not. IOWs, the result of the IO is undefined, and so we
need to return EIO to the user.

This "data is inconsistent until all sub-devices complete their IO
successfully" situation is also known as the "RAID 5 write hole".
Crash while these IOs are in flight, and some might hit the disk and
some might not. When the system comes back up, you've got no idea
which part of that IO was completed or not, so the data in the
entire stripe is corrupt.

> I'm not sure
> if this is what you called "silent partial writes", and if this is the
> core reason for commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across
> extent boundaries") you mentioned in previous mail.

No, it's not the same thing. This "split IO failed" should never
result in a silent partial write - it should always result in EIO.

> If this is the case, then it seems that the IO completion interface
> should be blamed for this issue. Besides REQ_NOWIAT, there may be more
> situations that will cause "silent partial writes".

Yes, there are many potential causes of this behaviour. They are all
considered to be a data corruption, and so in all cases they should
be returning EIO as the completion status to the submitter of the
bio.

> As long as one split bios could fail (besides EAGAIN, maybe EOPNOTSUPP,
> if possible), while other split bios may still succeeds, then the error
> of one split bio will still override the completion status of the whole
> DIO.

If you get EOPNOTSUPP from on device in bio split across many
devices, then you have a bug in the layer that is splitting the IO.
All the sub-devices need to support the operation, or the top level
device cannot safely support that operation.

And if one device didn't do the opreation, then we have inconsistent
data on disk now and so EIO is the result that should be returned to
the user.

> In this case "silent partial writes" is still possible? In my
> understanding, passing REQ_NOWAIT flags from IOCB_NOWAIT in DIO layer
> only makes the problem more likely to happen, but it's not the only
> source of this problem?

Passing REQ_NOWAIT from the iomap dio layer means that we will see
data corruptions in stable storage simply due to high IO load, rather
than the much, much rarer situations of software bugs or actual
hardware errors.  IOWs, REQ_NOWAIT essentially guarantees data
corruption in stable storage of a perfectly working storage stack
will occur and there is nothing the application can do to prevent
that occurring.

The worst part of this is that REQ_NOWAIT also requires the
application to detect and correct the corruption caused by the
kernel using REQ_NOWAIT inappropriately. And if the application
crashes before it can correct the data corruption, it's suddenly a
permanent, uncorrectable data corruption.

REQ_NOWAIT is dangerous and can easily cause harm to user data by
aborting parts of submitted IOs. IOWs, it's more like a random IO
error injection mechanism than a useful method of doing safe
non-blocking IO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-14  2:56             ` Dave Chinner
@ 2020-12-15  9:43               ` JeffleXu
  2021-04-02 14:32                 ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: JeffleXu @ 2020-12-15  9:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring,
	Joseph Qi

Thanks for your explanation, again.

On 12/14/20 10:56 AM, Dave Chinner wrote:
> On Fri, Dec 11, 2020 at 10:50:44AM +0800, JeffleXu wrote:
>> Excellent explanation! Thanks a lot.
>>
>> Still some questions below.
>>
>> On 12/10/20 1:18 PM, Dave Chinner wrote:
>>> On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote:
>>>> Sorry I'm still a little confused.
>>>>
>>>>
>>>> On 12/10/20 5:23 AM, Dave Chinner wrote:
>>>>> On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote:
>>>>>>
>>>>>>
>>>>>> On 12/7/20 10:21 AM, Dave Chinner wrote:
>>>>>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
>>>>>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
>>>>>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
>>>>>>>> IOCB_NOWAIT is set.
>>>>>>>>
>>>>>>>> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>>>>>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a
>>>>>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
>>>>>>>> means REQ_NOWAIT is not set in bio->bi_opf.
>>>>>>>
>>>>>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
>>>>>>> filesystem behaviour, not the block device.
>>>>>>>
>>>>>>> REQ_NOWAIT can result in partial IO failures because the error is
>>>>>>> only reported to the iomap layer via IO completions. Hence we can
>>>>>>> split a DIO into multiple bios and have random bios in that IO fail
>>>>>>> with EAGAIN because REQ_NOWAIT is set. This error will
>>>>>>> get reported to the submitter via completion, and it will override
>>>>>>> any of the partial IOs that actually completed.
>>>>>>>
>>>>>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
>>>>>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
>>>>>>> NOWAIT DIO across extent boundaries") we'll get silent partial
>>>>>>> writes occurring because the second submitted bio in an IO can
>>>>>>> trigger EAGAIN errors with partial IO completion having already
>>>>>>> occurred.
>>>>>>>
>>>>
>>>>>>> Further, we don't allow partial IO completion for DIO on XFS at all.
>>>>>>> DIO must be completely submitted and completed or return an error
>>>>>>> without having issued any IO at all.  Hence using REQ_NOWAIT for
>>>>>>> DIO bios is incorrect and not desirable.
>>>>
>>>>
>>>> The current block layer implementation causes that, as long as one split
>>>> bio fails, then the whole DIO fails, in which case several split bios
>>>> maybe have succeeded and the content has been written to the disk. This
>>>> is obviously what you called "partial IO completion".
>>>>
>>>> I'm just concerned on how do you achieve that "DIO must return an error
>>>> without having issued any IO at all". Do you have some method of
>>>> reverting the content has already been written into the disk when a
>>>> partial error happened?
>>>
>>> I think you've misunderstood what I was saying. I did not say
>>> "DIO must return an error without having issued any IO at all".
>>> There are two parts to my statement, and you just smashed part of
>>> the first statement into part of the second statement and came up
>>> something I didn't actually say.
>>>
>>> The first statement is:
>>>
>>> 	1. "DIO must be fully submitted and completed ...."
>>>
>>> That is, if we need to break an IO up into multiple parts, the
>>> entire IO must be submitted and completed as a whole. We do not
>>> allow partial submission or completion of the IO at all because we
>>> cannot accurately report what parts of a multi-bio DIO that failed
>>> through the completion interface. IOWs, if any of the IOs after the
>>> first one fail submission, then we must complete all the IOs that
>>> have already been submitted before we can report the failure that
>>> occurred during the IO.
>>>
>>
>> 1. Actually I'm quite not clear on what you called "partial submission
>> or completion". Even when REQ_NOWAIT is set for all split bios of one
>> DIO, then all these split bios are **submitted** to block layer through
>> submit_bio(). Even when one split bio after the first one failed
>> **inside** submit_bio() because of REQ_NOWAIT, submit_bio() only returns
>> BLK_QC_T_NONE, and the DIO layer (such as __blkdev_direct_IO()) will
>> still call submit_bio() for the remaining split bios. And then only when
>> all split bios complete, will the whole kiocb complete.
> 
> Yes, so what you have there is complete submission and completion.
> i.e. what I described as #1.
> 
>> So if you define "submission" as submitting to hardware disk (such as
>> nvme device driver), then it is indeed **partial submission** when
>> REQ_NOWAIT set. But if the definition of "submission" is actually
>> "submitting to block layer by calling submit_bio()", then all split bios
>> of one DIO are indeed submitted to block layer, even when one split bio
>> gets BLK_STS_AGAIN because of REQ_NOWIAT.
> 
> The latter is the definition we using in iomap, because at this
> layer we are the ones submitting the bios and defining the bounds of
> the IO. What happens in the lower layers does not concern us here.
> 
> Partial submission can occur at the iomap layer if IOCB_NOWAIT is
> set - that was what 883a790a8440 fixed. i.e. we do
> 
> 	loop for all mapped extents {
> 		loop for mapped range {
> 			submit_bio(subrange)
> 		}
> 	}
> 
> So if we have an extent range like so:
> 
> 	start					end
> 	+-----------+---------+------+-----------+
> 	  ext 1        ext 2    ext 3    ext 4
> 
> We actually have to loop above 4 times to map the 4 extents that
> span the user IO range. That means we need to submit at least 4
> independent bios to run this IO.
> 
> So if we submit all the bios in a first mapped range (ext 1), then
> go back to the filesystem to get the map for the next range to
> submit and it returns -EAGAIN, we break out of the outer loop
> without having completely submitted bios for the range spanning ext
> 2, 3 or 4. That's partial IO submission as the *iomap layer* defines
> it.

Now I can understand the context of commit 883a790a8440 and the root
issue here.

> 
> And the "silent" part of the "partial write" it came from the fact
> this propagated -EAGAIN to the user despite the fact that some IO
> was actually successfully committed and completed. IOws, we may have
> corrupted data on disk by telling the user nothing was done with
> -EAGAIN rather than failing the partial IO with -EIO.>
> This is the problem commit 883a790a8440 fixed.
> 
>> 2. One DIO could be split into multiple bios in DIO layer. Similarly one
>> big bio could be split into multiple bios in block layer. In the
>> situation when all split bios have already been submitted to block
>> layer, since the IO completion interface could return only one error
>> code, the whole DIO could fail as long as one split bio fails, while
>> several other split bios could have already succeeded and the
>> corresponding disk sectors have already been overwritten.
> 
> Yup. That can happen. That's a data corruption for which prevention
> is the responsibility of the layer doing the splitting and
> recombining. 

Unfortunately the block layer doesn't handle this well currently.


> The iomap layer knows nothing of this - this situation
> needs to result in the entire bio that was split being marked with
> an EIO error because we do not know what parts of the bio made it to
> disk or not. IOWs, the result of the IO is undefined, and so we
> need to return EIO to the user.

What's the difference of EIO and EAGAIN here, besides the literal
semantics, when the data corruption has already happened? The upper user
has to resubmit the bio when EAGAIN is received, and the data corruption
should disappear once the resubmission completes successfully. But the
bio may not be resubmitted when EIO is received, and the data corruption
is just left there.



> 
> This "data is inconsistent until all sub-devices complete their IO
> successfully" situation is also known as the "RAID 5 write hole".
> Crash while these IOs are in flight, and some might hit the disk and
> some might not. When the system comes back up, you've got no idea
> which part of that IO was completed or not, so the data in the
> entire stripe is corrupt.
> 
>> I'm not sure
>> if this is what you called "silent partial writes", and if this is the
>> core reason for commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across
>> extent boundaries") you mentioned in previous mail.
> 
> No, it's not the same thing. This "split IO failed" should never
> result in a silent partial write - it should always result in EIO.

I'm doubted about this.

The block layer could split one big bio into multiple split bios. Also
one big DIO from one extent could also be split into multiple split
bios, since one single bio could maintain at most BIO_MAX_PAGES i.e. 256
segments. (see the loop in iomap_dio_bio_actor()) If the input iov_iter
contains more than 256 segments, then it will be split into multiple
split bios, though this iov_iter may be mapped to one single extent.

And once one split bio (among all these split bios) failed with EAGAIN,
the completion status of the whole IO is EAGAIN, rather than EIO, at
least for the current implementation of block layer. So this should also
be called "silent partial write", since partial bios may have been
written to the disk successfully, while still **EAGAIN** returned in
this situation?


> 
>> If this is the case, then it seems that the IO completion interface
>> should be blamed for this issue. Besides REQ_NOWIAT, there may be more
>> situations that will cause "silent partial writes".
> 
> Yes, there are many potential causes of this behaviour. They are all
> considered to be a data corruption, and so in all cases they should
> be returning EIO as the completion status to the submitter of the
> bio.
> >> As long as one split bios could fail (besides EAGAIN, maybe EOPNOTSUPP,
>> if possible), while other split bios may still succeeds, then the error
>> of one split bio will still override the completion status of the whole
>> DIO.
> 
> If you get EOPNOTSUPP from on device in bio split across many
> devices, then you have a bug in the layer that is splitting the IO.
> All the sub-devices need to support the operation, or the top level
> device cannot safely support that operation.
> 
> And if one device didn't do the opreation, then we have inconsistent
> data on disk now and so EIO is the result that should be returned to
> the user.

EOPNOTSUPP is just an example here. I admit that NOWAIT may be
problematic here. But what I mean here is that, once one split bio
fails, the whole IO fails with the same error code e.g.
EAGIAN/EOPNOTSUPP rather than EIO, while partial split bios may have
already succeeded. IOWs, the block layer may also trigger "silent
partial write" even without REQ_NOWAIT, though the possibility of
"silent partial write" may increase a lot when REQ_NOWAIT applied.


> 
>> In this case "silent partial writes" is still possible? In my
>> understanding, passing REQ_NOWAIT flags from IOCB_NOWAIT in DIO layer
>> only makes the problem more likely to happen, but it's not the only
>> source of this problem?
> 
> Passing REQ_NOWAIT from the iomap dio layer means that we will see
> data corruptions in stable storage simply due to high IO load, rather
> than the much, much rarer situations of software bugs or actual
> hardware errors.  IOWs, REQ_NOWAIT essentially guarantees data
> corruption in stable storage of a perfectly working storage stack
> will occur and there is nothing the application can do to prevent
> that occurring.
> 
> The worst part of this is that REQ_NOWAIT also requires the
> application to detect and correct the corruption caused by the
> kernel using REQ_NOWAIT inappropriately. And if the application
> crashes before it can correct the data corruption, it's suddenly a
> permanent, uncorrectable data corruption.
> 
> REQ_NOWAIT is dangerous and can easily cause harm to user data by
> aborting parts of submitted IOs. IOWs, it's more like a random IO
> error injection mechanism than a useful method of doing safe
> non-blocking IO.
> 
> Cheers,
> 
> Dave.
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2020-12-15  9:43               ` JeffleXu
@ 2021-04-02 14:32                 ` Pavel Begunkov
  2021-04-02 16:26                   ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2021-04-02 14:32 UTC (permalink / raw)
  To: JeffleXu, Dave Chinner
  Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring,
	Joseph Qi

On 15/12/2020 09:43, JeffleXu wrote:
> Thanks for your explanation, again.

Got stale, let's bring it up again.

> 
> On 12/14/20 10:56 AM, Dave Chinner wrote:
>> On Fri, Dec 11, 2020 at 10:50:44AM +0800, JeffleXu wrote:
>>> Excellent explanation! Thanks a lot.
>>>
>>> Still some questions below.
>>>
>>> On 12/10/20 1:18 PM, Dave Chinner wrote:
>>>> On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote:
>>>>> Sorry I'm still a little confused.
>>>>>
>>>>>
>>>>> On 12/10/20 5:23 AM, Dave Chinner wrote:
>>>>>> On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 12/7/20 10:21 AM, Dave Chinner wrote:
>>>>>>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
>>>>>>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
>>>>>>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
>>>>>>>>> IOCB_NOWAIT is set.
>>>>>>>>>
>>>>>>>>> Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>>>>>>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a
>>>>>>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
>>>>>>>>> means REQ_NOWAIT is not set in bio->bi_opf.
>>>>>>>>
>>>>>>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
>>>>>>>> filesystem behaviour, not the block device.
>>>>>>>>
>>>>>>>> REQ_NOWAIT can result in partial IO failures because the error is
>>>>>>>> only reported to the iomap layer via IO completions. Hence we can
>>>>>>>> split a DIO into multiple bios and have random bios in that IO fail
>>>>>>>> with EAGAIN because REQ_NOWAIT is set. This error will
>>>>>>>> get reported to the submitter via completion, and it will override
>>>>>>>> any of the partial IOs that actually completed.
>>>>>>>>
>>>>>>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
>>>>>>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
>>>>>>>> NOWAIT DIO across extent boundaries") we'll get silent partial
>>>>>>>> writes occurring because the second submitted bio in an IO can
>>>>>>>> trigger EAGAIN errors with partial IO completion having already
>>>>>>>> occurred.
>>>>>>>>
>>>>>
>>>>>>>> Further, we don't allow partial IO completion for DIO on XFS at all.
>>>>>>>> DIO must be completely submitted and completed or return an error
>>>>>>>> without having issued any IO at all.  Hence using REQ_NOWAIT for
>>>>>>>> DIO bios is incorrect and not desirable.
>>>>>
>>>>>
>>>>> The current block layer implementation causes that, as long as one split
>>>>> bio fails, then the whole DIO fails, in which case several split bios
>>>>> maybe have succeeded and the content has been written to the disk. This
>>>>> is obviously what you called "partial IO completion".
>>>>>
>>>>> I'm just concerned on how do you achieve that "DIO must return an error
>>>>> without having issued any IO at all". Do you have some method of
>>>>> reverting the content has already been written into the disk when a
>>>>> partial error happened?
>>>>
>>>> I think you've misunderstood what I was saying. I did not say
>>>> "DIO must return an error without having issued any IO at all".
>>>> There are two parts to my statement, and you just smashed part of
>>>> the first statement into part of the second statement and came up
>>>> something I didn't actually say.
>>>>
>>>> The first statement is:
>>>>
>>>> 	1. "DIO must be fully submitted and completed ...."
>>>>
>>>> That is, if we need to break an IO up into multiple parts, the
>>>> entire IO must be submitted and completed as a whole. We do not
>>>> allow partial submission or completion of the IO at all because we
>>>> cannot accurately report what parts of a multi-bio DIO that failed
>>>> through the completion interface. IOWs, if any of the IOs after the
>>>> first one fail submission, then we must complete all the IOs that
>>>> have already been submitted before we can report the failure that
>>>> occurred during the IO.
>>>>
>>>
>>> 1. Actually I'm quite not clear on what you called "partial submission
>>> or completion". Even when REQ_NOWAIT is set for all split bios of one
>>> DIO, then all these split bios are **submitted** to block layer through
>>> submit_bio(). Even when one split bio after the first one failed
>>> **inside** submit_bio() because of REQ_NOWAIT, submit_bio() only returns
>>> BLK_QC_T_NONE, and the DIO layer (such as __blkdev_direct_IO()) will
>>> still call submit_bio() for the remaining split bios. And then only when
>>> all split bios complete, will the whole kiocb complete.
>>
>> Yes, so what you have there is complete submission and completion.
>> i.e. what I described as #1.
>>
>>> So if you define "submission" as submitting to hardware disk (such as
>>> nvme device driver), then it is indeed **partial submission** when
>>> REQ_NOWAIT set. But if the definition of "submission" is actually
>>> "submitting to block layer by calling submit_bio()", then all split bios
>>> of one DIO are indeed submitted to block layer, even when one split bio
>>> gets BLK_STS_AGAIN because of REQ_NOWIAT.
>>
>> The latter is the definition we using in iomap, because at this
>> layer we are the ones submitting the bios and defining the bounds of
>> the IO. What happens in the lower layers does not concern us here.
>>
>> Partial submission can occur at the iomap layer if IOCB_NOWAIT is
>> set - that was what 883a790a8440 fixed. i.e. we do
>>
>> 	loop for all mapped extents {
>> 		loop for mapped range {
>> 			submit_bio(subrange)
>> 		}
>> 	}
>>
>> So if we have an extent range like so:
>>
>> 	start					end
>> 	+-----------+---------+------+-----------+
>> 	  ext 1        ext 2    ext 3    ext 4
>>
>> We actually have to loop above 4 times to map the 4 extents that
>> span the user IO range. That means we need to submit at least 4
>> independent bios to run this IO.
>>
>> So if we submit all the bios in a first mapped range (ext 1), then
>> go back to the filesystem to get the map for the next range to
>> submit and it returns -EAGAIN, we break out of the outer loop
>> without having completely submitted bios for the range spanning ext
>> 2, 3 or 4. That's partial IO submission as the *iomap layer* defines
>> it.
> 
> Now I can understand the context of commit 883a790a8440 and the root
> issue here.
> 
>>
>> And the "silent" part of the "partial write" it came from the fact
>> this propagated -EAGAIN to the user despite the fact that some IO
>> was actually successfully committed and completed. IOws, we may have
>> corrupted data on disk by telling the user nothing was done with
>> -EAGAIN rather than failing the partial IO with -EIO.>
>> This is the problem commit 883a790a8440 fixed.
>>
>>> 2. One DIO could be split into multiple bios in DIO layer. Similarly one
>>> big bio could be split into multiple bios in block layer. In the
>>> situation when all split bios have already been submitted to block
>>> layer, since the IO completion interface could return only one error
>>> code, the whole DIO could fail as long as one split bio fails, while
>>> several other split bios could have already succeeded and the
>>> corresponding disk sectors have already been overwritten.
>>
>> Yup. That can happen. That's a data corruption for which prevention
>> is the responsibility of the layer doing the splitting and
>> recombining. 
> 
> Unfortunately the block layer doesn't handle this well currently.
> 
> 
>> The iomap layer knows nothing of this - this situation
>> needs to result in the entire bio that was split being marked with
>> an EIO error because we do not know what parts of the bio made it to
>> disk or not. IOWs, the result of the IO is undefined, and so we
>> need to return EIO to the user.
> 
> What's the difference of EIO and EAGAIN here, besides the literal
> semantics, when the data corruption has already happened? The upper user
> has to resubmit the bio when EAGAIN is received, and the data corruption
> should disappear once the resubmission completes successfully. But the
> bio may not be resubmitted when EIO is received, and the data corruption
> is just left there.
> 
> 
> 
>>
>> This "data is inconsistent until all sub-devices complete their IO
>> successfully" situation is also known as the "RAID 5 write hole".
>> Crash while these IOs are in flight, and some might hit the disk and
>> some might not. When the system comes back up, you've got no idea
>> which part of that IO was completed or not, so the data in the
>> entire stripe is corrupt.
>>
>>> I'm not sure
>>> if this is what you called "silent partial writes", and if this is the
>>> core reason for commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across
>>> extent boundaries") you mentioned in previous mail.
>>
>> No, it's not the same thing. This "split IO failed" should never
>> result in a silent partial write - it should always result in EIO.
> 
> I'm doubted about this.
> 
> The block layer could split one big bio into multiple split bios. Also
> one big DIO from one extent could also be split into multiple split
> bios, since one single bio could maintain at most BIO_MAX_PAGES i.e. 256
> segments. (see the loop in iomap_dio_bio_actor()) If the input iov_iter
> contains more than 256 segments, then it will be split into multiple
> split bios, though this iov_iter may be mapped to one single extent.
> 
> And once one split bio (among all these split bios) failed with EAGAIN,
> the completion status of the whole IO is EAGAIN, rather than EIO, at
> least for the current implementation of block layer. So this should also
> be called "silent partial write", since partial bios may have been
> written to the disk successfully, while still **EAGAIN** returned in
> this situation?
> 
> 
>>
>>> If this is the case, then it seems that the IO completion interface
>>> should be blamed for this issue. Besides REQ_NOWIAT, there may be more
>>> situations that will cause "silent partial writes".
>>
>> Yes, there are many potential causes of this behaviour. They are all
>> considered to be a data corruption, and so in all cases they should
>> be returning EIO as the completion status to the submitter of the
>> bio.
>>>> As long as one split bios could fail (besides EAGAIN, maybe EOPNOTSUPP,
>>> if possible), while other split bios may still succeeds, then the error
>>> of one split bio will still override the completion status of the whole
>>> DIO.
>>
>> If you get EOPNOTSUPP from on device in bio split across many
>> devices, then you have a bug in the layer that is splitting the IO.
>> All the sub-devices need to support the operation, or the top level
>> device cannot safely support that operation.
>>
>> And if one device didn't do the opreation, then we have inconsistent
>> data on disk now and so EIO is the result that should be returned to
>> the user.
> 
> EOPNOTSUPP is just an example here. I admit that NOWAIT may be
> problematic here. But what I mean here is that, once one split bio
> fails, the whole IO fails with the same error code e.g.
> EAGIAN/EOPNOTSUPP rather than EIO, while partial split bios may have
> already succeeded. IOWs, the block layer may also trigger "silent
> partial write" even without REQ_NOWAIT, though the possibility of
> "silent partial write" may increase a lot when REQ_NOWAIT applied.
> 
> 
>>
>>> In this case "silent partial writes" is still possible? In my
>>> understanding, passing REQ_NOWAIT flags from IOCB_NOWAIT in DIO layer
>>> only makes the problem more likely to happen, but it's not the only
>>> source of this problem?
>>
>> Passing REQ_NOWAIT from the iomap dio layer means that we will see
>> data corruptions in stable storage simply due to high IO load, rather
>> than the much, much rarer situations of software bugs or actual
>> hardware errors.  IOWs, REQ_NOWAIT essentially guarantees data
>> corruption in stable storage of a perfectly working storage stack
>> will occur and there is nothing the application can do to prevent
>> that occurring.
>>
>> The worst part of this is that REQ_NOWAIT also requires the
>> application to detect and correct the corruption caused by the
>> kernel using REQ_NOWAIT inappropriately. And if the application
>> crashes before it can correct the data corruption, it's suddenly a
>> permanent, uncorrectable data corruption.
>>
>> REQ_NOWAIT is dangerous and can easily cause harm to user data by
>> aborting parts of submitted IOs. IOWs, it's more like a random IO
>> error injection mechanism than a useful method of doing safe
>> non-blocking IO.
>>
>> Cheers,
>>
>> Dave.
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
  2021-04-02 14:32                 ` Pavel Begunkov
@ 2021-04-02 16:26                   ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2021-04-02 16:26 UTC (permalink / raw)
  To: Pavel Begunkov, JeffleXu, Dave Chinner
  Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-fsdevel, Konstantin Khlebnikov, io-uring, Joseph Qi

On 4/2/21 8:32 AM, Pavel Begunkov wrote:
> On 15/12/2020 09:43, JeffleXu wrote:
>> Thanks for your explanation, again.
> 
> Got stale, let's bring it up again.

How about something like this - check upfront if we're going to be
using multiple bios, and -EAGAIN for NOWAIT being set if that is
the case. That avoids the partial problem, and still retains (what
I consider) proper NOWAIT behavior for O_DIRECT with IOCB_NOWAIT
set.

It's also worth nothing that this condition exists already for
polled IO. If the bio is marked as polled, then we implicitly
set NOWAIT as well, as there's no way to support polled IO with
sleeping request allocations. Hence it's worth considering this
a fix for that case, too.


diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index e2c4991833b8..6f932fe99440 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -66,6 +66,8 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
 
 	if (dio->iocb->ki_flags & IOCB_HIPRI)
 		bio_set_polled(bio, dio->iocb);
+	if (dio->iocb->ki_flags & IOCB_NOWAIT)
+		bio->bi_opf |= REQ_NOWAIT;
 
 	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
 	if (dio->dops && dio->dops->submit_io)
@@ -236,6 +238,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 	unsigned int blkbits = blksize_bits(bdev_logical_block_size(iomap->bdev));
 	unsigned int fs_block_size = i_blocksize(inode), pad;
 	unsigned int align = iov_iter_alignment(dio->submit.iter);
+	bool nowait = dio->iocb->ki_flags & (IOCB_HIPRI | IOCB_NOWAIT);
 	unsigned int bio_opf;
 	struct bio *bio;
 	bool need_zeroout = false;
@@ -296,7 +299,17 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 	 */
 	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
 
-	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_PAGES);
+	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, INT_MAX);
+
+	/* Can't handle IOCB_NOWAIT for split bios */
+	if (nr_pages > BIO_MAX_PAGES) {
+		if (nowait) {
+			ret = -EAGAIN;
+			goto out;
+		}
+		nr_pages = BIO_MAX_PAGES;
+	}
+
 	do {
 		size_t n;
 		if (dio->error) {
@@ -326,6 +339,19 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 			goto zero_tail;
 		}
 
+		/*
+		 * If there are leftover pages, bail if nowait is set to avoid
+		 * multiple bios and potentially having one of them -EAGAIN
+		 * with the other succeeding.
+		 */
+		nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter,
+						 BIO_MAX_PAGES);
+		if (nr_pages && nowait) {
+			ret = -EAGAIN;
+			bio_put(bio);
+			goto out;
+		}
+
 		n = bio->bi_iter.bi_size;
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			task_io_account_write(n);
@@ -337,8 +363,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		dio->size += n;
 		copied += n;
 
-		nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter,
-						 BIO_MAX_PAGES);
 		iomap_dio_submit_bio(dio, iomap, bio, pos);
 		pos += n;
 	} while (nr_pages);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-04-02 16:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-04  9:44 [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO Hao Xu
2020-12-04 11:44 ` Pavel Begunkov
2020-12-07  2:21 ` Dave Chinner
2020-12-07 23:40   ` Jens Axboe
2020-12-09 21:15     ` Dave Chinner
2020-12-10  2:33       ` JeffleXu
2020-12-08  5:46   ` JeffleXu
2020-12-09 21:23     ` Dave Chinner
2020-12-10  1:55       ` JeffleXu
2020-12-10  5:18         ` Dave Chinner
2020-12-11  2:50           ` JeffleXu
2020-12-14  2:56             ` Dave Chinner
2020-12-15  9:43               ` JeffleXu
2021-04-02 14:32                 ` Pavel Begunkov
2021-04-02 16:26                   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.