All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
@ 2022-04-20 14:31 Ming Lei
  2022-04-21  7:02 ` Christoph Hellwig
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Ming Lei @ 2022-04-20 14:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Ming Lei, linux-mm, linux-xfs,
	Changhui Zhong

So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed
from userspace sync io interface, then block layer tries to poll until
the bio is completed. But the current implementation calls
blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or
timeout easily.

But looks no one reports this kind of issue, which should have been
triggered in normal io poll sanity test or blktests block/007 as
observed by Changhui, that means it is very likely that no one uses it
or no one cares it.

Also after io_uring is invented, io poll for sync dio becomes legacy
interface.

So ignore RWF_HIPRI hint for sync dio.

CC: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Reported-by: Changhui Zhong <czhong@redhat.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V2:
	- avoid to break io_uring async polling as pointed by Chritoph

 block/fops.c         | 22 +---------------------
 fs/iomap/direct-io.c |  7 +++----
 mm/page_io.c         |  4 +---
 3 files changed, 5 insertions(+), 28 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index e3643362c244..b9b83030e0df 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -44,14 +44,6 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
 
 #define DIO_INLINE_BIO_VECS 4
 
-static void blkdev_bio_end_io_simple(struct bio *bio)
-{
-	struct task_struct *waiter = bio->bi_private;
-
-	WRITE_ONCE(bio->bi_private, NULL);
-	blk_wake_io_task(waiter);
-}
-
 static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		struct iov_iter *iter, unsigned int nr_pages)
 {
@@ -83,8 +75,6 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio.bi_private = current;
-	bio.bi_end_io = blkdev_bio_end_io_simple;
 	bio.bi_ioprio = iocb->ki_ioprio;
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
@@ -97,18 +87,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		bio.bi_opf |= REQ_NOWAIT;
-	if (iocb->ki_flags & IOCB_HIPRI)
-		bio_set_polled(&bio, iocb);
 
-	submit_bio(&bio);
-	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (!READ_ONCE(bio.bi_private))
-			break;
-		if (!(iocb->ki_flags & IOCB_HIPRI) || !bio_poll(&bio, NULL, 0))
-			blk_io_schedule();
-	}
-	__set_current_state(TASK_RUNNING);
+	submit_bio_wait(&bio);
 
 	bio_release_pages(&bio, should_dirty);
 	if (unlikely(bio.bi_status))
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 62da020d02a1..80f9b047aa1b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -56,7 +56,8 @@ static void iomap_dio_submit_bio(const struct iomap_iter *iter,
 {
 	atomic_inc(&dio->ref);
 
-	if (dio->iocb->ki_flags & IOCB_HIPRI) {
+	/* Sync dio can't be polled reliably */
+	if ((dio->iocb->ki_flags & IOCB_HIPRI) && !is_sync_kiocb(dio->iocb)) {
 		bio_set_polled(bio, dio->iocb);
 		dio->submit.poll_bio = bio;
 	}
@@ -653,9 +654,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			if (!READ_ONCE(dio->submit.waiter))
 				break;
 
-			if (!dio->submit.poll_bio ||
-			    !bio_poll(dio->submit.poll_bio, NULL, 0))
-				blk_io_schedule();
+			blk_io_schedule();
 		}
 		__set_current_state(TASK_RUNNING);
 	}
diff --git a/mm/page_io.c b/mm/page_io.c
index 89fbf3cae30f..3fbdab6a940e 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -360,7 +360,6 @@ int swap_readpage(struct page *page, bool synchronous)
 	 * attempt to access it in the page fault retry time check.
 	 */
 	if (synchronous) {
-		bio->bi_opf |= REQ_POLLED;
 		get_task_struct(current);
 		bio->bi_private = current;
 	}
@@ -372,8 +371,7 @@ int swap_readpage(struct page *page, bool synchronous)
 		if (!READ_ONCE(bio->bi_private))
 			break;
 
-		if (!bio_poll(bio, NULL, 0))
-			blk_io_schedule();
+		blk_io_schedule();
 	}
 	__set_current_state(TASK_RUNNING);
 	bio_put(bio);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
  2022-04-20 14:31 [PATCH V2] block: ignore RWF_HIPRI hint for sync dio Ming Lei
@ 2022-04-21  7:02 ` Christoph Hellwig
  2022-04-21 11:48 ` Changhui Zhong
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2022-04-21  7:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-mm, linux-xfs,
	Changhui Zhong

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
  2022-04-20 14:31 [PATCH V2] block: ignore RWF_HIPRI hint for sync dio Ming Lei
  2022-04-21  7:02 ` Christoph Hellwig
@ 2022-04-21 11:48 ` Changhui Zhong
  2022-05-02 16:01 ` Ming Lei
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Changhui Zhong @ 2022-04-21 11:48 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-mm, linux-xfs

Test pass with this patch,
Thanks Ming and Christoph !

Tested-by: Changhui Zhong <czhong@redhat.com>



On Wed, Apr 20, 2022 at 10:31 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed
> from userspace sync io interface, then block layer tries to poll until
> the bio is completed. But the current implementation calls
> blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or
> timeout easily.
>
> But looks no one reports this kind of issue, which should have been
> triggered in normal io poll sanity test or blktests block/007 as
> observed by Changhui, that means it is very likely that no one uses it
> or no one cares it.
>
> Also after io_uring is invented, io poll for sync dio becomes legacy
> interface.
>
> So ignore RWF_HIPRI hint for sync dio.
>
> CC: linux-mm@kvack.org
> Cc: linux-xfs@vger.kernel.org
> Reported-by: Changhui Zhong <czhong@redhat.com>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V2:
>         - avoid to break io_uring async polling as pointed by Chritoph
>
>  block/fops.c         | 22 +---------------------
>  fs/iomap/direct-io.c |  7 +++----
>  mm/page_io.c         |  4 +---
>  3 files changed, 5 insertions(+), 28 deletions(-)
>
> diff --git a/block/fops.c b/block/fops.c
> index e3643362c244..b9b83030e0df 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -44,14 +44,6 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
>
>  #define DIO_INLINE_BIO_VECS 4
>
> -static void blkdev_bio_end_io_simple(struct bio *bio)
> -{
> -       struct task_struct *waiter = bio->bi_private;
> -
> -       WRITE_ONCE(bio->bi_private, NULL);
> -       blk_wake_io_task(waiter);
> -}
> -
>  static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
>                 struct iov_iter *iter, unsigned int nr_pages)
>  {
> @@ -83,8 +75,6 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
>                 bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
>         }
>         bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
> -       bio.bi_private = current;
> -       bio.bi_end_io = blkdev_bio_end_io_simple;
>         bio.bi_ioprio = iocb->ki_ioprio;
>
>         ret = bio_iov_iter_get_pages(&bio, iter);
> @@ -97,18 +87,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
>
>         if (iocb->ki_flags & IOCB_NOWAIT)
>                 bio.bi_opf |= REQ_NOWAIT;
> -       if (iocb->ki_flags & IOCB_HIPRI)
> -               bio_set_polled(&bio, iocb);
>
> -       submit_bio(&bio);
> -       for (;;) {
> -               set_current_state(TASK_UNINTERRUPTIBLE);
> -               if (!READ_ONCE(bio.bi_private))
> -                       break;
> -               if (!(iocb->ki_flags & IOCB_HIPRI) || !bio_poll(&bio, NULL, 0))
> -                       blk_io_schedule();
> -       }
> -       __set_current_state(TASK_RUNNING);
> +       submit_bio_wait(&bio);
>
>         bio_release_pages(&bio, should_dirty);
>         if (unlikely(bio.bi_status))
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 62da020d02a1..80f9b047aa1b 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -56,7 +56,8 @@ static void iomap_dio_submit_bio(const struct iomap_iter *iter,
>  {
>         atomic_inc(&dio->ref);
>
> -       if (dio->iocb->ki_flags & IOCB_HIPRI) {
> +       /* Sync dio can't be polled reliably */
> +       if ((dio->iocb->ki_flags & IOCB_HIPRI) && !is_sync_kiocb(dio->iocb)) {
>                 bio_set_polled(bio, dio->iocb);
>                 dio->submit.poll_bio = bio;
>         }
> @@ -653,9 +654,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>                         if (!READ_ONCE(dio->submit.waiter))
>                                 break;
>
> -                       if (!dio->submit.poll_bio ||
> -                           !bio_poll(dio->submit.poll_bio, NULL, 0))
> -                               blk_io_schedule();
> +                       blk_io_schedule();
>                 }
>                 __set_current_state(TASK_RUNNING);
>         }
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 89fbf3cae30f..3fbdab6a940e 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -360,7 +360,6 @@ int swap_readpage(struct page *page, bool synchronous)
>          * attempt to access it in the page fault retry time check.
>          */
>         if (synchronous) {
> -               bio->bi_opf |= REQ_POLLED;
>                 get_task_struct(current);
>                 bio->bi_private = current;
>         }
> @@ -372,8 +371,7 @@ int swap_readpage(struct page *page, bool synchronous)
>                 if (!READ_ONCE(bio->bi_private))
>                         break;
>
> -               if (!bio_poll(bio, NULL, 0))
> -                       blk_io_schedule();
> +               blk_io_schedule();
>         }
>         __set_current_state(TASK_RUNNING);
>         bio_put(bio);
> --
> 2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
  2022-04-20 14:31 [PATCH V2] block: ignore RWF_HIPRI hint for sync dio Ming Lei
  2022-04-21  7:02 ` Christoph Hellwig
  2022-04-21 11:48 ` Changhui Zhong
@ 2022-05-02 16:01 ` Ming Lei
  2022-05-02 16:07 ` Jens Axboe
  2022-05-23 22:36 ` Keith Busch
  4 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2022-05-02 16:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, linux-mm, linux-xfs, Changhui Zhong

On Wed, Apr 20, 2022 at 10:31:10PM +0800, Ming Lei wrote:
> So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed
> from userspace sync io interface, then block layer tries to poll until
> the bio is completed. But the current implementation calls
> blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or
> timeout easily.
> 
> But looks no one reports this kind of issue, which should have been
> triggered in normal io poll sanity test or blktests block/007 as
> observed by Changhui, that means it is very likely that no one uses it
> or no one cares it.
> 
> Also after io_uring is invented, io poll for sync dio becomes legacy
> interface.
> 
> So ignore RWF_HIPRI hint for sync dio.
> 
> CC: linux-mm@kvack.org
> Cc: linux-xfs@vger.kernel.org
> Reported-by: Changhui Zhong <czhong@redhat.com>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V2:
> 	- avoid to break io_uring async polling as pointed by Chritoph

Hello Jens,

Can you take this patch if you are fine?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
  2022-04-20 14:31 [PATCH V2] block: ignore RWF_HIPRI hint for sync dio Ming Lei
                   ` (2 preceding siblings ...)
  2022-05-02 16:01 ` Ming Lei
@ 2022-05-02 16:07 ` Jens Axboe
  2022-05-23 22:36 ` Keith Busch
  4 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2022-05-02 16:07 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-block, Christoph Hellwig, czhong, linux-mm, linux-xfs

On Wed, 20 Apr 2022 22:31:10 +0800, Ming Lei wrote:
> So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed
> from userspace sync io interface, then block layer tries to poll until
> the bio is completed. But the current implementation calls
> blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or
> timeout easily.
> 
> But looks no one reports this kind of issue, which should have been
> triggered in normal io poll sanity test or blktests block/007 as
> observed by Changhui, that means it is very likely that no one uses it
> or no one cares it.
> 
> [...]

Applied, thanks!

[1/1] block: ignore RWF_HIPRI hint for sync dio
      commit: 9650b453a3d4b1b8ed4ea8bcb9b40109608d1faf

Best regards,
-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
  2022-04-20 14:31 [PATCH V2] block: ignore RWF_HIPRI hint for sync dio Ming Lei
                   ` (3 preceding siblings ...)
  2022-05-02 16:07 ` Jens Axboe
@ 2022-05-23 22:36 ` Keith Busch
  2022-05-24  0:40   ` Ming Lei
  4 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2022-05-23 22:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-mm, linux-xfs,
	Changhui Zhong

On Wed, Apr 20, 2022 at 10:31:10PM +0800, Ming Lei wrote:
> So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed
> from userspace sync io interface, then block layer tries to poll until
> the bio is completed. But the current implementation calls
> blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or
> timeout easily.

Wait a second. The task's current state is TASK_RUNNING when bio_poll() returns
zero, so calling blk_io_schedule() isn't supposed to hang.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
  2022-05-23 22:36 ` Keith Busch
@ 2022-05-24  0:40   ` Ming Lei
  2022-05-24  3:02     ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-05-24  0:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-mm, linux-xfs,
	Changhui Zhong

On Mon, May 23, 2022 at 04:36:04PM -0600, Keith Busch wrote:
> On Wed, Apr 20, 2022 at 10:31:10PM +0800, Ming Lei wrote:
> > So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed
> > from userspace sync io interface, then block layer tries to poll until
> > the bio is completed. But the current implementation calls
> > blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or
> > timeout easily.
> 
> Wait a second. The task's current state is TASK_RUNNING when bio_poll() returns
> zero, so calling blk_io_schedule() isn't supposed to hang.

void __sched io_schedule(void)
{
        int token;

        token = io_schedule_prepare();
        schedule();
        io_schedule_finish(token);
}

But who can wakeup this task after scheduling out? There can't be irq
handler for POLLED request.

The hang can be triggered on nvme/qemu reliably:

fio --bs=4k --size=1G --ioengine=pvsync2 --norandommap --hipri=1 --iodepth=64 \
	--slat_percentiles=1 --nowait=0 --filename=/dev/nvme0n1 --direct=1 \
	--runtime=10 --numjobs=1 --rw=rw --name=test --group_reporting

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
  2022-05-24  0:40   ` Ming Lei
@ 2022-05-24  3:02     ` Keith Busch
  2022-05-24  4:34       ` Keith Busch
  2022-05-24  6:10       ` Ming Lei
  0 siblings, 2 replies; 12+ messages in thread
From: Keith Busch @ 2022-05-24  3:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-mm, linux-xfs,
	Changhui Zhong

On Tue, May 24, 2022 at 08:40:46AM +0800, Ming Lei wrote:
> On Mon, May 23, 2022 at 04:36:04PM -0600, Keith Busch wrote:
> > On Wed, Apr 20, 2022 at 10:31:10PM +0800, Ming Lei wrote:
> > > So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed
> > > from userspace sync io interface, then block layer tries to poll until
> > > the bio is completed. But the current implementation calls
> > > blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or
> > > timeout easily.
> > 
> > Wait a second. The task's current state is TASK_RUNNING when bio_poll() returns
> > zero, so calling blk_io_schedule() isn't supposed to hang.
> 
> void __sched io_schedule(void)
> {
>         int token;
> 
>         token = io_schedule_prepare();
>         schedule();
>         io_schedule_finish(token);
> }
> 
> But who can wakeup this task after scheduling out? There can't be irq
> handler for POLLED request.

No one. If everything was working, the task state would be RUNNING, so it is
immediately available to be scheduled back in.
 
> The hang can be triggered on nvme/qemu reliably:

And clearly it's not working, but for a different reason. The polling thread
sees an invalid cookie and fails to set the task back to RUNNING, so yes, it
will sleep with no waker in the current code.

We usually expect the cookie to be set inline with submit_bio(), but we're not
guaranteed it won't be punted off to .run_work for a variety of reasons, so
the thread writing the cookie may be racing with the reader.

This was fine before the bio polling support since the cookie was always
returned with submit_bio() before that.

And I would like psync to continue working with polling. As great as io_uring
is, it's just not as efficient @qd1.

Here's a bandaid, though I assume it'll break something...

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ed1869a305c4..348136dc7ba9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1146,8 +1146,6 @@ void blk_mq_start_request(struct request *rq)
 	if (blk_integrity_rq(rq) && req_op(rq) == REQ_OP_WRITE)
 		q->integrity.profile->prepare_fn(rq);
 #endif
-	if (rq->bio && rq->bio->bi_opf & REQ_POLLED)
-	        WRITE_ONCE(rq->bio->bi_cookie, blk_rq_to_qc(rq));
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
@@ -2464,6 +2462,9 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
 	WARN_ON_ONCE(err);
 
 	blk_account_io_start(rq);
+
+	if (rq->bio->bi_opf & REQ_POLLED)
+	        WRITE_ONCE(rq->bio->bi_cookie, blk_rq_to_qc(rq));
 }
 
 static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
--

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
  2022-05-24  3:02     ` Keith Busch
@ 2022-05-24  4:34       ` Keith Busch
  2022-05-24  6:28         ` Ming Lei
  2022-05-24  6:10       ` Ming Lei
  1 sibling, 1 reply; 12+ messages in thread
From: Keith Busch @ 2022-05-24  4:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-mm, linux-xfs,
	Changhui Zhong

On Mon, May 23, 2022 at 09:02:39PM -0600, Keith Busch wrote:
> Here's a bandaid, though I assume it'll break something...

On second thought, maybe this is okay! The encoded hctx doesn't change after
this call, which is the only thing polling cares about. The tag portion doesn't
matter.

The only user for the rq tag part of the cookie is hybrid polling and falls
back to classic polling if the tag wasn't valid, so that scenario is already
hangled. And hybrid polling breaks down anyway if queue-depth is >1, so leaving
an invalid tag might be a good thing.
 
> ---
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ed1869a305c4..348136dc7ba9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1146,8 +1146,6 @@ void blk_mq_start_request(struct request *rq)
>  	if (blk_integrity_rq(rq) && req_op(rq) == REQ_OP_WRITE)
>  		q->integrity.profile->prepare_fn(rq);
>  #endif
> -	if (rq->bio && rq->bio->bi_opf & REQ_POLLED)
> -	        WRITE_ONCE(rq->bio->bi_cookie, blk_rq_to_qc(rq));
>  }
>  EXPORT_SYMBOL(blk_mq_start_request);
>  
> @@ -2464,6 +2462,9 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
>  	WARN_ON_ONCE(err);
>  
>  	blk_account_io_start(rq);
> +
> +	if (rq->bio->bi_opf & REQ_POLLED)
> +	        WRITE_ONCE(rq->bio->bi_cookie, blk_rq_to_qc(rq));
>  }
>  
>  static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> --

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
  2022-05-24  3:02     ` Keith Busch
  2022-05-24  4:34       ` Keith Busch
@ 2022-05-24  6:10       ` Ming Lei
  1 sibling, 0 replies; 12+ messages in thread
From: Ming Lei @ 2022-05-24  6:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-mm, linux-xfs,
	Changhui Zhong

On Mon, May 23, 2022 at 09:02:39PM -0600, Keith Busch wrote:
> On Tue, May 24, 2022 at 08:40:46AM +0800, Ming Lei wrote:
> > On Mon, May 23, 2022 at 04:36:04PM -0600, Keith Busch wrote:
> > > On Wed, Apr 20, 2022 at 10:31:10PM +0800, Ming Lei wrote:
> > > > So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed
> > > > from userspace sync io interface, then block layer tries to poll until
> > > > the bio is completed. But the current implementation calls
> > > > blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or
> > > > timeout easily.
> > > 
> > > Wait a second. The task's current state is TASK_RUNNING when bio_poll() returns
> > > zero, so calling blk_io_schedule() isn't supposed to hang.
> > 
> > void __sched io_schedule(void)
> > {
> >         int token;
> > 
> >         token = io_schedule_prepare();
> >         schedule();
> >         io_schedule_finish(token);
> > }
> > 
> > But who can wakeup this task after scheduling out? There can't be irq
> > handler for POLLED request.
> 
> No one. If everything was working, the task state would be RUNNING, so it is
> immediately available to be scheduled back in.
>  
> > The hang can be triggered on nvme/qemu reliably:
> 
> And clearly it's not working, but for a different reason. The polling thread
> sees an invalid cookie and fails to set the task back to RUNNING, so yes, it
> will sleep with no waker in the current code.
> 
> We usually expect the cookie to be set inline with submit_bio(), but we're not
> guaranteed it won't be punted off to .run_work for a variety of reasons, so
> the thread writing the cookie may be racing with the reader.
> 
> This was fine before the bio polling support since the cookie was always
> returned with submit_bio() before that.
> 
> And I would like psync to continue working with polling. As great as io_uring
> is, it's just not as efficient @qd1.
> 
> Here's a bandaid, though I assume it'll break something...
> 
> ---
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ed1869a305c4..348136dc7ba9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1146,8 +1146,6 @@ void blk_mq_start_request(struct request *rq)
>  	if (blk_integrity_rq(rq) && req_op(rq) == REQ_OP_WRITE)
>  		q->integrity.profile->prepare_fn(rq);
>  #endif
> -	if (rq->bio && rq->bio->bi_opf & REQ_POLLED)
> -	        WRITE_ONCE(rq->bio->bi_cookie, blk_rq_to_qc(rq));
>  }
>  EXPORT_SYMBOL(blk_mq_start_request);
>  
> @@ -2464,6 +2462,9 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
>  	WARN_ON_ONCE(err);
>  
>  	blk_account_io_start(rq);
> +
> +	if (rq->bio->bi_opf & REQ_POLLED)
> +	        WRITE_ONCE(rq->bio->bi_cookie, blk_rq_to_qc(rq));
>  }

Yeah, this way may improve the situation, but still can't fix all,
what if the submitted bio is merged to another request in plug list?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
  2022-05-24  4:34       ` Keith Busch
@ 2022-05-24  6:28         ` Ming Lei
  2022-05-24 15:20           ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-05-24  6:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-mm, linux-xfs,
	Changhui Zhong

On Mon, May 23, 2022 at 10:34:27PM -0600, Keith Busch wrote:
> On Mon, May 23, 2022 at 09:02:39PM -0600, Keith Busch wrote:
> > Here's a bandaid, though I assume it'll break something...
> 
> On second thought, maybe this is okay! The encoded hctx doesn't change after
> this call, which is the only thing polling cares about. The tag portion doesn't
> matter.

I guess it still may change in case that nvme mpath bio is requeued.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] block: ignore RWF_HIPRI hint for sync dio
  2022-05-24  6:28         ` Ming Lei
@ 2022-05-24 15:20           ` Keith Busch
  0 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2022-05-24 15:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-mm, linux-xfs,
	Changhui Zhong

On Tue, May 24, 2022 at 02:28:58PM +0800, Ming Lei wrote:
> On Mon, May 23, 2022 at 10:34:27PM -0600, Keith Busch wrote:
> > On Mon, May 23, 2022 at 09:02:39PM -0600, Keith Busch wrote:
> > > Here's a bandaid, though I assume it'll break something...
> > 
> > On second thought, maybe this is okay! The encoded hctx doesn't change after
> > this call, which is the only thing polling cares about. The tag portion doesn't
> > matter.
> 
> I guess it still may change in case that nvme mpath bio is requeued.

A path failover requeue strips the REQ_POLL flag out of the bio, so it would be
interrupt driven at that point.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-05-24 15:20 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-20 14:31 [PATCH V2] block: ignore RWF_HIPRI hint for sync dio Ming Lei
2022-04-21  7:02 ` Christoph Hellwig
2022-04-21 11:48 ` Changhui Zhong
2022-05-02 16:01 ` Ming Lei
2022-05-02 16:07 ` Jens Axboe
2022-05-23 22:36 ` Keith Busch
2022-05-24  0:40   ` Ming Lei
2022-05-24  3:02     ` Keith Busch
2022-05-24  4:34       ` Keith Busch
2022-05-24  6:28         ` Ming Lei
2022-05-24 15:20           ` Keith Busch
2022-05-24  6:10       ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.