linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] Fixes for fio io_uring polled mode test failures
@ 2020-01-16  2:37 Bijan Mottahedeh
  2020-01-16  2:37 ` [RFC 1/2] io_uring: clear req->result always before issuing a read/write request Bijan Mottahedeh
  2020-01-16  2:37 ` [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe() Bijan Mottahedeh
  0 siblings, 2 replies; 16+ messages in thread
From: Bijan Mottahedeh @ 2020-01-16  2:37 UTC (permalink / raw)
  To: axboe; +Cc: linux-block

These patches address the crash and list corruption errors when running
the fio test below.

[global]
filename=/dev/nvme0n1
rw=randread
bs=4k
direct=1
time_based=1
randrepeat=1
gtod_reduce=1
[fiotest]


fio nvme.fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs 
--hipri --numjobs=$1 --runtime=$2

Bijan Mottahedeh (2):
  io_uring: clear req->result always before issuing a read/write request
  io_uring: acquire ctx->uring_lock before calling io_issue_sqe()

 fs/io_uring.c | 4 ++++
 1 file changed, 4 insertions(+)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC 1/2] io_uring: clear req->result always before issuing a read/write request
  2020-01-16  2:37 [RFC 0/2] Fixes for fio io_uring polled mode test failures Bijan Mottahedeh
@ 2020-01-16  2:37 ` Bijan Mottahedeh
  2020-01-16  4:34   ` Jens Axboe
  2020-01-16  2:37 ` [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe() Bijan Mottahedeh
  1 sibling, 1 reply; 16+ messages in thread
From: Bijan Mottahedeh @ 2020-01-16  2:37 UTC (permalink / raw)
  To: axboe; +Cc: linux-block

req->result is cleared when io_issue_sqe() calls io_read/write_pre()
routines.  Those routines however are not called when the sqe
argument is NULL, which is the case when io_issue_sqe() is called from
io_wq_submit_work().  io_issue_sqe() may then examine a stale result if
a polled request had previously failed with -EAGAIN:

        if (ctx->flags & IORING_SETUP_IOPOLL) {
                if (req->result == -EAGAIN)
                        return -EAGAIN;

                io_iopoll_req_issued(req);
        }

and in turn cause a subsequently completed request to be re-issued in
io_wq_submit_work().

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6ffab9aaf..d015ce8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2180,6 +2180,7 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
 	if (!force_nonblock)
 		req->rw.kiocb.ki_flags &= ~IOCB_NOWAIT;
 
+	req->result = 0;
 	io_size = ret;
 	if (req->flags & REQ_F_LINK)
 		req->result = io_size;
@@ -2267,6 +2268,7 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt,
 	if (!force_nonblock)
 		req->rw.kiocb.ki_flags &= ~IOCB_NOWAIT;
 
+	req->result = 0;
 	io_size = ret;
 	if (req->flags & REQ_F_LINK)
 		req->result = io_size;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-16  2:37 [RFC 0/2] Fixes for fio io_uring polled mode test failures Bijan Mottahedeh
  2020-01-16  2:37 ` [RFC 1/2] io_uring: clear req->result always before issuing a read/write request Bijan Mottahedeh
@ 2020-01-16  2:37 ` Bijan Mottahedeh
  2020-01-16  4:34   ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Bijan Mottahedeh @ 2020-01-16  2:37 UTC (permalink / raw)
  To: axboe; +Cc: linux-block

io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
so acquire ctx->uring_lock beforehand similar to other instances of
calling io_issue_sqe().

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index d015ce8..7b399e2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4359,7 +4359,9 @@ static void io_wq_submit_work(struct io_wq_work **workptr)
 		req->has_user = (work->flags & IO_WQ_WORK_HAS_MM) != 0;
 		req->in_async = true;
 		do {
+			mutex_lock(&req->ctx->uring_lock);
 			ret = io_issue_sqe(req, NULL, &nxt, false);
+			mutex_unlock(&req->ctx->uring_lock);
 			/*
 			 * We can get EAGAIN for polled IO even though we're
 			 * forcing a sync submission from here, since we can't
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-16  2:37 ` [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe() Bijan Mottahedeh
@ 2020-01-16  4:34   ` Jens Axboe
  2020-01-16  4:42     ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-01-16  4:34 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: linux-block

On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
> io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
> so acquire ctx->uring_lock beforehand similar to other instances of
> calling io_issue_sqe().

Is the below not enough?

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f9709a3a673c..900e86189ce7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4272,10 +4272,18 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return ret;
 
 	if (ctx->flags & IORING_SETUP_IOPOLL) {
+		const bool in_async = req->in_async;
+
 		if (req->result == -EAGAIN)
 			return -EAGAIN;
 
+		if (in_async)
+			mutex_lock(&ctx->uring_lock);
+
 		io_iopoll_req_issued(req);
+
+		if (in_async)
+			mutex_unlock(&ctx->uring_lock);
 	}
 
 	return 0;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC 1/2] io_uring: clear req->result always before issuing a read/write request
  2020-01-16  2:37 ` [RFC 1/2] io_uring: clear req->result always before issuing a read/write request Bijan Mottahedeh
@ 2020-01-16  4:34   ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-01-16  4:34 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: linux-block

On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
> req->result is cleared when io_issue_sqe() calls io_read/write_pre()
> routines.  Those routines however are not called when the sqe
> argument is NULL, which is the case when io_issue_sqe() is called from
> io_wq_submit_work().  io_issue_sqe() may then examine a stale result if
> a polled request had previously failed with -EAGAIN:
> 
>         if (ctx->flags & IORING_SETUP_IOPOLL) {
>                 if (req->result == -EAGAIN)
>                         return -EAGAIN;
> 
>                 io_iopoll_req_issued(req);
>         }
> 
> and in turn cause a subsequently completed request to be re-issued in
> io_wq_submit_work().

Looks good, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-16  4:34   ` Jens Axboe
@ 2020-01-16  4:42     ` Jens Axboe
  2020-01-16 16:22       ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-01-16  4:42 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: linux-block

On 1/15/20 9:34 PM, Jens Axboe wrote:
> On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
>> io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
>> so acquire ctx->uring_lock beforehand similar to other instances of
>> calling io_issue_sqe().
> 
> Is the below not enough?

This should be better, we have two that set ->in_async, and only one
doesn't hold the mutex.

If this works for you, can you resend patch 2 with that? Also add a:

Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions")

to it as well. Thanks!

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3130ed16456e..52e5764540e4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3286,10 +3286,19 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return ret;
 
 	if (ctx->flags & IORING_SETUP_IOPOLL) {
+		const bool in_async = io_wq_current_is_worker();
+
 		if (req->result == -EAGAIN)
 			return -EAGAIN;
 
+		/* workqueue context doesn't hold uring_lock, grab it now */
+		if (in_async)
+			mutex_lock(&ctx->uring_lock);
+
 		io_iopoll_req_issued(req);
+
+		if (in_async)
+			mutex_unlock(&ctx->uring_lock);
 	}
 
 	return 0;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-16  4:42     ` Jens Axboe
@ 2020-01-16 16:22       ` Jens Axboe
  2020-01-16 19:08         ` Bijan Mottahedeh
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-01-16 16:22 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: linux-block

On 1/15/20 9:42 PM, Jens Axboe wrote:
> On 1/15/20 9:34 PM, Jens Axboe wrote:
>> On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
>>> io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
>>> so acquire ctx->uring_lock beforehand similar to other instances of
>>> calling io_issue_sqe().
>>
>> Is the below not enough?
> 
> This should be better, we have two that set ->in_async, and only one
> doesn't hold the mutex.
> 
> If this works for you, can you resend patch 2 with that? Also add a:
> 
> Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions")
> 
> to it as well. Thanks!

I tested and queued this up:

https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.5&id=11ba820bf163e224bf5dd44e545a66a44a5b1d7a

Please let me know if this works, it sits on top of the ->result patch you
sent in.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-16 16:22       ` Jens Axboe
@ 2020-01-16 19:08         ` Bijan Mottahedeh
  2020-01-16 20:02           ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Bijan Mottahedeh @ 2020-01-16 19:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

On 1/16/2020 8:22 AM, Jens Axboe wrote:
> On 1/15/20 9:42 PM, Jens Axboe wrote:
>> On 1/15/20 9:34 PM, Jens Axboe wrote:
>>> On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
>>>> io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
>>>> so acquire ctx->uring_lock beforehand similar to other instances of
>>>> calling io_issue_sqe().
>>> Is the below not enough?
>> This should be better, we have two that set ->in_async, and only one
>> doesn't hold the mutex.
>>
>> If this works for you, can you resend patch 2 with that? Also add a:
>>
>> Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions")
>>
>> to it as well. Thanks!
> I tested and queued this up:
>
> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.5&id=11ba820bf163e224bf5dd44e545a66a44a5b1d7a
>
> Please let me know if this works, it sits on top of the ->result patch you
> sent in.
>
That works, thanks.

I'm however still seeing a use-after-free error in the request 
completion path in nvme_unmap_data().  It happens only when testing with 
large block sizes in fio, typically > 128k, e.g. bs=256k will always hit it.

This is the error:

DMA-API: nvme 0000:00:04.0: device driver tries to free DMA memory it 
has not allocated [device address=0x6b6b6b6b6b6b6b6b] [size=1802201963 
bytes]

and this warning occasionally:

WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);

It seems like a request might be issued multiple times but I can't see 
anything in io_uring code that would account for it.

--bijan




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-16 19:08         ` Bijan Mottahedeh
@ 2020-01-16 20:02           ` Jens Axboe
  2020-01-16 21:04             ` Bijan Mottahedeh
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-01-16 20:02 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: linux-block, Christoph Hellwig, Keith Busch

On 1/16/20 12:08 PM, Bijan Mottahedeh wrote:
> On 1/16/2020 8:22 AM, Jens Axboe wrote:
>> On 1/15/20 9:42 PM, Jens Axboe wrote:
>>> On 1/15/20 9:34 PM, Jens Axboe wrote:
>>>> On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
>>>>> io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
>>>>> so acquire ctx->uring_lock beforehand similar to other instances of
>>>>> calling io_issue_sqe().
>>>> Is the below not enough?
>>> This should be better, we have two that set ->in_async, and only one
>>> doesn't hold the mutex.
>>>
>>> If this works for you, can you resend patch 2 with that? Also add a:
>>>
>>> Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions")
>>>
>>> to it as well. Thanks!
>> I tested and queued this up:
>>
>> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.5&id=11ba820bf163e224bf5dd44e545a66a44a5b1d7a
>>
>> Please let me know if this works, it sits on top of the ->result patch you
>> sent in.
>>
> That works, thanks.
> 
> I'm however still seeing a use-after-free error in the request 
> completion path in nvme_unmap_data().  It happens only when testing with 
> large block sizes in fio, typically > 128k, e.g. bs=256k will always hit it.
> 
> This is the error:
> 
> DMA-API: nvme 0000:00:04.0: device driver tries to free DMA memory it 
> has not allocated [device address=0x6b6b6b6b6b6b6b6b] [size=1802201963 
> bytes]
> 
> and this warning occasionally:
> 
> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
> 
> It seems like a request might be issued multiple times but I can't see 
> anything in io_uring code that would account for it.

Both of them indicate reuse, and I agree I don't think it's io_uring. It
really feels like an issue with nvme when a poll queue is shared, but I
haven't been able to pin point what it is yet.

The 128K is interesting, that would seem to indicate that it's related to
splitting of the IO (which would create > 1 IO per submitted IO).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-16 20:02           ` Jens Axboe
@ 2020-01-16 21:04             ` Bijan Mottahedeh
  2020-01-16 21:26               ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Bijan Mottahedeh @ 2020-01-16 21:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Keith Busch

On 1/16/2020 12:02 PM, Jens Axboe wrote:
> On 1/16/20 12:08 PM, Bijan Mottahedeh wrote:
>> On 1/16/2020 8:22 AM, Jens Axboe wrote:
>>> On 1/15/20 9:42 PM, Jens Axboe wrote:
>>>> On 1/15/20 9:34 PM, Jens Axboe wrote:
>>>>> On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
>>>>>> io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
>>>>>> so acquire ctx->uring_lock beforehand similar to other instances of
>>>>>> calling io_issue_sqe().
>>>>> Is the below not enough?
>>>> This should be better, we have two that set ->in_async, and only one
>>>> doesn't hold the mutex.
>>>>
>>>> If this works for you, can you resend patch 2 with that? Also add a:
>>>>
>>>> Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions")
>>>>
>>>> to it as well. Thanks!
>>> I tested and queued this up:
>>>
>>> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.5&id=11ba820bf163e224bf5dd44e545a66a44a5b1d7a
>>>
>>> Please let me know if this works, it sits on top of the ->result patch you
>>> sent in.
>>>
>> That works, thanks.
>>
>> I'm however still seeing a use-after-free error in the request
>> completion path in nvme_unmap_data().  It happens only when testing with
>> large block sizes in fio, typically > 128k, e.g. bs=256k will always hit it.
>>
>> This is the error:
>>
>> DMA-API: nvme 0000:00:04.0: device driver tries to free DMA memory it
>> has not allocated [device address=0x6b6b6b6b6b6b6b6b] [size=1802201963
>> bytes]
>>
>> and this warning occasionally:
>>
>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
>>
>> It seems like a request might be issued multiple times but I can't see
>> anything in io_uring code that would account for it.
> Both of them indicate reuse, and I agree I don't think it's io_uring. It
> really feels like an issue with nvme when a poll queue is shared, but I
> haven't been able to pin point what it is yet.
>
> The 128K is interesting, that would seem to indicate that it's related to
> splitting of the IO (which would create > 1 IO per submitted IO).
>
Where does the split take place?  I had suspected that it might be 
related to the submit_bio() loop in __blkdev_direct_IO() but I don't 
think I saw multiple submit_bio() calls or maybe I missed something.

--bijan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-16 21:04             ` Bijan Mottahedeh
@ 2020-01-16 21:26               ` Jens Axboe
  2020-01-28 20:34                 ` Bijan Mottahedeh
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-01-16 21:26 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: linux-block, Christoph Hellwig, Keith Busch

On 1/16/20 2:04 PM, Bijan Mottahedeh wrote:
> On 1/16/2020 12:02 PM, Jens Axboe wrote:
>> On 1/16/20 12:08 PM, Bijan Mottahedeh wrote:
>>> On 1/16/2020 8:22 AM, Jens Axboe wrote:
>>>> On 1/15/20 9:42 PM, Jens Axboe wrote:
>>>>> On 1/15/20 9:34 PM, Jens Axboe wrote:
>>>>>> On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
>>>>>>> io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
>>>>>>> so acquire ctx->uring_lock beforehand similar to other instances of
>>>>>>> calling io_issue_sqe().
>>>>>> Is the below not enough?
>>>>> This should be better, we have two that set ->in_async, and only one
>>>>> doesn't hold the mutex.
>>>>>
>>>>> If this works for you, can you resend patch 2 with that? Also add a:
>>>>>
>>>>> Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions")
>>>>>
>>>>> to it as well. Thanks!
>>>> I tested and queued this up:
>>>>
>>>> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.5&id=11ba820bf163e224bf5dd44e545a66a44a5b1d7a
>>>>
>>>> Please let me know if this works, it sits on top of the ->result patch you
>>>> sent in.
>>>>
>>> That works, thanks.
>>>
>>> I'm however still seeing a use-after-free error in the request
>>> completion path in nvme_unmap_data().  It happens only when testing with
>>> large block sizes in fio, typically > 128k, e.g. bs=256k will always hit it.
>>>
>>> This is the error:
>>>
>>> DMA-API: nvme 0000:00:04.0: device driver tries to free DMA memory it
>>> has not allocated [device address=0x6b6b6b6b6b6b6b6b] [size=1802201963
>>> bytes]
>>>
>>> and this warning occasionally:
>>>
>>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
>>>
>>> It seems like a request might be issued multiple times but I can't see
>>> anything in io_uring code that would account for it.
>> Both of them indicate reuse, and I agree I don't think it's io_uring. It
>> really feels like an issue with nvme when a poll queue is shared, but I
>> haven't been able to pin point what it is yet.
>>
>> The 128K is interesting, that would seem to indicate that it's related to
>> splitting of the IO (which would create > 1 IO per submitted IO).
>>
> Where does the split take place?  I had suspected that it might be 
> related to the submit_bio() loop in __blkdev_direct_IO() but I don't 
> think I saw multiple submit_bio() calls or maybe I missed something.

See the path from blk_mq_make_request() -> __blk_queue_split() ->
blk_bio_segment_split(). The bio is built and submitted, then split if
it violates any size constraints. The splits are submitted through
generic_make_request(), so that might be why you didn't see multiple
submit_bio() calls.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-16 21:26               ` Jens Axboe
@ 2020-01-28 20:34                 ` Bijan Mottahedeh
  2020-01-28 23:37                   ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Bijan Mottahedeh @ 2020-01-28 20:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Keith Busch

On 1/16/2020 1:26 PM, Jens Axboe wrote:
> On 1/16/20 2:04 PM, Bijan Mottahedeh wrote:
>> On 1/16/2020 12:02 PM, Jens Axboe wrote:
>>> On 1/16/20 12:08 PM, Bijan Mottahedeh wrote:
>>>> On 1/16/2020 8:22 AM, Jens Axboe wrote:
>>>>> On 1/15/20 9:42 PM, Jens Axboe wrote:
>>>>>> On 1/15/20 9:34 PM, Jens Axboe wrote:
>>>>>>> On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
>>>>>>>> io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
>>>>>>>> so acquire ctx->uring_lock beforehand similar to other instances of
>>>>>>>> calling io_issue_sqe().
>>>>>>> Is the below not enough?
>>>>>> This should be better, we have two that set ->in_async, and only one
>>>>>> doesn't hold the mutex.
>>>>>>
>>>>>> If this works for you, can you resend patch 2 with that? Also add a:
>>>>>>
>>>>>> Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions")
>>>>>>
>>>>>> to it as well. Thanks!
>>>>> I tested and queued this up:
>>>>>
>>>>> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.5&id=11ba820bf163e224bf5dd44e545a66a44a5b1d7a
>>>>>
>>>>> Please let me know if this works, it sits on top of the ->result patch you
>>>>> sent in.
>>>>>
>>>> That works, thanks.
>>>>
>>>> I'm however still seeing a use-after-free error in the request
>>>> completion path in nvme_unmap_data().  It happens only when testing with
>>>> large block sizes in fio, typically > 128k, e.g. bs=256k will always hit it.
>>>>
>>>> This is the error:
>>>>
>>>> DMA-API: nvme 0000:00:04.0: device driver tries to free DMA memory it
>>>> has not allocated [device address=0x6b6b6b6b6b6b6b6b] [size=1802201963
>>>> bytes]
>>>>
>>>> and this warning occasionally:
>>>>
>>>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
>>>>
>>>> It seems like a request might be issued multiple times but I can't see
>>>> anything in io_uring code that would account for it.
>>> Both of them indicate reuse, and I agree I don't think it's io_uring. It
>>> really feels like an issue with nvme when a poll queue is shared, but I
>>> haven't been able to pin point what it is yet.
>>>
>>> The 128K is interesting, that would seem to indicate that it's related to
>>> splitting of the IO (which would create > 1 IO per submitted IO).
>>>
>> Where does the split take place?  I had suspected that it might be
>> related to the submit_bio() loop in __blkdev_direct_IO() but I don't
>> think I saw multiple submit_bio() calls or maybe I missed something.
> See the path from blk_mq_make_request() -> __blk_queue_split() ->
> blk_bio_segment_split(). The bio is built and submitted, then split if
> it violates any size constraints. The splits are submitted through
> generic_make_request(), so that might be why you didn't see multiple
> submit_bio() calls.
>

I think the problem is in __blkdev_direct_IO() and not related to 
request size:

                         qc = submit_bio(bio);

                         if (polled)
                                 WRITE_ONCE(iocb->ki_cookie, qc);


The first call to submit_bio() when dio->is_sync is not set won't have 
acquired a bio ref through bio_get() and so the bio/dio could be freed 
when ki_cookie is set.

With the specific io_uring test, this happens because 
blk_mq_make_request()->blk_mq_get_request() fails and so terminates the 
request.

As for the fix for polled io (!is_sync) case, I'm wondering if 
dio->multi_bio is really necessary in __blkdev_direct_IO(). Can we call 
bio_get() unconditionally after the call to bio_alloc_bioset(), set 
dio->ref = 1, and increment it for additional submit bio calls?  Would 
it make sense to do away with multi_bio?

Also, I'm not clear on how is_sync + mult_bio case is supposed to work.  
__blkdev_direct_IO() polls for *a* completion in the request's hctx and 
not *the* request completion itself, so what does that tell us for 
multi_bio + is_sync? Is the polling supposed to guarantee that all 
constituent bios for a mult_bio request have completed before return?


--bijan


PS I couldn't see 256k requests being split via __blk_queue_split(), 
still not sure how that works.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-28 20:34                 ` Bijan Mottahedeh
@ 2020-01-28 23:37                   ` Jens Axboe
  2020-01-28 23:49                     ` Bijan Mottahedeh
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-01-28 23:37 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: linux-block, Christoph Hellwig, Keith Busch

On 1/28/20 1:34 PM, Bijan Mottahedeh wrote:
> On 1/16/2020 1:26 PM, Jens Axboe wrote:
>> On 1/16/20 2:04 PM, Bijan Mottahedeh wrote:
>>> On 1/16/2020 12:02 PM, Jens Axboe wrote:
>>>> On 1/16/20 12:08 PM, Bijan Mottahedeh wrote:
>>>>> On 1/16/2020 8:22 AM, Jens Axboe wrote:
>>>>>> On 1/15/20 9:42 PM, Jens Axboe wrote:
>>>>>>> On 1/15/20 9:34 PM, Jens Axboe wrote:
>>>>>>>> On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
>>>>>>>>> io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
>>>>>>>>> so acquire ctx->uring_lock beforehand similar to other instances of
>>>>>>>>> calling io_issue_sqe().
>>>>>>>> Is the below not enough?
>>>>>>> This should be better, we have two that set ->in_async, and only one
>>>>>>> doesn't hold the mutex.
>>>>>>>
>>>>>>> If this works for you, can you resend patch 2 with that? Also add a:
>>>>>>>
>>>>>>> Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions")
>>>>>>>
>>>>>>> to it as well. Thanks!
>>>>>> I tested and queued this up:
>>>>>>
>>>>>> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.5&id=11ba820bf163e224bf5dd44e545a66a44a5b1d7a
>>>>>>
>>>>>> Please let me know if this works, it sits on top of the ->result patch you
>>>>>> sent in.
>>>>>>
>>>>> That works, thanks.
>>>>>
>>>>> I'm however still seeing a use-after-free error in the request
>>>>> completion path in nvme_unmap_data().  It happens only when testing with
>>>>> large block sizes in fio, typically > 128k, e.g. bs=256k will always hit it.
>>>>>
>>>>> This is the error:
>>>>>
>>>>> DMA-API: nvme 0000:00:04.0: device driver tries to free DMA memory it
>>>>> has not allocated [device address=0x6b6b6b6b6b6b6b6b] [size=1802201963
>>>>> bytes]
>>>>>
>>>>> and this warning occasionally:
>>>>>
>>>>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
>>>>>
>>>>> It seems like a request might be issued multiple times but I can't see
>>>>> anything in io_uring code that would account for it.
>>>> Both of them indicate reuse, and I agree I don't think it's io_uring. It
>>>> really feels like an issue with nvme when a poll queue is shared, but I
>>>> haven't been able to pin point what it is yet.
>>>>
>>>> The 128K is interesting, that would seem to indicate that it's related to
>>>> splitting of the IO (which would create > 1 IO per submitted IO).
>>>>
>>> Where does the split take place?  I had suspected that it might be
>>> related to the submit_bio() loop in __blkdev_direct_IO() but I don't
>>> think I saw multiple submit_bio() calls or maybe I missed something.
>> See the path from blk_mq_make_request() -> __blk_queue_split() ->
>> blk_bio_segment_split(). The bio is built and submitted, then split if
>> it violates any size constraints. The splits are submitted through
>> generic_make_request(), so that might be why you didn't see multiple
>> submit_bio() calls.
>>
> 
> I think the problem is in __blkdev_direct_IO() and not related to 
> request size:
> 
>                          qc = submit_bio(bio);
> 
>                          if (polled)
>                                  WRITE_ONCE(iocb->ki_cookie, qc);
> 
> 
> The first call to submit_bio() when dio->is_sync is not set won't have 
> acquired a bio ref through bio_get() and so the bio/dio could be freed 
> when ki_cookie is set.
> 
> With the specific io_uring test, this happens because 
> blk_mq_make_request()->blk_mq_get_request() fails and so terminates the 
> request.
> 
> As for the fix for polled io (!is_sync) case, I'm wondering if 
> dio->multi_bio is really necessary in __blkdev_direct_IO(). Can we call 
> bio_get() unconditionally after the call to bio_alloc_bioset(), set 
> dio->ref = 1, and increment it for additional submit bio calls?  Would 
> it make sense to do away with multi_bio?

It's not ideal, but not sure I see a better way to fix it. You see the
case on failure, which we could check for (don't write cookie if it's
invalid). But this won't fix the case where the IO complete fast, or
even immediately.

Hence I think you're right, there's really no way around doing the bio
ref counting, even for the sync case. Care to cook up a patch we can
take a look at? I can run some high performance sync testing too, so we
can see how badly it might hurt.

> Also, I'm not clear on how is_sync + mult_bio case is supposed to work.  
> __blkdev_direct_IO() polls for *a* completion in the request's hctx and 
> not *the* request completion itself, so what does that tell us for 
> multi_bio + is_sync? Is the polling supposed to guarantee that all 
> constituent bios for a mult_bio request have completed before return?

The polling really just ignores that, it doesn't take multi requests
into account. We just poll for the first part of it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-28 23:37                   ` Jens Axboe
@ 2020-01-28 23:49                     ` Bijan Mottahedeh
  2020-01-28 23:52                       ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Bijan Mottahedeh @ 2020-01-28 23:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Keith Busch

On 1/28/2020 3:37 PM, Jens Axboe wrote:
> On 1/28/20 1:34 PM, Bijan Mottahedeh wrote:
>> On 1/16/2020 1:26 PM, Jens Axboe wrote:
>>> On 1/16/20 2:04 PM, Bijan Mottahedeh wrote:
>>>> On 1/16/2020 12:02 PM, Jens Axboe wrote:
>>>>> On 1/16/20 12:08 PM, Bijan Mottahedeh wrote:
>>>>>> On 1/16/2020 8:22 AM, Jens Axboe wrote:
>>>>>>> On 1/15/20 9:42 PM, Jens Axboe wrote:
>>>>>>>> On 1/15/20 9:34 PM, Jens Axboe wrote:
>>>>>>>>> On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
>>>>>>>>>> io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
>>>>>>>>>> so acquire ctx->uring_lock beforehand similar to other instances of
>>>>>>>>>> calling io_issue_sqe().
>>>>>>>>> Is the below not enough?
>>>>>>>> This should be better, we have two that set ->in_async, and only one
>>>>>>>> doesn't hold the mutex.
>>>>>>>>
>>>>>>>> If this works for you, can you resend patch 2 with that? Also add a:
>>>>>>>>
>>>>>>>> Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions")
>>>>>>>>
>>>>>>>> to it as well. Thanks!
>>>>>>> I tested and queued this up:
>>>>>>>
>>>>>>> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.5&id=11ba820bf163e224bf5dd44e545a66a44a5b1d7a
>>>>>>>
>>>>>>> Please let me know if this works, it sits on top of the ->result patch you
>>>>>>> sent in.
>>>>>>>
>>>>>> That works, thanks.
>>>>>>
>>>>>> I'm however still seeing a use-after-free error in the request
>>>>>> completion path in nvme_unmap_data().  It happens only when testing with
>>>>>> large block sizes in fio, typically > 128k, e.g. bs=256k will always hit it.
>>>>>>
>>>>>> This is the error:
>>>>>>
>>>>>> DMA-API: nvme 0000:00:04.0: device driver tries to free DMA memory it
>>>>>> has not allocated [device address=0x6b6b6b6b6b6b6b6b] [size=1802201963
>>>>>> bytes]
>>>>>>
>>>>>> and this warning occasionally:
>>>>>>
>>>>>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
>>>>>>
>>>>>> It seems like a request might be issued multiple times but I can't see
>>>>>> anything in io_uring code that would account for it.
>>>>> Both of them indicate reuse, and I agree I don't think it's io_uring. It
>>>>> really feels like an issue with nvme when a poll queue is shared, but I
>>>>> haven't been able to pin point what it is yet.
>>>>>
>>>>> The 128K is interesting, that would seem to indicate that it's related to
>>>>> splitting of the IO (which would create > 1 IO per submitted IO).
>>>>>
>>>> Where does the split take place?  I had suspected that it might be
>>>> related to the submit_bio() loop in __blkdev_direct_IO() but I don't
>>>> think I saw multiple submit_bio() calls or maybe I missed something.
>>> See the path from blk_mq_make_request() -> __blk_queue_split() ->
>>> blk_bio_segment_split(). The bio is built and submitted, then split if
>>> it violates any size constraints. The splits are submitted through
>>> generic_make_request(), so that might be why you didn't see multiple
>>> submit_bio() calls.
>>>
>> I think the problem is in __blkdev_direct_IO() and not related to
>> request size:
>>
>>                           qc = submit_bio(bio);
>>
>>                           if (polled)
>>                                   WRITE_ONCE(iocb->ki_cookie, qc);
>>
>>
>> The first call to submit_bio() when dio->is_sync is not set won't have
>> acquired a bio ref through bio_get() and so the bio/dio could be freed
>> when ki_cookie is set.
>>
>> With the specific io_uring test, this happens because
>> blk_mq_make_request()->blk_mq_get_request() fails and so terminates the
>> request.
>>
>> As for the fix for polled io (!is_sync) case, I'm wondering if
>> dio->multi_bio is really necessary in __blkdev_direct_IO(). Can we call
>> bio_get() unconditionally after the call to bio_alloc_bioset(), set
>> dio->ref = 1, and increment it for additional submit bio calls?  Would
>> it make sense to do away with multi_bio?
> It's not ideal, but not sure I see a better way to fix it. You see the
> case on failure, which we could check for (don't write cookie if it's
> invalid). But this won't fix the case where the IO complete fast, or
> even immediately.
>
> Hence I think you're right, there's really no way around doing the bio
> ref counting, even for the sync case. Care to cook up a patch we can
> take a look at? I can run some high performance sync testing too, so we
> can see how badly it might hurt.

Sure, I'll take a stab at it.

>
>> Also, I'm not clear on how is_sync + mult_bio case is supposed to work.
>> __blkdev_direct_IO() polls for *a* completion in the request's hctx and
>> not *the* request completion itself, so what does that tell us for
>> multi_bio + is_sync? Is the polling supposed to guarantee that all
>> constituent bios for a mult_bio request have completed before return?
> The polling really just ignores that, it doesn't take multi requests
> into account. We just poll for the first part of it.
>

Even for a single request though, the poll doesn't guarantee that the 
request just issued completes; it just says that some request from the 
same hctx completes, right?

--bijan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-28 23:49                     ` Bijan Mottahedeh
@ 2020-01-28 23:52                       ` Jens Axboe
  2020-01-31  3:36                         ` Bijan Mottahedeh
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-01-28 23:52 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: linux-block, Christoph Hellwig, Keith Busch

On 1/28/20 4:49 PM, Bijan Mottahedeh wrote:
> On 1/28/2020 3:37 PM, Jens Axboe wrote:
>> On 1/28/20 1:34 PM, Bijan Mottahedeh wrote:
>>> On 1/16/2020 1:26 PM, Jens Axboe wrote:
>>>> On 1/16/20 2:04 PM, Bijan Mottahedeh wrote:
>>>>> On 1/16/2020 12:02 PM, Jens Axboe wrote:
>>>>>> On 1/16/20 12:08 PM, Bijan Mottahedeh wrote:
>>>>>>> On 1/16/2020 8:22 AM, Jens Axboe wrote:
>>>>>>>> On 1/15/20 9:42 PM, Jens Axboe wrote:
>>>>>>>>> On 1/15/20 9:34 PM, Jens Axboe wrote:
>>>>>>>>>> On 1/15/20 7:37 PM, Bijan Mottahedeh wrote:
>>>>>>>>>>> io_issue_sqe() calls io_iopoll_req_issued() which manipulates poll_list,
>>>>>>>>>>> so acquire ctx->uring_lock beforehand similar to other instances of
>>>>>>>>>>> calling io_issue_sqe().
>>>>>>>>>> Is the below not enough?
>>>>>>>>> This should be better, we have two that set ->in_async, and only one
>>>>>>>>> doesn't hold the mutex.
>>>>>>>>>
>>>>>>>>> If this works for you, can you resend patch 2 with that? Also add a:
>>>>>>>>>
>>>>>>>>> Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions")
>>>>>>>>>
>>>>>>>>> to it as well. Thanks!
>>>>>>>> I tested and queued this up:
>>>>>>>>
>>>>>>>> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.5&id=11ba820bf163e224bf5dd44e545a66a44a5b1d7a
>>>>>>>>
>>>>>>>> Please let me know if this works, it sits on top of the ->result patch you
>>>>>>>> sent in.
>>>>>>>>
>>>>>>> That works, thanks.
>>>>>>>
>>>>>>> I'm however still seeing a use-after-free error in the request
>>>>>>> completion path in nvme_unmap_data().  It happens only when testing with
>>>>>>> large block sizes in fio, typically > 128k, e.g. bs=256k will always hit it.
>>>>>>>
>>>>>>> This is the error:
>>>>>>>
>>>>>>> DMA-API: nvme 0000:00:04.0: device driver tries to free DMA memory it
>>>>>>> has not allocated [device address=0x6b6b6b6b6b6b6b6b] [size=1802201963
>>>>>>> bytes]
>>>>>>>
>>>>>>> and this warning occasionally:
>>>>>>>
>>>>>>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
>>>>>>>
>>>>>>> It seems like a request might be issued multiple times but I can't see
>>>>>>> anything in io_uring code that would account for it.
>>>>>> Both of them indicate reuse, and I agree I don't think it's io_uring. It
>>>>>> really feels like an issue with nvme when a poll queue is shared, but I
>>>>>> haven't been able to pin point what it is yet.
>>>>>>
>>>>>> The 128K is interesting, that would seem to indicate that it's related to
>>>>>> splitting of the IO (which would create > 1 IO per submitted IO).
>>>>>>
>>>>> Where does the split take place?  I had suspected that it might be
>>>>> related to the submit_bio() loop in __blkdev_direct_IO() but I don't
>>>>> think I saw multiple submit_bio() calls or maybe I missed something.
>>>> See the path from blk_mq_make_request() -> __blk_queue_split() ->
>>>> blk_bio_segment_split(). The bio is built and submitted, then split if
>>>> it violates any size constraints. The splits are submitted through
>>>> generic_make_request(), so that might be why you didn't see multiple
>>>> submit_bio() calls.
>>>>
>>> I think the problem is in __blkdev_direct_IO() and not related to
>>> request size:
>>>
>>>                           qc = submit_bio(bio);
>>>
>>>                           if (polled)
>>>                                   WRITE_ONCE(iocb->ki_cookie, qc);
>>>
>>>
>>> The first call to submit_bio() when dio->is_sync is not set won't have
>>> acquired a bio ref through bio_get() and so the bio/dio could be freed
>>> when ki_cookie is set.
>>>
>>> With the specific io_uring test, this happens because
>>> blk_mq_make_request()->blk_mq_get_request() fails and so terminates the
>>> request.
>>>
>>> As for the fix for polled io (!is_sync) case, I'm wondering if
>>> dio->multi_bio is really necessary in __blkdev_direct_IO(). Can we call
>>> bio_get() unconditionally after the call to bio_alloc_bioset(), set
>>> dio->ref = 1, and increment it for additional submit bio calls?  Would
>>> it make sense to do away with multi_bio?
>> It's not ideal, but not sure I see a better way to fix it. You see the
>> case on failure, which we could check for (don't write cookie if it's
>> invalid). But this won't fix the case where the IO complete fast, or
>> even immediately.
>>
>> Hence I think you're right, there's really no way around doing the bio
>> ref counting, even for the sync case. Care to cook up a patch we can
>> take a look at? I can run some high performance sync testing too, so we
>> can see how badly it might hurt.
> 
> Sure, I'll take a stab at it.

Thanks!

>>> Also, I'm not clear on how is_sync + mult_bio case is supposed to work.
>>> __blkdev_direct_IO() polls for *a* completion in the request's hctx and
>>> not *the* request completion itself, so what does that tell us for
>>> multi_bio + is_sync? Is the polling supposed to guarantee that all
>>> constituent bios for a mult_bio request have completed before return?
>> The polling really just ignores that, it doesn't take multi requests
>> into account. We just poll for the first part of it.
>>
> 
> Even for a single request though, the poll doesn't guarantee that the 
> request just issued completes; it just says that some request from the 
> same hctx completes, right?

Correct

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe()
  2020-01-28 23:52                       ` Jens Axboe
@ 2020-01-31  3:36                         ` Bijan Mottahedeh
  0 siblings, 0 replies; 16+ messages in thread
From: Bijan Mottahedeh @ 2020-01-31  3:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Keith Busch


>>>>>>>> I'm however still seeing a use-after-free error in the request
>>>>>>>> completion path in nvme_unmap_data().  It happens only when testing with
>>>>>>>> large block sizes in fio, typically > 128k, e.g. bs=256k will always hit it.
>>>>>>>>
>>>>>>>> This is the error:
>>>>>>>>
>>>>>>>> DMA-API: nvme 0000:00:04.0: device driver tries to free DMA memory it
>>>>>>>> has not allocated [device address=0x6b6b6b6b6b6b6b6b] [size=1802201963
>>>>>>>> bytes]
>>>>>>>>
>>>>>>>> and this warning occasionally:
>>>>>>>>
>>>>>>>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
>>>>>>>>
>>>>>>>> It seems like a request might be issued multiple times but I can't see
>>>>>>>> anything in io_uring code that would account for it.
>>>>>>> Both of them indicate reuse, and I agree I don't think it's io_uring. It
>>>>>>> really feels like an issue with nvme when a poll queue is shared, but I
>>>>>>> haven't been able to pin point what it is yet.
>>>>>>>
>>>>>>> The 128K is interesting, that would seem to indicate that it's related to
>>>>>>> splitting of the IO (which would create > 1 IO per submitted IO).
>>>>>>>
>>>>>> Where does the split take place?  I had suspected that it might be
>>>>>> related to the submit_bio() loop in __blkdev_direct_IO() but I don't
>>>>>> think I saw multiple submit_bio() calls or maybe I missed something.
>>>>> See the path from blk_mq_make_request() -> __blk_queue_split() ->
>>>>> blk_bio_segment_split(). The bio is built and submitted, then split if
>>>>> it violates any size constraints. The splits are submitted through
>>>>> generic_make_request(), so that might be why you didn't see multiple
>>>>> submit_bio() calls.
>>>>>
>>>> I think the problem is in __blkdev_direct_IO() and not related to
>>>> request size:
>>>>
>>>>                            qc = submit_bio(bio);
>>>>
>>>>                            if (polled)
>>>>                                    WRITE_ONCE(iocb->ki_cookie, qc);
>>>>
>>>>
>>>> The first call to submit_bio() when dio->is_sync is not set won't have
>>>> acquired a bio ref through bio_get() and so the bio/dio could be freed
>>>> when ki_cookie is set.
>>>>
>>>> With the specific io_uring test, this happens because
>>>> blk_mq_make_request()->blk_mq_get_request() fails and so terminates the
>>>> request.
>>>>
>>>> As for the fix for polled io (!is_sync) case, I'm wondering if
>>>> dio->multi_bio is really necessary in __blkdev_direct_IO(). Can we call
>>>> bio_get() unconditionally after the call to bio_alloc_bioset(), set
>>>> dio->ref = 1, and increment it for additional submit bio calls?  Would
>>>> it make sense to do away with multi_bio?
>>> It's not ideal, but not sure I see a better way to fix it. You see the
>>> case on failure, which we could check for (don't write cookie if it's
>>> invalid). But this won't fix the case where the IO complete fast, or
>>> even immediately.
>>>
>>> Hence I think you're right, there's really no way around doing the bio
>>> ref counting, even for the sync case. Care to cook up a patch we can
>>> take a look at? I can run some high performance sync testing too, so we
>>> can see how badly it might hurt.
>> Sure, I'll take a stab at it.
> Thanks!

I sent it out.  When I tested with next-20200114, the fio test ran ok 
for sync/async with 4k.  The sync test ran ok with 256k as well but I 
still hit the original use-after-free bug with 256k.

With next-20200130 however, I'm hitting the use-after-free bug even with 
4k so it is not a size related issue.

I wasn't sure how to force a multi-bio case so that hasn't been tested.

Also, a question about below code in io_complete_rw_iopoll()

         if (res != req->result)
                 req_set_fail_links(req);


req->result could be set to the size of the completed io request, is the 
check ok in that case?

>>>> Also, I'm not clear on how is_sync + mult_bio case is supposed to work.
>>>> __blkdev_direct_IO() polls for *a* completion in the request's hctx and
>>>> not *the* request completion itself, so what does that tell us for
>>>> multi_bio + is_sync? Is the polling supposed to guarantee that all
>>>> constituent bios for a mult_bio request have completed before return?
>>> The polling really just ignores that, it doesn't take multi requests
>>> into account. We just poll for the first part of it.

In a multi-bio case, I think it would poll for the last part of it, I 
haven't changed that.  I did add a check for a valid cookie since I 
think it would loop forever in that case.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-01-31  3:37 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-16  2:37 [RFC 0/2] Fixes for fio io_uring polled mode test failures Bijan Mottahedeh
2020-01-16  2:37 ` [RFC 1/2] io_uring: clear req->result always before issuing a read/write request Bijan Mottahedeh
2020-01-16  4:34   ` Jens Axboe
2020-01-16  2:37 ` [RFC 2/2] io_uring: acquire ctx->uring_lock before calling io_issue_sqe() Bijan Mottahedeh
2020-01-16  4:34   ` Jens Axboe
2020-01-16  4:42     ` Jens Axboe
2020-01-16 16:22       ` Jens Axboe
2020-01-16 19:08         ` Bijan Mottahedeh
2020-01-16 20:02           ` Jens Axboe
2020-01-16 21:04             ` Bijan Mottahedeh
2020-01-16 21:26               ` Jens Axboe
2020-01-28 20:34                 ` Bijan Mottahedeh
2020-01-28 23:37                   ` Jens Axboe
2020-01-28 23:49                     ` Bijan Mottahedeh
2020-01-28 23:52                       ` Jens Axboe
2020-01-31  3:36                         ` Bijan Mottahedeh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).