linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Block device direct read EIO handling broken?
@ 2019-08-05 18:15 Darrick J. Wong
  2019-08-05 18:31 ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2019-08-05 18:15 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-fsdevel, axboe, linux-block, xfs

Hi Damien,

I noticed a regression in xfs/747 (an unreleased xfstest for the
xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
to a simpler reproducer:

# dmsetup table
error-test: 0 209 linear 8:48 0
error-test: 209 1 error 
error-test: 210 6446894 linear 8:48 210

Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
for sector 209 and to pass the io to the scsi device everywhere else.

On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
(in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
EIO like you'd expect:

# strace -e pread64 xfs_io -d -c 'pread -b 1024k 0k 1120k' /dev/mapper/error-test
pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
pread: Input/output error
+++ exited with 0 +++

But doing it with a larger buffer succeeds(!):

# strace -e pread64 xfs_io -d -c 'pread -b 2048k 0k 1120k' /dev/mapper/error-test
pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
read 1146880/1146880 bytes at offset 0
1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
+++ exited with 0 +++

(Note that the part of the buffer corresponding to the dm-error area is
uninitialized)

On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
only change between rc2 and rc3 is commit 0eb6ddfb865c ("block: Fix
__blkdev_direct_IO() for bio fragments").

AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
and a second one for the 96k after that.

I think the problem is that every time we submit a bio, we increase ret
by the size of that bio, but at the time we do that we have no idea if
the bio is going to succeed or not.  At the end of the function we do:

	if (!ret)
		ret = blk_status_to_errno(dio->bio.bi_status);

Which means that we only pick up the IO error if we haven't already set
ret.  I suppose that was useful for being able to return a short read,
but now that we always increment ret by the size of the bio, we act like
the whole buffer was read.  I tried a -rc2 kernel and found that 40% of
the time I'd get an EIO and the rest of the time I got a short read.

Not sure where to go from here, but something's not right...

--D

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-05 18:15 Block device direct read EIO handling broken? Darrick J. Wong
@ 2019-08-05 18:31 ` Jens Axboe
  2019-08-05 20:31   ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2019-08-05 18:31 UTC (permalink / raw)
  To: Darrick J. Wong, Damien Le Moal; +Cc: linux-fsdevel, linux-block, xfs

On 8/5/19 11:15 AM, Darrick J. Wong wrote:
> Hi Damien,
> 
> I noticed a regression in xfs/747 (an unreleased xfstest for the
> xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
> to a simpler reproducer:
> 
> # dmsetup table
> error-test: 0 209 linear 8:48 0
> error-test: 209 1 error
> error-test: 210 6446894 linear 8:48 210
> 
> Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
> for sector 209 and to pass the io to the scsi device everywhere else.
> 
> On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
> (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
> EIO like you'd expect:
> 
> # strace -e pread64 xfs_io -d -c 'pread -b 1024k 0k 1120k' /dev/mapper/error-test
> pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
> pread: Input/output error
> +++ exited with 0 +++
> 
> But doing it with a larger buffer succeeds(!):
> 
> # strace -e pread64 xfs_io -d -c 'pread -b 2048k 0k 1120k' /dev/mapper/error-test
> pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
> read 1146880/1146880 bytes at offset 0
> 1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
> +++ exited with 0 +++
> 
> (Note that the part of the buffer corresponding to the dm-error area is
> uninitialized)
> 
> On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
> only change between rc2 and rc3 is commit 0eb6ddfb865c ("block: Fix
> __blkdev_direct_IO() for bio fragments").
> 
> AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
> split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
> and a second one for the 96k after that.
> 
> I think the problem is that every time we submit a bio, we increase ret
> by the size of that bio, but at the time we do that we have no idea if
> the bio is going to succeed or not.  At the end of the function we do:
> 
> 	if (!ret)
> 		ret = blk_status_to_errno(dio->bio.bi_status);
> 
> Which means that we only pick up the IO error if we haven't already set
> ret.  I suppose that was useful for being able to return a short read,
> but now that we always increment ret by the size of the bio, we act like
> the whole buffer was read.  I tried a -rc2 kernel and found that 40% of
> the time I'd get an EIO and the rest of the time I got a short read.
> 
> Not sure where to go from here, but something's not right...

I'll take a look.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-05 18:31 ` Jens Axboe
@ 2019-08-05 20:31   ` Jens Axboe
  2019-08-05 20:54     ` Jens Axboe
  2019-08-05 21:08     ` Damien Le Moal
  0 siblings, 2 replies; 18+ messages in thread
From: Jens Axboe @ 2019-08-05 20:31 UTC (permalink / raw)
  To: Darrick J. Wong, Damien Le Moal; +Cc: linux-fsdevel, linux-block, xfs

On 8/5/19 11:31 AM, Jens Axboe wrote:
> On 8/5/19 11:15 AM, Darrick J. Wong wrote:
>> Hi Damien,
>>
>> I noticed a regression in xfs/747 (an unreleased xfstest for the
>> xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
>> to a simpler reproducer:
>>
>> # dmsetup table
>> error-test: 0 209 linear 8:48 0
>> error-test: 209 1 error
>> error-test: 210 6446894 linear 8:48 210
>>
>> Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
>> for sector 209 and to pass the io to the scsi device everywhere else.
>>
>> On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
>> (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
>> EIO like you'd expect:
>>
>> # strace -e pread64 xfs_io -d -c 'pread -b 1024k 0k 1120k' /dev/mapper/error-test
>> pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
>> pread: Input/output error
>> +++ exited with 0 +++
>>
>> But doing it with a larger buffer succeeds(!):
>>
>> # strace -e pread64 xfs_io -d -c 'pread -b 2048k 0k 1120k' /dev/mapper/error-test
>> pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
>> read 1146880/1146880 bytes at offset 0
>> 1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
>> +++ exited with 0 +++
>>
>> (Note that the part of the buffer corresponding to the dm-error area is
>> uninitialized)
>>
>> On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
>> only change between rc2 and rc3 is commit 0eb6ddfb865c ("block: Fix
>> __blkdev_direct_IO() for bio fragments").
>>
>> AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
>> split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
>> and a second one for the 96k after that.
>>
>> I think the problem is that every time we submit a bio, we increase ret
>> by the size of that bio, but at the time we do that we have no idea if
>> the bio is going to succeed or not.  At the end of the function we do:
>>
>> 	if (!ret)
>> 		ret = blk_status_to_errno(dio->bio.bi_status);
>>
>> Which means that we only pick up the IO error if we haven't already set
>> ret.  I suppose that was useful for being able to return a short read,
>> but now that we always increment ret by the size of the bio, we act like
>> the whole buffer was read.  I tried a -rc2 kernel and found that 40% of
>> the time I'd get an EIO and the rest of the time I got a short read.
>>
>> Not sure where to go from here, but something's not right...
> 
> I'll take a look.

How about this? The old code did:

	if (!ret)
		ret = blk_status_to_errno(dio->bio.bi_status);
	if (likely(!ret))
		ret = dio->size;

where 'ret' was just tracking the error. With 'ret' now being the
positive IO size, we should overwrite it if ret is >= 0, not just if
it's zero.

Also kill a use-after-free.

diff --git a/fs/block_dev.c b/fs/block_dev.c
index a6f7c892cb4a..67c8e87c9481 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -386,6 +386,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 
 	ret = 0;
 	for (;;) {
+		ssize_t this_size;
 		int err;
 
 		bio_set_dev(bio, bdev);
@@ -433,13 +434,14 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				polled = true;
 			}
 
+			this_size = bio->bi_iter.bi_size;
 			qc = submit_bio(bio);
 			if (qc == BLK_QC_T_EAGAIN) {
 				if (!ret)
 					ret = -EAGAIN;
 				goto error;
 			}
-			ret = dio->size;
+			ret += this_size;
 
 			if (polled)
 				WRITE_ONCE(iocb->ki_cookie, qc);
@@ -460,13 +462,14 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 			atomic_inc(&dio->ref);
 		}
 
+		this_size = bio->bi_iter.bi_size;
 		qc = submit_bio(bio);
 		if (qc == BLK_QC_T_EAGAIN) {
 			if (!ret)
 				ret = -EAGAIN;
 			goto error;
 		}
-		ret = dio->size;
+		ret += this_size;
 
 		bio = bio_alloc(gfp, nr_pages);
 		if (!bio) {
@@ -494,7 +497,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 	__set_current_state(TASK_RUNNING);
 
 out:
-	if (!ret)
+	if (ret >= 0)
 		ret = blk_status_to_errno(dio->bio.bi_status);
 
 	bio_put(&dio->bio);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-05 20:31   ` Jens Axboe
@ 2019-08-05 20:54     ` Jens Axboe
  2019-08-05 21:08     ` Damien Le Moal
  1 sibling, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2019-08-05 20:54 UTC (permalink / raw)
  To: Darrick J. Wong, Damien Le Moal; +Cc: linux-fsdevel, linux-block, xfs

On 8/5/19 1:31 PM, Jens Axboe wrote:
> On 8/5/19 11:31 AM, Jens Axboe wrote:
>> On 8/5/19 11:15 AM, Darrick J. Wong wrote:
>>> Hi Damien,
>>>
>>> I noticed a regression in xfs/747 (an unreleased xfstest for the
>>> xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
>>> to a simpler reproducer:
>>>
>>> # dmsetup table
>>> error-test: 0 209 linear 8:48 0
>>> error-test: 209 1 error
>>> error-test: 210 6446894 linear 8:48 210
>>>
>>> Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
>>> for sector 209 and to pass the io to the scsi device everywhere else.
>>>
>>> On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
>>> (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
>>> EIO like you'd expect:
>>>
>>> # strace -e pread64 xfs_io -d -c 'pread -b 1024k 0k 1120k' /dev/mapper/error-test
>>> pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
>>> pread: Input/output error
>>> +++ exited with 0 +++
>>>
>>> But doing it with a larger buffer succeeds(!):
>>>
>>> # strace -e pread64 xfs_io -d -c 'pread -b 2048k 0k 1120k' /dev/mapper/error-test
>>> pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
>>> read 1146880/1146880 bytes at offset 0
>>> 1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
>>> +++ exited with 0 +++
>>>
>>> (Note that the part of the buffer corresponding to the dm-error area is
>>> uninitialized)
>>>
>>> On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
>>> only change between rc2 and rc3 is commit 0eb6ddfb865c ("block: Fix
>>> __blkdev_direct_IO() for bio fragments").
>>>
>>> AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
>>> split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
>>> and a second one for the 96k after that.
>>>
>>> I think the problem is that every time we submit a bio, we increase ret
>>> by the size of that bio, but at the time we do that we have no idea if
>>> the bio is going to succeed or not.  At the end of the function we do:
>>>
>>> 	if (!ret)
>>> 		ret = blk_status_to_errno(dio->bio.bi_status);
>>>
>>> Which means that we only pick up the IO error if we haven't already set
>>> ret.  I suppose that was useful for being able to return a short read,
>>> but now that we always increment ret by the size of the bio, we act like
>>> the whole buffer was read.  I tried a -rc2 kernel and found that 40% of
>>> the time I'd get an EIO and the rest of the time I got a short read.
>>>
>>> Not sure where to go from here, but something's not right...
>>
>> I'll take a look.
> 
> How about this? The old code did:
> 
> 	if (!ret)
> 		ret = blk_status_to_errno(dio->bio.bi_status);
> 	if (likely(!ret))
> 		ret = dio->size;
> 
> where 'ret' was just tracking the error. With 'ret' now being the
> positive IO size, we should overwrite it if ret is >= 0, not just if
> it's zero.
> 
> Also kill a use-after-free.

This should be better, we don't want to override 'ret' is bio->bi_status
doesn't indicate an error.


diff --git a/fs/block_dev.c b/fs/block_dev.c
index a6f7c892cb4a..1ac89f4fcbcc 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -386,6 +386,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 
 	ret = 0;
 	for (;;) {
+		ssize_t this_size;
 		int err;
 
 		bio_set_dev(bio, bdev);
@@ -433,13 +434,14 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				polled = true;
 			}
 
+			this_size = bio->bi_iter.bi_size;
 			qc = submit_bio(bio);
 			if (qc == BLK_QC_T_EAGAIN) {
 				if (!ret)
 					ret = -EAGAIN;
 				goto error;
 			}
-			ret = dio->size;
+			ret += this_size;
 
 			if (polled)
 				WRITE_ONCE(iocb->ki_cookie, qc);
@@ -460,13 +462,14 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 			atomic_inc(&dio->ref);
 		}
 
+		this_size = bio->bi_iter.bi_size;
 		qc = submit_bio(bio);
 		if (qc == BLK_QC_T_EAGAIN) {
 			if (!ret)
 				ret = -EAGAIN;
 			goto error;
 		}
-		ret = dio->size;
+		ret += this_size;
 
 		bio = bio_alloc(gfp, nr_pages);
 		if (!bio) {
@@ -494,7 +497,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 	__set_current_state(TASK_RUNNING);
 
 out:
-	if (!ret)
+	if (ret >= 0 && dio->bio.bi_status)
 		ret = blk_status_to_errno(dio->bio.bi_status);
 
 	bio_put(&dio->bio);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-05 20:31   ` Jens Axboe
  2019-08-05 20:54     ` Jens Axboe
@ 2019-08-05 21:08     ` Damien Le Moal
  2019-08-05 21:25       ` Jens Axboe
  1 sibling, 1 reply; 18+ messages in thread
From: Damien Le Moal @ 2019-08-05 21:08 UTC (permalink / raw)
  To: Jens Axboe, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

On 2019/08/06 5:31, Jens Axboe wrote:
> On 8/5/19 11:31 AM, Jens Axboe wrote:
>> On 8/5/19 11:15 AM, Darrick J. Wong wrote:
>>> Hi Damien,
>>>
>>> I noticed a regression in xfs/747 (an unreleased xfstest for the
>>> xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
>>> to a simpler reproducer:
>>>
>>> # dmsetup table
>>> error-test: 0 209 linear 8:48 0
>>> error-test: 209 1 error
>>> error-test: 210 6446894 linear 8:48 210
>>>
>>> Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
>>> for sector 209 and to pass the io to the scsi device everywhere else.
>>>
>>> On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
>>> (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
>>> EIO like you'd expect:
>>>
>>> # strace -e pread64 xfs_io -d -c 'pread -b 1024k 0k 1120k' /dev/mapper/error-test
>>> pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
>>> pread: Input/output error
>>> +++ exited with 0 +++
>>>
>>> But doing it with a larger buffer succeeds(!):
>>>
>>> # strace -e pread64 xfs_io -d -c 'pread -b 2048k 0k 1120k' /dev/mapper/error-test
>>> pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
>>> read 1146880/1146880 bytes at offset 0
>>> 1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
>>> +++ exited with 0 +++
>>>
>>> (Note that the part of the buffer corresponding to the dm-error area is
>>> uninitialized)
>>>
>>> On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
>>> only change between rc2 and rc3 is commit 0eb6ddfb865c ("block: Fix
>>> __blkdev_direct_IO() for bio fragments").
>>>
>>> AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
>>> split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
>>> and a second one for the 96k after that.
>>>
>>> I think the problem is that every time we submit a bio, we increase ret
>>> by the size of that bio, but at the time we do that we have no idea if
>>> the bio is going to succeed or not.  At the end of the function we do:
>>>
>>> 	if (!ret)
>>> 		ret = blk_status_to_errno(dio->bio.bi_status);
>>>
>>> Which means that we only pick up the IO error if we haven't already set
>>> ret.  I suppose that was useful for being able to return a short read,
>>> but now that we always increment ret by the size of the bio, we act like
>>> the whole buffer was read.  I tried a -rc2 kernel and found that 40% of
>>> the time I'd get an EIO and the rest of the time I got a short read.
>>>
>>> Not sure where to go from here, but something's not right...
>>
>> I'll take a look.
> 
> How about this? The old code did:
> 
> 	if (!ret)
> 		ret = blk_status_to_errno(dio->bio.bi_status);
> 	if (likely(!ret))
> 		ret = dio->size;
> 
> where 'ret' was just tracking the error. With 'ret' now being the
> positive IO size, we should overwrite it if ret is >= 0, not just if
> it's zero.
> 
> Also kill a use-after-free.
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index a6f7c892cb4a..67c8e87c9481 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -386,6 +386,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  
>  	ret = 0;
>  	for (;;) {
> +		ssize_t this_size;
>  		int err;
>  
>  		bio_set_dev(bio, bdev);
> @@ -433,13 +434,14 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  				polled = true;
>  			}
>  
> +			this_size = bio->bi_iter.bi_size;
>  			qc = submit_bio(bio);
>  			if (qc == BLK_QC_T_EAGAIN) {
>  				if (!ret)
>  					ret = -EAGAIN;
>  				goto error;
>  			}
> -			ret = dio->size;
> +			ret += this_size;
>  
>  			if (polled)
>  				WRITE_ONCE(iocb->ki_cookie, qc);
> @@ -460,13 +462,14 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  			atomic_inc(&dio->ref);
>  		}
>  
> +		this_size = bio->bi_iter.bi_size;
>  		qc = submit_bio(bio);
>  		if (qc == BLK_QC_T_EAGAIN) {
>  			if (!ret)
>  				ret = -EAGAIN;
>  			goto error;
>  		}
> -		ret = dio->size;
> +		ret += this_size;
>  
>  		bio = bio_alloc(gfp, nr_pages);
>  		if (!bio) {
> @@ -494,7 +497,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  	__set_current_state(TASK_RUNNING);
>  
>  out:
> -	if (!ret)
> +	if (ret >= 0)
>  		ret = blk_status_to_errno(dio->bio.bi_status);
>  
>  	bio_put(&dio->bio);
> 

Jens,

I would set "this_size" when dio->size is being incremented though, to avoid
repeating it.

		if (nowait)
			bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);

+		this_size = bio->bi_iter.bi_size;
-		dio->size += bio->bi_iter.bi_size;
+		dio->size += this_size;
		pos += bio->bi_iter.bi_size;

In any case, looking again at this code, it looks like there is a problem with
dio->size being incremented early, even for fragments that get BLK_QC_T_EAGAIN,
because dio->size is being used in blkdev_bio_end_io(). So an incorrect size can
be reported to user space in that case on completion (e.g. large asynchronous
no-wait dio that cannot be issued in one go).

So maybe something like this ? (completely untested)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 75cc7f424b3a..77714e03c21e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -349,7 +349,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
        loff_t pos = iocb->ki_pos;
        blk_qc_t qc = BLK_QC_T_NONE;
        gfp_t gfp;
-       ssize_t ret;
+       ssize_t ret = 0;

        if ((pos | iov_iter_alignment(iter)) &
            (bdev_logical_block_size(bdev) - 1))
@@ -386,6 +386,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

        ret = 0;
        for (;;) {
+               size_t this_size;
                int err;

                bio_set_dev(bio, bdev);
@@ -421,7 +422,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
                if (nowait)
                        bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);

-               dio->size += bio->bi_iter.bi_size;
+               this_size = bio->bi_iter.bi_size;
                pos += bio->bi_iter.bi_size;

                nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
@@ -435,11 +436,11 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

                        qc = submit_bio(bio);
                        if (qc == BLK_QC_T_EAGAIN) {
-                               if (!ret)
+                               if (!dio->size)
                                        ret = -EAGAIN;
                                goto error;
                        }
-                       ret = dio->size;
+                       dio->size += this_size;

                        if (polled)
                                WRITE_ONCE(iocb->ki_cookie, qc);
@@ -462,15 +463,15 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

                qc = submit_bio(bio);
                if (qc == BLK_QC_T_EAGAIN) {
-                       if (!ret)
+                       if (!dio->size)
                                ret = -EAGAIN;
                        goto error;
                }
-               ret = dio->size;
+               dio->size += this_size;

                bio = bio_alloc(gfp, nr_pages);
                if (!bio) {
-                       if (!ret)
+                       if (!dio->size)
                                ret = -EAGAIN;
                        goto error;
                }
@@ -496,10 +497,15 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
 out:
        if (!ret)
                ret = blk_status_to_errno(dio->bio.bi_status);
+       if (likely(!ret))
+               ret = dio->size;

        bio_put(&dio->bio);
        return ret;


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-05 21:08     ` Damien Le Moal
@ 2019-08-05 21:25       ` Jens Axboe
  2019-08-05 21:27         ` Damien Le Moal
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2019-08-05 21:25 UTC (permalink / raw)
  To: Damien Le Moal, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

On 8/5/19 2:08 PM, Damien Le Moal wrote:
> On 2019/08/06 5:31, Jens Axboe wrote:
>> On 8/5/19 11:31 AM, Jens Axboe wrote:
>>> On 8/5/19 11:15 AM, Darrick J. Wong wrote:
>>>> Hi Damien,
>>>>
>>>> I noticed a regression in xfs/747 (an unreleased xfstest for the
>>>> xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
>>>> to a simpler reproducer:
>>>>
>>>> # dmsetup table
>>>> error-test: 0 209 linear 8:48 0
>>>> error-test: 209 1 error
>>>> error-test: 210 6446894 linear 8:48 210
>>>>
>>>> Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
>>>> for sector 209 and to pass the io to the scsi device everywhere else.
>>>>
>>>> On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
>>>> (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
>>>> EIO like you'd expect:
>>>>
>>>> # strace -e pread64 xfs_io -d -c 'pread -b 1024k 0k 1120k' /dev/mapper/error-test
>>>> pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
>>>> pread: Input/output error
>>>> +++ exited with 0 +++
>>>>
>>>> But doing it with a larger buffer succeeds(!):
>>>>
>>>> # strace -e pread64 xfs_io -d -c 'pread -b 2048k 0k 1120k' /dev/mapper/error-test
>>>> pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
>>>> read 1146880/1146880 bytes at offset 0
>>>> 1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
>>>> +++ exited with 0 +++
>>>>
>>>> (Note that the part of the buffer corresponding to the dm-error area is
>>>> uninitialized)
>>>>
>>>> On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
>>>> only change between rc2 and rc3 is commit 0eb6ddfb865c ("block: Fix
>>>> __blkdev_direct_IO() for bio fragments").
>>>>
>>>> AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
>>>> split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
>>>> and a second one for the 96k after that.
>>>>
>>>> I think the problem is that every time we submit a bio, we increase ret
>>>> by the size of that bio, but at the time we do that we have no idea if
>>>> the bio is going to succeed or not.  At the end of the function we do:
>>>>
>>>> 	if (!ret)
>>>> 		ret = blk_status_to_errno(dio->bio.bi_status);
>>>>
>>>> Which means that we only pick up the IO error if we haven't already set
>>>> ret.  I suppose that was useful for being able to return a short read,
>>>> but now that we always increment ret by the size of the bio, we act like
>>>> the whole buffer was read.  I tried a -rc2 kernel and found that 40% of
>>>> the time I'd get an EIO and the rest of the time I got a short read.
>>>>
>>>> Not sure where to go from here, but something's not right...
>>>
>>> I'll take a look.
>>
>> How about this? The old code did:
>>
>> 	if (!ret)
>> 		ret = blk_status_to_errno(dio->bio.bi_status);
>> 	if (likely(!ret))
>> 		ret = dio->size;
>>
>> where 'ret' was just tracking the error. With 'ret' now being the
>> positive IO size, we should overwrite it if ret is >= 0, not just if
>> it's zero.
>>
>> Also kill a use-after-free.
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index a6f7c892cb4a..67c8e87c9481 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -386,6 +386,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   
>>   	ret = 0;
>>   	for (;;) {
>> +		ssize_t this_size;
>>   		int err;
>>   
>>   		bio_set_dev(bio, bdev);
>> @@ -433,13 +434,14 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   				polled = true;
>>   			}
>>   
>> +			this_size = bio->bi_iter.bi_size;
>>   			qc = submit_bio(bio);
>>   			if (qc == BLK_QC_T_EAGAIN) {
>>   				if (!ret)
>>   					ret = -EAGAIN;
>>   				goto error;
>>   			}
>> -			ret = dio->size;
>> +			ret += this_size;
>>   
>>   			if (polled)
>>   				WRITE_ONCE(iocb->ki_cookie, qc);
>> @@ -460,13 +462,14 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   			atomic_inc(&dio->ref);
>>   		}
>>   
>> +		this_size = bio->bi_iter.bi_size;
>>   		qc = submit_bio(bio);
>>   		if (qc == BLK_QC_T_EAGAIN) {
>>   			if (!ret)
>>   				ret = -EAGAIN;
>>   			goto error;
>>   		}
>> -		ret = dio->size;
>> +		ret += this_size;
>>   
>>   		bio = bio_alloc(gfp, nr_pages);
>>   		if (!bio) {
>> @@ -494,7 +497,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   	__set_current_state(TASK_RUNNING);
>>   
>>   out:
>> -	if (!ret)
>> +	if (ret >= 0)
>>   		ret = blk_status_to_errno(dio->bio.bi_status);
>>   
>>   	bio_put(&dio->bio);
>>
> 
> Jens,
> 
> I would set "this_size" when dio->size is being incremented though, to avoid
> repeating it.
> 
> 		if (nowait)
> 			bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);
> 
> +		this_size = bio->bi_iter.bi_size;
> -		dio->size += bio->bi_iter.bi_size;
> +		dio->size += this_size;
> 		pos += bio->bi_iter.bi_size;
> 
> In any case, looking again at this code, it looks like there is a
> problem with dio->size being incremented early, even for fragments
> that get BLK_QC_T_EAGAIN, because dio->size is being used in
> blkdev_bio_end_io(). So an incorrect size can be reported to user
> space in that case on completion (e.g. large asynchronous no-wait dio
> that cannot be issued in one go).
> 
> So maybe something like this ? (completely untested)

I think that looks pretty good, I like not double accounting with
this_size and dio->size, and we retain the old style ordering for the
ret value.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-05 21:25       ` Jens Axboe
@ 2019-08-05 21:27         ` Damien Le Moal
  2019-08-05 21:28           ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Damien Le Moal @ 2019-08-05 21:27 UTC (permalink / raw)
  To: Jens Axboe, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

On 2019/08/06 6:26, Jens Axboe wrote:
>> In any case, looking again at this code, it looks like there is a
>> problem with dio->size being incremented early, even for fragments
>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
>> blkdev_bio_end_io(). So an incorrect size can be reported to user
>> space in that case on completion (e.g. large asynchronous no-wait dio
>> that cannot be issued in one go).
>>
>> So maybe something like this ? (completely untested)
> 
> I think that looks pretty good, I like not double accounting with
> this_size and dio->size, and we retain the old style ordering for the
> ret value.

Do you want a proper patch with real testing backup ? I can send that later today.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-05 21:27         ` Damien Le Moal
@ 2019-08-05 21:28           ` Jens Axboe
  2019-08-05 21:59             ` Damien Le Moal
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2019-08-05 21:28 UTC (permalink / raw)
  To: Damien Le Moal, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

On 8/5/19 2:27 PM, Damien Le Moal wrote:
> On 2019/08/06 6:26, Jens Axboe wrote:
>>> In any case, looking again at this code, it looks like there is a
>>> problem with dio->size being incremented early, even for fragments
>>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
>>> blkdev_bio_end_io(). So an incorrect size can be reported to user
>>> space in that case on completion (e.g. large asynchronous no-wait dio
>>> that cannot be issued in one go).
>>>
>>> So maybe something like this ? (completely untested)
>>
>> I think that looks pretty good, I like not double accounting with
>> this_size and dio->size, and we retain the old style ordering for the
>> ret value.
> 
> Do you want a proper patch with real testing backup ? I can send that
> later today.

Yeah that'd be great, I like your approach better.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-05 21:28           ` Jens Axboe
@ 2019-08-05 21:59             ` Damien Le Moal
  2019-08-05 22:05               ` Damien Le Moal
  0 siblings, 1 reply; 18+ messages in thread
From: Damien Le Moal @ 2019-08-05 21:59 UTC (permalink / raw)
  To: Jens Axboe, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

On 2019/08/06 6:28, Jens Axboe wrote:
> On 8/5/19 2:27 PM, Damien Le Moal wrote:
>> On 2019/08/06 6:26, Jens Axboe wrote:
>>>> In any case, looking again at this code, it looks like there is a
>>>> problem with dio->size being incremented early, even for fragments
>>>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
>>>> blkdev_bio_end_io(). So an incorrect size can be reported to user
>>>> space in that case on completion (e.g. large asynchronous no-wait dio
>>>> that cannot be issued in one go).
>>>>
>>>> So maybe something like this ? (completely untested)
>>>
>>> I think that looks pretty good, I like not double accounting with
>>> this_size and dio->size, and we retain the old style ordering for the
>>> ret value.
>>
>> Do you want a proper patch with real testing backup ? I can send that
>> later today.
> 
> Yeah that'd be great, I like your approach better.
> 

Looking again, I think this is not it yet: dio->size is being referenced after
submit_bio(), so blkdev_bio_end_io() may see the old value if the bio completes
before dio->size increment. So the use-after-free is still there. And since
blkdev_bio_end_io() processes completion to user space only when dio->ref
becomes 0, adding an atomic_inc/dec(&dio->ref) over the loop would not help and
does not cover the single BIO case. Any idea how to address this one ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-05 21:59             ` Damien Le Moal
@ 2019-08-05 22:05               ` Damien Le Moal
  2019-08-06  0:05                 ` Damien Le Moal
  0 siblings, 1 reply; 18+ messages in thread
From: Damien Le Moal @ 2019-08-05 22:05 UTC (permalink / raw)
  To: Jens Axboe, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

On 2019/08/06 6:59, Damien Le Moal wrote:
> On 2019/08/06 6:28, Jens Axboe wrote:
>> On 8/5/19 2:27 PM, Damien Le Moal wrote:
>>> On 2019/08/06 6:26, Jens Axboe wrote:
>>>>> In any case, looking again at this code, it looks like there is a
>>>>> problem with dio->size being incremented early, even for fragments
>>>>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
>>>>> blkdev_bio_end_io(). So an incorrect size can be reported to user
>>>>> space in that case on completion (e.g. large asynchronous no-wait dio
>>>>> that cannot be issued in one go).
>>>>>
>>>>> So maybe something like this ? (completely untested)
>>>>
>>>> I think that looks pretty good, I like not double accounting with
>>>> this_size and dio->size, and we retain the old style ordering for the
>>>> ret value.
>>>
>>> Do you want a proper patch with real testing backup ? I can send that
>>> later today.
>>
>> Yeah that'd be great, I like your approach better.
>>
> 
> Looking again, I think this is not it yet: dio->size is being referenced after
> submit_bio(), so blkdev_bio_end_io() may see the old value if the bio completes
> before dio->size increment. So the use-after-free is still there. And since
> blkdev_bio_end_io() processes completion to user space only when dio->ref
> becomes 0, adding an atomic_inc/dec(&dio->ref) over the loop would not help and
> does not cover the single BIO case. Any idea how to address this one ?
> 

May be add a bio_get/put() over the 2 places that do submit_bio() would work,
for all cases (single/multi BIO, sync & async). E.g.:

+                       bio_get(bio);
                        qc = submit_bio(bio);
                        if (qc == BLK_QC_T_EAGAIN) {
                                if (!dio->size)
                                        ret = -EAGAIN;
+                               bio_put(bio);
                                goto error;
                        }
                        dio->size += bio_size;
+                       bio_put(bio);

Thoughts ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-05 22:05               ` Damien Le Moal
@ 2019-08-06  0:05                 ` Damien Le Moal
  2019-08-06  0:23                   ` Dave Chinner
  2019-08-06  4:09                   ` Jens Axboe
  0 siblings, 2 replies; 18+ messages in thread
From: Damien Le Moal @ 2019-08-06  0:05 UTC (permalink / raw)
  To: Jens Axboe, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

On 2019/08/06 7:05, Damien Le Moal wrote:
> On 2019/08/06 6:59, Damien Le Moal wrote:
>> On 2019/08/06 6:28, Jens Axboe wrote:
>>> On 8/5/19 2:27 PM, Damien Le Moal wrote:
>>>> On 2019/08/06 6:26, Jens Axboe wrote:
>>>>>> In any case, looking again at this code, it looks like there is a
>>>>>> problem with dio->size being incremented early, even for fragments
>>>>>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
>>>>>> blkdev_bio_end_io(). So an incorrect size can be reported to user
>>>>>> space in that case on completion (e.g. large asynchronous no-wait dio
>>>>>> that cannot be issued in one go).
>>>>>>
>>>>>> So maybe something like this ? (completely untested)
>>>>>
>>>>> I think that looks pretty good, I like not double accounting with
>>>>> this_size and dio->size, and we retain the old style ordering for the
>>>>> ret value.
>>>>
>>>> Do you want a proper patch with real testing backup ? I can send that
>>>> later today.
>>>
>>> Yeah that'd be great, I like your approach better.
>>>
>>
>> Looking again, I think this is not it yet: dio->size is being referenced after
>> submit_bio(), so blkdev_bio_end_io() may see the old value if the bio completes
>> before dio->size increment. So the use-after-free is still there. And since
>> blkdev_bio_end_io() processes completion to user space only when dio->ref
>> becomes 0, adding an atomic_inc/dec(&dio->ref) over the loop would not help and
>> does not cover the single BIO case. Any idea how to address this one ?
>>
> 
> May be add a bio_get/put() over the 2 places that do submit_bio() would work,
> for all cases (single/multi BIO, sync & async). E.g.:
> 
> +                       bio_get(bio);
>                         qc = submit_bio(bio);
>                         if (qc == BLK_QC_T_EAGAIN) {
>                                 if (!dio->size)
>                                         ret = -EAGAIN;
> +                               bio_put(bio);
>                                 goto error;
>                         }
>                         dio->size += bio_size;
> +                       bio_put(bio);
> 
> Thoughts ?
> 

That does not work since the reference to dio->size in blkdev_bio_end_io()
depends on atomic_dec_and_test(&dio->ref) which counts the BIO fragments for the
dio (+1 for async multi-bio case). So completion of the last bio can still
reference the old value of dio->size.

Adding a bio_get/put() on dio->bio ensures that dio stays around, but does not
prevent the use of the wrong dio->size. Adding an additional
atomic_inc/dec(&dio->ref) would prevent that, but we would need to handle dio
completion at the end of __blkdev_direct_IO() if all BIO fragments already
completed at that point. That is a lot more plumbing needed, relying completely
on dio->ref for all cases, thus removing the dio->multi_bio management.

Something like this:

diff --git a/fs/block_dev.c b/fs/block_dev.c
index a6f7c892cb4a..51d36baa367c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -279,7 +279,6 @@ struct blkdev_dio {
        };
        size_t                  size;
        atomic_t                ref;
-       bool                    multi_bio : 1;
        bool                    should_dirty : 1;
        bool                    is_sync : 1;
        struct bio              bio;
@@ -295,15 +294,14 @@ static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
        return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
 }

-static void blkdev_bio_end_io(struct bio *bio)
+static inline void blkdev_get_dio(struct blkdev_dio *dio)
 {
-       struct blkdev_dio *dio = bio->bi_private;
-       bool should_dirty = dio->should_dirty;
-
-       if (bio->bi_status && !dio->bio.bi_status)
-               dio->bio.bi_status = bio->bi_status;
+       atomic_inc(&dio->ref);
+}

-       if (!dio->multi_bio || atomic_dec_and_test(&dio->ref)) {
+static void blkdev_put_dio(struct blkdev_dio *dio)
+{
+       if (atomic_dec_and_test(&dio->ref)) {
                if (!dio->is_sync) {
                        struct kiocb *iocb = dio->iocb;
                        ssize_t ret;
@@ -316,15 +314,25 @@ static void blkdev_bio_end_io(struct bio *bio)
                        }

                        dio->iocb->ki_complete(iocb, ret, 0);
-                       if (dio->multi_bio)
-                               bio_put(&dio->bio);
                } else {
                        struct task_struct *waiter = dio->waiter;

                        WRITE_ONCE(dio->waiter, NULL);
                        blk_wake_io_task(waiter);
                }
+               bio_put(&dio->bio);
        }
+}
+
+static void blkdev_bio_end_io(struct bio *bio)
+{
+       struct blkdev_dio *dio = bio->bi_private;
+       bool should_dirty = dio->should_dirty;
+
+       if (bio->bi_status && !dio->bio.bi_status)
+               dio->bio.bi_status = bio->bi_status;
+
+       blkdev_put_dio(dio);

        if (should_dirty) {
                bio_check_pages_dirty(bio);
@@ -349,7 +357,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
        loff_t pos = iocb->ki_pos;
        blk_qc_t qc = BLK_QC_T_NONE;
        gfp_t gfp;
-       ssize_t ret;
+       ssize_t ret = 0;

        if ((pos | iov_iter_alignment(iter)) &
            (bdev_logical_block_size(bdev) - 1))
@@ -366,17 +374,24 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

        dio = container_of(bio, struct blkdev_dio, bio);
        dio->is_sync = is_sync = is_sync_kiocb(iocb);
-       if (dio->is_sync) {
+       if (dio->is_sync)
                dio->waiter = current;
-               bio_get(bio);
-       } else {
+       else
                dio->iocb = iocb;
-       }

        dio->size = 0;
-       dio->multi_bio = false;
        dio->should_dirty = is_read && iter_is_iovec(iter);

+       /*
+        * Get an extra reference on the first bio to ensure that the dio
+        * structure which is embedded into the first bio stays around for AIOs
+        * and while we are still doing dio->size accounting in the loop below.
+        * For dio->is_sync case, the extra reference is released on exit of
+        * this function.
+        */
+       bio_get(bio);
+       blkdev_get_dio(dio);
+
        /*
         * Don't plug for HIPRI/polled IO, as those should go straight
         * to issue
@@ -386,6 +401,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

        ret = 0;
        for (;;) {
+               size_t bio_size;
                int err;

                bio_set_dev(bio, bdev);
@@ -421,7 +437,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
                if (nowait)
                        bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);

-               dio->size += bio->bi_iter.bi_size;
+               bio_size = bio->bi_iter.bi_size;
                pos += bio->bi_iter.bi_size;

                nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);

@@ -435,42 +451,30 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

                        qc = submit_bio(bio);
                        if (qc == BLK_QC_T_EAGAIN) {
-                               if (!ret)
+                               if (!dio->size)
                                        ret = -EAGAIN;
                                goto error;
                        }
-                       ret = dio->size;
+                       dio->size += bio_size;

                        if (polled)
                                WRITE_ONCE(iocb->ki_cookie, qc);
                        break;
                }

-               if (!dio->multi_bio) {
-                       /*
-                        * AIO needs an extra reference to ensure the dio
-                        * structure which is embedded into the first bio
-                        * stays around.
-                        */
-                       if (!is_sync)
-                               bio_get(bio);
-                       dio->multi_bio = true;
-                       atomic_set(&dio->ref, 2);
-               } else {
-                       atomic_inc(&dio->ref);
-               }
+               blkdev_get_dio(dio);

                qc = submit_bio(bio);
                if (qc == BLK_QC_T_EAGAIN) {
-                       if (!ret)
+                       if (!dio->size)
                                ret = -EAGAIN;
                        goto error;
                }
-               ret = dio->size;
+               dio->size += bio_size;

                bio = bio_alloc(gfp, nr_pages);
                if (!bio) {
-                       if (!ret)
+                       if (!dio->size)
                                ret = -EAGAIN;
                        goto error;
                }
@@ -496,8 +500,10 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
 out:
        if (!ret)
                ret = blk_status_to_errno(dio->bio.bi_status);
+       if (likely(!ret))
+               ret = dio->size;

-       bio_put(&dio->bio);
+       blkdev_put_dio(dio);
        return ret;
 error:
        if (!is_poll)

I think that works for all cases. Starting tests. Let me know if you think this
is not appropriate.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-06  0:05                 ` Damien Le Moal
@ 2019-08-06  0:23                   ` Dave Chinner
  2019-08-06 11:32                     ` Damien Le Moal
  2019-08-06  4:09                   ` Jens Axboe
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2019-08-06  0:23 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jens Axboe, Darrick J. Wong, linux-fsdevel, linux-block, xfs

On Tue, Aug 06, 2019 at 12:05:51AM +0000, Damien Le Moal wrote:
> On 2019/08/06 7:05, Damien Le Moal wrote:
> > On 2019/08/06 6:59, Damien Le Moal wrote:
> >> On 2019/08/06 6:28, Jens Axboe wrote:
> >>> On 8/5/19 2:27 PM, Damien Le Moal wrote:
> >>>> On 2019/08/06 6:26, Jens Axboe wrote:
> >>>>>> In any case, looking again at this code, it looks like there is a
> >>>>>> problem with dio->size being incremented early, even for fragments
> >>>>>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
> >>>>>> blkdev_bio_end_io(). So an incorrect size can be reported to user
> >>>>>> space in that case on completion (e.g. large asynchronous no-wait dio
> >>>>>> that cannot be issued in one go).
> >>>>>>
> >>>>>> So maybe something like this ? (completely untested)
> >>>>>
> >>>>> I think that looks pretty good, I like not double accounting with
> >>>>> this_size and dio->size, and we retain the old style ordering for the
> >>>>> ret value.
> >>>>
> >>>> Do you want a proper patch with real testing backup ? I can send that
> >>>> later today.
> >>>
> >>> Yeah that'd be great, I like your approach better.
> >>>
> >>
> >> Looking again, I think this is not it yet: dio->size is being referenced after
> >> submit_bio(), so blkdev_bio_end_io() may see the old value if the bio completes
> >> before dio->size increment. So the use-after-free is still there. And since
> >> blkdev_bio_end_io() processes completion to user space only when dio->ref
> >> becomes 0, adding an atomic_inc/dec(&dio->ref) over the loop would not help and
> >> does not cover the single BIO case. Any idea how to address this one ?
> >>
> > 
> > May be add a bio_get/put() over the 2 places that do submit_bio() would work,
> > for all cases (single/multi BIO, sync & async). E.g.:
> > 
> > +                       bio_get(bio);
> >                         qc = submit_bio(bio);
> >                         if (qc == BLK_QC_T_EAGAIN) {
> >                                 if (!dio->size)
> >                                         ret = -EAGAIN;
> > +                               bio_put(bio);
> >                                 goto error;
> >                         }
> >                         dio->size += bio_size;
> > +                       bio_put(bio);
> > 
> > Thoughts ?
> > 
> 
> That does not work since the reference to dio->size in blkdev_bio_end_io()
> depends on atomic_dec_and_test(&dio->ref) which counts the BIO fragments for the
> dio (+1 for async multi-bio case). So completion of the last bio can still
> reference the old value of dio->size.

Didn't we fix this same use-after-free in iomap_dio_rw() in commit
4ea899ead278 ("iomap: fix a use after free in iomap_dio_rw")?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-06  0:05                 ` Damien Le Moal
  2019-08-06  0:23                   ` Dave Chinner
@ 2019-08-06  4:09                   ` Jens Axboe
  2019-08-06  7:05                     ` Damien Le Moal
  2019-08-06 13:23                     ` Damien Le Moal
  1 sibling, 2 replies; 18+ messages in thread
From: Jens Axboe @ 2019-08-06  4:09 UTC (permalink / raw)
  To: Damien Le Moal, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

On 8/5/19 5:05 PM, Damien Le Moal wrote:
> On 2019/08/06 7:05, Damien Le Moal wrote:
>> On 2019/08/06 6:59, Damien Le Moal wrote:
>>> On 2019/08/06 6:28, Jens Axboe wrote:
>>>> On 8/5/19 2:27 PM, Damien Le Moal wrote:
>>>>> On 2019/08/06 6:26, Jens Axboe wrote:
>>>>>>> In any case, looking again at this code, it looks like there is a
>>>>>>> problem with dio->size being incremented early, even for fragments
>>>>>>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
>>>>>>> blkdev_bio_end_io(). So an incorrect size can be reported to user
>>>>>>> space in that case on completion (e.g. large asynchronous no-wait dio
>>>>>>> that cannot be issued in one go).
>>>>>>>
>>>>>>> So maybe something like this ? (completely untested)
>>>>>>
>>>>>> I think that looks pretty good, I like not double accounting with
>>>>>> this_size and dio->size, and we retain the old style ordering for the
>>>>>> ret value.
>>>>>
>>>>> Do you want a proper patch with real testing backup ? I can send that
>>>>> later today.
>>>>
>>>> Yeah that'd be great, I like your approach better.
>>>>
>>>
>>> Looking again, I think this is not it yet: dio->size is being referenced after
>>> submit_bio(), so blkdev_bio_end_io() may see the old value if the bio completes
>>> before dio->size increment. So the use-after-free is still there. And since
>>> blkdev_bio_end_io() processes completion to user space only when dio->ref
>>> becomes 0, adding an atomic_inc/dec(&dio->ref) over the loop would not help and
>>> does not cover the single BIO case. Any idea how to address this one ?
>>>
>>
>> May be add a bio_get/put() over the 2 places that do submit_bio() would work,
>> for all cases (single/multi BIO, sync & async). E.g.:
>>
>> +                       bio_get(bio);
>>                          qc = submit_bio(bio);
>>                          if (qc == BLK_QC_T_EAGAIN) {
>>                                  if (!dio->size)
>>                                          ret = -EAGAIN;
>> +                               bio_put(bio);
>>                                  goto error;
>>                          }
>>                          dio->size += bio_size;
>> +                       bio_put(bio);
>>
>> Thoughts ?
>>
> 
> That does not work since the reference to dio->size in
> blkdev_bio_end_io() depends on atomic_dec_and_test(&dio->ref) which
> counts the BIO fragments for the dio (+1 for async multi-bio case). So
> completion of the last bio can still reference the old value of
> dio->size.
> 
> Adding a bio_get/put() on dio->bio ensures that dio stays around, but
> does not prevent the use of the wrong dio->size. Adding an additional
> atomic_inc/dec(&dio->ref) would prevent that, but we would need to
> handle dio completion at the end of __blkdev_direct_IO() if all BIO
> fragments already completed at that point. That is a lot more plumbing
> needed, relying completely on dio->ref for all cases, thus removing
> the dio->multi_bio management.
> 
> Something like this:

Don't like this, as it adds unnecessary atomics for the sync case.
What's wrong with just adjusting dio->size if we get BLK_QC_T_EAGAIN?
It's safe to do so, since we're doing the final put later. We just can't
do it for the normal case of submit_bio() succeeding. Kill the new 'ret'
usage and return to what we had as well, it's more readable too imho.

Totally untested...

diff --git a/fs/block_dev.c b/fs/block_dev.c
index a6f7c892cb4a..131e2e0582a6 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -349,7 +349,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 	loff_t pos = iocb->ki_pos;
 	blk_qc_t qc = BLK_QC_T_NONE;
 	gfp_t gfp;
-	ssize_t ret;
+	int ret;
 
 	if ((pos | iov_iter_alignment(iter)) &
 	    (bdev_logical_block_size(bdev) - 1))
@@ -386,8 +386,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 
 	ret = 0;
 	for (;;) {
-		int err;
-
 		bio_set_dev(bio, bdev);
 		bio->bi_iter.bi_sector = pos >> 9;
 		bio->bi_write_hint = iocb->ki_hint;
@@ -395,10 +393,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
 
-		err = bio_iov_iter_get_pages(bio, iter);
-		if (unlikely(err)) {
-			if (!ret)
-				ret = err;
+		ret = bio_iov_iter_get_pages(bio, iter);
+		if (unlikely(ret)) {
 			bio->bi_status = BLK_STS_IOERR;
 			bio_endio(bio);
 			break;
@@ -421,7 +417,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		if (nowait)
 			bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);
 
-		dio->size += bio->bi_iter.bi_size;
 		pos += bio->bi_iter.bi_size;
 
 		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
@@ -433,13 +428,13 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				polled = true;
 			}
 
+			dio->size += bio->bi_iter.bi_size;
 			qc = submit_bio(bio);
 			if (qc == BLK_QC_T_EAGAIN) {
-				if (!ret)
-					ret = -EAGAIN;
+				dio->size -= bio->bi_iter.bi_size;
+				ret = -EAGAIN;
 				goto error;
 			}
-			ret = dio->size;
 
 			if (polled)
 				WRITE_ONCE(iocb->ki_cookie, qc);
@@ -460,18 +455,17 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 			atomic_inc(&dio->ref);
 		}
 
+		dio->size += bio->bi_iter.bi_size;
 		qc = submit_bio(bio);
 		if (qc == BLK_QC_T_EAGAIN) {
-			if (!ret)
-				ret = -EAGAIN;
+			dio->size -= bio->bi_iter.bi_size;
+			ret = -EAGAIN;
 			goto error;
 		}
-		ret = dio->size;
 
 		bio = bio_alloc(gfp, nr_pages);
 		if (!bio) {
-			if (!ret)
-				ret = -EAGAIN;
+			ret = -EAGAIN;
 			goto error;
 		}
 	}
@@ -496,6 +490,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 out:
 	if (!ret)
 		ret = blk_status_to_errno(dio->bio.bi_status);
+	if (likely(!ret))
+		ret = dio->size;
 
 	bio_put(&dio->bio);
 	return ret;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-06  4:09                   ` Jens Axboe
@ 2019-08-06  7:05                     ` Damien Le Moal
  2019-08-06 13:34                       ` Jens Axboe
  2019-08-06 13:23                     ` Damien Le Moal
  1 sibling, 1 reply; 18+ messages in thread
From: Damien Le Moal @ 2019-08-06  7:05 UTC (permalink / raw)
  To: Jens Axboe, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

On 2019/08/06 13:09, Jens Axboe wrote:
> On 8/5/19 5:05 PM, Damien Le Moal wrote:
>> On 2019/08/06 7:05, Damien Le Moal wrote:
>>> On 2019/08/06 6:59, Damien Le Moal wrote:
>>>> On 2019/08/06 6:28, Jens Axboe wrote:
>>>>> On 8/5/19 2:27 PM, Damien Le Moal wrote:
>>>>>> On 2019/08/06 6:26, Jens Axboe wrote:
>>>>>>>> In any case, looking again at this code, it looks like there is a
>>>>>>>> problem with dio->size being incremented early, even for fragments
>>>>>>>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
>>>>>>>> blkdev_bio_end_io(). So an incorrect size can be reported to user
>>>>>>>> space in that case on completion (e.g. large asynchronous no-wait dio
>>>>>>>> that cannot be issued in one go).
>>>>>>>>
>>>>>>>> So maybe something like this ? (completely untested)
>>>>>>>
>>>>>>> I think that looks pretty good, I like not double accounting with
>>>>>>> this_size and dio->size, and we retain the old style ordering for the
>>>>>>> ret value.
>>>>>>
>>>>>> Do you want a proper patch with real testing backup ? I can send that
>>>>>> later today.
>>>>>
>>>>> Yeah that'd be great, I like your approach better.
>>>>>
>>>>
>>>> Looking again, I think this is not it yet: dio->size is being referenced after
>>>> submit_bio(), so blkdev_bio_end_io() may see the old value if the bio completes
>>>> before dio->size increment. So the use-after-free is still there. And since
>>>> blkdev_bio_end_io() processes completion to user space only when dio->ref
>>>> becomes 0, adding an atomic_inc/dec(&dio->ref) over the loop would not help and
>>>> does not cover the single BIO case. Any idea how to address this one ?
>>>>
>>>
>>> May be add a bio_get/put() over the 2 places that do submit_bio() would work,
>>> for all cases (single/multi BIO, sync & async). E.g.:
>>>
>>> +                       bio_get(bio);
>>>                          qc = submit_bio(bio);
>>>                          if (qc == BLK_QC_T_EAGAIN) {
>>>                                  if (!dio->size)
>>>                                          ret = -EAGAIN;
>>> +                               bio_put(bio);
>>>                                  goto error;
>>>                          }
>>>                          dio->size += bio_size;
>>> +                       bio_put(bio);
>>>
>>> Thoughts ?
>>>
>>
>> That does not work since the reference to dio->size in
>> blkdev_bio_end_io() depends on atomic_dec_and_test(&dio->ref) which
>> counts the BIO fragments for the dio (+1 for async multi-bio case). So
>> completion of the last bio can still reference the old value of
>> dio->size.
>>
>> Adding a bio_get/put() on dio->bio ensures that dio stays around, but
>> does not prevent the use of the wrong dio->size. Adding an additional
>> atomic_inc/dec(&dio->ref) would prevent that, but we would need to
>> handle dio completion at the end of __blkdev_direct_IO() if all BIO
>> fragments already completed at that point. That is a lot more plumbing
>> needed, relying completely on dio->ref for all cases, thus removing
>> the dio->multi_bio management.
>>
>> Something like this:
> 
> Don't like this, as it adds unnecessary atomics for the sync case.
> What's wrong with just adjusting dio->size if we get BLK_QC_T_EAGAIN?
> It's safe to do so, since we're doing the final put later. We just can't
> do it for the normal case of submit_bio() succeeding. Kill the new 'ret'
> usage and return to what we had as well, it's more readable too imho.
> 
> Totally untested...
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index a6f7c892cb4a..131e2e0582a6 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -349,7 +349,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  	loff_t pos = iocb->ki_pos;
>  	blk_qc_t qc = BLK_QC_T_NONE;
>  	gfp_t gfp;
> -	ssize_t ret;
> +	int ret;
>  
>  	if ((pos | iov_iter_alignment(iter)) &
>  	    (bdev_logical_block_size(bdev) - 1))
> @@ -386,8 +386,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  
>  	ret = 0;
>  	for (;;) {
> -		int err;
> -
>  		bio_set_dev(bio, bdev);
>  		bio->bi_iter.bi_sector = pos >> 9;
>  		bio->bi_write_hint = iocb->ki_hint;
> @@ -395,10 +393,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  		bio->bi_end_io = blkdev_bio_end_io;
>  		bio->bi_ioprio = iocb->ki_ioprio;
>  
> -		err = bio_iov_iter_get_pages(bio, iter);
> -		if (unlikely(err)) {
> -			if (!ret)
> -				ret = err;
> +		ret = bio_iov_iter_get_pages(bio, iter);
> +		if (unlikely(ret)) {
>  			bio->bi_status = BLK_STS_IOERR;
>  			bio_endio(bio);
>  			break;
> @@ -421,7 +417,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  		if (nowait)
>  			bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);
>  
> -		dio->size += bio->bi_iter.bi_size;
>  		pos += bio->bi_iter.bi_size;
>  
>  		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
> @@ -433,13 +428,13 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  				polled = true;
>  			}
>  
> +			dio->size += bio->bi_iter.bi_size;
>  			qc = submit_bio(bio);
>  			if (qc == BLK_QC_T_EAGAIN) {
> -				if (!ret)
> -					ret = -EAGAIN;
> +				dio->size -= bio->bi_iter.bi_size;

ref after free of bio here. Easy to fix though. Also, with this, the bio_endio()
call within submit_bio() for the EAGAIN failure will see a dio->size too large,
including this failed bio. So this does not work.

One thing I do not 100% sure is the nowait case with a fragmented dio: if the
processing stops on a BLK_QC_T_EAGAIN, the code in blkdev_bio_end_io() will
complete the iocb with -EAGAIN, while this code (submission path) will return
the amount of submitted bytes for the dio (short read or write). So for nowait
&& is_sync case, this means that preadv2/pwritev2 would always get -EAGAIN
instead of the partial read/write achieved. For the nowait && !is_sync case,
similarly, aio_return() would always return -EAGAIN too even if a partial
read/write was done. Or am I missing something ? Checking again the man pages,
this does not look like the described behavior.

If this analysis is correct, I think the proper fix cannot be done only in
__blkdev_direct_IO(). blkdev_bio_end_io() needs to change too. Basically, I
think the "dio->size -= bio->bi_iter.bi_size" needs to go into
blkdev_bio_end_io() and a test added to see if dio->size is 0 (then complete
with -EAGAIN) or not (then partial completion).
Starting testing something now with these changes in blkdev_bio_end_io().


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-06  0:23                   ` Dave Chinner
@ 2019-08-06 11:32                     ` Damien Le Moal
  0 siblings, 0 replies; 18+ messages in thread
From: Damien Le Moal @ 2019-08-06 11:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jens Axboe, Darrick J. Wong, linux-fsdevel, linux-block, xfs

On 2019/08/06 9:25, Dave Chinner wrote:
> On Tue, Aug 06, 2019 at 12:05:51AM +0000, Damien Le Moal wrote:
>> On 2019/08/06 7:05, Damien Le Moal wrote:
>>> On 2019/08/06 6:59, Damien Le Moal wrote:
>>>> On 2019/08/06 6:28, Jens Axboe wrote:
>>>>> On 8/5/19 2:27 PM, Damien Le Moal wrote:
>>>>>> On 2019/08/06 6:26, Jens Axboe wrote:
>>>>>>>> In any case, looking again at this code, it looks like there is a
>>>>>>>> problem with dio->size being incremented early, even for fragments
>>>>>>>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
>>>>>>>> blkdev_bio_end_io(). So an incorrect size can be reported to user
>>>>>>>> space in that case on completion (e.g. large asynchronous no-wait dio
>>>>>>>> that cannot be issued in one go).
>>>>>>>>
>>>>>>>> So maybe something like this ? (completely untested)
>>>>>>>
>>>>>>> I think that looks pretty good, I like not double accounting with
>>>>>>> this_size and dio->size, and we retain the old style ordering for the
>>>>>>> ret value.
>>>>>>
>>>>>> Do you want a proper patch with real testing backup ? I can send that
>>>>>> later today.
>>>>>
>>>>> Yeah that'd be great, I like your approach better.
>>>>>
>>>>
>>>> Looking again, I think this is not it yet: dio->size is being referenced after
>>>> submit_bio(), so blkdev_bio_end_io() may see the old value if the bio completes
>>>> before dio->size increment. So the use-after-free is still there. And since
>>>> blkdev_bio_end_io() processes completion to user space only when dio->ref
>>>> becomes 0, adding an atomic_inc/dec(&dio->ref) over the loop would not help and
>>>> does not cover the single BIO case. Any idea how to address this one ?
>>>>
>>>
>>> May be add a bio_get/put() over the 2 places that do submit_bio() would work,
>>> for all cases (single/multi BIO, sync & async). E.g.:
>>>
>>> +                       bio_get(bio);
>>>                         qc = submit_bio(bio);
>>>                         if (qc == BLK_QC_T_EAGAIN) {
>>>                                 if (!dio->size)
>>>                                         ret = -EAGAIN;
>>> +                               bio_put(bio);
>>>                                 goto error;
>>>                         }
>>>                         dio->size += bio_size;
>>> +                       bio_put(bio);
>>>
>>> Thoughts ?
>>>
>>
>> That does not work since the reference to dio->size in blkdev_bio_end_io()
>> depends on atomic_dec_and_test(&dio->ref) which counts the BIO fragments for the
>> dio (+1 for async multi-bio case). So completion of the last bio can still
>> reference the old value of dio->size.
> 
> Didn't we fix this same use-after-free in iomap_dio_rw() in commit
> 4ea899ead278 ("iomap: fix a use after free in iomap_dio_rw")?

Not so similar in this case with raw block device dio. But I will look more into
it for inspiration. Thank you for the pointer.

> 
> Cheers,
> 
> Dave.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-06  4:09                   ` Jens Axboe
  2019-08-06  7:05                     ` Damien Le Moal
@ 2019-08-06 13:23                     ` Damien Le Moal
  1 sibling, 0 replies; 18+ messages in thread
From: Damien Le Moal @ 2019-08-06 13:23 UTC (permalink / raw)
  To: Jens Axboe, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

[-- Attachment #1: Type: text/plain, Size: 8696 bytes --]

On 2019/08/06 13:09, Jens Axboe wrote:
> On 8/5/19 5:05 PM, Damien Le Moal wrote:
>> On 2019/08/06 7:05, Damien Le Moal wrote:
>>> On 2019/08/06 6:59, Damien Le Moal wrote:
>>>> On 2019/08/06 6:28, Jens Axboe wrote:
>>>>> On 8/5/19 2:27 PM, Damien Le Moal wrote:
>>>>>> On 2019/08/06 6:26, Jens Axboe wrote:
>>>>>>>> In any case, looking again at this code, it looks like there is a
>>>>>>>> problem with dio->size being incremented early, even for fragments
>>>>>>>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
>>>>>>>> blkdev_bio_end_io(). So an incorrect size can be reported to user
>>>>>>>> space in that case on completion (e.g. large asynchronous no-wait dio
>>>>>>>> that cannot be issued in one go).
>>>>>>>>
>>>>>>>> So maybe something like this ? (completely untested)
>>>>>>>
>>>>>>> I think that looks pretty good, I like not double accounting with
>>>>>>> this_size and dio->size, and we retain the old style ordering for the
>>>>>>> ret value.
>>>>>>
>>>>>> Do you want a proper patch with real testing backup ? I can send that
>>>>>> later today.
>>>>>
>>>>> Yeah that'd be great, I like your approach better.
>>>>>
>>>>
>>>> Looking again, I think this is not it yet: dio->size is being referenced after
>>>> submit_bio(), so blkdev_bio_end_io() may see the old value if the bio completes
>>>> before dio->size increment. So the use-after-free is still there. And since
>>>> blkdev_bio_end_io() processes completion to user space only when dio->ref
>>>> becomes 0, adding an atomic_inc/dec(&dio->ref) over the loop would not help and
>>>> does not cover the single BIO case. Any idea how to address this one ?
>>>>
>>>
>>> May be add a bio_get/put() over the 2 places that do submit_bio() would work,
>>> for all cases (single/multi BIO, sync & async). E.g.:
>>>
>>> +                       bio_get(bio);
>>>                          qc = submit_bio(bio);
>>>                          if (qc == BLK_QC_T_EAGAIN) {
>>>                                  if (!dio->size)
>>>                                          ret = -EAGAIN;
>>> +                               bio_put(bio);
>>>                                  goto error;
>>>                          }
>>>                          dio->size += bio_size;
>>> +                       bio_put(bio);
>>>
>>> Thoughts ?
>>>
>>
>> That does not work since the reference to dio->size in
>> blkdev_bio_end_io() depends on atomic_dec_and_test(&dio->ref) which
>> counts the BIO fragments for the dio (+1 for async multi-bio case). So
>> completion of the last bio can still reference the old value of
>> dio->size.
>>
>> Adding a bio_get/put() on dio->bio ensures that dio stays around, but
>> does not prevent the use of the wrong dio->size. Adding an additional
>> atomic_inc/dec(&dio->ref) would prevent that, but we would need to
>> handle dio completion at the end of __blkdev_direct_IO() if all BIO
>> fragments already completed at that point. That is a lot more plumbing
>> needed, relying completely on dio->ref for all cases, thus removing
>> the dio->multi_bio management.
>>
>> Something like this:
> 
> Don't like this, as it adds unnecessary atomics for the sync case.
> What's wrong with just adjusting dio->size if we get BLK_QC_T_EAGAIN?
> It's safe to do so, since we're doing the final put later. We just can't
> do it for the normal case of submit_bio() succeeding. Kill the new 'ret'
> usage and return to what we had as well, it's more readable too imho.

Here is what I have so far:

diff --git a/fs/block_dev.c b/fs/block_dev.c
index a6f7c892cb4a..6dd945fdf962 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -300,7 +300,9 @@ static void blkdev_bio_end_io(struct bio *bio)
        struct blkdev_dio *dio = bio->bi_private;
        bool should_dirty = dio->should_dirty;

-       if (bio->bi_status && !dio->bio.bi_status)
+       if (bio->bi_status &&
+           bio->bi_status != BLK_STS_AGAIN &&
+           !dio->bio.bi_status)
                dio->bio.bi_status = bio->bi_status;

        if (!dio->multi_bio || atomic_dec_and_test(&dio->ref)) {
@@ -349,7 +351,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
        loff_t pos = iocb->ki_pos;
        blk_qc_t qc = BLK_QC_T_NONE;
        gfp_t gfp;
-       ssize_t ret;
+       ssize_t ret = 0;

        if ((pos | iov_iter_alignment(iter)) &
            (bdev_logical_block_size(bdev) - 1))
@@ -386,7 +388,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

        ret = 0;
        for (;;) {
-               int err;
+               unsigned int bio_size;

                bio_set_dev(bio, bdev);
                bio->bi_iter.bi_sector = pos >> 9;
@@ -395,10 +397,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
                bio->bi_end_io = blkdev_bio_end_io;
                bio->bi_ioprio = iocb->ki_ioprio;

-               err = bio_iov_iter_get_pages(bio, iter);
-               if (unlikely(err)) {
-                       if (!ret)
-                               ret = err;
+               ret = bio_iov_iter_get_pages(bio, iter);
+               if (unlikely(ret)) {
                        bio->bi_status = BLK_STS_IOERR;
                        bio_endio(bio);
                        break;
@@ -421,7 +421,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
                if (nowait)
                        bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);

-               dio->size += bio->bi_iter.bi_size;
+               bio_size = bio->bi_iter.bi_size;
                pos += bio->bi_iter.bi_size;

                nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
@@ -435,11 +435,11 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

                        qc = submit_bio(bio);
                        if (qc == BLK_QC_T_EAGAIN) {
-                               if (!ret)
+                               if (!dio->size)
                                        ret = -EAGAIN;
                                goto error;
                        }
-                       ret = dio->size;
+                       dio->size += bio_size;

                        if (polled)
                                WRITE_ONCE(iocb->ki_cookie, qc);
@@ -462,15 +462,15 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

                qc = submit_bio(bio);
                if (qc == BLK_QC_T_EAGAIN) {
-                       if (!ret)
+                       if (!dio->size)
                                ret = -EAGAIN;
                        goto error;
                }
-               ret = dio->size;
+               dio->size += bio_size;

                bio = bio_alloc(gfp, nr_pages);
                if (!bio) {
-                       if (!ret)
+                       if (!dio->size)
                                ret = -EAGAIN;
                        goto error;
                }
@@ -496,6 +496,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
 out:
        if (!ret)
                ret = blk_status_to_errno(dio->bio.bi_status);
+       if (likely(!ret))
+               ret = dio->size;

        bio_put(&dio->bio);
        return ret;

This fixes the sync case return value with nowait, but the async case is still
wrong as there is a potential use-after-free of dio and dio->size seen by
blkdev_bio_end_io() may not be the uptodate incremented value after submit_bio().

And I am stuck. I do not see how to fix this without the extra dio reference and
handling of completion of dio->ref decrement also in __blkdev_direct_IO(). This
is the previous proposal I sent that you did not like. This extra ref solution
is similar to the iomap direct IO code. Any other idea ?

Also, I am seeing some very weird behavior of submit_bio() with the nowait case.
For very large dio requests, I see submit_bio returning BLK_QC_T_EAGAIN but the
bio actually going to the drive, which of course does not make sense.
Another thing I noticed is that REQ_NOWAIT_INLINE is handled for the inline
error return only by blk_mq_make_request(). generic_make_request() is lacking
handling of REQ_NOWAIT through bio_wouldblock_error().

I will keep digging into all this, but being on vacation this week, I may be
very slow.

FYI, attaching to this email the small application I used for testing. Planning
to use it for blktests cases.

Best regards.

-- 
Damien Le Moal
Western Digital Research

[-- Attachment #2: dio.c --]
[-- Type: text/plain, Size: 4682 bytes --]

// SPDX-License-Identifier: GPL-2.0-only
/*
 * Copyright (C) 2019 Western Digital Corporation or its affiliates.
 * Author: Damien Le Moal <damien.lemoal@wdc.com>
 */
#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/uio.h>
#include <sys/syscall.h>
#include <fcntl.h>
#include <linux/aio_abi.h>
#include <linux/fs.h>

static void dio_usage(char *cmd)
{
	printf("Usage: %s [options] <dev path> <offset (B)> <size (B)>\n",
	       cmd);
	printf("Options:\n"
	       "    -rd: do read (default)\n"
	       "    -wr: do wr\n"
	       "    -nowait: Use RWF_NOWAIT (do not wait for resources).\n"
	       "    -async: Do asynchronous direct IO\n");
}

ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
		off_t offset, int flags)
{
	return syscall(SYS_preadv2, fd, iov, iovcnt, offset, 0, flags);
}

ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
		 off_t offset, int flags)
{
	return syscall(SYS_pwritev2, fd, iov, iovcnt, offset, 0, flags);
}

static void dio_sync(int fd, void *buf, size_t size, loff_t offset, int flags,
		     bool do_read)
{
	ssize_t ret;
	struct iovec iov = {
		.iov_base = buf,
		.iov_len = size,
	};

	if (do_read)
		ret = preadv2(fd, &iov, 1, offset, flags);
	else
		ret = pwritev2(fd, &iov, 1, offset, flags);
	if (ret < 0)
		ret = -errno;

	printf("%zu %zd\n", size, ret);
}

static inline int io_setup(unsigned int nr, aio_context_t *ctxp)
{
	return syscall(__NR_io_setup, nr, ctxp);
}

static inline int io_destroy(aio_context_t ctx)
{
	return syscall(__NR_io_destroy, ctx);
}

static inline int io_submit(aio_context_t ctx, long nr, struct iocb **iocbpp)
{
	return syscall(__NR_io_submit, ctx, nr, iocbpp);
}

static inline int io_getevents(aio_context_t ctx, long min_nr, long max_nr,
			       struct io_event *events,
			       struct timespec *timeout)
{
	return syscall(__NR_io_getevents, ctx, min_nr, max_nr, events, timeout);
}

static void dio_async(int fd, void *buf, size_t size, loff_t offset, int flags,
		      bool do_read)
{
	aio_context_t aioctx;                                 
        struct iocb aiocb, *aiocbs;
	struct io_event aioevent;
	ssize_t ret;

	memset(&aioctx, 0, sizeof(aioctx));
	memset(&aiocb, 0, sizeof(aiocb));
	memset(&aioevent, 0, sizeof(aioevent));

	ret = io_setup(1, &aioctx);
        if (ret < 0) {
                fprintf(stderr,
                        "io_setup failed %d (%s)\n", errno, strerror(errno));
                return;
        }

	aiocb.aio_fildes = fd;
	aiocb.aio_buf = (unsigned long)buf;
	aiocb.aio_nbytes = size;
	if (do_read)
		aiocb.aio_lio_opcode = IOCB_CMD_PREAD;
	else
		aiocb.aio_lio_opcode = IOCB_CMD_PWRITE;
	aiocb.aio_rw_flags = flags;
	aiocbs = &aiocb;

	ret = io_submit(aioctx, 1, &aiocbs);
	if (ret < 0) {
		fprintf(stderr, "io_submit failed %d (%s)\n",
			errno, strerror(errno));
		goto out;
	}

	ret = io_getevents(aioctx, 1, 1, &aioevent, NULL);
        if (ret != 1) {
                fprintf(stderr, "io_getevents failed %d (%s)\n",
                        errno, strerror(errno));
		goto out;
        }

	printf("%zu %lld\n", size, aioevent.res);

out:
	io_destroy(aioctx);
}

int main(int argc, char **argv)
{
	int ret, fd, i, flags = 0;
	char *dev_path;
	loff_t offset;
	size_t size;
	void *buf = NULL;
	bool async = false;
	bool do_read = true;
	int open_flags = O_DIRECT;

	for (i = 1; i < argc; i++) {
		if (strcmp(argv[i], "-nowait") == 0) {
			flags = RWF_NOWAIT;
		} else if (strcmp(argv[i], "-async") == 0) {
			async = true;
		} else if (strcmp(argv[i], "-rd") == 0) {
			do_read = true;
		} else if (strcmp(argv[i], "-wr") == 0) {
			do_read = false;
		} else if (argv[i][0] == '-') {
			fprintf(stderr, "Invalid option %s\n", argv[i]);
			return 1;
		} else {
			break;
		}
	}

	if (argc - i != 3) {
		dio_usage(argv[0]);
		return 1;
	}

	dev_path = argv[i];
	offset = atoll(argv[i + 1]);
	if (offset < 0) {
		fprintf(stderr, "Invalid offset %s\n", argv[i + 1]);
		return 1;
	}

	size = atoll(argv[i + 2]);

	if (do_read)
		open_flags |= O_RDONLY;
	else
		open_flags |= O_WRONLY;
	fd = open(dev_path, open_flags, 0);
	if (fd < 0) {
		fprintf(stderr, "Open %s failed %d (%s)\n",
			dev_path, errno, strerror(errno));
		return 1;
	}

	ret = posix_memalign((void **) &buf, sysconf(_SC_PAGESIZE), size);
	if (ret != 0) {
		fprintf(stderr, "Allocate buffer failed %d (%s)\n",
			-ret, strerror(-ret));
		ret = 1;
		goto out;
	}

	if (!async)
		dio_sync(fd, buf, size, offset, flags, do_read);
	else
		dio_async(fd, buf, size, offset, flags, do_read);

out:
	close(fd);
	free(buf);

	return ret;
}

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-06  7:05                     ` Damien Le Moal
@ 2019-08-06 13:34                       ` Jens Axboe
  2019-08-07  9:42                         ` Damien Le Moal
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2019-08-06 13:34 UTC (permalink / raw)
  To: Damien Le Moal, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

On 8/6/19 12:05 AM, Damien Le Moal wrote:
> On 2019/08/06 13:09, Jens Axboe wrote:
>> On 8/5/19 5:05 PM, Damien Le Moal wrote:
>>> On 2019/08/06 7:05, Damien Le Moal wrote:
>>>> On 2019/08/06 6:59, Damien Le Moal wrote:
>>>>> On 2019/08/06 6:28, Jens Axboe wrote:
>>>>>> On 8/5/19 2:27 PM, Damien Le Moal wrote:
>>>>>>> On 2019/08/06 6:26, Jens Axboe wrote:
>>>>>>>>> In any case, looking again at this code, it looks like there is a
>>>>>>>>> problem with dio->size being incremented early, even for fragments
>>>>>>>>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
>>>>>>>>> blkdev_bio_end_io(). So an incorrect size can be reported to user
>>>>>>>>> space in that case on completion (e.g. large asynchronous no-wait dio
>>>>>>>>> that cannot be issued in one go).
>>>>>>>>>
>>>>>>>>> So maybe something like this ? (completely untested)
>>>>>>>>
>>>>>>>> I think that looks pretty good, I like not double accounting with
>>>>>>>> this_size and dio->size, and we retain the old style ordering for the
>>>>>>>> ret value.
>>>>>>>
>>>>>>> Do you want a proper patch with real testing backup ? I can send that
>>>>>>> later today.
>>>>>>
>>>>>> Yeah that'd be great, I like your approach better.
>>>>>>
>>>>>
>>>>> Looking again, I think this is not it yet: dio->size is being referenced after
>>>>> submit_bio(), so blkdev_bio_end_io() may see the old value if the bio completes
>>>>> before dio->size increment. So the use-after-free is still there. And since
>>>>> blkdev_bio_end_io() processes completion to user space only when dio->ref
>>>>> becomes 0, adding an atomic_inc/dec(&dio->ref) over the loop would not help and
>>>>> does not cover the single BIO case. Any idea how to address this one ?
>>>>>
>>>>
>>>> May be add a bio_get/put() over the 2 places that do submit_bio() would work,
>>>> for all cases (single/multi BIO, sync & async). E.g.:
>>>>
>>>> +                       bio_get(bio);
>>>>                           qc = submit_bio(bio);
>>>>                           if (qc == BLK_QC_T_EAGAIN) {
>>>>                                   if (!dio->size)
>>>>                                           ret = -EAGAIN;
>>>> +                               bio_put(bio);
>>>>                                   goto error;
>>>>                           }
>>>>                           dio->size += bio_size;
>>>> +                       bio_put(bio);
>>>>
>>>> Thoughts ?
>>>>
>>>
>>> That does not work since the reference to dio->size in
>>> blkdev_bio_end_io() depends on atomic_dec_and_test(&dio->ref) which
>>> counts the BIO fragments for the dio (+1 for async multi-bio case). So
>>> completion of the last bio can still reference the old value of
>>> dio->size.
>>>
>>> Adding a bio_get/put() on dio->bio ensures that dio stays around, but
>>> does not prevent the use of the wrong dio->size. Adding an additional
>>> atomic_inc/dec(&dio->ref) would prevent that, but we would need to
>>> handle dio completion at the end of __blkdev_direct_IO() if all BIO
>>> fragments already completed at that point. That is a lot more plumbing
>>> needed, relying completely on dio->ref for all cases, thus removing
>>> the dio->multi_bio management.
>>>
>>> Something like this:
>>
>> Don't like this, as it adds unnecessary atomics for the sync case.
>> What's wrong with just adjusting dio->size if we get BLK_QC_T_EAGAIN?
>> It's safe to do so, since we're doing the final put later. We just can't
>> do it for the normal case of submit_bio() succeeding. Kill the new 'ret'
>> usage and return to what we had as well, it's more readable too imho.
>>
>> Totally untested...
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index a6f7c892cb4a..131e2e0582a6 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -349,7 +349,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   	loff_t pos = iocb->ki_pos;
>>   	blk_qc_t qc = BLK_QC_T_NONE;
>>   	gfp_t gfp;
>> -	ssize_t ret;
>> +	int ret;
>>   
>>   	if ((pos | iov_iter_alignment(iter)) &
>>   	    (bdev_logical_block_size(bdev) - 1))
>> @@ -386,8 +386,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   
>>   	ret = 0;
>>   	for (;;) {
>> -		int err;
>> -
>>   		bio_set_dev(bio, bdev);
>>   		bio->bi_iter.bi_sector = pos >> 9;
>>   		bio->bi_write_hint = iocb->ki_hint;
>> @@ -395,10 +393,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   		bio->bi_end_io = blkdev_bio_end_io;
>>   		bio->bi_ioprio = iocb->ki_ioprio;
>>   
>> -		err = bio_iov_iter_get_pages(bio, iter);
>> -		if (unlikely(err)) {
>> -			if (!ret)
>> -				ret = err;
>> +		ret = bio_iov_iter_get_pages(bio, iter);
>> +		if (unlikely(ret)) {
>>   			bio->bi_status = BLK_STS_IOERR;
>>   			bio_endio(bio);
>>   			break;
>> @@ -421,7 +417,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   		if (nowait)
>>   			bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);
>>   
>> -		dio->size += bio->bi_iter.bi_size;
>>   		pos += bio->bi_iter.bi_size;
>>   
>>   		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>> @@ -433,13 +428,13 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   				polled = true;
>>   			}
>>   
>> +			dio->size += bio->bi_iter.bi_size;
>>   			qc = submit_bio(bio);
>>   			if (qc == BLK_QC_T_EAGAIN) {
>> -				if (!ret)
>> -					ret = -EAGAIN;
>> +				dio->size -= bio->bi_iter.bi_size;
> 
> ref after free of bio here. Easy to fix though. Also, with this, the
> bio_endio() call within submit_bio() for the EAGAIN failure will see a
> dio->size too large, including this failed bio. So this does not work.

There's no ref after free here - if BLK_QC_T_EAGAIN is being returned,
the bio has not been freed. There's no calling bio_endio() for that
case.

For dio->size, it doesn't matter. If we get the error here, bio_endio()
was never called. And if the submission is successful, we use dio->size
for the success case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Block device direct read EIO handling broken?
  2019-08-06 13:34                       ` Jens Axboe
@ 2019-08-07  9:42                         ` Damien Le Moal
  0 siblings, 0 replies; 18+ messages in thread
From: Damien Le Moal @ 2019-08-07  9:42 UTC (permalink / raw)
  To: Jens Axboe, Darrick J. Wong; +Cc: linux-fsdevel, linux-block, xfs

On 2019/08/06 22:34, Jens Axboe wrote:
> On 8/6/19 12:05 AM, Damien Le Moal wrote:
>> On 2019/08/06 13:09, Jens Axboe wrote:
>>> On 8/5/19 5:05 PM, Damien Le Moal wrote:
>>>> On 2019/08/06 7:05, Damien Le Moal wrote:
>>>>> On 2019/08/06 6:59, Damien Le Moal wrote:
>>>>>> On 2019/08/06 6:28, Jens Axboe wrote:
>>>>>>> On 8/5/19 2:27 PM, Damien Le Moal wrote:
>>>>>>>> On 2019/08/06 6:26, Jens Axboe wrote:
>>>>>>>>>> In any case, looking again at this code, it looks like there is a
>>>>>>>>>> problem with dio->size being incremented early, even for fragments
>>>>>>>>>> that get BLK_QC_T_EAGAIN, because dio->size is being used in
>>>>>>>>>> blkdev_bio_end_io(). So an incorrect size can be reported to user
>>>>>>>>>> space in that case on completion (e.g. large asynchronous no-wait dio
>>>>>>>>>> that cannot be issued in one go).
>>>>>>>>>>
>>>>>>>>>> So maybe something like this ? (completely untested)
>>>>>>>>>
>>>>>>>>> I think that looks pretty good, I like not double accounting with
>>>>>>>>> this_size and dio->size, and we retain the old style ordering for the
>>>>>>>>> ret value.
>>>>>>>>
>>>>>>>> Do you want a proper patch with real testing backup ? I can send that
>>>>>>>> later today.
>>>>>>>
>>>>>>> Yeah that'd be great, I like your approach better.
>>>>>>>
>>>>>>
>>>>>> Looking again, I think this is not it yet: dio->size is being referenced after
>>>>>> submit_bio(), so blkdev_bio_end_io() may see the old value if the bio completes
>>>>>> before dio->size increment. So the use-after-free is still there. And since
>>>>>> blkdev_bio_end_io() processes completion to user space only when dio->ref
>>>>>> becomes 0, adding an atomic_inc/dec(&dio->ref) over the loop would not help and
>>>>>> does not cover the single BIO case. Any idea how to address this one ?
>>>>>>
>>>>>
>>>>> May be add a bio_get/put() over the 2 places that do submit_bio() would work,
>>>>> for all cases (single/multi BIO, sync & async). E.g.:
>>>>>
>>>>> +                       bio_get(bio);
>>>>>                           qc = submit_bio(bio);
>>>>>                           if (qc == BLK_QC_T_EAGAIN) {
>>>>>                                   if (!dio->size)
>>>>>                                           ret = -EAGAIN;
>>>>> +                               bio_put(bio);
>>>>>                                   goto error;
>>>>>                           }
>>>>>                           dio->size += bio_size;
>>>>> +                       bio_put(bio);
>>>>>
>>>>> Thoughts ?
>>>>>
>>>>
>>>> That does not work since the reference to dio->size in
>>>> blkdev_bio_end_io() depends on atomic_dec_and_test(&dio->ref) which
>>>> counts the BIO fragments for the dio (+1 for async multi-bio case). So
>>>> completion of the last bio can still reference the old value of
>>>> dio->size.
>>>>
>>>> Adding a bio_get/put() on dio->bio ensures that dio stays around, but
>>>> does not prevent the use of the wrong dio->size. Adding an additional
>>>> atomic_inc/dec(&dio->ref) would prevent that, but we would need to
>>>> handle dio completion at the end of __blkdev_direct_IO() if all BIO
>>>> fragments already completed at that point. That is a lot more plumbing
>>>> needed, relying completely on dio->ref for all cases, thus removing
>>>> the dio->multi_bio management.
>>>>
>>>> Something like this:
>>>
>>> Don't like this, as it adds unnecessary atomics for the sync case.
>>> What's wrong with just adjusting dio->size if we get BLK_QC_T_EAGAIN?
>>> It's safe to do so, since we're doing the final put later. We just can't
>>> do it for the normal case of submit_bio() succeeding. Kill the new 'ret'
>>> usage and return to what we had as well, it's more readable too imho.
>>>
>>> Totally untested...
>>>
>>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>>> index a6f7c892cb4a..131e2e0582a6 100644
>>> --- a/fs/block_dev.c
>>> +++ b/fs/block_dev.c
>>> @@ -349,7 +349,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>>   	loff_t pos = iocb->ki_pos;
>>>   	blk_qc_t qc = BLK_QC_T_NONE;
>>>   	gfp_t gfp;
>>> -	ssize_t ret;
>>> +	int ret;
>>>   
>>>   	if ((pos | iov_iter_alignment(iter)) &
>>>   	    (bdev_logical_block_size(bdev) - 1))
>>> @@ -386,8 +386,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>>   
>>>   	ret = 0;
>>>   	for (;;) {
>>> -		int err;
>>> -
>>>   		bio_set_dev(bio, bdev);
>>>   		bio->bi_iter.bi_sector = pos >> 9;
>>>   		bio->bi_write_hint = iocb->ki_hint;
>>> @@ -395,10 +393,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>>   		bio->bi_end_io = blkdev_bio_end_io;
>>>   		bio->bi_ioprio = iocb->ki_ioprio;
>>>   
>>> -		err = bio_iov_iter_get_pages(bio, iter);
>>> -		if (unlikely(err)) {
>>> -			if (!ret)
>>> -				ret = err;
>>> +		ret = bio_iov_iter_get_pages(bio, iter);
>>> +		if (unlikely(ret)) {
>>>   			bio->bi_status = BLK_STS_IOERR;
>>>   			bio_endio(bio);
>>>   			break;
>>> @@ -421,7 +417,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>>   		if (nowait)
>>>   			bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);
>>>   
>>> -		dio->size += bio->bi_iter.bi_size;
>>>   		pos += bio->bi_iter.bi_size;
>>>   
>>>   		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>>> @@ -433,13 +428,13 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>>   				polled = true;
>>>   			}
>>>   
>>> +			dio->size += bio->bi_iter.bi_size;
>>>   			qc = submit_bio(bio);
>>>   			if (qc == BLK_QC_T_EAGAIN) {
>>> -				if (!ret)
>>> -					ret = -EAGAIN;
>>> +				dio->size -= bio->bi_iter.bi_size;
>>
>> ref after free of bio here. Easy to fix though. Also, with this, the
>> bio_endio() call within submit_bio() for the EAGAIN failure will see a
>> dio->size too large, including this failed bio. So this does not work.
> 
> There's no ref after free here - if BLK_QC_T_EAGAIN is being returned,
> the bio has not been freed. There's no calling bio_endio() for that
> case.
> 
> For dio->size, it doesn't matter. If we get the error here, bio_endio()
> was never called. And if the submission is successful, we use dio->size
> for the success case.
> 

OK. Got it (I checked REQ_NOWAIT_INLINE, I should have done this earlier !).

But, unless I am missing something again, there are 2 other problems though:
1) I do not see a bio_put() for the bio submission failure with BLK_QC_T_EAGAIN.
2) If the submission failure is for a bio fragment of a multi-bio dio, dio->ref
for this failed bio will never be decremented, so the dio will never complete.

I am still trying to get a reliable way to generate BLK_QC_T_EAGAIN submit_bio()
failures for testing. Any hint on how to do this ? I tried setting qd to 1 and
nr_requests to the minimum 4 and generate a lot of load on the test disk, but
never got BLK_QC_T_EAGAIN.

Best regards.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-08-07  9:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-05 18:15 Block device direct read EIO handling broken? Darrick J. Wong
2019-08-05 18:31 ` Jens Axboe
2019-08-05 20:31   ` Jens Axboe
2019-08-05 20:54     ` Jens Axboe
2019-08-05 21:08     ` Damien Le Moal
2019-08-05 21:25       ` Jens Axboe
2019-08-05 21:27         ` Damien Le Moal
2019-08-05 21:28           ` Jens Axboe
2019-08-05 21:59             ` Damien Le Moal
2019-08-05 22:05               ` Damien Le Moal
2019-08-06  0:05                 ` Damien Le Moal
2019-08-06  0:23                   ` Dave Chinner
2019-08-06 11:32                     ` Damien Le Moal
2019-08-06  4:09                   ` Jens Axboe
2019-08-06  7:05                     ` Damien Le Moal
2019-08-06 13:34                       ` Jens Axboe
2019-08-07  9:42                         ` Damien Le Moal
2019-08-06 13:23                     ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).