All of lore.kernel.org
 help / color / mirror / Atom feed
* Silent data corruption in blkdev_direct_IO()
@ 2018-07-12 14:36 Hannes Reinecke
  2018-07-12 15:08 ` Jens Axboe
  2018-07-12 23:29 ` Ming Lei
  0 siblings, 2 replies; 54+ messages in thread
From: Hannes Reinecke @ 2018-07-12 14:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-block, Martin Wilck

Hi Jens, Christoph,

we're currently hunting down a silent data corruption occurring due to
commit 72ecad22d9f1 ("block: support a full bio worth of IO for
simplified bdev direct-io").

While the whole thing is still hazy on the details, the one thing we've
found is that reverting that patch fixes the data corruption.

And looking closer, I've found this:

static ssize_t
blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
	int nr_pages;

	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
	if (!nr_pages)
		return 0;
	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);

	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
}

When checking the call path
__blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.

So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
It's not that we can handle it in __blkdev_direct_IO() ...

Thanks for any clarification.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-12 14:36 Silent data corruption in blkdev_direct_IO() Hannes Reinecke
@ 2018-07-12 15:08 ` Jens Axboe
  2018-07-12 16:11   ` Martin Wilck
  2018-07-12 16:14   ` Hannes Reinecke
  2018-07-12 23:29 ` Ming Lei
  1 sibling, 2 replies; 54+ messages in thread
From: Jens Axboe @ 2018-07-12 15:08 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christoph Hellwig, linux-block, Martin Wilck

On 7/12/18 8:36 AM, Hannes Reinecke wrote:
> Hi Jens, Christoph,
> 
> we're currently hunting down a silent data corruption occurring due to
> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
> simplified bdev direct-io").
> 
> While the whole thing is still hazy on the details, the one thing we've
> found is that reverting that patch fixes the data corruption.
> 
> And looking closer, I've found this:
> 
> static ssize_t
> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> {
> 	int nr_pages;
> 
> 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> 	if (!nr_pages)
> 		return 0;
> 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
> 
> 	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
> }
> 
> When checking the call path
> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
> 
> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> It's not that we can handle it in __blkdev_direct_IO() ...

The logic could be cleaned up like below, the sync part is really all
we care about. What is the test case for this? async or sync?

I also don't remember why it's BIO_MAX_PAGES + 1...

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aaeb39a..14ef3d71b55f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	int nr_pages;
 
-	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
+	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
 	if (!nr_pages)
 		return 0;
-	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
+	if (is_sync_kiocb(iocb))
 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
 
-	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
+	return __blkdev_direct_IO(iocb, iter, nr_pages);
 }
 
 static __init int blkdev_init(void)

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-12 15:08 ` Jens Axboe
@ 2018-07-12 16:11   ` Martin Wilck
  2018-07-12 16:14   ` Hannes Reinecke
  1 sibling, 0 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-12 16:11 UTC (permalink / raw)
  To: Jens Axboe, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block

On Thu, 2018-07-12 at 09:08 -0600, Jens Axboe wrote:
> On 7/12/18 8:36 AM, Hannes Reinecke wrote:
> > Hi Jens, Christoph,
> > 
> > we're currently hunting down a silent data corruption occurring due
> > to
> > commit 72ecad22d9f1 ("block: support a full bio worth of IO for
> > simplified bdev direct-io").
> > 
> > While the whole thing is still hazy on the details, the one thing
> > we've
> > found is that reverting that patch fixes the data corruption.
> > 
> > And looking closer, I've found this:
> > 
> > static ssize_t
> > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> > {
> > 	int nr_pages;
> > 
> > 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> > 	if (!nr_pages)
> > 		return 0;
> > 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> > 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
> > 
> > 	return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> > BIO_MAX_PAGES));
> > }
> > 
> > When checking the call path
> > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> > I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
> > 
> > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> > It's not that we can handle it in __blkdev_direct_IO() ...
> 
> The logic could be cleaned up like below, the sync part is really all
> we care about. What is the test case for this? async or sync?

It's sync, and the corruption is in the __blkdev_direct_IO_simple()
path. It starts to occur with your "block: support a full bio worth of
IO for simplified bdev direct-io" patch, which causes the "simple" path
to be taken for larger IO sizes.

Martin



> 
> I also don't remember why it's BIO_MAX_PAGES + 1...
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aaeb39a..14ef3d71b55f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct
> iov_iter *iter)
>  {
>  	int nr_pages;
>  
> -	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> +	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>  	if (!nr_pages)
>  		return 0;
> -	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> +	if (is_sync_kiocb(iocb))
>  		return __blkdev_direct_IO_simple(iocb, iter,
> nr_pages);
>  
> -	return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> BIO_MAX_PAGES));
> +	return __blkdev_direct_IO(iocb, iter, nr_pages);
>  }
>  
>  static __init int blkdev_init(void)
> 

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-12 15:08 ` Jens Axboe
  2018-07-12 16:11   ` Martin Wilck
@ 2018-07-12 16:14   ` Hannes Reinecke
  2018-07-12 16:20     ` Jens Axboe
  1 sibling, 1 reply; 54+ messages in thread
From: Hannes Reinecke @ 2018-07-12 16:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-block, Martin Wilck

On 07/12/2018 05:08 PM, Jens Axboe wrote:
> On 7/12/18 8:36 AM, Hannes Reinecke wrote:
>> Hi Jens, Christoph,
>>
>> we're currently hunting down a silent data corruption occurring due to
>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
>> simplified bdev direct-io").
>>
>> While the whole thing is still hazy on the details, the one thing we've
>> found is that reverting that patch fixes the data corruption.
>>
>> And looking closer, I've found this:
>>
>> static ssize_t
>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>> {
>> 	int nr_pages;
>>
>> 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>> 	if (!nr_pages)
>> 		return 0;
>> 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>> 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>
>> 	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>> }
>>
>> When checking the call path
>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
>>
>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
>> It's not that we can handle it in __blkdev_direct_IO() ...
> 
> The logic could be cleaned up like below, the sync part is really all
> we care about. What is the test case for this? async or sync?
> 
> I also don't remember why it's BIO_MAX_PAGES + 1...
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aaeb39a..14ef3d71b55f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>   {
>   	int nr_pages;
>   
> -	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> +	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>   	if (!nr_pages)
>   		return 0;
> -	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> +	if (is_sync_kiocb(iocb))
>   		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>   
> -	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
> +	return __blkdev_direct_IO(iocb, iter, nr_pages);
>   }
>   
>   static __init int blkdev_init(void)
> 
Hmm. We'll give it a go, but somehow I feel this won't solve our problem.
Another question (which probably shows my complete ignorance):
What happens if the iter holds >= BIO_MAX_PAGES?
'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on 
whether the iter contains more pages.
What happens to those left-over pages?
Will they ever be processed?

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-12 16:14   ` Hannes Reinecke
@ 2018-07-12 16:20     ` Jens Axboe
  2018-07-12 16:42       ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2018-07-12 16:20 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christoph Hellwig, linux-block, Martin Wilck

On 7/12/18 10:14 AM, Hannes Reinecke wrote:
> On 07/12/2018 05:08 PM, Jens Axboe wrote:
>> On 7/12/18 8:36 AM, Hannes Reinecke wrote:
>>> Hi Jens, Christoph,
>>>
>>> we're currently hunting down a silent data corruption occurring due to
>>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
>>> simplified bdev direct-io").
>>>
>>> While the whole thing is still hazy on the details, the one thing we've
>>> found is that reverting that patch fixes the data corruption.
>>>
>>> And looking closer, I've found this:
>>>
>>> static ssize_t
>>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>> {
>>> 	int nr_pages;
>>>
>>> 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>>> 	if (!nr_pages)
>>> 		return 0;
>>> 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>> 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>>
>>> 	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>>> }
>>>
>>> When checking the call path
>>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
>>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
>>>
>>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
>>> It's not that we can handle it in __blkdev_direct_IO() ...
>>
>> The logic could be cleaned up like below, the sync part is really all
>> we care about. What is the test case for this? async or sync?
>>
>> I also don't remember why it's BIO_MAX_PAGES + 1...
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 0dd87aaeb39a..14ef3d71b55f 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>   {
>>   	int nr_pages;
>>   
>> -	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>> +	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>>   	if (!nr_pages)
>>   		return 0;
>> -	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>> +	if (is_sync_kiocb(iocb))
>>   		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>   
>> -	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>> +	return __blkdev_direct_IO(iocb, iter, nr_pages);
>>   }
>>   
>>   static __init int blkdev_init(void)
>>
> Hmm. We'll give it a go, but somehow I feel this won't solve our problem.

It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it
does simplify that part...

> Another question (which probably shows my complete ignorance):
> What happens if the iter holds >= BIO_MAX_PAGES?
> 'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on 
> whether the iter contains more pages.
> What happens to those left-over pages?
> Will they ever be processed?

Short read or write, we rely on being called again for the remainder.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-12 16:20     ` Jens Axboe
@ 2018-07-12 16:42       ` Jens Axboe
  2018-07-13  6:47         ` Martin Wilck
  2018-07-13 16:56         ` Martin Wilck
  0 siblings, 2 replies; 54+ messages in thread
From: Jens Axboe @ 2018-07-12 16:42 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christoph Hellwig, linux-block, Martin Wilck

On 7/12/18 10:20 AM, Jens Axboe wrote:
> On 7/12/18 10:14 AM, Hannes Reinecke wrote:
>> On 07/12/2018 05:08 PM, Jens Axboe wrote:
>>> On 7/12/18 8:36 AM, Hannes Reinecke wrote:
>>>> Hi Jens, Christoph,
>>>>
>>>> we're currently hunting down a silent data corruption occurring due to
>>>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
>>>> simplified bdev direct-io").
>>>>
>>>> While the whole thing is still hazy on the details, the one thing we've
>>>> found is that reverting that patch fixes the data corruption.
>>>>
>>>> And looking closer, I've found this:
>>>>
>>>> static ssize_t
>>>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>> {
>>>> 	int nr_pages;
>>>>
>>>> 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>>>> 	if (!nr_pages)
>>>> 		return 0;
>>>> 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>>> 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>>>
>>>> 	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>>>> }
>>>>
>>>> When checking the call path
>>>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
>>>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
>>>>
>>>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
>>>> It's not that we can handle it in __blkdev_direct_IO() ...
>>>
>>> The logic could be cleaned up like below, the sync part is really all
>>> we care about. What is the test case for this? async or sync?
>>>
>>> I also don't remember why it's BIO_MAX_PAGES + 1...
>>>
>>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>>> index 0dd87aaeb39a..14ef3d71b55f 100644
>>> --- a/fs/block_dev.c
>>> +++ b/fs/block_dev.c
>>> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>   {
>>>   	int nr_pages;
>>>   
>>> -	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>>> +	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>>>   	if (!nr_pages)
>>>   		return 0;
>>> -	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>> +	if (is_sync_kiocb(iocb))
>>>   		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>>   
>>> -	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>>> +	return __blkdev_direct_IO(iocb, iter, nr_pages);
>>>   }
>>>   
>>>   static __init int blkdev_init(void)
>>>
>> Hmm. We'll give it a go, but somehow I feel this won't solve our problem.
> 
> It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it
> does simplify that part...

OK, now I remember. The +1 is just to check if there are actually more
pages. __blkdev_direct_IO_simple() only does one bio, so it has to fit
within that one bio. __blkdev_direct_IO() will loop just fine and
will finish any size, BIO_MAX_PAGES at the time.

Hence the patch I sent is wrong, the code actually looks fine. Which
means we're back to trying to figure out what is going on here. It'd
be great with a test case...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-12 14:36 Silent data corruption in blkdev_direct_IO() Hannes Reinecke
  2018-07-12 15:08 ` Jens Axboe
@ 2018-07-12 23:29 ` Ming Lei
  2018-07-13 18:54   ` Jens Axboe
  1 sibling, 1 reply; 54+ messages in thread
From: Ming Lei @ 2018-07-12 23:29 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Jens Axboe, Christoph Hellwig, linux-block, Martin Wilck

On Thu, Jul 12, 2018 at 10:36 PM, Hannes Reinecke <hare@suse.de> wrote:
> Hi Jens, Christoph,
>
> we're currently hunting down a silent data corruption occurring due to
> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
> simplified bdev direct-io").
>
> While the whole thing is still hazy on the details, the one thing we've
> found is that reverting that patch fixes the data corruption.
>
> And looking closer, I've found this:
>
> static ssize_t
> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> {
>         int nr_pages;
>
>         nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>         if (!nr_pages)
>                 return 0;
>         if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>                 return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>
>         return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
> }
>
> When checking the call path
> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
>
> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> It's not that we can handle it in __blkdev_direct_IO() ...
>
> Thanks for any clarification.

Maybe you can try the following patch from Christoph to see if it makes a
difference:

https://marc.info/?l=linux-kernel&m=153013977816825&w=2


thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-12 16:42       ` Jens Axboe
@ 2018-07-13  6:47         ` Martin Wilck
  2018-07-13 16:56         ` Martin Wilck
  1 sibling, 0 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-13  6:47 UTC (permalink / raw)
  To: Jens Axboe, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block

On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote:
> On 7/12/18 10:20 AM, Jens Axboe wrote:
> > On 7/12/18 10:14 AM, Hannes Reinecke wrote:
> > > On 07/12/2018 05:08 PM, Jens Axboe wrote:
> > > > On 7/12/18 8:36 AM, Hannes Reinecke wrote:
> > > > > Hi Jens, Christoph,
> > > > > 
> > > > > we're currently hunting down a silent data corruption
> > > > > occurring due to
> > > > > commit 72ecad22d9f1 ("block: support a full bio worth of IO
> > > > > for
> > > > > simplified bdev direct-io").
> > > > > 
> > > > > While the whole thing is still hazy on the details, the one
> > > > > thing we've
> > > > > found is that reverting that patch fixes the data corruption.
> > > > > 
> > > > > And looking closer, I've found this:
> > > > > 
> > > > > static ssize_t
> > > > > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> > > > > {
> > > > > 	int nr_pages;
> > > > > 
> > > > > 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> > > > > 	if (!nr_pages)
> > > > > 		return 0;
> > > > > 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> > > > > 		return __blkdev_direct_IO_simple(iocb, iter,
> > > > > nr_pages);
> > > > > 
> > > > > 	return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> > > > > BIO_MAX_PAGES));
> > > > > }
> > > > > 
> > > > > When checking the call path
> > > > > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> > > > > I found that bvec_alloc() will fail if nr_pages >
> > > > > BIO_MAX_PAGES.
> > > > > 
> > > > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> > > > > It's not that we can handle it in __blkdev_direct_IO() ...
> > > > 
> > > > The logic could be cleaned up like below, the sync part is
> > > > really all
> > > > we care about. What is the test case for this? async or sync?
> > > > 
> > > > I also don't remember why it's BIO_MAX_PAGES + 1...
> > > > 
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 0dd87aaeb39a..14ef3d71b55f 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb,
> > > > struct iov_iter *iter)
> > > >   {
> > > >   	int nr_pages;
> > > >   
> > > > -	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> > > > +	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
> > > >   	if (!nr_pages)
> > > >   		return 0;
> > > > -	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> > > > +	if (is_sync_kiocb(iocb))
> > > >   		return __blkdev_direct_IO_simple(iocb, iter,
> > > > nr_pages);
> > > >   
> > > > -	return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> > > > BIO_MAX_PAGES));
> > > > +	return __blkdev_direct_IO(iocb, iter, nr_pages);
> > > >   }
> > > >   
> > > >   static __init int blkdev_init(void)
> > > > 
> > > 
> > > Hmm. We'll give it a go, but somehow I feel this won't solve our
> > > problem.
> > 
> > It probably won't, the only joker here is the BIO_MAX_PAGES + 1.
> > But it
> > does simplify that part...
> 
> OK, now I remember. The +1 is just to check if there are actually
> more
> pages. __blkdev_direct_IO_simple() only does one bio, so it has to
> fit
> within that one bio. __blkdev_direct_IO() will loop just fine and
> will finish any size, BIO_MAX_PAGES at the time.

Right. Hannes, I think we (at least myself) have been irritated by
looking at outdated code. The key point which I missed is that
__blkdev_direct_IO() is called with min(nr_pages, BIO_MAX_PAGES), and
advances beyond that later in the loop.

> Hence the patch I sent is wrong, the code actually looks fine. Which
> means we're back to trying to figure out what is going on here. It'd
> be great with a test case...

Unfortunately we have no reproducer just yet. Only the customer can
reproduce it. The scenario is a data base running on a KVM virtual
machine on top of a virtio-scsi volume backed by a multipath map, with
cache='none' in qemu.

My late thinking is that if, for whatever reason I don't yet
understand, blkdev_direct_IO() resulted in a short write,
__generic_file_write_iter() would fall back to buffered writing, which
might be a possible explanation for the data corruption we observe.
That's just speculation at the current stage.

Regards
Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-12 16:42       ` Jens Axboe
  2018-07-13  6:47         ` Martin Wilck
@ 2018-07-13 16:56         ` Martin Wilck
  2018-07-13 18:00           ` Jens Axboe
  1 sibling, 1 reply; 54+ messages in thread
From: Martin Wilck @ 2018-07-13 16:56 UTC (permalink / raw)
  To: Jens Axboe, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block

On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote:
> 
> Hence the patch I sent is wrong, the code actually looks fine. Which
> means we're back to trying to figure out what is going on here. It'd
> be great with a test case...

We don't have an easy test case yet. But the customer has confirmed
that the problem occurs with upstream 4.17.5, too. We also confirmed
again that the problem occurs when the kernel uses the kmalloc() code
path in __blkdev_direct_IO_simple().

My personal suggestion would be to ditch __blkdev_direct_IO_simple()
altogether. After all, it's not _that_ much simpler thatn
__blkdev_direct_IO(), and it seems to be broken in a subtle way.

However, so far I've only identified a minor problem, see below - it
doesn't explain the data corruption we're seeing.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSELinux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)


>From d0c74ef1fc73c03983950c496f7490da8aa56671 Mon Sep 17 00:00:00 2001
From: Martin Wilck <mwilck@suse.com>
Date: Fri, 13 Jul 2018 18:38:44 +0200
Subject: [PATCH] fs: fix error exit in __blkdev_direct_IO_simple

Cleanup code was missing in the error return path.
---
 fs/block_dev.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7ec920e..b82b516 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -218,8 +218,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb,
struct iov_iter *iter,
 	bio.bi_end_io = blkdev_bio_end_io_simple;
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
-	if (unlikely(ret))
+	if (unlikely(ret)) {
+		if (vecs != inline_vecs)
+			kfree(vecs);
+		bio_uninit(&bio);
 		return ret;
+	}
 	ret = bio.bi_iter.bi_size;
 
 	if (iov_iter_rw(iter) == READ) {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-13 16:56         ` Martin Wilck
@ 2018-07-13 18:00           ` Jens Axboe
  2018-07-13 18:50             ` Jens Axboe
  2018-07-13 20:48             ` Martin Wilck
  0 siblings, 2 replies; 54+ messages in thread
From: Jens Axboe @ 2018-07-13 18:00 UTC (permalink / raw)
  To: Martin Wilck, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block

On 7/13/18 10:56 AM, Martin Wilck wrote:
> On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote:
>>
>> Hence the patch I sent is wrong, the code actually looks fine. Which
>> means we're back to trying to figure out what is going on here. It'd
>> be great with a test case...
> 
> We don't have an easy test case yet. But the customer has confirmed
> that the problem occurs with upstream 4.17.5, too. We also confirmed
> again that the problem occurs when the kernel uses the kmalloc() code
> path in __blkdev_direct_IO_simple().
> 
> My personal suggestion would be to ditch __blkdev_direct_IO_simple()
> altogether. After all, it's not _that_ much simpler thatn
> __blkdev_direct_IO(), and it seems to be broken in a subtle way.

That's not a great suggestion at all, we need to find out why we're
hitting the issue. For all you know, the bug could be elsewhere and
we're just going to be hitting it differently some other way. The
head-in-the-sand approach is rarely a win long term.

It's saving an allocation per IO, that's definitely measurable on
the faster storage. For reads, it's also not causing a context
switch for dirtying pages. I'm not a huge fan of multiple cases
in general, but this one is definitely warranted in an era where
1 usec is a lot of extra time for an IO.

> However, so far I've only identified a minor problem, see below - it
> doesn't explain the data corruption we're seeing.

What would help is trying to boil down a test case. So far it's a lot
of hand waving, and nothing that can really help narrow down what is
going on here.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-13 18:00           ` Jens Axboe
@ 2018-07-13 18:50             ` Jens Axboe
  2018-07-13 22:21               ` Martin Wilck
  2018-07-13 20:48             ` Martin Wilck
  1 sibling, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2018-07-13 18:50 UTC (permalink / raw)
  To: Martin Wilck, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block

On 7/13/18 12:00 PM, Jens Axboe wrote:
> On 7/13/18 10:56 AM, Martin Wilck wrote:
>> On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote:
>>>
>>> Hence the patch I sent is wrong, the code actually looks fine. Which
>>> means we're back to trying to figure out what is going on here. It'd
>>> be great with a test case...
>>
>> We don't have an easy test case yet. But the customer has confirmed
>> that the problem occurs with upstream 4.17.5, too. We also confirmed
>> again that the problem occurs when the kernel uses the kmalloc() code
>> path in __blkdev_direct_IO_simple().
>>
>> My personal suggestion would be to ditch __blkdev_direct_IO_simple()
>> altogether. After all, it's not _that_ much simpler thatn
>> __blkdev_direct_IO(), and it seems to be broken in a subtle way.
> 
> That's not a great suggestion at all, we need to find out why we're
> hitting the issue. For all you know, the bug could be elsewhere and
> we're just going to be hitting it differently some other way. The
> head-in-the-sand approach is rarely a win long term.
> 
> It's saving an allocation per IO, that's definitely measurable on
> the faster storage. For reads, it's also not causing a context
> switch for dirtying pages. I'm not a huge fan of multiple cases
> in general, but this one is definitely warranted in an era where
> 1 usec is a lot of extra time for an IO.
> 
>> However, so far I've only identified a minor problem, see below - it
>> doesn't explain the data corruption we're seeing.
> 
> What would help is trying to boil down a test case. So far it's a lot
> of hand waving, and nothing that can really help narrow down what is
> going on here.

When someone reports to you that some kind of data corruption occurs,
you need to find out as many details you can. Ideally they can give you
a test case, but if they can't, you ask as many questions as possible to
help YOU make a test case.

- What is the nature of the corruption? Is it reads or writes? Is it
  zeroes, random data, or data from somewhere else? How big is the
  corrupted area?

- What does the workload look like that reproduces it? If they don't
  really know and they can't give you a reproducer, you help them with
  tracing to give you the information you need.

Those are a great start. Right now we have zero information, other than
reverting that patch fixes it. This means that we're likely dealing with
IO that is larger than 16k, since we're going to be hitting the same
path for <= 16k ios with the patch reverted. We also know it's less than
1MB. But that's it. I don't even know if it's writes, since your (and
Hannes's) report has no details at all.

Go look at what you ask customers to provide for bug reports, then apply
some of that to this case...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-12 23:29 ` Ming Lei
@ 2018-07-13 18:54   ` Jens Axboe
  2018-07-13 22:29     ` Martin Wilck
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2018-07-13 18:54 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block, Martin Wilck

On 7/12/18 5:29 PM, Ming Lei wrote:
> On Thu, Jul 12, 2018 at 10:36 PM, Hannes Reinecke <hare@suse.de> wrote:
>> Hi Jens, Christoph,
>>
>> we're currently hunting down a silent data corruption occurring due to
>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
>> simplified bdev direct-io").
>>
>> While the whole thing is still hazy on the details, the one thing we've
>> found is that reverting that patch fixes the data corruption.
>>
>> And looking closer, I've found this:
>>
>> static ssize_t
>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>> {
>>         int nr_pages;
>>
>>         nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>>         if (!nr_pages)
>>                 return 0;
>>         if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>                 return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>
>>         return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>> }
>>
>> When checking the call path
>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
>>
>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
>> It's not that we can handle it in __blkdev_direct_IO() ...
>>
>> Thanks for any clarification.
> 
> Maybe you can try the following patch from Christoph to see if it makes a
> difference:
> 
> https://marc.info/?l=linux-kernel&m=153013977816825&w=2

That's not a bad idea.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-13 18:00           ` Jens Axboe
  2018-07-13 18:50             ` Jens Axboe
@ 2018-07-13 20:48             ` Martin Wilck
  2018-07-13 20:52               ` Jens Axboe
  1 sibling, 1 reply; 54+ messages in thread
From: Martin Wilck @ 2018-07-13 20:48 UTC (permalink / raw)
  To: Jens Axboe, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block

On Fri, 2018-07-13 at 12:00 -0600, Jens Axboe wrote:
> On 7/13/18 10:56 AM, Martin Wilck wrote:
> > On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote:
> > > 
> > > Hence the patch I sent is wrong, the code actually looks fine.
> > > Which
> > > means we're back to trying to figure out what is going on here.
> > > It'd
> > > be great with a test case...
> > 
> > We don't have an easy test case yet. But the customer has confirmed
> > that the problem occurs with upstream 4.17.5, too. We also
> > confirmed
> > again that the problem occurs when the kernel uses the kmalloc()
> > code
> > path in __blkdev_direct_IO_simple().
> > 
> > My personal suggestion would be to ditch
> > __blkdev_direct_IO_simple()
> > altogether. After all, it's not _that_ much simpler thatn
> > __blkdev_direct_IO(), and it seems to be broken in a subtle way.
> 
> That's not a great suggestion at all, we need to find out why we're
> hitting the issue. 

We're trying.

> For all you know, the bug could be elsewhere and
> we're just going to be hitting it differently some other way. The
> head-in-the-sand approach is rarely a win long term.
> 
> It's saving an allocation per IO, that's definitely measurable on
> the faster storage. 

I an see that for the inline path, but is there still an advantage if
we need to kmalloc() the biovec?

> For reads, it's also not causing a context
> switch for dirtying pages. I'm not a huge fan of multiple cases
> in general, but this one is definitely warranted in an era where
> 1 usec is a lot of extra time for an IO.

Ok, thanks for pointing that out.

> 
> > However, so far I've only identified a minor problem, see below -
> > it
> > doesn't explain the data corruption we're seeing.
> 
> What would help is trying to boil down a test case. So far it's a lot
> of hand waving, and nothing that can really help narrow down what is
> going on here.

It's not that we didn't try. We've run fio with verification on block
devices with varying io sizes, block sizes, and alignments, but so far
we haven't hit the issue. We've also tried to reproduce it by
approximating the customer's VM setup, with no success up to now.

However, we're now much closer than we used to be, so I'm confident
that we'll be able to present more concrete facts soon.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-13 20:48             ` Martin Wilck
@ 2018-07-13 20:52               ` Jens Axboe
  2018-07-16 19:05                 ` Martin Wilck
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2018-07-13 20:52 UTC (permalink / raw)
  To: Martin Wilck, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block

On 7/13/18 2:48 PM, Martin Wilck wrote:
>> For all you know, the bug could be elsewhere and
>> we're just going to be hitting it differently some other way. The
>> head-in-the-sand approach is rarely a win long term.
>>
>> It's saving an allocation per IO, that's definitely measurable on
>> the faster storage. 
> 
> I an see that for the inline path, but is there still an advantage if
> we need to kmalloc() the biovec?

It's still one allocation instead of two. In the era of competing with
spdk on single digit latency devices, the answer is yes.

>>> However, so far I've only identified a minor problem, see below -
>>> it
>>> doesn't explain the data corruption we're seeing.
>>
>> What would help is trying to boil down a test case. So far it's a lot
>> of hand waving, and nothing that can really help narrow down what is
>> going on here.
> 
> It's not that we didn't try. We've run fio with verification on block
> devices with varying io sizes, block sizes, and alignments, but so far
> we haven't hit the issue. We've also tried to reproduce it by
> approximating the customer's VM setup, with no success up to now.

I ran some testing yesterday as well, but didn't trigger anything.
Didn't expect to either, as all the basic functionality was verified
when the patch was done. It's not really a path with a lot of corner
cases, so it's really weird that we're seeing anything at all. Which is
why I'm suspecting it's something else entirely, but it's really hard to
guesstimate on that with no clues at all.

> However, we're now much closer than we used to be, so I'm confident
> that we'll be able to present more concrete facts soon.

OK, sounds good.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-13 18:50             ` Jens Axboe
@ 2018-07-13 22:21               ` Martin Wilck
  0 siblings, 0 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-13 22:21 UTC (permalink / raw)
  To: Jens Axboe, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block

On Fri, 2018-07-13 at 12:50 -0600, Jens Axboe wrote:
> On 7/13/18 12:00 PM, Jens Axboe wrote:
> > On 7/13/18 10:56 AM, Martin Wilck wrote:
> > > On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote:
> > > > 
> > > > Hence the patch I sent is wrong, the code actually looks fine.
> > > > Which
> > > > means we're back to trying to figure out what is going on here.
> > > > It'd
> > > > be great with a test case...
> > > 
> > > We don't have an easy test case yet. But the customer has
> > > confirmed
> > > that the problem occurs with upstream 4.17.5, too. We also
> > > confirmed
> > > again that the problem occurs when the kernel uses the kmalloc()
> > > code
> > > path in __blkdev_direct_IO_simple().
> > > 
> > > My personal suggestion would be to ditch
> > > __blkdev_direct_IO_simple()
> > > altogether. After all, it's not _that_ much simpler thatn
> > > __blkdev_direct_IO(), and it seems to be broken in a subtle way.
> > 
> > That's not a great suggestion at all, we need to find out why we're
> > hitting the issue. For all you know, the bug could be elsewhere and
> > we're just going to be hitting it differently some other way. The
> > head-in-the-sand approach is rarely a win long term.
> > 
> > It's saving an allocation per IO, that's definitely measurable on
> > the faster storage. For reads, it's also not causing a context
> > switch for dirtying pages. I'm not a huge fan of multiple cases
> > in general, but this one is definitely warranted in an era where
> > 1 usec is a lot of extra time for an IO.
> > 
> > > However, so far I've only identified a minor problem, see below -
> > > it
> > > doesn't explain the data corruption we're seeing.
> > 
> > What would help is trying to boil down a test case. So far it's a
> > lot
> > of hand waving, and nothing that can really help narrow down what
> > is
> > going on here.
> 
> When someone reports to you that some kind of data corruption occurs,
> you need to find out as many details you can. Ideally they can give
> you
> a test case, but if they can't, you ask as many questions as possible
> to
> help YOU make a test case.
> 
> - What is the nature of the corruption? Is it reads or writes? Is it
>   zeroes, random data, or data from somewhere else? How big is the
>   corrupted area?
> 
> - What does the workload look like that reproduces it? If they don't
>   really know and they can't give you a reproducer, you help them
> with
>   tracing to give you the information you need.
> 
> Those are a great start. Right now we have zero information, other
> than
> reverting that patch fixes it. This means that we're likely dealing
> with
> IO that is larger than 16k, since we're going to be hitting the same
> path for <= 16k ios with the patch reverted. We also know it's less
> than
> 1MB.

The nature of the corruption is an interesting question. The weird
thing is that it's not block-aligned - "bad" data starts at some
apparently random offset into the log file. That'd actually made me
think that we're not looking at a block IO issue but maybe some
corruption that occured in memory before write-out. But then we
isolated this specific patch by bisection... I had my doubts, but it
seems to be sound, at least the test results indicate so.

The user application is a database which writes logs in blocks that are
multiples of 8k. The IO goes through various layers in the VM (ext3 on
LVM on MD on virtio), then through qemu via virtio-scsi, and onto a
host multipath device which is backed by FC storage. The VM is running
an old SLES 11 distro, the problems were caused by a host kernel
update.

My attempts to reproduce a corruption with lab equipment have failed up
to now. I've been trying to replay the customer's workload with
btreplay. I also tried to replay it with fio with verification to try
to find corruptions, but I guess I didn't get the setup quite right. 

>  But that's it. I don't even know if it's writes, since your (and
> Hannes's) report has no details at all.
> Go look at what you ask customers to provide for bug reports, then
> apply
> some of that to this case...

Sorry. It's writes. The reason we posted this here was a) the concrete
questions Hannes asked about the code, b) the patch we'd isolated - we
thought you or Christoph might just have a bright idea about it, or see
what we didn't, and c) that we thought the problem might affect
upstream, too.

We don't expect you to do the dirty work for us. We'll come back to the
list when we know more, or have specific questions.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-13 18:54   ` Jens Axboe
@ 2018-07-13 22:29     ` Martin Wilck
  2018-07-16 11:45       ` Ming Lei
  0 siblings, 1 reply; 54+ messages in thread
From: Martin Wilck @ 2018-07-13 22:29 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block

Hi Ming & Jens,

On Fri, 2018-07-13 at 12:54 -0600, Jens Axboe wrote:
> On 7/12/18 5:29 PM, Ming Lei wrote:
> > 
> > Maybe you can try the following patch from Christoph to see if it
> > makes a
> > difference:
> > 
> > https://marc.info/?l=linux-kernel&m=153013977816825&w=2
> 
> That's not a bad idea.

Are you saying that the previous "nasty hack"  in
bio_iov_iter_get_pages() was broken, and the new one is not?
I've scratched my head over that (old) code lately, but I couldn't spot
an actual error in it.

Unfortunately we can only do one test at a time. We're currently trying
to find out more about the IO sizes at which the problem occurs. As you
noted, currently we know no more than that it happens somewhere between
16k and 1M.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-13 22:29     ` Martin Wilck
@ 2018-07-16 11:45       ` Ming Lei
  2018-07-18  0:07         ` Martin Wilck
  0 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2018-07-16 11:45 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Hannes Reinecke, Christoph Hellwig, linux-block, Ming Lei

On Sat, Jul 14, 2018 at 6:29 AM, Martin Wilck <mwilck@suse.com> wrote:
> Hi Ming & Jens,
>
> On Fri, 2018-07-13 at 12:54 -0600, Jens Axboe wrote:
>> On 7/12/18 5:29 PM, Ming Lei wrote:
>> >
>> > Maybe you can try the following patch from Christoph to see if it
>> > makes a
>> > difference:
>> >
>> > https://marc.info/?l=linux-kernel&m=153013977816825&w=2
>>
>> That's not a bad idea.
>
> Are you saying that the previous "nasty hack"  in
> bio_iov_iter_get_pages() was broken, and the new one is not?
> I've scratched my head over that (old) code lately, but I couldn't spot
> an actual error in it.

Yeah, I think the new patch in above link is better, and looks its correctness
is easy to prove.

   https://marc.info/?t=152812081400001&r=1&w=2

So please test the patch if possible.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-13 20:52               ` Jens Axboe
@ 2018-07-16 19:05                 ` Martin Wilck
  0 siblings, 0 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-16 19:05 UTC (permalink / raw)
  To: Jens Axboe, Hannes Reinecke, Jan Kara; +Cc: Christoph Hellwig, linux-block

On Fri, 2018-07-13 at 14:52 -0600, Jens Axboe wrote:
> On 7/13/18 2:48 PM, Martin Wilck wrote:
> > > 
> > > > However, so far I've only identified a minor problem, see below
> > > > -
> > > > it
> > > > doesn't explain the data corruption we're seeing.
> > > 
> > > What would help is trying to boil down a test case. So far it's a
> > > lot
> > > of hand waving, and nothing that can really help narrow down what
> > > is
> > > going on here.
> > 
> > It's not that we didn't try. We've run fio with verification on
> > block
> > devices with varying io sizes, block sizes, and alignments, but so
> > far
> > we haven't hit the issue. We've also tried to reproduce it by
> > approximating the customer's VM setup, with no success up to now.
> 
> I ran some testing yesterday as well, but didn't trigger anything.
> Didn't expect to either, as all the basic functionality was verified
> when the patch was done. It's not really a path with a lot of corner
> cases, so it's really weird that we're seeing anything at all. Which
> is
> why I'm suspecting it's something else entirely, but it's really hard
> to
> guesstimate on that with no clues at all.
> 
> > However, we're now much closer than we used to be, so I'm confident
> > that we'll be able to present more concrete facts soon.
> 
> OK, sounds good.

Jan Kara has provided very convincing analysis and provided a patch
which we are going to have to the customer test. 

By calling bio_iov_iter_get_pages() only once,
__blkdev_direct_IO_simple() may not transfer all requested bytes,
because bio_iov_iter_get_pages() doesn't necessarily exhaust all data
in the iov_iter. Thus a short write may occur, and
__generic_file_write_iter() falls back to buffered IO. We've actually
observed these "short direct writes" in the error case with an
instrumented kernel (in a trace I got, ~13000/800000 direct write ops
on a block device transferred less data than requested).

We suspect that this concurrency of direct and buffered writes may
cause the corruption the customer observes.

Does that make sense to you?

Regards,
Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-16 11:45       ` Ming Lei
@ 2018-07-18  0:07         ` Martin Wilck
  2018-07-18  2:48           ` Ming Lei
  0 siblings, 1 reply; 54+ messages in thread
From: Martin Wilck @ 2018-07-18  0:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, Christoph Hellwig, linux-block,
	Ming Lei, jack, kent.overstreet

On Mon, 2018-07-16 at 19:45 +0800, Ming Lei wrote:
> On Sat, Jul 14, 2018 at 6:29 AM, Martin Wilck <mwilck@suse.com>
> wrote:
> > Hi Ming & Jens,
> > 
> > On Fri, 2018-07-13 at 12:54 -0600, Jens Axboe wrote:
> > > On 7/12/18 5:29 PM, Ming Lei wrote:
> > > > 
> > > > Maybe you can try the following patch from Christoph to see if
> > > > it
> > > > makes a
> > > > difference:
> > > > 
> > > > https://marc.info/?l=linux-kernel&m=153013977816825&w=2
> > > 
> > > That's not a bad idea.
> > 
> > Are you saying that the previous "nasty hack"  in
> > bio_iov_iter_get_pages() was broken, and the new one is not?
> > I've scratched my head over that (old) code lately, but I couldn't
> > spot
> > an actual error in it.
> 
> Yeah, I think the new patch in above link is better, and looks its
> correctness
> is easy to prove.
> 
>    https://marc.info/?t=152812081400001&r=1&w=2
> 
> So please test the patch if possible.

I haven't tested yet, but upon further inspection, I can tell that the
current code (without Christoph's patch) is actually broken if the
function is called with bio->bi_vcnt > 0. The following patch explains
the problem. AFAICS, Christoph's new code is correct.

>From b75adc856119346e02126cf8975755300f2d9b7f Mon Sep 17 00:00:00 2001
From: Martin Wilck <mwilck@suse.com>
Date: Wed, 18 Jul 2018 01:56:37 +0200
Subject: [PATCH] block: bio_iov_iter_get_pages: fix size of last iovec

If the last page of the bio is not "full", the length of the last
vector bin needs to be corrected. This bin has the index
(bio->bi_vcnt - 1), but in bio->bi_io_vec, not in the "bv" helper array which
is shifted by the value of bio->bi_vcnt at function invocation.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 block/bio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 53e0f0a..c22e76f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -913,7 +913,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	bv[0].bv_offset += offset;
 	bv[0].bv_len -= offset;
 	if (diff)
-		bv[bio->bi_vcnt - 1].bv_len -= diff;
+		bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -= diff;
 
 	iov_iter_advance(iter, size);
 	return 0;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-18  0:07         ` Martin Wilck
@ 2018-07-18  2:48           ` Ming Lei
  2018-07-18  7:32             ` Martin Wilck
  0 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2018-07-18  2:48 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Ming Lei, Jens Axboe, Hannes Reinecke, Christoph Hellwig,
	linux-block, jack, kent.overstreet

On Wed, Jul 18, 2018 at 02:07:28AM +0200, Martin Wilck wrote:
> On Mon, 2018-07-16 at 19:45 +0800, Ming Lei wrote:
> > On Sat, Jul 14, 2018 at 6:29 AM, Martin Wilck <mwilck@suse.com>
> > wrote:
> > > Hi Ming & Jens,
> > > 
> > > On Fri, 2018-07-13 at 12:54 -0600, Jens Axboe wrote:
> > > > On 7/12/18 5:29 PM, Ming Lei wrote:
> > > > > 
> > > > > Maybe you can try the following patch from Christoph to see if
> > > > > it
> > > > > makes a
> > > > > difference:
> > > > > 
> > > > > https://marc.info/?l=linux-kernel&m=153013977816825&w=2
> > > > 
> > > > That's not a bad idea.
> > > 
> > > Are you saying that the previous "nasty hack"  in
> > > bio_iov_iter_get_pages() was broken, and the new one is not?
> > > I've scratched my head over that (old) code lately, but I couldn't
> > > spot
> > > an actual error in it.
> > 
> > Yeah, I think the new patch in above link is better, and looks its
> > correctness
> > is easy to prove.
> > 
> >    https://marc.info/?t=152812081400001&r=1&w=2
> > 
> > So please test the patch if possible.
> 
> I haven't tested yet, but upon further inspection, I can tell that the
> current code (without Christoph's patch) is actually broken if the
> function is called with bio->bi_vcnt > 0. The following patch explains
> the problem. AFAICS, Christoph's new code is correct.
> 
> From b75adc856119346e02126cf8975755300f2d9b7f Mon Sep 17 00:00:00 2001
> From: Martin Wilck <mwilck@suse.com>
> Date: Wed, 18 Jul 2018 01:56:37 +0200
> Subject: [PATCH] block: bio_iov_iter_get_pages: fix size of last iovec
> 
> If the last page of the bio is not "full", the length of the last
> vector bin needs to be corrected. This bin has the index
> (bio->bi_vcnt - 1), but in bio->bi_io_vec, not in the "bv" helper array which
> is shifted by the value of bio->bi_vcnt at function invocation.
> 
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  block/bio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 53e0f0a..c22e76f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -913,7 +913,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	bv[0].bv_offset += offset;
>  	bv[0].bv_len -= offset;
>  	if (diff)
> -		bv[bio->bi_vcnt - 1].bv_len -= diff;
> +		bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -= diff;
>  
>  	iov_iter_advance(iter, size);
>  	return 0;

Right, that is the issue, we need this fix for -stable, but maybe the
following fix is more readable:

diff --git a/block/bio.c b/block/bio.c
index f3536bfc8298..6e37b803755b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -914,16 +914,16 @@ EXPORT_SYMBOL(bio_add_page);
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
-	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
+	unsigned short idx, nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
-	size_t offset, diff;
+	size_t offset;
 	ssize_t size;
 
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
-	nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
+	idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
 
 	/*
 	 * Deep magic below:  We need to walk the pinned pages backwards
@@ -936,17 +936,15 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	bio->bi_iter.bi_size += size;
 	bio->bi_vcnt += nr_pages;
 
-	diff = (nr_pages * PAGE_SIZE - offset) - size;
-	while (nr_pages--) {
-		bv[nr_pages].bv_page = pages[nr_pages];
-		bv[nr_pages].bv_len = PAGE_SIZE;
-		bv[nr_pages].bv_offset = 0;
+	while (idx--) {
+		bv[idx].bv_page = pages[idx];
+		bv[idx].bv_len = PAGE_SIZE;
+		bv[idx].bv_offset = 0;
 	}
 
 	bv[0].bv_offset += offset;
 	bv[0].bv_len -= offset;
-	if (diff)
-		bv[bio->bi_vcnt - 1].bv_len -= diff;
+	bv[nr_pages - 1].bv_len -= (nr_pages * PAGE_SIZE - offset) - size;
 
 	iov_iter_advance(iter, size);
 	return 0;

And for mainline, I suggest to make Christoph's new code in, that is
easy to prove its correctness, and seems simpler.

Thanks,
Ming

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-18  2:48           ` Ming Lei
@ 2018-07-18  7:32             ` Martin Wilck
  2018-07-18  7:54               ` Ming Lei
  0 siblings, 1 reply; 54+ messages in thread
From: Martin Wilck @ 2018-07-18  7:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, Hannes Reinecke, Christoph Hellwig,
	linux-block, jack, kent.overstreet

On Wed, 2018-07-18 at 10:48 +0800, Ming Lei wrote:
> On Wed, Jul 18, 2018 at 02:07:28AM +0200, Martin Wilck wrote:
> > 
> > From b75adc856119346e02126cf8975755300f2d9b7f Mon Sep 17 00:00:00
> > 2001
> > From: Martin Wilck <mwilck@suse.com>
> > Date: Wed, 18 Jul 2018 01:56:37 +0200
> > Subject: [PATCH] block: bio_iov_iter_get_pages: fix size of last
> > iovec
> > 
> > If the last page of the bio is not "full", the length of the last
> > vector bin needs to be corrected. This bin has the index
> > (bio->bi_vcnt - 1), but in bio->bi_io_vec, not in the "bv" helper
> > array which
> > is shifted by the value of bio->bi_vcnt at function invocation.
> > 
> > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > ---
> >  block/bio.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 53e0f0a..c22e76f 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -913,7 +913,7 @@ int bio_iov_iter_get_pages(struct bio *bio,
> > struct iov_iter *iter)
> >  	bv[0].bv_offset += offset;
> >  	bv[0].bv_len -= offset;
> >  	if (diff)
> > -		bv[bio->bi_vcnt - 1].bv_len -= diff;
> > +		bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -= diff;
> >  
> >  	iov_iter_advance(iter, size);
> >  	return 0;
> 
> Right, that is the issue, we need this fix for -stable, but maybe the
> following fix is more readable:
> 
> diff --git a/block/bio.c b/block/bio.c
> index f3536bfc8298..6e37b803755b 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -914,16 +914,16 @@ EXPORT_SYMBOL(bio_add_page);
>   */
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
> -	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> +	unsigned short idx, nr_pages = bio->bi_max_vecs - bio-
> >bi_vcnt;
>  	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>  	struct page **pages = (struct page **)bv;
> -	size_t offset, diff;
> +	size_t offset;
>  	ssize_t size;
>  
>  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages,
> &offset);
>  	if (unlikely(size <= 0))
>  		return size ? size : -EFAULT;
> -	nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
> +	idx = nr_pages = (size + offset + PAGE_SIZE - 1) /
> PAGE_SIZE;
>  
>  	/*
>  	 * Deep magic below:  We need to walk the pinned pages
> backwards
> @@ -936,17 +936,15 @@ int bio_iov_iter_get_pages(struct bio *bio,
> struct iov_iter *iter)
>  	bio->bi_iter.bi_size += size;
>  	bio->bi_vcnt += nr_pages;
>  
> -	diff = (nr_pages * PAGE_SIZE - offset) - size;
> -	while (nr_pages--) {
> -		bv[nr_pages].bv_page = pages[nr_pages];
> -		bv[nr_pages].bv_len = PAGE_SIZE;
> -		bv[nr_pages].bv_offset = 0;
> +	while (idx--) {
> +		bv[idx].bv_page = pages[idx];
> +		bv[idx].bv_len = PAGE_SIZE;
> +		bv[idx].bv_offset = 0;
>  	}
>  
>  	bv[0].bv_offset += offset;
>  	bv[0].bv_len -= offset;
> -	if (diff)
> -		bv[bio->bi_vcnt - 1].bv_len -= diff;
> +	bv[nr_pages - 1].bv_len -= (nr_pages * PAGE_SIZE - offset) -
> size;
>  
>  	iov_iter_advance(iter, size);
>  	return 0;
> 
> And for mainline, I suggest to make Christoph's new code in, that is
> easy to prove its correctness, and seems simpler.

Fine with me. Will you take care of a submission, or should I?
Btw, this is not the full fix for our data corruption issue yet.
Another patch is needed which still needs testing.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-18  7:32             ` Martin Wilck
@ 2018-07-18  7:54               ` Ming Lei
  2018-07-18  9:20                 ` Johannes Thumshirn
  2018-07-19  9:39                 ` [PATCH 0/2] Fix silent " Martin Wilck
  0 siblings, 2 replies; 54+ messages in thread
From: Ming Lei @ 2018-07-18  7:54 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Ming Lei, Jens Axboe, Hannes Reinecke, Christoph Hellwig,
	linux-block, jack, kent.overstreet

On Wed, Jul 18, 2018 at 09:32:12AM +0200, Martin Wilck wrote:
> On Wed, 2018-07-18 at 10:48 +0800, Ming Lei wrote:
> > On Wed, Jul 18, 2018 at 02:07:28AM +0200, Martin Wilck wrote:
> > > 
> > > From b75adc856119346e02126cf8975755300f2d9b7f Mon Sep 17 00:00:00
> > > 2001
> > > From: Martin Wilck <mwilck@suse.com>
> > > Date: Wed, 18 Jul 2018 01:56:37 +0200
> > > Subject: [PATCH] block: bio_iov_iter_get_pages: fix size of last
> > > iovec
> > > 
> > > If the last page of the bio is not "full", the length of the last
> > > vector bin needs to be corrected. This bin has the index
> > > (bio->bi_vcnt - 1), but in bio->bi_io_vec, not in the "bv" helper
> > > array which
> > > is shifted by the value of bio->bi_vcnt at function invocation.
> > > 
> > > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > > ---
> > >  block/bio.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/block/bio.c b/block/bio.c
> > > index 53e0f0a..c22e76f 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -913,7 +913,7 @@ int bio_iov_iter_get_pages(struct bio *bio,
> > > struct iov_iter *iter)
> > >  	bv[0].bv_offset += offset;
> > >  	bv[0].bv_len -= offset;
> > >  	if (diff)
> > > -		bv[bio->bi_vcnt - 1].bv_len -= diff;
> > > +		bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -= diff;
> > >  
> > >  	iov_iter_advance(iter, size);
> > >  	return 0;
> > 
> > Right, that is the issue, we need this fix for -stable, but maybe the
> > following fix is more readable:
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index f3536bfc8298..6e37b803755b 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -914,16 +914,16 @@ EXPORT_SYMBOL(bio_add_page);
> >   */
> >  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> >  {
> > -	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> > +	unsigned short idx, nr_pages = bio->bi_max_vecs - bio-
> > >bi_vcnt;
> >  	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> >  	struct page **pages = (struct page **)bv;
> > -	size_t offset, diff;
> > +	size_t offset;
> >  	ssize_t size;
> >  
> >  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages,
> > &offset);
> >  	if (unlikely(size <= 0))
> >  		return size ? size : -EFAULT;
> > -	nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
> > +	idx = nr_pages = (size + offset + PAGE_SIZE - 1) /
> > PAGE_SIZE;
> >  
> >  	/*
> >  	 * Deep magic below:  We need to walk the pinned pages
> > backwards
> > @@ -936,17 +936,15 @@ int bio_iov_iter_get_pages(struct bio *bio,
> > struct iov_iter *iter)
> >  	bio->bi_iter.bi_size += size;
> >  	bio->bi_vcnt += nr_pages;
> >  
> > -	diff = (nr_pages * PAGE_SIZE - offset) - size;
> > -	while (nr_pages--) {
> > -		bv[nr_pages].bv_page = pages[nr_pages];
> > -		bv[nr_pages].bv_len = PAGE_SIZE;
> > -		bv[nr_pages].bv_offset = 0;
> > +	while (idx--) {
> > +		bv[idx].bv_page = pages[idx];
> > +		bv[idx].bv_len = PAGE_SIZE;
> > +		bv[idx].bv_offset = 0;
> >  	}
> >  
> >  	bv[0].bv_offset += offset;
> >  	bv[0].bv_len -= offset;
> > -	if (diff)
> > -		bv[bio->bi_vcnt - 1].bv_len -= diff;
> > +	bv[nr_pages - 1].bv_len -= (nr_pages * PAGE_SIZE - offset) -
> > size;
> >  
> >  	iov_iter_advance(iter, size);
> >  	return 0;
> > 
> > And for mainline, I suggest to make Christoph's new code in, that is
> > easy to prove its correctness, and seems simpler.
> 
> Fine with me. Will you take care of a submission, or should I?
> Btw, this is not the full fix for our data corruption issue yet.
> Another patch is needed which still needs testing.

Please go ahead and take care of it since you have the test cases.

thanks
Ming

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-18  7:54               ` Ming Lei
@ 2018-07-18  9:20                 ` Johannes Thumshirn
  2018-07-18 11:40                   ` Jan Kara
  2018-07-19  9:39                 ` [PATCH 0/2] Fix silent " Martin Wilck
  1 sibling, 1 reply; 54+ messages in thread
From: Johannes Thumshirn @ 2018-07-18  9:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Martin Wilck, Ming Lei, Jens Axboe, Hannes Reinecke,
	Christoph Hellwig, linux-block, jack, kent.overstreet

On Wed, Jul 18, 2018 at 03:54:46PM +0800, Ming Lei wrote:
> Please go ahead and take care of it since you have the test cases.

Speaking of which, do we already know how it is triggered and can we
cook up a blktests testcase for it? This would be more than helpful
for all parties.

Thanks,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-18  9:20                 ` Johannes Thumshirn
@ 2018-07-18 11:40                   ` Jan Kara
  2018-07-18 11:57                     ` Jan Kara
  0 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2018-07-18 11:40 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Ming Lei, Martin Wilck, Ming Lei, Jens Axboe, Hannes Reinecke,
	Christoph Hellwig, linux-block, jack, kent.overstreet

On Wed 18-07-18 11:20:15, Johannes Thumshirn wrote:
> On Wed, Jul 18, 2018 at 03:54:46PM +0800, Ming Lei wrote:
> > Please go ahead and take care of it since you have the test cases.
> 
> Speaking of which, do we already know how it is triggered and can we
> cook up a blktests testcase for it? This would be more than helpful
> for all parties.

Using multiple iovecs with writev / readv trivially triggers the case of IO
that is done partly as direct and partly as buffered. Neither me nor Martin
were able to trigger the data corruption the customer is seeing with KVM
though (since the generic code tries to maintain data integrity even if the
IO is mixed). It should be possible to trigger the corruption by having two
processes doing write to the same PAGE_SIZE region of a block device, just at
different offsets. And if the first process happens to use direct IO while
the second ends up doing read-modify-write cycle through page cache, the
first write could end up being lost. I'll try whether something like this
is able to see the corruption...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Silent data corruption in blkdev_direct_IO()
  2018-07-18 11:40                   ` Jan Kara
@ 2018-07-18 11:57                     ` Jan Kara
  0 siblings, 0 replies; 54+ messages in thread
From: Jan Kara @ 2018-07-18 11:57 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Ming Lei, Martin Wilck, Ming Lei, Jens Axboe, Hannes Reinecke,
	Christoph Hellwig, linux-block, jack, kent.overstreet

[-- Attachment #1: Type: text/plain, Size: 1435 bytes --]

On Wed 18-07-18 13:40:07, Jan Kara wrote:
> On Wed 18-07-18 11:20:15, Johannes Thumshirn wrote:
> > On Wed, Jul 18, 2018 at 03:54:46PM +0800, Ming Lei wrote:
> > > Please go ahead and take care of it since you have the test cases.
> > 
> > Speaking of which, do we already know how it is triggered and can we
> > cook up a blktests testcase for it? This would be more than helpful
> > for all parties.
> 
> Using multiple iovecs with writev / readv trivially triggers the case of IO
> that is done partly as direct and partly as buffered. Neither me nor Martin
> were able to trigger the data corruption the customer is seeing with KVM
> though (since the generic code tries to maintain data integrity even if the
> IO is mixed). It should be possible to trigger the corruption by having two
> processes doing write to the same PAGE_SIZE region of a block device, just at
> different offsets. And if the first process happens to use direct IO while
> the second ends up doing read-modify-write cycle through page cache, the
> first write could end up being lost. I'll try whether something like this
> is able to see the corruption...

OK, when I run attached test program like:

blkdev-dio-test /dev/loop0 0 &
blkdev-dio-test /dev/loop0 2048 &

One of them reports lost write almost immediately. On kernel with my fix
the test program runs for quite a while without problems.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: blkdev-dio-test.c --]
[-- Type: text/x-c, Size: 1336 bytes --]

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <sys/uio.h>

#define PAGE_SIZE 4096
#define SECT_SIZE 512
#define BUF_OFF (2*SECT_SIZE)

int main(int argc, char **argv)
{
	int fd = open(argv[1], O_RDWR | O_DIRECT);
	int ret;
	char *buf;
	loff_t off;
	struct iovec iov[2];
	unsigned int seq;

	if (fd < 0) {
		perror("open");
		return 1;
	}

	off = strtol(argv[2], NULL, 10);

	buf = aligned_alloc(PAGE_SIZE, PAGE_SIZE);

	iov[0].iov_base = buf;
	iov[0].iov_len = SECT_SIZE;
	iov[1].iov_base = buf + BUF_OFF;
	iov[1].iov_len = SECT_SIZE;

	seq = 0;
	memset(buf, 0, PAGE_SIZE);
	while (1) {
		*(unsigned int *)buf = seq;
		*(unsigned int *)(buf + BUF_OFF) = seq;
		ret = pwritev(fd, iov, 2, off);
		if (ret < 0) {
			perror("pwritev");
			return 1;
		}
		if (ret != 2*SECT_SIZE) {
			fprintf(stderr, "Short pwritev: %d\n", ret);
			return 1;
		}
		ret = pread(fd, buf, PAGE_SIZE, off);
		if (ret < 0) {
			perror("pread");
			return 1;
		}
		if (ret != PAGE_SIZE) {
			fprintf(stderr, "Short read: %d\n", ret);
			return 1;
		}
		if (*(unsigned int *)buf != seq ||
		    *(unsigned int *)(buf + SECT_SIZE) != seq) {
			printf("Lost write %u: %u %u\n", seq, *(unsigned int *)buf, *(unsigned int *)(buf + SECT_SIZE));
			return 1;
		}
		seq++;
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO()
  2018-07-18  7:54               ` Ming Lei
  2018-07-18  9:20                 ` Johannes Thumshirn
@ 2018-07-19  9:39                 ` Martin Wilck
  2018-07-19  9:39                   ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
                                     ` (3 more replies)
  1 sibling, 4 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-19  9:39 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Jan Kara
  Cc: Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	Christoph Hellwig, linux-block, Martin Wilck

Hello Jens, Ming, Jan, and all others,

the following patches have been verified by a customer to fix a silent data
corruption which he has been seeing since "72ecad2 block: support a full bio
worth of IO for simplified bdev direct-io".

The patches are based on our observation that the corruption is only
observed if the __blkdev_direct_IO_simple() code path is executed,
and if that happens, "short writes" are observed in this code path,
which causes a fallback to buffered IO, while the application continues
submitting direct IO requests.

For the first patch, an alternative solution by Christoph Hellwig
exists:

   https://marc.info/?l=linux-kernel&m=153013977816825&w=2

While I believe that Christoph's patch is correct, the one presented
here is smaller. Ming has suggested to use Christoph's for mainline and
mine for -stable.

Wrt the second patch, we've had an internal discussion at SUSE how to handle
(unlikely) error conditions from bio_iov_iter_get_pages(). The patch presented
here tries to submit as much IO as possible via the direct path even in the
error case, while Jan Kara suggested to abort, not submit any IO, and fall 
back to buffered IO in that case.

Looking forward to your opinions and suggestions.

Regards
Martin

Martin Wilck (2):
  block: bio_iov_iter_get_pages: fix size of last iovec
  blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

 block/bio.c    | 18 ++++++++----------
 fs/block_dev.c |  8 +++++++-
 2 files changed, 15 insertions(+), 11 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec
  2018-07-19  9:39                 ` [PATCH 0/2] Fix silent " Martin Wilck
@ 2018-07-19  9:39                   ` Martin Wilck
  2018-07-19 10:05                     ` Hannes Reinecke
                                       ` (3 more replies)
  2018-07-19  9:39                   ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
                                     ` (2 subsequent siblings)
  3 siblings, 4 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-19  9:39 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Jan Kara
  Cc: Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	Christoph Hellwig, linux-block, Martin Wilck

If the last page of the bio is not "full", the length of the last
vector slot needs to be corrected. This slot has the index
(bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
array, which is shifted by the value of bio->bi_vcnt at function
invocation, the correct index is (nr_pages - 1).

V2: improved readability following suggestions from Ming Lei.

Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 block/bio.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 67eff5e..0964328 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -912,16 +912,16 @@ EXPORT_SYMBOL(bio_add_page);
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
-	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
+	unsigned short idx, nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
-	size_t offset, diff;
+	size_t offset;
 	ssize_t size;
 
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
-	nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
+	idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
 
 	/*
 	 * Deep magic below:  We need to walk the pinned pages backwards
@@ -934,17 +934,15 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	bio->bi_iter.bi_size += size;
 	bio->bi_vcnt += nr_pages;
 
-	diff = (nr_pages * PAGE_SIZE - offset) - size;
-	while (nr_pages--) {
-		bv[nr_pages].bv_page = pages[nr_pages];
-		bv[nr_pages].bv_len = PAGE_SIZE;
-		bv[nr_pages].bv_offset = 0;
+	while (idx--) {
+		bv[idx].bv_page = pages[idx];
+		bv[idx].bv_len = PAGE_SIZE;
+		bv[idx].bv_offset = 0;
 	}
 
 	bv[0].bv_offset += offset;
 	bv[0].bv_len -= offset;
-	if (diff)
-		bv[bio->bi_vcnt - 1].bv_len -= diff;
+	bv[nr_pages - 1].bv_len -= nr_pages * PAGE_SIZE - offset - size;
 
 	iov_iter_advance(iter, size);
 	return 0;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19  9:39                 ` [PATCH 0/2] Fix silent " Martin Wilck
  2018-07-19  9:39                   ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
@ 2018-07-19  9:39                   ` Martin Wilck
  2018-07-19 10:06                     ` Hannes Reinecke
                                       ` (3 more replies)
  2018-07-19 10:08                   ` [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO() Hannes Reinecke
  2018-07-19 14:50                   ` Christoph Hellwig
  3 siblings, 4 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-19  9:39 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Jan Kara
  Cc: Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	Christoph Hellwig, linux-block, Martin Wilck

bio_iov_iter_get_pages() returns only pages for a single non-empty
segment of the input iov_iter's iovec. This may be much less than the number
of pages __blkdev_direct_IO_simple() is supposed to process. Call
bio_iov_iter_get_pages() repeatedly until either the requested number
of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
short writes or reads may occur for direct synchronous IOs with multiple
iovec slots (such as generated by writev()). In that case,
__generic_file_write_iter() falls back to buffered writes, which
has been observed to cause data corruption in certain workloads.

Note: if segments aren't page-aligned in the input iovec, this patch may
result in multiple adjacent slots of the bi_io_vec array to reference the same
page (the byte ranges are guaranteed to be disjunct if the preceding patch is
applied). We haven't seen problems with that in our and the customer's
tests. It'd be possible to detect this situation and merge bi_io_vec slots
that refer to the same page, but I prefer to keep it simple for now.

Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 fs/block_dev.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aa..41643c4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
-		return ret;
+		goto out;
+
+	while (ret == 0 &&
+	       bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
+		ret = bio_iov_iter_get_pages(&bio, iter);
+
 	ret = bio.bi_iter.bi_size;
 
 	if (iov_iter_rw(iter) == READ) {
@@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		put_page(bvec->bv_page);
 	}
 
+out:
 	if (vecs != inline_vecs)
 		kfree(vecs);
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec
  2018-07-19  9:39                   ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
@ 2018-07-19 10:05                     ` Hannes Reinecke
  2018-07-19 10:09                     ` Ming Lei
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 54+ messages in thread
From: Hannes Reinecke @ 2018-07-19 10:05 UTC (permalink / raw)
  To: Martin Wilck, Jens Axboe, Ming Lei, Jan Kara
  Cc: Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block

On 07/19/2018 11:39 AM, Martin Wilck wrote:
> If the last page of the bio is not "full", the length of the last
> vector slot needs to be corrected. This slot has the index
> (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
> array, which is shifted by the value of bio->bi_vcnt at function
> invocation, the correct index is (nr_pages - 1).
> 
> V2: improved readability following suggestions from Ming Lei.
> 
> Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  block/bio.c | 18 ++++++++----------
>  1 file changed, 8 insertions(+), 10 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19  9:39                   ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
@ 2018-07-19 10:06                     ` Hannes Reinecke
  2018-07-19 10:21                     ` Ming Lei
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 54+ messages in thread
From: Hannes Reinecke @ 2018-07-19 10:06 UTC (permalink / raw)
  To: Martin Wilck, Jens Axboe, Ming Lei, Jan Kara
  Cc: Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block

On 07/19/2018 11:39 AM, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call
> bio_iov_iter_get_pages() repeatedly until either the requested number
> of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> short writes or reads may occur for direct synchronous IOs with multiple
> iovec slots (such as generated by writev()). In that case,
> __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
> 
> Note: if segments aren't page-aligned in the input iovec, this patch may
> result in multiple adjacent slots of the bi_io_vec array to reference the same
> page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> applied). We haven't seen problems with that in our and the customer's
> tests. It'd be possible to detect this situation and merge bi_io_vec slots
> that refer to the same page, but I prefer to keep it simple for now.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  fs/block_dev.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO()
  2018-07-19  9:39                 ` [PATCH 0/2] Fix silent " Martin Wilck
  2018-07-19  9:39                   ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
  2018-07-19  9:39                   ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
@ 2018-07-19 10:08                   ` Hannes Reinecke
  2018-07-19 14:50                   ` Christoph Hellwig
  3 siblings, 0 replies; 54+ messages in thread
From: Hannes Reinecke @ 2018-07-19 10:08 UTC (permalink / raw)
  To: Martin Wilck, Jens Axboe, Ming Lei, Jan Kara
  Cc: Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block

On 07/19/2018 11:39 AM, Martin Wilck wrote:
> Hello Jens, Ming, Jan, and all others,
> 
> the following patches have been verified by a customer to fix a silent data
> corruption which he has been seeing since "72ecad2 block: support a full bio
> worth of IO for simplified bdev direct-io".
> 
> The patches are based on our observation that the corruption is only
> observed if the __blkdev_direct_IO_simple() code path is executed,
> and if that happens, "short writes" are observed in this code path,
> which causes a fallback to buffered IO, while the application continues
> submitting direct IO requests.
> 
> For the first patch, an alternative solution by Christoph Hellwig
> exists:
> 
>    https://marc.info/?l=linux-kernel&m=153013977816825&w=2
> 
> While I believe that Christoph's patch is correct, the one presented
> here is smaller. Ming has suggested to use Christoph's for mainline and
> mine for -stable.
> 
> Wrt the second patch, we've had an internal discussion at SUSE how to handle
> (unlikely) error conditions from bio_iov_iter_get_pages(). The patch presented
> here tries to submit as much IO as possible via the direct path even in the
> error case, while Jan Kara suggested to abort, not submit any IO, and fall 
> back to buffered IO in that case.
> 
> Looking forward to your opinions and suggestions.
> 
Can you please add the test program from Jan Kara to the blktest suite?
This issue is something we really should cover there too, seeing the
amount of time we've spend debugging it...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec
  2018-07-19  9:39                   ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
  2018-07-19 10:05                     ` Hannes Reinecke
@ 2018-07-19 10:09                     ` Ming Lei
  2018-07-19 10:20                     ` Jan Kara
  2018-07-19 14:52                     ` Christoph Hellwig
  3 siblings, 0 replies; 54+ messages in thread
From: Ming Lei @ 2018-07-19 10:09 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

On Thu, Jul 19, 2018 at 5:39 PM, Martin Wilck <mwilck@suse.com> wrote:
> If the last page of the bio is not "full", the length of the last
> vector slot needs to be corrected. This slot has the index
> (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
> array, which is shifted by the value of bio->bi_vcnt at function
> invocation, the correct index is (nr_pages - 1).
>
> V2: improved readability following suggestions from Ming Lei.
>
> Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  block/bio.c | 18 ++++++++----------
>  1 file changed, 8 insertions(+), 10 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 67eff5e..0964328 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -912,16 +912,16 @@ EXPORT_SYMBOL(bio_add_page);
>   */
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
> -       unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> +       unsigned short idx, nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
>         struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>         struct page **pages = (struct page **)bv;
> -       size_t offset, diff;
> +       size_t offset;
>         ssize_t size;
>
>         size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
>         if (unlikely(size <= 0))
>                 return size ? size : -EFAULT;
> -       nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
> +       idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
>
>         /*
>          * Deep magic below:  We need to walk the pinned pages backwards
> @@ -934,17 +934,15 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>         bio->bi_iter.bi_size += size;
>         bio->bi_vcnt += nr_pages;
>
> -       diff = (nr_pages * PAGE_SIZE - offset) - size;
> -       while (nr_pages--) {
> -               bv[nr_pages].bv_page = pages[nr_pages];
> -               bv[nr_pages].bv_len = PAGE_SIZE;
> -               bv[nr_pages].bv_offset = 0;
> +       while (idx--) {
> +               bv[idx].bv_page = pages[idx];
> +               bv[idx].bv_len = PAGE_SIZE;
> +               bv[idx].bv_offset = 0;
>         }
>
>         bv[0].bv_offset += offset;
>         bv[0].bv_len -= offset;
> -       if (diff)
> -               bv[bio->bi_vcnt - 1].bv_len -= diff;
> +       bv[nr_pages - 1].bv_len -= nr_pages * PAGE_SIZE - offset - size;
>
>         iov_iter_advance(iter, size);
>         return 0;
> --
> 2.17.1
>

Reviewed-by: Ming Lei <ming.lei@redhat.com>

BTW, we can make it for mainline too, then rebase Christoph's new patch
against this one.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec
  2018-07-19  9:39                   ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
  2018-07-19 10:05                     ` Hannes Reinecke
  2018-07-19 10:09                     ` Ming Lei
@ 2018-07-19 10:20                     ` Jan Kara
  2018-07-19 14:52                     ` Christoph Hellwig
  3 siblings, 0 replies; 54+ messages in thread
From: Jan Kara @ 2018-07-19 10:20 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

On Thu 19-07-18 11:39:17, Martin Wilck wrote:
> If the last page of the bio is not "full", the length of the last
> vector slot needs to be corrected. This slot has the index
> (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
> array, which is shifted by the value of bio->bi_vcnt at function
> invocation, the correct index is (nr_pages - 1).
> 
> V2: improved readability following suggestions from Ming Lei.
> 
> Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
> Signed-off-by: Martin Wilck <mwilck@suse.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

BTW, explicit CC: stable@vger.kernel.org would be good. But Jens can add it
I guess.

								Honza

> ---
>  block/bio.c | 18 ++++++++----------
>  1 file changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 67eff5e..0964328 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -912,16 +912,16 @@ EXPORT_SYMBOL(bio_add_page);
>   */
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
> -	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> +	unsigned short idx, nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
>  	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>  	struct page **pages = (struct page **)bv;
> -	size_t offset, diff;
> +	size_t offset;
>  	ssize_t size;
>  
>  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
>  	if (unlikely(size <= 0))
>  		return size ? size : -EFAULT;
> -	nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
> +	idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
>  
>  	/*
>  	 * Deep magic below:  We need to walk the pinned pages backwards
> @@ -934,17 +934,15 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	bio->bi_iter.bi_size += size;
>  	bio->bi_vcnt += nr_pages;
>  
> -	diff = (nr_pages * PAGE_SIZE - offset) - size;
> -	while (nr_pages--) {
> -		bv[nr_pages].bv_page = pages[nr_pages];
> -		bv[nr_pages].bv_len = PAGE_SIZE;
> -		bv[nr_pages].bv_offset = 0;
> +	while (idx--) {
> +		bv[idx].bv_page = pages[idx];
> +		bv[idx].bv_len = PAGE_SIZE;
> +		bv[idx].bv_offset = 0;
>  	}
>  
>  	bv[0].bv_offset += offset;
>  	bv[0].bv_len -= offset;
> -	if (diff)
> -		bv[bio->bi_vcnt - 1].bv_len -= diff;
> +	bv[nr_pages - 1].bv_len -= nr_pages * PAGE_SIZE - offset - size;
>  
>  	iov_iter_advance(iter, size);
>  	return 0;
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19  9:39                   ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
  2018-07-19 10:06                     ` Hannes Reinecke
@ 2018-07-19 10:21                     ` Ming Lei
  2018-07-19 10:37                       ` Jan Kara
  2018-07-19 10:45                     ` Jan Kara
  2018-07-19 11:04                     ` Ming Lei
  3 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2018-07-19 10:21 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn,
	Kent Overstreet, Christoph Hellwig, linux-block

On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call

In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
as many as possible pages since both 'maxsize' and 'maxpages' are provided
to cover all.

So the question is why iov_iter_get_pages() doesn't work as expected?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 10:21                     ` Ming Lei
@ 2018-07-19 10:37                       ` Jan Kara
  2018-07-19 10:46                         ` Ming Lei
  2018-07-19 11:08                         ` Al Viro
  0 siblings, 2 replies; 54+ messages in thread
From: Jan Kara @ 2018-07-19 10:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block, Al Viro

On Thu 19-07-18 18:21:23, Ming Lei wrote:
> On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > segment of the input iov_iter's iovec. This may be much less than the number
> > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> 
> In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
> as many as possible pages since both 'maxsize' and 'maxpages' are provided
> to cover all.
> 
> So the question is why iov_iter_get_pages() doesn't work as expected?

Well, there has never been a promise that it will grab *all* pages in the
iter AFAIK. Practically, I think that it was just too hairy to implement in
the macro magic that iter processing is... Al might know more (added to
CC).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19  9:39                   ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
  2018-07-19 10:06                     ` Hannes Reinecke
  2018-07-19 10:21                     ` Ming Lei
@ 2018-07-19 10:45                     ` Jan Kara
  2018-07-19 12:23                       ` Martin Wilck
  2018-07-19 11:04                     ` Ming Lei
  3 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2018-07-19 10:45 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

On Thu 19-07-18 11:39:18, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call
> bio_iov_iter_get_pages() repeatedly until either the requested number
> of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> short writes or reads may occur for direct synchronous IOs with multiple
> iovec slots (such as generated by writev()). In that case,
> __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
> 
> Note: if segments aren't page-aligned in the input iovec, this patch may
> result in multiple adjacent slots of the bi_io_vec array to reference the same
> page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> applied). We haven't seen problems with that in our and the customer's
> tests. It'd be possible to detect this situation and merge bi_io_vec slots
> that refer to the same page, but I prefer to keep it simple for now.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  fs/block_dev.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aa..41643c4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	ret = bio_iov_iter_get_pages(&bio, iter);
>  	if (unlikely(ret))
> -		return ret;
> +		goto out;
> +
> +	while (ret == 0 &&
> +	       bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> +		ret = bio_iov_iter_get_pages(&bio, iter);
> +

I have two suggestions here (posting them now in public):

Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we made
sure we have enough vecs for pages in iter. So I'd WARN if this isn't true.

Secondly, I don't think it is good to discard error from
bio_iov_iter_get_pages() here and just submit partial IO. It will again
lead to part of IO being done as direct and part attempted to be done as
buffered. Also the "slow" direct IO path in __blkdev_direct_IO() behaves
differently - it aborts and returns error if bio_iov_iter_get_pages() ever
returned error. IMO we should do the same here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 10:37                       ` Jan Kara
@ 2018-07-19 10:46                         ` Ming Lei
  2018-07-19 11:08                         ` Al Viro
  1 sibling, 0 replies; 54+ messages in thread
From: Ming Lei @ 2018-07-19 10:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block, Al Viro

On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote:
> On Thu 19-07-18 18:21:23, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > segment of the input iov_iter's iovec. This may be much less than the number
> > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > 
> > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
> > as many as possible pages since both 'maxsize' and 'maxpages' are provided
> > to cover all.
> > 
> > So the question is why iov_iter_get_pages() doesn't work as expected?
> 
> Well, there has never been a promise that it will grab *all* pages in the
> iter AFAIK. Practically, I think that it was just too hairy to implement in
> the macro magic that iter processing is... Al might know more (added to
> CC).

OK, I got it, given get_user_pages_fast() may fail and it is reasonable
to see less pages retrieved.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19  9:39                   ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
                                       ` (2 preceding siblings ...)
  2018-07-19 10:45                     ` Jan Kara
@ 2018-07-19 11:04                     ` Ming Lei
  2018-07-19 11:56                       ` Jan Kara
  3 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2018-07-19 11:04 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn,
	Kent Overstreet, Christoph Hellwig, linux-block

On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call
> bio_iov_iter_get_pages() repeatedly until either the requested number
> of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> short writes or reads may occur for direct synchronous IOs with multiple
> iovec slots (such as generated by writev()). In that case,
> __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
> 
> Note: if segments aren't page-aligned in the input iovec, this patch may
> result in multiple adjacent slots of the bi_io_vec array to reference the same
> page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> applied). We haven't seen problems with that in our and the customer's
> tests. It'd be possible to detect this situation and merge bi_io_vec slots
> that refer to the same page, but I prefer to keep it simple for now.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  fs/block_dev.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aa..41643c4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	ret = bio_iov_iter_get_pages(&bio, iter);
>  	if (unlikely(ret))
> -		return ret;
> +		goto out;
> +
> +	while (ret == 0 &&
> +	       bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> +		ret = bio_iov_iter_get_pages(&bio, iter);
> +
>  	ret = bio.bi_iter.bi_size;
>  
>  	if (iov_iter_rw(iter) == READ) {
> @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  		put_page(bvec->bv_page);
>  	}
>  
> +out:
>  	if (vecs != inline_vecs)
>  		kfree(vecs);
>

You might put the 'vecs' leak fix into another patch, and resue the
current code block for that.

Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
what do you think about the following way?

diff --git a/block/bio.c b/block/bio.c
index f3536bfc8298..23dd4c163dfc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -904,15 +904,7 @@ int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);
 
-/**
- * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
- * @bio: bio to add pages to
- * @iter: iov iterator describing the region to be mapped
- *
- * Pins as many pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- */
-int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
@@ -951,6 +943,28 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_advance(iter, size);
 	return 0;
 }
+
+/**
+ * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
+ * @bio: bio to add pages to
+ * @iter: iov iterator describing the region to be mapped
+ *
+ * Pins as many pages from *iter and appends them to @bio's bvec array. The
+ * pages will have to be released using put_page() when done.
+ */
+int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+{
+	int ret;
+	unsigned int size;
+
+	do {
+		size = bio->bi_iter.bi_size;
+		ret = __bio_iov_iter_get_pages(bio, iter);
+	} while (!bio_full(bio) && iov_iter_count(iter) > 0 &&
+			bio->bi_iter.bi_size > size);
+
+	return bio->bi_iter.bi_size > 0 ? 0 : ret;
+}
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
 static void submit_bio_wait_endio(struct bio *bio)

Thanks, 
Ming

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 10:37                       ` Jan Kara
  2018-07-19 10:46                         ` Ming Lei
@ 2018-07-19 11:08                         ` Al Viro
  2018-07-19 14:53                           ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: Al Viro @ 2018-07-19 11:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ming Lei, Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote:
> On Thu 19-07-18 18:21:23, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > segment of the input iov_iter's iovec. This may be much less than the number
> > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > 
> > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve
> > as many as possible pages since both 'maxsize' and 'maxpages' are provided
> > to cover all.

Huh?  Why does having a way to say "I (the caller) don't want more than <amount>"
implies anything of that sort?  It never had been promised, explicitly or not...

> > So the question is why iov_iter_get_pages() doesn't work as expected?
>
> Well, there has never been a promise that it will grab *all* pages in the
> iter AFAIK. Practically, I think that it was just too hairy to implement in
> the macro magic that iter processing is... Al might know more (added to
> CC).

Not really - it's more that VM has every right to refuse letting you pin
an arbitrary amount of pages anyway.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 11:04                     ` Ming Lei
@ 2018-07-19 11:56                       ` Jan Kara
  2018-07-19 12:20                         ` Ming Lei
  2018-07-19 12:25                         ` Martin Wilck
  0 siblings, 2 replies; 54+ messages in thread
From: Jan Kara @ 2018-07-19 11:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

On Thu 19-07-18 19:04:46, Ming Lei wrote:
> On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > segment of the input iov_iter's iovec. This may be much less than the number
> > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > bio_iov_iter_get_pages() repeatedly until either the requested number
> > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> > short writes or reads may occur for direct synchronous IOs with multiple
> > iovec slots (such as generated by writev()). In that case,
> > __generic_file_write_iter() falls back to buffered writes, which
> > has been observed to cause data corruption in certain workloads.
> > 
> > Note: if segments aren't page-aligned in the input iovec, this patch may
> > result in multiple adjacent slots of the bi_io_vec array to reference the same
> > page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> > applied). We haven't seen problems with that in our and the customer's
> > tests. It'd be possible to detect this situation and merge bi_io_vec slots
> > that refer to the same page, but I prefer to keep it simple for now.
> > 
> > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
> > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > ---
> >  fs/block_dev.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 0dd87aa..41643c4 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
> >  
> >  	ret = bio_iov_iter_get_pages(&bio, iter);
> >  	if (unlikely(ret))
> > -		return ret;
> > +		goto out;
> > +
> > +	while (ret == 0 &&
> > +	       bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> > +		ret = bio_iov_iter_get_pages(&bio, iter);
> > +
> >  	ret = bio.bi_iter.bi_size;
> >  
> >  	if (iov_iter_rw(iter) == READ) {
> > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
> >  		put_page(bvec->bv_page);
> >  	}
> >  
> > +out:
> >  	if (vecs != inline_vecs)
> >  		kfree(vecs);
> >
> 
> You might put the 'vecs' leak fix into another patch, and resue the
> current code block for that.
> 
> Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
> what do you think about the following way?

No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly
fine with it returning less pages and they loop appropriately.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 11:56                       ` Jan Kara
@ 2018-07-19 12:20                         ` Ming Lei
  2018-07-19 15:21                           ` Jan Kara
  2018-07-19 12:25                         ` Martin Wilck
  1 sibling, 1 reply; 54+ messages in thread
From: Ming Lei @ 2018-07-19 12:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote:
> On Thu 19-07-18 19:04:46, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > segment of the input iov_iter's iovec. This may be much less than the number
> > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > > bio_iov_iter_get_pages() repeatedly until either the requested number
> > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> > > short writes or reads may occur for direct synchronous IOs with multiple
> > > iovec slots (such as generated by writev()). In that case,
> > > __generic_file_write_iter() falls back to buffered writes, which
> > > has been observed to cause data corruption in certain workloads.
> > > 
> > > Note: if segments aren't page-aligned in the input iovec, this patch may
> > > result in multiple adjacent slots of the bi_io_vec array to reference the same
> > > page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> > > applied). We haven't seen problems with that in our and the customer's
> > > tests. It'd be possible to detect this situation and merge bi_io_vec slots
> > > that refer to the same page, but I prefer to keep it simple for now.
> > > 
> > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
> > > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > > ---
> > >  fs/block_dev.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 0dd87aa..41643c4 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
> > >  
> > >  	ret = bio_iov_iter_get_pages(&bio, iter);
> > >  	if (unlikely(ret))
> > > -		return ret;
> > > +		goto out;
> > > +
> > > +	while (ret == 0 &&
> > > +	       bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> > > +		ret = bio_iov_iter_get_pages(&bio, iter);
> > > +
> > >  	ret = bio.bi_iter.bi_size;
> > >  
> > >  	if (iov_iter_rw(iter) == READ) {
> > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
> > >  		put_page(bvec->bv_page);
> > >  	}
> > >  
> > > +out:
> > >  	if (vecs != inline_vecs)
> > >  		kfree(vecs);
> > >
> > 
> > You might put the 'vecs' leak fix into another patch, and resue the
> > current code block for that.
> > 
> > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
> > what do you think about the following way?
> 
> No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly
> fine with it returning less pages and they loop appropriately.

OK, but this way still may make one bio to hold more data, especially
the comment of bio_iov_iter_get_pages() says that 'Pins as many pages from
*iter', so looks it is the correct way to do.

thanks,
Ming

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 10:45                     ` Jan Kara
@ 2018-07-19 12:23                       ` Martin Wilck
  2018-07-19 15:15                         ` Jan Kara
  0 siblings, 1 reply; 54+ messages in thread
From: Martin Wilck @ 2018-07-19 12:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote:
> On Thu 19-07-18 11:39:18, Martin Wilck wrote:
> > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > segment of the input iov_iter's iovec. This may be much less than
> > the number
> > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > bio_iov_iter_get_pages() repeatedly until either the requested
> > number
> > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not
> > done,
> > short writes or reads may occur for direct synchronous IOs with
> > multiple
> > iovec slots (such as generated by writev()). In that case,
> > __generic_file_write_iter() falls back to buffered writes, which
> > has been observed to cause data corruption in certain workloads.
> > 
> > Note: if segments aren't page-aligned in the input iovec, this
> > patch may
> > result in multiple adjacent slots of the bi_io_vec array to
> > reference the same
> > page (the byte ranges are guaranteed to be disjunct if the
> > preceding patch is
> > applied). We haven't seen problems with that in our and the
> > customer's
> > tests. It'd be possible to detect this situation and merge
> > bi_io_vec slots
> > that refer to the same page, but I prefer to keep it simple for
> > now.
> > 
> > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
> > simplified bdev direct-io")
> > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > ---
> >  fs/block_dev.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 0dd87aa..41643c4 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb,
> > struct iov_iter *iter,
> >  
> >  	ret = bio_iov_iter_get_pages(&bio, iter);
> >  	if (unlikely(ret))
> > -		return ret;
> > +		goto out;
> > +
> > +	while (ret == 0 &&
> > +	       bio.bi_vcnt < bio.bi_max_vecs &&
> > iov_iter_count(iter) > 0)
> > +		ret = bio_iov_iter_get_pages(&bio, iter);
> > +
> 
> I have two suggestions here (posting them now in public):
> 
> Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we
> made
> sure we have enough vecs for pages in iter. So I'd WARN if this isn't
> true.

Yeah. I wanted to add that to the patch. Slipped through, somehow.
Sorry about that.

> Secondly, I don't think it is good to discard error from
> bio_iov_iter_get_pages() here and just submit partial IO. It will
> again
> lead to part of IO being done as direct and part attempted to be done
> as
> buffered. Also the "slow" direct IO path in __blkdev_direct_IO()
> behaves
> differently - it aborts and returns error if bio_iov_iter_get_pages()
> ever
> returned error. IMO we should do the same here.

Well, it aborts the loop, but then (in the sync case) it still waits
for the already submitted IOs to finish. Here, too, I'd find it more
logical to return the number of successfully transmitted bytes rather
than an error code. In the async case, the submitted bios are left in
place, and will probably sooner or later finish, changing iocb->ki_pos.

I'm actually not quite certain if that's correct. In the sync case, it
causes the already-performed IO to be done again, buffered. In the
async case, it it may even cause two IOs for the same range to be in
flight at the same time ... ?

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 11:56                       ` Jan Kara
  2018-07-19 12:20                         ` Ming Lei
@ 2018-07-19 12:25                         ` Martin Wilck
  1 sibling, 0 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-19 12:25 UTC (permalink / raw)
  To: Jan Kara, Ming Lei
  Cc: Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn,
	Kent Overstreet, Christoph Hellwig, linux-block

On Thu, 2018-07-19 at 13:56 +0200, Jan Kara wrote:
> On Thu 19-07-18 19:04:46, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > bio_iov_iter_get_pages() returns only pages for a single non-
> > > empty
> > > segment of the input iov_iter's iovec. This may be much less than
> > > the number
> > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > > bio_iov_iter_get_pages() repeatedly until either the requested
> > > number
> > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is
> > > not done,
> > > short writes or reads may occur for direct synchronous IOs with
> > > multiple
> > > iovec slots (such as generated by writev()). In that case,
> > > __generic_file_write_iter() falls back to buffered writes, which
> > > has been observed to cause data corruption in certain workloads.
> > > 
> > > Note: if segments aren't page-aligned in the input iovec, this
> > > patch may
> > > result in multiple adjacent slots of the bi_io_vec array to
> > > reference the same
> > > page (the byte ranges are guaranteed to be disjunct if the
> > > preceding patch is
> > > applied). We haven't seen problems with that in our and the
> > > customer's
> > > tests. It'd be possible to detect this situation and merge
> > > bi_io_vec slots
> > > that refer to the same page, but I prefer to keep it simple for
> > > now.
> > > 
> > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
> > > simplified bdev direct-io")
> > > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > > ---
> > >  fs/block_dev.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 0dd87aa..41643c4 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb
> > > *iocb, struct iov_iter *iter,
> > >  
> > >  	ret = bio_iov_iter_get_pages(&bio, iter);
> > >  	if (unlikely(ret))
> > > -		return ret;
> > > +		goto out;
> > > +
> > > +	while (ret == 0 &&
> > > +	       bio.bi_vcnt < bio.bi_max_vecs &&
> > > iov_iter_count(iter) > 0)
> > > +		ret = bio_iov_iter_get_pages(&bio, iter);
> > > +
> > >  	ret = bio.bi_iter.bi_size;
> > >  
> > >  	if (iov_iter_rw(iter) == READ) {
> > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb,
> > > struct iov_iter *iter,
> > >  		put_page(bvec->bv_page);
> > >  	}
> > >  
> > > +out:
> > >  	if (vecs != inline_vecs)
> > >  		kfree(vecs);
> > > 
> > 
> > You might put the 'vecs' leak fix into another patch, and resue the
> > current code block for that.
> > 
> > Looks all users of bio_iov_iter_get_pages() need this kind of fix,
> > so
> > what do you think about the following way?
> 
> No. AFAICT all the other users of bio_iov_iter_get_pages() are
> perfectly
> fine with it returning less pages and they loop appropriately.

We might consider to better fill the bios in __blkdev_direct_IO(), too,
it might be a performance gain.

Regards
Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO()
  2018-07-19  9:39                 ` [PATCH 0/2] Fix silent " Martin Wilck
                                     ` (2 preceding siblings ...)
  2018-07-19 10:08                   ` [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO() Hannes Reinecke
@ 2018-07-19 14:50                   ` Christoph Hellwig
  3 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-07-19 14:50 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

Martin,

please NEVER send a patch series as a reply to a previous thread.
That makes it complete hell to find in the inbox.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec
  2018-07-19  9:39                   ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
                                       ` (2 preceding siblings ...)
  2018-07-19 10:20                     ` Jan Kara
@ 2018-07-19 14:52                     ` Christoph Hellwig
  3 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-07-19 14:52 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

On Thu, Jul 19, 2018 at 11:39:17AM +0200, Martin Wilck wrote:
> If the last page of the bio is not "full", the length of the last
> vector slot needs to be corrected. This slot has the index
> (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
> array, which is shifted by the value of bio->bi_vcnt at function
> invocation, the correct index is (nr_pages - 1).
> 
> V2: improved readability following suggestions from Ming Lei.
> 
> Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  block/bio.c | 18 ++++++++----------
>  1 file changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 67eff5e..0964328 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -912,16 +912,16 @@ EXPORT_SYMBOL(bio_add_page);
>   */
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
> -	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> +	unsigned short idx, nr_pages = bio->bi_max_vecs - bio->bi_vcnt;

Nipick: I prefer to keep the variable(s) that are initialized first on
any given line.

Otherwise this looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 11:08                         ` Al Viro
@ 2018-07-19 14:53                           ` Christoph Hellwig
  2018-07-19 15:06                             ` Jan Kara
  2018-07-19 19:34                             ` Martin Wilck
  0 siblings, 2 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-07-19 14:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Ming Lei, Martin Wilck, Jens Axboe, Jan Kara,
	Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	Christoph Hellwig, linux-block

On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote:
> > Well, there has never been a promise that it will grab *all* pages in the
> > iter AFAIK. Practically, I think that it was just too hairy to implement in
> > the macro magic that iter processing is... Al might know more (added to
> > CC).
> 
> Not really - it's more that VM has every right to refuse letting you pin
> an arbitrary amount of pages anyway.

In which case the code after this patch isn't going to help either, because
it still tries to pin it all, just in multiple calls to get_user_pages().

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 14:53                           ` Christoph Hellwig
@ 2018-07-19 15:06                             ` Jan Kara
  2018-07-19 15:11                               ` Christoph Hellwig
  2018-07-19 19:34                             ` Martin Wilck
  1 sibling, 1 reply; 54+ messages in thread
From: Jan Kara @ 2018-07-19 15:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, Jan Kara, Ming Lei, Martin Wilck, Jens Axboe, Jan Kara,
	Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	linux-block

On Thu 19-07-18 16:53:51, Christoph Hellwig wrote:
> On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote:
> > > Well, there has never been a promise that it will grab *all* pages in the
> > > iter AFAIK. Practically, I think that it was just too hairy to implement in
> > > the macro magic that iter processing is... Al might know more (added to
> > > CC).
> > 
> > Not really - it's more that VM has every right to refuse letting you pin
> > an arbitrary amount of pages anyway.
> 
> In which case the code after this patch isn't going to help either, because
> it still tries to pin it all, just in multiple calls to get_user_pages().

Yeah. Actually previous version of the fix (not posted publicly) submitted
partial bio and then reused the bio to submit more. This is also the way
__blkdev_direct_IO operates. Martin optimized this to fill the bio
completely (as we know we have enough bvecs) before submitting which has
chances to perform better. I'm fine with either approach, just we have to
decide which way to go.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 15:06                             ` Jan Kara
@ 2018-07-19 15:11                               ` Christoph Hellwig
  2018-07-19 19:21                                 ` Martin Wilck
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-07-19 15:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Al Viro, Ming Lei, Martin Wilck, Jens Axboe,
	Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	linux-block

On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote:
> Yeah. Actually previous version of the fix (not posted publicly) submitted
> partial bio and then reused the bio to submit more. This is also the way
> __blkdev_direct_IO operates. Martin optimized this to fill the bio
> completely (as we know we have enough bvecs) before submitting which has
> chances to perform better. I'm fine with either approach, just we have to
> decide which way to go.

I think this first version is going to be less fragile, so I we should
aim for that.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 12:23                       ` Martin Wilck
@ 2018-07-19 15:15                         ` Jan Kara
  2018-07-19 20:01                           ` Martin Wilck
  0 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2018-07-19 15:15 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jan Kara, Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

On Thu 19-07-18 14:23:53, Martin Wilck wrote:
> On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote:
> > Secondly, I don't think it is good to discard error from
> > bio_iov_iter_get_pages() here and just submit partial IO. It will
> > again
> > lead to part of IO being done as direct and part attempted to be done
> > as
> > buffered. Also the "slow" direct IO path in __blkdev_direct_IO()
> > behaves
> > differently - it aborts and returns error if bio_iov_iter_get_pages()
> > ever
> > returned error. IMO we should do the same here.
> 
> Well, it aborts the loop, but then (in the sync case) it still waits
> for the already submitted IOs to finish. Here, too, I'd find it more
> logical to return the number of successfully transmitted bytes rather
> than an error code. In the async case, the submitted bios are left in
> place, and will probably sooner or later finish, changing iocb->ki_pos.

Well, both these behaviors make sense, just traditionally (defined by our
implementation) DIO returns error even if part of IO has actually been
successfully submitted. Making a userspace visible change like you suggest
thus has to be very carefully analyzed and frankly I don't think it's worth
the bother.

> I'm actually not quite certain if that's correct. In the sync case, it
> causes the already-performed IO to be done again, buffered. In the
> async case, it it may even cause two IOs for the same range to be in
> flight at the same time ... ?

It doesn't cause IO to be done again. Look at __generic_file_write_iter().
If generic_file_direct_write() returned error, we immediately return error
as well without retrying buffered IO. We only retry buffered IO for partial
(or 0) return value.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 12:20                         ` Ming Lei
@ 2018-07-19 15:21                           ` Jan Kara
  2018-07-19 19:06                             ` Martin Wilck
  0 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2018-07-19 15:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jan Kara, Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

On Thu 19-07-18 20:20:51, Ming Lei wrote:
> On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote:
> > On Thu 19-07-18 19:04:46, Ming Lei wrote:
> > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > > > segment of the input iov_iter's iovec. This may be much less than the number
> > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > > > bio_iov_iter_get_pages() repeatedly until either the requested number
> > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> > > > short writes or reads may occur for direct synchronous IOs with multiple
> > > > iovec slots (such as generated by writev()). In that case,
> > > > __generic_file_write_iter() falls back to buffered writes, which
> > > > has been observed to cause data corruption in certain workloads.
> > > > 
> > > > Note: if segments aren't page-aligned in the input iovec, this patch may
> > > > result in multiple adjacent slots of the bi_io_vec array to reference the same
> > > > page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> > > > applied). We haven't seen problems with that in our and the customer's
> > > > tests. It'd be possible to detect this situation and merge bi_io_vec slots
> > > > that refer to the same page, but I prefer to keep it simple for now.
> > > > 
> > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
> > > > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > > > ---
> > > >  fs/block_dev.c | 8 +++++++-
> > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 0dd87aa..41643c4 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
> > > >  
> > > >  	ret = bio_iov_iter_get_pages(&bio, iter);
> > > >  	if (unlikely(ret))
> > > > -		return ret;
> > > > +		goto out;
> > > > +
> > > > +	while (ret == 0 &&
> > > > +	       bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0)
> > > > +		ret = bio_iov_iter_get_pages(&bio, iter);
> > > > +
> > > >  	ret = bio.bi_iter.bi_size;
> > > >  
> > > >  	if (iov_iter_rw(iter) == READ) {
> > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
> > > >  		put_page(bvec->bv_page);
> > > >  	}
> > > >  
> > > > +out:
> > > >  	if (vecs != inline_vecs)
> > > >  		kfree(vecs);
> > > >
> > > 
> > > You might put the 'vecs' leak fix into another patch, and resue the
> > > current code block for that.
> > > 
> > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so
> > > what do you think about the following way?
> > 
> > No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly
> > fine with it returning less pages and they loop appropriately.
> 
> OK, but this way still may make one bio to hold more data, especially
> the comment of bio_iov_iter_get_pages() says that 'Pins as many pages from
> *iter', so looks it is the correct way to do.

Well, but as Al pointed out MM may decide that user has already pinned too
many pages and refuse to pin more. So pinning full iter worth of pages may
just be impossible. Currently, there are no checks like this in MM but
eventually, I'd like to account pinned pages in mlock ulimit (or a limit of
similar kind) and then what Al speaks about would become very real. So I'd
prefer to not develop new locations that must be able to pin arbitrary
amount of pages.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 15:21                           ` Jan Kara
@ 2018-07-19 19:06                             ` Martin Wilck
  0 siblings, 0 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-19 19:06 UTC (permalink / raw)
  To: Jan Kara, Ming Lei
  Cc: Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn,
	Kent Overstreet, Christoph Hellwig, linux-block

On Thu, 2018-07-19 at 17:21 +0200, Jan Kara wrote:
> On Thu 19-07-18 20:20:51, Ming Lei wrote:
> > On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote:
> > > On Thu 19-07-18 19:04:46, Ming Lei wrote:
> > > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote:
> > > > > bio_iov_iter_get_pages() returns only pages for a single non-
> > > > > empty
> > > > > segment of the input iov_iter's iovec. This may be much less
> > > > > than the number
> > > > > of pages __blkdev_direct_IO_simple() is supposed to process.
> > > > > Call
> > > > > bio_iov_iter_get_pages() repeatedly until either the
> > > > > requested number
> > > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this
> > > > > is not done,
> > > > > short writes or reads may occur for direct synchronous IOs
> > > > > with multiple
> > > > > iovec slots (such as generated by writev()). In that case,
> > > > > __generic_file_write_iter() falls back to buffered writes,
> > > > > which
> > > > > has been observed to cause data corruption in certain
> > > > > workloads.
> > > > > 
> > > > > Note: if segments aren't page-aligned in the input iovec,
> > > > > this patch may
> > > > > result in multiple adjacent slots of the bi_io_vec array to
> > > > > reference the same
> > > > > page (the byte ranges are guaranteed to be disjunct if the
> > > > > preceding patch is
> > > > > applied). We haven't seen problems with that in our and the
> > > > > customer's
> > > > > tests. It'd be possible to detect this situation and merge
> > > > > bi_io_vec slots
> > > > > that refer to the same page, but I prefer to keep it simple
> > > > > for now.
> > > > > 
> > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO
> > > > > for simplified bdev direct-io")
> > > > > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > > > > ---
> > > > >  fs/block_dev.c | 8 +++++++-
> > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > > index 0dd87aa..41643c4 100644
> > > > > --- a/fs/block_dev.c
> > > > > +++ b/fs/block_dev.c
> > > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb
> > > > > *iocb, struct iov_iter *iter,
> > > > >  
> > > > >  	ret = bio_iov_iter_get_pages(&bio, iter);
> > > > >  	if (unlikely(ret))
> > > > > -		return ret;
> > > > > +		goto out;
> > > > > +
> > > > > +	while (ret == 0 &&
> > > > > +	       bio.bi_vcnt < bio.bi_max_vecs &&
> > > > > iov_iter_count(iter) > 0)
> > > > > +		ret = bio_iov_iter_get_pages(&bio, iter);
> > > > > +
> > > > >  	ret = bio.bi_iter.bi_size;
> > > > >  
> > > > >  	if (iov_iter_rw(iter) == READ) {
> > > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb
> > > > > *iocb, struct iov_iter *iter,
> > > > >  		put_page(bvec->bv_page);
> > > > >  	}
> > > > >  
> > > > > +out:
> > > > >  	if (vecs != inline_vecs)
> > > > >  		kfree(vecs);
> > > > > 
> > > > 
> > > > You might put the 'vecs' leak fix into another patch, and resue
> > > > the
> > > > current code block for that.
> > > > 
> > > > Looks all users of bio_iov_iter_get_pages() need this kind of
> > > > fix, so
> > > > what do you think about the following way?
> > > 
> > > No. AFAICT all the other users of bio_iov_iter_get_pages() are
> > > perfectly
> > > fine with it returning less pages and they loop appropriately.
> > 
> > OK, but this way still may make one bio to hold more data,
> > especially
> > the comment of bio_iov_iter_get_pages() says that 'Pins as many
> > pages from
> > *iter', so looks it is the correct way to do.
> 
> Well, but as Al pointed out MM may decide that user has already
> pinned too
> many pages and refuse to pin more. So pinning full iter worth of
> pages may
> just be impossible. Currently, there are no checks like this in MM
> but
> eventually, I'd like to account pinned pages in mlock ulimit (or a
> limit of
> similar kind) and then what Al speaks about would become very real.
> So I'd
> prefer to not develop new locations that must be able to pin
> arbitrary
> amount of pages.

Doesn't this wrongfully penalize scatter-gather IO? Consider two 1MiB
IO requests, one with a single 1 MiB slot, and one with 256 single-page 
slots. Calling bio_iov_iter_get_pages() once will result in a 1MiB bio
in first case, and a 4k bio in the second. You need to submit 256
individual bios to finish the IO.

By using the approach which my patch is takes (repeatedly calling
bio_io_iter_get_pages before submitting the bio), you'd get a 1Mib bio
in both cases. So Ming's idea to generalize this may not be so bad,
after all. I don't see that it would increase the pressure on MM. If we
are concerned about MM, we should limit the total number of pages that
can be pinned, rather than the number of iovec segments that are
processed.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 15:11                               ` Christoph Hellwig
@ 2018-07-19 19:21                                 ` Martin Wilck
  0 siblings, 0 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-19 19:21 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Al Viro, Ming Lei, Jens Axboe, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, linux-block

On Thu, 2018-07-19 at 17:11 +0200, Christoph Hellwig wrote:
> On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote:
> > Yeah. Actually previous version of the fix (not posted publicly)
> > submitted
> > partial bio and then reused the bio to submit more. This is also
> > the way
> > __blkdev_direct_IO operates. Martin optimized this to fill the
> > bio
> > completely (as we know we have enough bvecs) before submitting
> > which has
> > chances to perform better. I'm fine with either approach, just we
> > have to
> > decide which way to go.
> 
> I think this first version is going to be less fragile, so I we
> should
> aim for that.

Wait a second. We changed the approach for a reason. By submitting
partial bios synchronously and sequentially, we loose a lot of
performance, so much that this "fast path" quickly falls behind the
regular __blkkdev_direct_IO() path as the number of input IO segments
increases. The patch I submitted avoids that. The whole point of this
fast path is to submit a single bio, and return as quickly as possible.

It's not an error condition if bio_iov_iter_get_pages() returns less
data than possible. The function just takes one iteration step at a
time, and thus iterating over it until the desired number of pages is
obtained, and collecting the resulting pages in a single bio, isn't
"fragile", it's just the natural thing to do.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 14:53                           ` Christoph Hellwig
  2018-07-19 15:06                             ` Jan Kara
@ 2018-07-19 19:34                             ` Martin Wilck
  1 sibling, 0 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-19 19:34 UTC (permalink / raw)
  To: Christoph Hellwig, Al Viro
  Cc: Jan Kara, Ming Lei, Jens Axboe, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, linux-block

On Thu, 2018-07-19 at 16:53 +0200, Christoph Hellwig wrote:
> On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote:
> > > Well, there has never been a promise that it will grab *all*
> > > pages in the
> > > iter AFAIK. Practically, I think that it was just too hairy to
> > > implement in
> > > the macro magic that iter processing is... Al might know more
> > > (added to
> > > CC).
> > 
> > Not really - it's more that VM has every right to refuse letting
> > you pin
> > an arbitrary amount of pages anyway.
> 
> In which case the code after this patch isn't going to help either,
> because
> it still tries to pin it all, just in multiple calls to
> get_user_pages().

That was not the point of the patch. It's not about a situation in
which MM refuses to pin more pages.

The patch is about the "iterator::next()" nature of
bio_iov_iter_get_pages().

If it can't pin the pages, bio_iov_iter_get_pages() returns an error
code (elsewhere in this thread we discussed how to treat that right).
Otherwise, it always adds _some_ data to the bio, but the amount added
depends on the segment structure of the input iov_iter. If the input
iovec has just a single segment, it fills the bio in a single call.
With multiple segments, it just returns the page(s) of the first
segment. The point of my patch is to make no difference between single-
segment and multi-segment IOs.

Regards
Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 15:15                         ` Jan Kara
@ 2018-07-19 20:01                           ` Martin Wilck
  0 siblings, 0 replies; 54+ messages in thread
From: Martin Wilck @ 2018-07-19 20:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, Christoph Hellwig,
	linux-block

On Thu, 2018-07-19 at 17:15 +0200, Jan Kara wrote:
> On Thu 19-07-18 14:23:53, Martin Wilck wrote:
> > On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote:
> > > Secondly, I don't think it is good to discard error from
> > > bio_iov_iter_get_pages() here and just submit partial IO. It will
> > > again
> > > lead to part of IO being done as direct and part attempted to be
> > > done
> > > as
> > > buffered. Also the "slow" direct IO path in __blkdev_direct_IO()
> > > behaves
> > > differently - it aborts and returns error if
> > > bio_iov_iter_get_pages()
> > > ever
> > > returned error. IMO we should do the same here.
> > 
> > Well, it aborts the loop, but then (in the sync case) it still
> > waits
> > for the already submitted IOs to finish. Here, too, I'd find it
> > more
> > logical to return the number of successfully transmitted bytes
> > rather
> > than an error code. In the async case, the submitted bios are left
> > in
> > place, and will probably sooner or later finish, changing iocb-
> > >ki_pos.
> 
> Well, both these behaviors make sense, just traditionally (defined by
> our
> implementation) DIO returns error even if part of IO has actually
> been
> successfully submitted. Making a userspace visible change like you
> suggest
> thus has to be very carefully analyzed and frankly I don't think it's
> worth
> the bother.

OK, maybe not.
I'll rework the treatment of the error case.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2018-07-19 20:01 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-12 14:36 Silent data corruption in blkdev_direct_IO() Hannes Reinecke
2018-07-12 15:08 ` Jens Axboe
2018-07-12 16:11   ` Martin Wilck
2018-07-12 16:14   ` Hannes Reinecke
2018-07-12 16:20     ` Jens Axboe
2018-07-12 16:42       ` Jens Axboe
2018-07-13  6:47         ` Martin Wilck
2018-07-13 16:56         ` Martin Wilck
2018-07-13 18:00           ` Jens Axboe
2018-07-13 18:50             ` Jens Axboe
2018-07-13 22:21               ` Martin Wilck
2018-07-13 20:48             ` Martin Wilck
2018-07-13 20:52               ` Jens Axboe
2018-07-16 19:05                 ` Martin Wilck
2018-07-12 23:29 ` Ming Lei
2018-07-13 18:54   ` Jens Axboe
2018-07-13 22:29     ` Martin Wilck
2018-07-16 11:45       ` Ming Lei
2018-07-18  0:07         ` Martin Wilck
2018-07-18  2:48           ` Ming Lei
2018-07-18  7:32             ` Martin Wilck
2018-07-18  7:54               ` Ming Lei
2018-07-18  9:20                 ` Johannes Thumshirn
2018-07-18 11:40                   ` Jan Kara
2018-07-18 11:57                     ` Jan Kara
2018-07-19  9:39                 ` [PATCH 0/2] Fix silent " Martin Wilck
2018-07-19  9:39                   ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
2018-07-19 10:05                     ` Hannes Reinecke
2018-07-19 10:09                     ` Ming Lei
2018-07-19 10:20                     ` Jan Kara
2018-07-19 14:52                     ` Christoph Hellwig
2018-07-19  9:39                   ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
2018-07-19 10:06                     ` Hannes Reinecke
2018-07-19 10:21                     ` Ming Lei
2018-07-19 10:37                       ` Jan Kara
2018-07-19 10:46                         ` Ming Lei
2018-07-19 11:08                         ` Al Viro
2018-07-19 14:53                           ` Christoph Hellwig
2018-07-19 15:06                             ` Jan Kara
2018-07-19 15:11                               ` Christoph Hellwig
2018-07-19 19:21                                 ` Martin Wilck
2018-07-19 19:34                             ` Martin Wilck
2018-07-19 10:45                     ` Jan Kara
2018-07-19 12:23                       ` Martin Wilck
2018-07-19 15:15                         ` Jan Kara
2018-07-19 20:01                           ` Martin Wilck
2018-07-19 11:04                     ` Ming Lei
2018-07-19 11:56                       ` Jan Kara
2018-07-19 12:20                         ` Ming Lei
2018-07-19 15:21                           ` Jan Kara
2018-07-19 19:06                             ` Martin Wilck
2018-07-19 12:25                         ` Martin Wilck
2018-07-19 10:08                   ` [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO() Hannes Reinecke
2018-07-19 14:50                   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.