All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] Fix silent data corruption in blkdev_direct_IO()
@ 2018-07-19 21:01 Martin Wilck
  2018-07-19 21:01 ` [PATCH v3 1/3] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Martin Wilck @ 2018-07-19 21:01 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Jan Kara
  Cc: Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig, Al Viro,
	Kent Overstreet, linux-block, Martin Wilck

Hello Jens, Ming, Jan, and all others,

the following patches have been verified by a customer to fix a silent data
corruption which he has been seeing since "72ecad2 block: support a full bio
worth of IO for simplified bdev direct-io".

The patches are based on our observation that the corruption is only
observed if the __blkdev_direct_IO_simple() code path is executed,
and if that happens, "short writes" are observed in this code path,
which causes a fallback to buffered IO, while the application continues
submitting direct IO requests.

I called this v3 because the first patch was at v2 already in the previous
submission.

Changes wrt v1:
 - 1/3: minor formatting change (Christoph)
 - 2/3: split off the leak fix (Ming)
 - 3/3: give up if bio_iov_iter_get_pages() returns an error (Jan)
 - 3/3: warn if space in bio exhausted (Jan)
 - 3/3: add comments

Martin Wilck (3):
  block: bio_iov_iter_get_pages: fix size of last iovec
  blkdev: __blkdev_direct_IO_simple: fix leak in error case
  blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

 block/bio.c    | 18 ++++++++----------
 fs/block_dev.c | 27 +++++++++++++++++++++++----
 2 files changed, 31 insertions(+), 14 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 1/3] block: bio_iov_iter_get_pages: fix size of last iovec
  2018-07-19 21:01 [PATCH v3 0/3] Fix silent data corruption in blkdev_direct_IO() Martin Wilck
@ 2018-07-19 21:01 ` Martin Wilck
  2018-07-19 21:01 ` [PATCH v3 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case Martin Wilck
  2018-07-19 21:01 ` [PATCH v3 3/3] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
  2 siblings, 0 replies; 7+ messages in thread
From: Martin Wilck @ 2018-07-19 21:01 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Jan Kara
  Cc: Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig, Al Viro,
	Kent Overstreet, linux-block, Martin Wilck

If the last page of the bio is not "full", the length of the last
vector slot needs to be corrected. This slot has the index
(bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
array, which is shifted by the value of bio->bi_vcnt at function
invocation, the correct index is (nr_pages - 1).

v2: improved readability following suggestions from Ming Lei.
v3: followed a formatting suggestion from Christoph Hellwig.

Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/bio.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 67eff5e..489a430 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -912,16 +912,16 @@ EXPORT_SYMBOL(bio_add_page);
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
-	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
+	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt, idx;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
-	size_t offset, diff;
+	size_t offset;
 	ssize_t size;
 
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
-	nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
+	idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
 
 	/*
 	 * Deep magic below:  We need to walk the pinned pages backwards
@@ -934,17 +934,15 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	bio->bi_iter.bi_size += size;
 	bio->bi_vcnt += nr_pages;
 
-	diff = (nr_pages * PAGE_SIZE - offset) - size;
-	while (nr_pages--) {
-		bv[nr_pages].bv_page = pages[nr_pages];
-		bv[nr_pages].bv_len = PAGE_SIZE;
-		bv[nr_pages].bv_offset = 0;
+	while (idx--) {
+		bv[idx].bv_page = pages[idx];
+		bv[idx].bv_len = PAGE_SIZE;
+		bv[idx].bv_offset = 0;
 	}
 
 	bv[0].bv_offset += offset;
 	bv[0].bv_len -= offset;
-	if (diff)
-		bv[bio->bi_vcnt - 1].bv_len -= diff;
+	bv[nr_pages - 1].bv_len -= nr_pages * PAGE_SIZE - offset - size;
 
 	iov_iter_advance(iter, size);
 	return 0;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case
  2018-07-19 21:01 [PATCH v3 0/3] Fix silent data corruption in blkdev_direct_IO() Martin Wilck
  2018-07-19 21:01 ` [PATCH v3 1/3] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
@ 2018-07-19 21:01 ` Martin Wilck
  2018-07-20  2:22   ` Ming Lei
  2018-07-19 21:01 ` [PATCH v3 3/3] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
  2 siblings, 1 reply; 7+ messages in thread
From: Martin Wilck @ 2018-07-19 21:01 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Jan Kara
  Cc: Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig, Al Viro,
	Kent Overstreet, linux-block, Martin Wilck

Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 fs/block_dev.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aa..aba2541 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -221,7 +221,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
-		return ret;
+		goto out;
 	ret = bio.bi_iter.bi_size;
 
 	if (iov_iter_rw(iter) == READ) {
@@ -250,12 +250,13 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		put_page(bvec->bv_page);
 	}
 
-	if (vecs != inline_vecs)
-		kfree(vecs);
-
 	if (unlikely(bio.bi_status))
 		ret = blk_status_to_errno(bio.bi_status);
 
+out:
+	if (vecs != inline_vecs)
+		kfree(vecs);
+
 	bio_uninit(&bio);
 
 	return ret;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 3/3] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 21:01 [PATCH v3 0/3] Fix silent data corruption in blkdev_direct_IO() Martin Wilck
  2018-07-19 21:01 ` [PATCH v3 1/3] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
  2018-07-19 21:01 ` [PATCH v3 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case Martin Wilck
@ 2018-07-19 21:01 ` Martin Wilck
  2018-07-20  2:41   ` Ming Lei
  2 siblings, 1 reply; 7+ messages in thread
From: Martin Wilck @ 2018-07-19 21:01 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Jan Kara
  Cc: Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig, Al Viro,
	Kent Overstreet, linux-block, Martin Wilck

When bio_iov_iter_get_pages() is called from  __blkdev_direct_IO_simple(),
we already know that the content of the input iov_iter fits into a single
bio, so we expect iov_iter_count(iter) to drop to 0. But in a single invocation,
bio_iov_iter_get_pages() may add less bytes then we expect. For iov_iters with
multiple segments (generated e.g. by writev()), it behaves like an iterator's
next() function, taking only one step (segment) at a time. Furthermore, MM may
fail or refuse to pin all requested pages. The latter may signify an error condition
(in which case eventually an error code will be returned), the former does not. 

Call bio_iov_iter_get_pages() repeatedly to avoid short reads or writes.
Otherwise, __generic_file_write_iter() falls back to buffered writes, which
has been observed to cause data corruption in certain workloads.

Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 fs/block_dev.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index aba2541..561c34e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -222,6 +222,24 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 	ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
 		goto out;
+
+	/*
+	 * bio_iov_iter_get_pages() may add less bytes than we expect:
+	 * - for multi-segment iov_iters, as it only adds one segment at a time
+	 * - if MM refuses or fails to pin all requested pages. In this case,
+	 *   an error is returned eventually if no progress can be made.
+	 */
+	while (iov_iter_count(iter) > 0 && bio.bi_vcnt < bio.bi_max_vecs) {
+		ret = bio_iov_iter_get_pages(&bio, iter);
+		if (unlikely(ret))
+			goto out;
+	}
+	/*
+	 * Our bi_io_vec should be big enough to hold all data from the
+	 * iov_iter, as this has been checked before calling this function.
+	 */
+	WARN_ON_ONCE(iov_iter_count(iter) > 0);
+
 	ret = bio.bi_iter.bi_size;
 
 	if (iov_iter_rw(iter) == READ) {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case
  2018-07-19 21:01 ` [PATCH v3 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case Martin Wilck
@ 2018-07-20  2:22   ` Ming Lei
  0 siblings, 0 replies; 7+ messages in thread
From: Ming Lei @ 2018-07-20  2:22 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn,
	Christoph Hellwig, Al Viro, Kent Overstreet, linux-block

On Thu, Jul 19, 2018 at 11:01:57PM +0200, Martin Wilck wrote:
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  fs/block_dev.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aa..aba2541 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -221,7 +221,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	ret = bio_iov_iter_get_pages(&bio, iter);
>  	if (unlikely(ret))
> -		return ret;
> +		goto out;
>  	ret = bio.bi_iter.bi_size;
>  
>  	if (iov_iter_rw(iter) == READ) {
> @@ -250,12 +250,13 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  		put_page(bvec->bv_page);
>  	}
>  
> -	if (vecs != inline_vecs)
> -		kfree(vecs);
> -
>  	if (unlikely(bio.bi_status))
>  		ret = blk_status_to_errno(bio.bi_status);
>  
> +out:
> +	if (vecs != inline_vecs)
> +		kfree(vecs);
> +
>  	bio_uninit(&bio);
>  
>  	return ret;
> -- 
> 2.17.1
> 

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 3/3] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-19 21:01 ` [PATCH v3 3/3] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
@ 2018-07-20  2:41   ` Ming Lei
  2018-07-20  7:18     ` Martin Wilck
  0 siblings, 1 reply; 7+ messages in thread
From: Ming Lei @ 2018-07-20  2:41 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn,
	Christoph Hellwig, Al Viro, Kent Overstreet, linux-block

On Thu, Jul 19, 2018 at 11:01:58PM +0200, Martin Wilck wrote:
> When bio_iov_iter_get_pages() is called from  __blkdev_direct_IO_simple(),
> we already know that the content of the input iov_iter fits into a single
> bio, so we expect iov_iter_count(iter) to drop to 0. But in a single invocation,
> bio_iov_iter_get_pages() may add less bytes then we expect. For iov_iters with
> multiple segments (generated e.g. by writev()), it behaves like an iterator's
> next() function, taking only one step (segment) at a time. Furthermore, MM may
> fail or refuse to pin all requested pages. The latter may signify an error condition
> (in which case eventually an error code will be returned), the former does not. 
> 
> Call bio_iov_iter_get_pages() repeatedly to avoid short reads or writes.
> Otherwise, __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  fs/block_dev.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index aba2541..561c34e 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -222,6 +222,24 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  	ret = bio_iov_iter_get_pages(&bio, iter);
>  	if (unlikely(ret))
>  		goto out;
> +
> +	/*
> +	 * bio_iov_iter_get_pages() may add less bytes than we expect:
> +	 * - for multi-segment iov_iters, as it only adds one segment at a time
> +	 * - if MM refuses or fails to pin all requested pages. In this case,
> +	 *   an error is returned eventually if no progress can be made.
> +	 */
> +	while (iov_iter_count(iter) > 0 && bio.bi_vcnt < bio.bi_max_vecs) {

Please use the helper bio_full().

> +		ret = bio_iov_iter_get_pages(&bio, iter);
> +		if (unlikely(ret))
> +			goto out;
> +	}

When ret returns, pages pinned already need to be put.

Also I think the way suggested in the following link may be more generic:

	https://marc.info/?l=linux-block&m=153199830130209&w=2

in which it is retried until no progress is made, and there should have
performance benefit for other users too, not only fix this partial dio
issue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 3/3] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
  2018-07-20  2:41   ` Ming Lei
@ 2018-07-20  7:18     ` Martin Wilck
  0 siblings, 0 replies; 7+ messages in thread
From: Martin Wilck @ 2018-07-20  7:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn,
	Christoph Hellwig, Al Viro, Kent Overstreet, linux-block

On Fri, 2018-07-20 at 10:41 +0800, Ming Lei wrote:
> On Thu, Jul 19, 2018 at 11:01:58PM +0200, Martin Wilck wrote:
> > When bio_iov_iter_get_pages() is called
> > from  __blkdev_direct_IO_simple(),
> > we already know that the content of the input iov_iter fits into a
> > single
> > bio, so we expect iov_iter_count(iter) to drop to 0. But in a
> > single invocation,
> > bio_iov_iter_get_pages() may add less bytes then we expect. For
> > iov_iters with
> > multiple segments (generated e.g. by writev()), it behaves like an
> > iterator's
> > next() function, taking only one step (segment) at a time.
> > Furthermore, MM may
> > fail or refuse to pin all requested pages. The latter may signify
> > an error condition
> > (in which case eventually an error code will be returned), the
> > former does not. 
> > 
> > Call bio_iov_iter_get_pages() repeatedly to avoid short reads or
> > writes.
> > Otherwise, __generic_file_write_iter() falls back to buffered
> > writes, which
> > has been observed to cause data corruption in certain workloads.
> > 
> > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
> > simplified bdev direct-io")
> > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > ---
> >  fs/block_dev.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index aba2541..561c34e 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -222,6 +222,24 @@ __blkdev_direct_IO_simple(struct kiocb *iocb,
> > struct iov_iter *iter,
> >  	ret = bio_iov_iter_get_pages(&bio, iter);
> >  	if (unlikely(ret))
> >  		goto out;
> > +
> > +	/*
> > +	 * bio_iov_iter_get_pages() may add less bytes than we
> > expect:
> > +	 * - for multi-segment iov_iters, as it only adds one
> > segment at a time
> > +	 * - if MM refuses or fails to pin all requested pages. In
> > this case,
> > +	 *   an error is returned eventually if no progress can be
> > made.
> > +	 */
> > +	while (iov_iter_count(iter) > 0 && bio.bi_vcnt <
> > bio.bi_max_vecs) {
> 
> Please use the helper bio_full().
> 
> > +		ret = bio_iov_iter_get_pages(&bio, iter);
> > +		if (unlikely(ret))
> > +			goto out;
> > +	}
> 
> When ret returns, pages pinned already need to be put.

Oops, sorry for missing that.

> Also I think the way suggested in the following link may be more
> generic:
> 
> 	https://marc.info/?l=linux-block&m=153199830130209&w=2
> 
> in which it is retried until no progress is made, and there should
> have
> performance benefit for other users too, not only fix this partial
> dio
> issue.

As there was no unequivocal support for that proposal, I propose to add
a new helper without renaming the current one. Callers could then be
migrated one-by-one. I will do so in v4.

Thanks,
Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-07-20  7:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-19 21:01 [PATCH v3 0/3] Fix silent data corruption in blkdev_direct_IO() Martin Wilck
2018-07-19 21:01 ` [PATCH v3 1/3] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
2018-07-19 21:01 ` [PATCH v3 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case Martin Wilck
2018-07-20  2:22   ` Ming Lei
2018-07-19 21:01 ` [PATCH v3 3/3] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
2018-07-20  2:41   ` Ming Lei
2018-07-20  7:18     ` Martin Wilck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.