All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO()
@ 2018-07-25 21:15 Martin Wilck
  2018-07-25 21:15 ` [PATCH v5 1/3] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Martin Wilck @ 2018-07-25 21:15 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Ming Lei
  Cc: Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	linux-block, Martin Wilck

Hello Jens, Ming, Jan, and all others,

the following patches have been verified by a customer to fix a silent data
corruption which he has been seeing since "72ecad2 block: support a full bio
worth of IO for simplified bdev direct-io".

The patches are based on our observation that the corruption is only
observed if the __blkdev_direct_IO_simple() code path is executed,
and if that happens, "short writes" are observed in this code path,
which causes a fallback to buffered IO, while the application continues
submitting direct IO requests.

Following Ming's suggestion, I've changed the patch set such that
bio_iov_iter_get_pages() now always returns as many pages as possible.
This simplifies the patch set a lot. Except for
__blkdev_direct_IO_simple(), all callers of bio_iov_iter_get_pages()
call it in a loop, and expect to get just some pages. Therefore I
have made bio_iov_iter_get_pages() return success if it can pin some
pages, even if MM returns an error on the way. Error is returned only
if no pages at all could be pinned. This also avoids the need for
cleanup code in the helper - callers will submit the bio with the
allocated pages, and clean up later as appropriate.

Regards,
Martin

Changes wrt v4:
 - 3/3: replaced bio_iov_iter_get_pages() with the new helper
   (Ming, Christoph)
 - 4/4 dropped: this way, no changes to fs/block_dev.c are necessary any
   more except for the leak fix.

Changes wrt v3:
 - split previous 3/3 into two patches (3/4, 4/4).
 - 3/4: add a new helper to retrieve as many pages as possible (Ming)
 - 3/4: put pages in case of error (Ming)

Changes wrt v1:
 - 1/3: minor formatting change (Christoph)
 - 2/3: split off the leak fix (Ming)
 - 3/3: give up if bio_iov_iter_get_pages() returns an error (Jan)
 - 3/3: warn if space in bio exhausted (Jan)
 - 3/3: add comments

Martin Wilck (3):
  block: bio_iov_iter_get_pages: fix size of last iovec
  blkdev: __blkdev_direct_IO_simple: fix leak in error case
  block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs

 block/bio.c    | 53 +++++++++++++++++++++++++++++++++++++-------------
 fs/block_dev.c |  9 +++++----
 2 files changed, 45 insertions(+), 17 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/3] block: bio_iov_iter_get_pages: fix size of last iovec
  2018-07-25 21:15 [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO() Martin Wilck
@ 2018-07-25 21:15 ` Martin Wilck
  2018-07-25 21:15 ` [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case Martin Wilck
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Martin Wilck @ 2018-07-25 21:15 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Ming Lei
  Cc: Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	linux-block, Martin Wilck

If the last page of the bio is not "full", the length of the last
vector slot needs to be corrected. This slot has the index
(bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
array, which is shifted by the value of bio->bi_vcnt at function
invocation, the correct index is (nr_pages - 1).

v2: improved readability following suggestions from Ming Lei.
v3: followed a formatting suggestion from Christoph Hellwig.

Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 block/bio.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 67eff5e..489a430 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -912,16 +912,16 @@ EXPORT_SYMBOL(bio_add_page);
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
-	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
+	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt, idx;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
-	size_t offset, diff;
+	size_t offset;
 	ssize_t size;
 
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
-	nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
+	idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
 
 	/*
 	 * Deep magic below:  We need to walk the pinned pages backwards
@@ -934,17 +934,15 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	bio->bi_iter.bi_size += size;
 	bio->bi_vcnt += nr_pages;
 
-	diff = (nr_pages * PAGE_SIZE - offset) - size;
-	while (nr_pages--) {
-		bv[nr_pages].bv_page = pages[nr_pages];
-		bv[nr_pages].bv_len = PAGE_SIZE;
-		bv[nr_pages].bv_offset = 0;
+	while (idx--) {
+		bv[idx].bv_page = pages[idx];
+		bv[idx].bv_len = PAGE_SIZE;
+		bv[idx].bv_offset = 0;
 	}
 
 	bv[0].bv_offset += offset;
 	bv[0].bv_len -= offset;
-	if (diff)
-		bv[bio->bi_vcnt - 1].bv_len -= diff;
+	bv[nr_pages - 1].bv_len -= nr_pages * PAGE_SIZE - offset - size;
 
 	iov_iter_advance(iter, size);
 	return 0;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case
  2018-07-25 21:15 [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO() Martin Wilck
  2018-07-25 21:15 ` [PATCH v5 1/3] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
@ 2018-07-25 21:15 ` Martin Wilck
  2018-07-26  6:38   ` Hannes Reinecke
  2018-07-26  9:20   ` Christoph Hellwig
  2018-07-25 21:15 ` [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs Martin Wilck
  2018-07-26 17:53 ` [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO() Jens Axboe
  3 siblings, 2 replies; 13+ messages in thread
From: Martin Wilck @ 2018-07-25 21:15 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Ming Lei
  Cc: Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	linux-block, Martin Wilck

Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
 simplified bdev direct-io")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 fs/block_dev.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aa..aba2541 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -221,7 +221,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
-		return ret;
+		goto out;
 	ret = bio.bi_iter.bi_size;
 
 	if (iov_iter_rw(iter) == READ) {
@@ -250,12 +250,13 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		put_page(bvec->bv_page);
 	}
 
-	if (vecs != inline_vecs)
-		kfree(vecs);
-
 	if (unlikely(bio.bi_status))
 		ret = blk_status_to_errno(bio.bi_status);
 
+out:
+	if (vecs != inline_vecs)
+		kfree(vecs);
+
 	bio_uninit(&bio);
 
 	return ret;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  2018-07-25 21:15 [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO() Martin Wilck
  2018-07-25 21:15 ` [PATCH v5 1/3] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
  2018-07-25 21:15 ` [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case Martin Wilck
@ 2018-07-25 21:15 ` Martin Wilck
  2018-07-26  9:21   ` Christoph Hellwig
  2018-07-30 12:37   ` Ming Lei
  2018-07-26 17:53 ` [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO() Jens Axboe
  3 siblings, 2 replies; 13+ messages in thread
From: Martin Wilck @ 2018-07-25 21:15 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Ming Lei
  Cc: Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	linux-block, Martin Wilck

bio_iov_iter_get_pages() currently only adds pages for the
next non-zero segment from the iov_iter to the bio. That's
suboptimal for callers, which typically try to pin as many
pages as fit into the bio. This patch converts the current
bio_iov_iter_get_pages() into a static helper, and introduces
a new helper that allocates as many pages as

 1) fit into the bio,
 2) are present in the iov_iter,
 3) and can be pinned by MM.

Error is returned only if zero pages could be pinned. Because of
3), a zero return value doesn't necessarily mean all pages have been
pinned. Callers that have to pin every page in the iov_iter must still
call this function in a loop (this is currently the case).

This change matters most for __blkdev_direct_IO_simple(), which calls
bio_iov_iter_get_pages() only once. If it obtains less pages than requested,
it returns a "short write" or "short read", and __generic_file_write_iter()
falls back to buffered writes, which may lead to data corruption.

Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
 simplified bdev direct-io")
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 block/bio.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 489a430..925033d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -903,14 +903,16 @@ int bio_add_page(struct bio *bio, struct page *page,
 EXPORT_SYMBOL(bio_add_page);
 
 /**
- * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
+ * __bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins as many pages from *iter and appends them to @bio's bvec array. The
+ * Pins pages from *iter and appends them to @bio's bvec array. The
  * pages will have to be released using put_page() when done.
+ * For multi-segment *iter, this function only adds pages from the
+ * the next non-empty segment of the iov iterator.
  */
-int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt, idx;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
@@ -947,6 +949,33 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_advance(iter, size);
 	return 0;
 }
+
+/**
+ * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
+ * @bio: bio to add pages to
+ * @iter: iov iterator describing the region to be mapped
+ *
+ * Pins pages from *iter and appends them to @bio's bvec array. The
+ * pages will have to be released using put_page() when done.
+ * The function tries, but does not guarantee, to pin as many pages as
+ * fit into the bio, or are requested in *iter, whatever is smaller.
+ * If MM encounters an error pinning the requested pages, it stops.
+ * Error is returned only if 0 pages could be pinned.
+ */
+int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+{
+	unsigned short orig_vcnt = bio->bi_vcnt;
+
+	do {
+		int ret = __bio_iov_iter_get_pages(bio, iter);
+
+		if (unlikely(ret))
+			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
+
+	} while (iov_iter_count(iter) && !bio_full(bio));
+
+	return 0;
+}
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
 static void submit_bio_wait_endio(struct bio *bio)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case
  2018-07-25 21:15 ` [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case Martin Wilck
@ 2018-07-26  6:38   ` Hannes Reinecke
  2018-07-26  9:20   ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Hannes Reinecke @ 2018-07-26  6:38 UTC (permalink / raw)
  To: Martin Wilck, Jens Axboe, Jan Kara, Christoph Hellwig, Ming Lei
  Cc: Johannes Thumshirn, Kent Overstreet, linux-block

On 07/25/2018 11:15 PM, Martin Wilck wrote:
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
>  simplified bdev direct-io")
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  fs/block_dev.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case
  2018-07-25 21:15 ` [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case Martin Wilck
  2018-07-26  6:38   ` Hannes Reinecke
@ 2018-07-26  9:20   ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-07-26  9:20 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Ming Lei,
	Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	linux-block

On Wed, Jul 25, 2018 at 11:15:08PM +0200, Martin Wilck wrote:
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
>  simplified bdev direct-io")
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Martin Wilck <mwilck@suse.com>

I'm pretty sure I already ACKed this last time, but here we go
again:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  2018-07-25 21:15 ` [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs Martin Wilck
@ 2018-07-26  9:21   ` Christoph Hellwig
  2018-07-30 12:37   ` Ming Lei
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-07-26  9:21 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Ming Lei,
	Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	linux-block

Both the changelog and the comments should use up all horizontal space
(73 and 80 charcs respectively), but that can be fixed up when applied.

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO()
  2018-07-25 21:15 [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO() Martin Wilck
                   ` (2 preceding siblings ...)
  2018-07-25 21:15 ` [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs Martin Wilck
@ 2018-07-26 17:53 ` Jens Axboe
  3 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2018-07-26 17:53 UTC (permalink / raw)
  To: Martin Wilck, Jan Kara, Christoph Hellwig, Ming Lei
  Cc: Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, linux-block

On 7/25/18 2:15 PM, Martin Wilck wrote:
> Hello Jens, Ming, Jan, and all others,
> 
> the following patches have been verified by a customer to fix a silent data
> corruption which he has been seeing since "72ecad2 block: support a full bio
> worth of IO for simplified bdev direct-io".
> 
> The patches are based on our observation that the corruption is only
> observed if the __blkdev_direct_IO_simple() code path is executed,
> and if that happens, "short writes" are observed in this code path,
> which causes a fallback to buffered IO, while the application continues
> submitting direct IO requests.
> 
> Following Ming's suggestion, I've changed the patch set such that
> bio_iov_iter_get_pages() now always returns as many pages as possible.
> This simplifies the patch set a lot. Except for
> __blkdev_direct_IO_simple(), all callers of bio_iov_iter_get_pages()
> call it in a loop, and expect to get just some pages. Therefore I
> have made bio_iov_iter_get_pages() return success if it can pin some
> pages, even if MM returns an error on the way. Error is returned only
> if no pages at all could be pinned. This also avoids the need for
> cleanup code in the helper - callers will submit the bio with the
> allocated pages, and clean up later as appropriate.

Thanks everyone involved in this, I've queued it up for 4.18.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  2018-07-25 21:15 ` [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs Martin Wilck
  2018-07-26  9:21   ` Christoph Hellwig
@ 2018-07-30 12:37   ` Ming Lei
  2018-08-22  8:02     ` Martin Wilck
  1 sibling, 1 reply; 13+ messages in thread
From: Ming Lei @ 2018-07-30 12:37 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, linux-block

On Wed, Jul 25, 2018 at 11:15:09PM +0200, Martin Wilck wrote:
> bio_iov_iter_get_pages() currently only adds pages for the
> next non-zero segment from the iov_iter to the bio. That's
> suboptimal for callers, which typically try to pin as many
> pages as fit into the bio. This patch converts the current
> bio_iov_iter_get_pages() into a static helper, and introduces
> a new helper that allocates as many pages as
> 
>  1) fit into the bio,
>  2) are present in the iov_iter,
>  3) and can be pinned by MM.
> 
> Error is returned only if zero pages could be pinned. Because of
> 3), a zero return value doesn't necessarily mean all pages have been
> pinned. Callers that have to pin every page in the iov_iter must still
> call this function in a loop (this is currently the case).
> 
> This change matters most for __blkdev_direct_IO_simple(), which calls
> bio_iov_iter_get_pages() only once. If it obtains less pages than requested,
> it returns a "short write" or "short read", and __generic_file_write_iter()
> falls back to buffered writes, which may lead to data corruption.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
>  simplified bdev direct-io")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  block/bio.c | 35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 489a430..925033d 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -903,14 +903,16 @@ int bio_add_page(struct bio *bio, struct page *page,
>  EXPORT_SYMBOL(bio_add_page);
>  
>  /**
> - * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
> + * __bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
>   * @bio: bio to add pages to
>   * @iter: iov iterator describing the region to be mapped
>   *
> - * Pins as many pages from *iter and appends them to @bio's bvec array. The
> + * Pins pages from *iter and appends them to @bio's bvec array. The
>   * pages will have to be released using put_page() when done.
> + * For multi-segment *iter, this function only adds pages from the
> + * the next non-empty segment of the iov iterator.
>   */
> -int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> +static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
>  	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt, idx;
>  	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> @@ -947,6 +949,33 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	iov_iter_advance(iter, size);
>  	return 0;
>  }
> +
> +/**
> + * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
> + * @bio: bio to add pages to
> + * @iter: iov iterator describing the region to be mapped
> + *
> + * Pins pages from *iter and appends them to @bio's bvec array. The
> + * pages will have to be released using put_page() when done.
> + * The function tries, but does not guarantee, to pin as many pages as
> + * fit into the bio, or are requested in *iter, whatever is smaller.
> + * If MM encounters an error pinning the requested pages, it stops.
> + * Error is returned only if 0 pages could be pinned.
> + */
> +int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> +{
> +	unsigned short orig_vcnt = bio->bi_vcnt;
> +
> +	do {
> +		int ret = __bio_iov_iter_get_pages(bio, iter);
> +
> +		if (unlikely(ret))
> +			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
> +
> +	} while (iov_iter_count(iter) && !bio_full(bio));

When 'ret' isn't zero, and some partial progress has been made, seems less pages
might be obtained than requested too. Is that something we need to worry about?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  2018-07-30 12:37   ` Ming Lei
@ 2018-08-22  8:02     ` Martin Wilck
  2018-08-22 10:33       ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Wilck @ 2018-08-22  8:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Hannes Reinecke,
	Johannes Thumshirn, Kent Overstreet, linux-block

On Mon, 2018-07-30 at 20:37 +0800, Ming Lei wrote:
> On Wed, Jul 25, 2018 at 11:15:09PM +0200, Martin Wilck wrote:
> > 
> > +/**
> > + * bio_iov_iter_get_pages - pin user or kernel pages and add them
> > to a bio
> > + * @bio: bio to add pages to
> > + * @iter: iov iterator describing the region to be mapped
> > + *
> > + * Pins pages from *iter and appends them to @bio's bvec array.
> > The
> > + * pages will have to be released using put_page() when done.
> > + * The function tries, but does not guarantee, to pin as many
> > pages as
> > + * fit into the bio, or are requested in *iter, whatever is
> > smaller.
> > + * If MM encounters an error pinning the requested pages, it
> > stops.
> > + * Error is returned only if 0 pages could be pinned.
> > + */
> > +int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> > +{
> > +	unsigned short orig_vcnt = bio->bi_vcnt;
> > +
> > +	do {
> > +		int ret = __bio_iov_iter_get_pages(bio, iter);
> > +
> > +		if (unlikely(ret))
> > +			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
> > +
> > +	} while (iov_iter_count(iter) && !bio_full(bio));
> 
> When 'ret' isn't zero, and some partial progress has been made, seems
> less pages
> might be obtained than requested too. Is that something we need to
> worry about?

This would be the case when VM isn't willing or able to fulfill the
page-pinning request. Previously, we came to the conclusion that VM has
the right to do so. This is the reason why callers have to check the
number of pages allocated, and either loop over
bio_iov_iter_get_pages(), or fall back to buffered I/O, until all pages
have been obtained. All callers except the blockdev fast path do the
former. 

We could add looping in __blkdev_direct_IO_simple() on top of the
current patch set, to avoid fallback to buffered IO in this corner
case. Should we? If yes, only for WRITEs, or for READs as well?

I haven't encountered this situation in my tests, and I'm unsure how to
provoke it - run a direct IO test under high memory pressure?

Regards,
Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  2018-08-22  8:02     ` Martin Wilck
@ 2018-08-22 10:33       ` Jan Kara
  2018-08-22 10:50         ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2018-08-22 10:33 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Ming Lei, Jens Axboe, Jan Kara, Christoph Hellwig,
	Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	linux-block

On Wed 22-08-18 10:02:49, Martin Wilck wrote:
> On Mon, 2018-07-30 at 20:37 +0800, Ming Lei wrote:
> > On Wed, Jul 25, 2018 at 11:15:09PM +0200, Martin Wilck wrote:
> > > 
> > > +/**
> > > + * bio_iov_iter_get_pages - pin user or kernel pages and add them
> > > to a bio
> > > + * @bio: bio to add pages to
> > > + * @iter: iov iterator describing the region to be mapped
> > > + *
> > > + * Pins pages from *iter and appends them to @bio's bvec array.
> > > The
> > > + * pages will have to be released using put_page() when done.
> > > + * The function tries, but does not guarantee, to pin as many
> > > pages as
> > > + * fit into the bio, or are requested in *iter, whatever is
> > > smaller.
> > > + * If MM encounters an error pinning the requested pages, it
> > > stops.
> > > + * Error is returned only if 0 pages could be pinned.
> > > + */
> > > +int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> > > +{
> > > +	unsigned short orig_vcnt = bio->bi_vcnt;
> > > +
> > > +	do {
> > > +		int ret = __bio_iov_iter_get_pages(bio, iter);
> > > +
> > > +		if (unlikely(ret))
> > > +			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
> > > +
> > > +	} while (iov_iter_count(iter) && !bio_full(bio));
> > 
> > When 'ret' isn't zero, and some partial progress has been made, seems
> > less pages
> > might be obtained than requested too. Is that something we need to
> > worry about?
> 
> This would be the case when VM isn't willing or able to fulfill the
> page-pinning request. Previously, we came to the conclusion that VM has
> the right to do so. This is the reason why callers have to check the
> number of pages allocated, and either loop over
> bio_iov_iter_get_pages(), or fall back to buffered I/O, until all pages
> have been obtained. All callers except the blockdev fast path do the
> former. 
> 
> We could add looping in __blkdev_direct_IO_simple() on top of the
> current patch set, to avoid fallback to buffered IO in this corner
> case. Should we? If yes, only for WRITEs, or for READs as well?
> 
> I haven't encountered this situation in my tests, and I'm unsure how to
> provoke it - run a direct IO test under high memory pressure?

Currently, iov_iter_get_pages() is always guaranteed to get at least one
page as that is current guarantee of get_user_pages() (unless we hit
EFAULT obviously). So bio_iov_iter_get_pages() as is now is guaranteed to
exhaust 'iter' or fill 'bio'. But in the future, the guarantee that
get_user_pages() will always pin at least one page may go away. But we'd
have to audit all users at that time anyway.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  2018-08-22 10:33       ` Jan Kara
@ 2018-08-22 10:50         ` Ming Lei
  2018-08-22 12:47           ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2018-08-22 10:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Martin Wilck, Jens Axboe, Jan Kara, Christoph Hellwig,
	Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	linux-block

On Wed, Aug 22, 2018 at 12:33:05PM +0200, Jan Kara wrote:
> On Wed 22-08-18 10:02:49, Martin Wilck wrote:
> > On Mon, 2018-07-30 at 20:37 +0800, Ming Lei wrote:
> > > On Wed, Jul 25, 2018 at 11:15:09PM +0200, Martin Wilck wrote:
> > > > 
> > > > +/**
> > > > + * bio_iov_iter_get_pages - pin user or kernel pages and add them
> > > > to a bio
> > > > + * @bio: bio to add pages to
> > > > + * @iter: iov iterator describing the region to be mapped
> > > > + *
> > > > + * Pins pages from *iter and appends them to @bio's bvec array.
> > > > The
> > > > + * pages will have to be released using put_page() when done.
> > > > + * The function tries, but does not guarantee, to pin as many
> > > > pages as
> > > > + * fit into the bio, or are requested in *iter, whatever is
> > > > smaller.
> > > > + * If MM encounters an error pinning the requested pages, it
> > > > stops.
> > > > + * Error is returned only if 0 pages could be pinned.
> > > > + */
> > > > +int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> > > > +{
> > > > +	unsigned short orig_vcnt = bio->bi_vcnt;
> > > > +
> > > > +	do {
> > > > +		int ret = __bio_iov_iter_get_pages(bio, iter);
> > > > +
> > > > +		if (unlikely(ret))
> > > > +			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
> > > > +
> > > > +	} while (iov_iter_count(iter) && !bio_full(bio));
> > > 
> > > When 'ret' isn't zero, and some partial progress has been made, seems
> > > less pages
> > > might be obtained than requested too. Is that something we need to
> > > worry about?
> > 
> > This would be the case when VM isn't willing or able to fulfill the
> > page-pinning request. Previously, we came to the conclusion that VM has
> > the right to do so. This is the reason why callers have to check the
> > number of pages allocated, and either loop over
> > bio_iov_iter_get_pages(), or fall back to buffered I/O, until all pages
> > have been obtained. All callers except the blockdev fast path do the
> > former. 
> > 
> > We could add looping in __blkdev_direct_IO_simple() on top of the
> > current patch set, to avoid fallback to buffered IO in this corner
> > case. Should we? If yes, only for WRITEs, or for READs as well?
> > 
> > I haven't encountered this situation in my tests, and I'm unsure how to
> > provoke it - run a direct IO test under high memory pressure?
> 
> Currently, iov_iter_get_pages() is always guaranteed to get at least one
> page as that is current guarantee of get_user_pages() (unless we hit
> EFAULT obviously). So bio_iov_iter_get_pages() as is now is guaranteed to

Is it possible for this EFAULT to happen on the user-space VM?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  2018-08-22 10:50         ` Ming Lei
@ 2018-08-22 12:47           ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2018-08-22 12:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jan Kara, Martin Wilck, Jens Axboe, Jan Kara, Christoph Hellwig,
	Hannes Reinecke, Johannes Thumshirn, Kent Overstreet,
	linux-block

On Wed 22-08-18 18:50:53, Ming Lei wrote:
> On Wed, Aug 22, 2018 at 12:33:05PM +0200, Jan Kara wrote:
> > On Wed 22-08-18 10:02:49, Martin Wilck wrote:
> > > On Mon, 2018-07-30 at 20:37 +0800, Ming Lei wrote:
> > > > On Wed, Jul 25, 2018 at 11:15:09PM +0200, Martin Wilck wrote:
> > > > > 
> > > > > +/**
> > > > > + * bio_iov_iter_get_pages - pin user or kernel pages and add them
> > > > > to a bio
> > > > > + * @bio: bio to add pages to
> > > > > + * @iter: iov iterator describing the region to be mapped
> > > > > + *
> > > > > + * Pins pages from *iter and appends them to @bio's bvec array.
> > > > > The
> > > > > + * pages will have to be released using put_page() when done.
> > > > > + * The function tries, but does not guarantee, to pin as many
> > > > > pages as
> > > > > + * fit into the bio, or are requested in *iter, whatever is
> > > > > smaller.
> > > > > + * If MM encounters an error pinning the requested pages, it
> > > > > stops.
> > > > > + * Error is returned only if 0 pages could be pinned.
> > > > > + */
> > > > > +int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> > > > > +{
> > > > > +	unsigned short orig_vcnt = bio->bi_vcnt;
> > > > > +
> > > > > +	do {
> > > > > +		int ret = __bio_iov_iter_get_pages(bio, iter);
> > > > > +
> > > > > +		if (unlikely(ret))
> > > > > +			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
> > > > > +
> > > > > +	} while (iov_iter_count(iter) && !bio_full(bio));
> > > > 
> > > > When 'ret' isn't zero, and some partial progress has been made, seems
> > > > less pages
> > > > might be obtained than requested too. Is that something we need to
> > > > worry about?
> > > 
> > > This would be the case when VM isn't willing or able to fulfill the
> > > page-pinning request. Previously, we came to the conclusion that VM has
> > > the right to do so. This is the reason why callers have to check the
> > > number of pages allocated, and either loop over
> > > bio_iov_iter_get_pages(), or fall back to buffered I/O, until all pages
> > > have been obtained. All callers except the blockdev fast path do the
> > > former. 
> > > 
> > > We could add looping in __blkdev_direct_IO_simple() on top of the
> > > current patch set, to avoid fallback to buffered IO in this corner
> > > case. Should we? If yes, only for WRITEs, or for READs as well?
> > > 
> > > I haven't encountered this situation in my tests, and I'm unsure how to
> > > provoke it - run a direct IO test under high memory pressure?
> > 
> > Currently, iov_iter_get_pages() is always guaranteed to get at least one
> > page as that is current guarantee of get_user_pages() (unless we hit
> > EFAULT obviously). So bio_iov_iter_get_pages() as is now is guaranteed to
> 
> Is it possible for this EFAULT to happen on the user-space VM?

Certainly if the user passes bogus address...

									Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-08-22 12:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-25 21:15 [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO() Martin Wilck
2018-07-25 21:15 ` [PATCH v5 1/3] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
2018-07-25 21:15 ` [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case Martin Wilck
2018-07-26  6:38   ` Hannes Reinecke
2018-07-26  9:20   ` Christoph Hellwig
2018-07-25 21:15 ` [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs Martin Wilck
2018-07-26  9:21   ` Christoph Hellwig
2018-07-30 12:37   ` Ming Lei
2018-08-22  8:02     ` Martin Wilck
2018-08-22 10:33       ` Jan Kara
2018-08-22 10:50         ` Ming Lei
2018-08-22 12:47           ` Jan Kara
2018-07-26 17:53 ` [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO() Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.