All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin Wilck <mwilck@suse.com>
To: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>, Ming Lei <ming.lei@redhat.com>,
	Jan Kara <jack@suse.com>, Hannes Reinecke <hare@suse.de>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Kent Overstreet <kent.overstreet@gmail.com>,
	Christoph Hellwig <hch@lst.de>,
	linux-block@vger.kernel.org
Subject: Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio
Date: Thu, 19 Jul 2018 14:23:53 +0200	[thread overview]
Message-ID: <f0945524222b06bee92e969b9d36f00d5b1a0800.camel@suse.com> (raw)
In-Reply-To: <20180719104551.jqndys6uxgglsbfh@quack2.suse.cz>

On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote:
> On Thu 19-07-18 11:39:18, Martin Wilck wrote:
> > bio_iov_iter_get_pages() returns only pages for a single non-empty
> > segment of the input iov_iter's iovec. This may be much less than
> > the number
> > of pages __blkdev_direct_IO_simple() is supposed to process. Call
> > bio_iov_iter_get_pages() repeatedly until either the requested
> > number
> > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not
> > done,
> > short writes or reads may occur for direct synchronous IOs with
> > multiple
> > iovec slots (such as generated by writev()). In that case,
> > __generic_file_write_iter() falls back to buffered writes, which
> > has been observed to cause data corruption in certain workloads.
> > 
> > Note: if segments aren't page-aligned in the input iovec, this
> > patch may
> > result in multiple adjacent slots of the bi_io_vec array to
> > reference the same
> > page (the byte ranges are guaranteed to be disjunct if the
> > preceding patch is
> > applied). We haven't seen problems with that in our and the
> > customer's
> > tests. It'd be possible to detect this situation and merge
> > bi_io_vec slots
> > that refer to the same page, but I prefer to keep it simple for
> > now.
> > 
> > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
> > simplified bdev direct-io")
> > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > ---
> >  fs/block_dev.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 0dd87aa..41643c4 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb,
> > struct iov_iter *iter,
> >  
> >  	ret = bio_iov_iter_get_pages(&bio, iter);
> >  	if (unlikely(ret))
> > -		return ret;
> > +		goto out;
> > +
> > +	while (ret == 0 &&
> > +	       bio.bi_vcnt < bio.bi_max_vecs &&
> > iov_iter_count(iter) > 0)
> > +		ret = bio_iov_iter_get_pages(&bio, iter);
> > +
> 
> I have two suggestions here (posting them now in public):
> 
> Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we
> made
> sure we have enough vecs for pages in iter. So I'd WARN if this isn't
> true.

Yeah. I wanted to add that to the patch. Slipped through, somehow.
Sorry about that.

> Secondly, I don't think it is good to discard error from
> bio_iov_iter_get_pages() here and just submit partial IO. It will
> again
> lead to part of IO being done as direct and part attempted to be done
> as
> buffered. Also the "slow" direct IO path in __blkdev_direct_IO()
> behaves
> differently - it aborts and returns error if bio_iov_iter_get_pages()
> ever
> returned error. IMO we should do the same here.

Well, it aborts the loop, but then (in the sync case) it still waits
for the already submitted IOs to finish. Here, too, I'd find it more
logical to return the number of successfully transmitted bytes rather
than an error code. In the async case, the submitted bios are left in
place, and will probably sooner or later finish, changing iocb->ki_pos.

I'm actually not quite certain if that's correct. In the sync case, it
causes the already-performed IO to be done again, buffered. In the
async case, it it may even cause two IOs for the same range to be in
flight at the same time ... ?

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

  reply	other threads:[~2018-07-19 12:23 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-12 14:36 Silent data corruption in blkdev_direct_IO() Hannes Reinecke
2018-07-12 15:08 ` Jens Axboe
2018-07-12 16:11   ` Martin Wilck
2018-07-12 16:14   ` Hannes Reinecke
2018-07-12 16:20     ` Jens Axboe
2018-07-12 16:42       ` Jens Axboe
2018-07-13  6:47         ` Martin Wilck
2018-07-13 16:56         ` Martin Wilck
2018-07-13 18:00           ` Jens Axboe
2018-07-13 18:50             ` Jens Axboe
2018-07-13 22:21               ` Martin Wilck
2018-07-13 20:48             ` Martin Wilck
2018-07-13 20:52               ` Jens Axboe
2018-07-16 19:05                 ` Martin Wilck
2018-07-12 23:29 ` Ming Lei
2018-07-13 18:54   ` Jens Axboe
2018-07-13 22:29     ` Martin Wilck
2018-07-16 11:45       ` Ming Lei
2018-07-18  0:07         ` Martin Wilck
2018-07-18  2:48           ` Ming Lei
2018-07-18  7:32             ` Martin Wilck
2018-07-18  7:54               ` Ming Lei
2018-07-18  9:20                 ` Johannes Thumshirn
2018-07-18 11:40                   ` Jan Kara
2018-07-18 11:57                     ` Jan Kara
2018-07-19  9:39                 ` [PATCH 0/2] Fix silent " Martin Wilck
2018-07-19  9:39                   ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck
2018-07-19 10:05                     ` Hannes Reinecke
2018-07-19 10:09                     ` Ming Lei
2018-07-19 10:20                     ` Jan Kara
2018-07-19 14:52                     ` Christoph Hellwig
2018-07-19  9:39                   ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck
2018-07-19 10:06                     ` Hannes Reinecke
2018-07-19 10:21                     ` Ming Lei
2018-07-19 10:37                       ` Jan Kara
2018-07-19 10:46                         ` Ming Lei
2018-07-19 11:08                         ` Al Viro
2018-07-19 14:53                           ` Christoph Hellwig
2018-07-19 15:06                             ` Jan Kara
2018-07-19 15:11                               ` Christoph Hellwig
2018-07-19 19:21                                 ` Martin Wilck
2018-07-19 19:34                             ` Martin Wilck
2018-07-19 10:45                     ` Jan Kara
2018-07-19 12:23                       ` Martin Wilck [this message]
2018-07-19 15:15                         ` Jan Kara
2018-07-19 20:01                           ` Martin Wilck
2018-07-19 11:04                     ` Ming Lei
2018-07-19 11:56                       ` Jan Kara
2018-07-19 12:20                         ` Ming Lei
2018-07-19 15:21                           ` Jan Kara
2018-07-19 19:06                             ` Martin Wilck
2018-07-19 12:25                         ` Martin Wilck
2018-07-19 10:08                   ` [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO() Hannes Reinecke
2018-07-19 14:50                   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f0945524222b06bee92e969b9d36f00d5b1a0800.camel@suse.com \
    --to=mwilck@suse.com \
    --cc=axboe@kernel.dk \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=jthumshirn@suse.de \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-block@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.