* Silent data corruption in blkdev_direct_IO() @ 2018-07-12 14:36 Hannes Reinecke 2018-07-12 15:08 ` Jens Axboe 2018-07-12 23:29 ` Ming Lei 0 siblings, 2 replies; 54+ messages in thread From: Hannes Reinecke @ 2018-07-12 14:36 UTC (permalink / raw) To: Jens Axboe; +Cc: Christoph Hellwig, linux-block, Martin Wilck Hi Jens, Christoph, we're currently hunting down a silent data corruption occurring due to commit 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io"). While the whole thing is still hazy on the details, the one thing we've found is that reverting that patch fixes the data corruption. And looking closer, I've found this: static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { int nr_pages; nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); if (!nr_pages) return 0; if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) return __blkdev_direct_IO_simple(iocb, iter, nr_pages); return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); } When checking the call path __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? It's not that we can handle it in __blkdev_direct_IO() ... Thanks for any clarification. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-12 14:36 Silent data corruption in blkdev_direct_IO() Hannes Reinecke @ 2018-07-12 15:08 ` Jens Axboe 2018-07-12 16:11 ` Martin Wilck 2018-07-12 16:14 ` Hannes Reinecke 2018-07-12 23:29 ` Ming Lei 1 sibling, 2 replies; 54+ messages in thread From: Jens Axboe @ 2018-07-12 15:08 UTC (permalink / raw) To: Hannes Reinecke; +Cc: Christoph Hellwig, linux-block, Martin Wilck On 7/12/18 8:36 AM, Hannes Reinecke wrote: > Hi Jens, Christoph, > > we're currently hunting down a silent data corruption occurring due to > commit 72ecad22d9f1 ("block: support a full bio worth of IO for > simplified bdev direct-io"). > > While the whole thing is still hazy on the details, the one thing we've > found is that reverting that patch fixes the data corruption. > > And looking closer, I've found this: > > static ssize_t > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > { > int nr_pages; > > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > if (!nr_pages) > return 0; > if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > return __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); > } > > When checking the call path > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() > I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? > It's not that we can handle it in __blkdev_direct_IO() ... The logic could be cleaned up like below, the sync part is really all we care about. What is the test case for this? async or sync? I also don't remember why it's BIO_MAX_PAGES + 1... diff --git a/fs/block_dev.c b/fs/block_dev.c index 0dd87aaeb39a..14ef3d71b55f 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { int nr_pages; - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); if (!nr_pages) return 0; - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) + if (is_sync_kiocb(iocb)) return __blkdev_direct_IO_simple(iocb, iter, nr_pages); - return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); + return __blkdev_direct_IO(iocb, iter, nr_pages); } static __init int blkdev_init(void) -- Jens Axboe ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-12 15:08 ` Jens Axboe @ 2018-07-12 16:11 ` Martin Wilck 2018-07-12 16:14 ` Hannes Reinecke 1 sibling, 0 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-12 16:11 UTC (permalink / raw) To: Jens Axboe, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block On Thu, 2018-07-12 at 09:08 -0600, Jens Axboe wrote: > On 7/12/18 8:36 AM, Hannes Reinecke wrote: > > Hi Jens, Christoph, > > > > we're currently hunting down a silent data corruption occurring due > > to > > commit 72ecad22d9f1 ("block: support a full bio worth of IO for > > simplified bdev direct-io"). > > > > While the whole thing is still hazy on the details, the one thing > > we've > > found is that reverting that patch fixes the data corruption. > > > > And looking closer, I've found this: > > > > static ssize_t > > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > > { > > int nr_pages; > > > > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > > if (!nr_pages) > > return 0; > > if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > > return __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > > > return __blkdev_direct_IO(iocb, iter, min(nr_pages, > > BIO_MAX_PAGES)); > > } > > > > When checking the call path > > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() > > I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. > > > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? > > It's not that we can handle it in __blkdev_direct_IO() ... > > The logic could be cleaned up like below, the sync part is really all > we care about. What is the test case for this? async or sync? It's sync, and the corruption is in the __blkdev_direct_IO_simple() path. It starts to occur with your "block: support a full bio worth of IO for simplified bdev direct-io" patch, which causes the "simple" path to be taken for larger IO sizes. Martin > > I also don't remember why it's BIO_MAX_PAGES + 1... > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 0dd87aaeb39a..14ef3d71b55f 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct > iov_iter *iter) > { > int nr_pages; > > - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); > if (!nr_pages) > return 0; > - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > + if (is_sync_kiocb(iocb)) > return __blkdev_direct_IO_simple(iocb, iter, > nr_pages); > > - return __blkdev_direct_IO(iocb, iter, min(nr_pages, > BIO_MAX_PAGES)); > + return __blkdev_direct_IO(iocb, iter, nr_pages); > } > > static __init int blkdev_init(void) > -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-12 15:08 ` Jens Axboe 2018-07-12 16:11 ` Martin Wilck @ 2018-07-12 16:14 ` Hannes Reinecke 2018-07-12 16:20 ` Jens Axboe 1 sibling, 1 reply; 54+ messages in thread From: Hannes Reinecke @ 2018-07-12 16:14 UTC (permalink / raw) To: Jens Axboe; +Cc: Christoph Hellwig, linux-block, Martin Wilck On 07/12/2018 05:08 PM, Jens Axboe wrote: > On 7/12/18 8:36 AM, Hannes Reinecke wrote: >> Hi Jens, Christoph, >> >> we're currently hunting down a silent data corruption occurring due to >> commit 72ecad22d9f1 ("block: support a full bio worth of IO for >> simplified bdev direct-io"). >> >> While the whole thing is still hazy on the details, the one thing we've >> found is that reverting that patch fixes the data corruption. >> >> And looking closer, I've found this: >> >> static ssize_t >> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >> { >> int nr_pages; >> >> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >> if (!nr_pages) >> return 0; >> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >> >> return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >> } >> >> When checking the call path >> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() >> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. >> >> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? >> It's not that we can handle it in __blkdev_direct_IO() ... > > The logic could be cleaned up like below, the sync part is really all > we care about. What is the test case for this? async or sync? > > I also don't remember why it's BIO_MAX_PAGES + 1... > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 0dd87aaeb39a..14ef3d71b55f 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > { > int nr_pages; > > - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); > if (!nr_pages) > return 0; > - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > + if (is_sync_kiocb(iocb)) > return __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > - return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); > + return __blkdev_direct_IO(iocb, iter, nr_pages); > } > > static __init int blkdev_init(void) > Hmm. We'll give it a go, but somehow I feel this won't solve our problem. Another question (which probably shows my complete ignorance): What happens if the iter holds >= BIO_MAX_PAGES? 'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on whether the iter contains more pages. What happens to those left-over pages? Will they ever be processed? Cheers, Hannes ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-12 16:14 ` Hannes Reinecke @ 2018-07-12 16:20 ` Jens Axboe 2018-07-12 16:42 ` Jens Axboe 0 siblings, 1 reply; 54+ messages in thread From: Jens Axboe @ 2018-07-12 16:20 UTC (permalink / raw) To: Hannes Reinecke; +Cc: Christoph Hellwig, linux-block, Martin Wilck On 7/12/18 10:14 AM, Hannes Reinecke wrote: > On 07/12/2018 05:08 PM, Jens Axboe wrote: >> On 7/12/18 8:36 AM, Hannes Reinecke wrote: >>> Hi Jens, Christoph, >>> >>> we're currently hunting down a silent data corruption occurring due to >>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for >>> simplified bdev direct-io"). >>> >>> While the whole thing is still hazy on the details, the one thing we've >>> found is that reverting that patch fixes the data corruption. >>> >>> And looking closer, I've found this: >>> >>> static ssize_t >>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >>> { >>> int nr_pages; >>> >>> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >>> if (!nr_pages) >>> return 0; >>> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >>> >>> return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >>> } >>> >>> When checking the call path >>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() >>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. >>> >>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? >>> It's not that we can handle it in __blkdev_direct_IO() ... >> >> The logic could be cleaned up like below, the sync part is really all >> we care about. What is the test case for this? async or sync? >> >> I also don't remember why it's BIO_MAX_PAGES + 1... >> >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 0dd87aaeb39a..14ef3d71b55f 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >> { >> int nr_pages; >> >> - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >> + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); >> if (!nr_pages) >> return 0; >> - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >> + if (is_sync_kiocb(iocb)) >> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >> >> - return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >> + return __blkdev_direct_IO(iocb, iter, nr_pages); >> } >> >> static __init int blkdev_init(void) >> > Hmm. We'll give it a go, but somehow I feel this won't solve our problem. It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it does simplify that part... > Another question (which probably shows my complete ignorance): > What happens if the iter holds >= BIO_MAX_PAGES? > 'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on > whether the iter contains more pages. > What happens to those left-over pages? > Will they ever be processed? Short read or write, we rely on being called again for the remainder. -- Jens Axboe ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-12 16:20 ` Jens Axboe @ 2018-07-12 16:42 ` Jens Axboe 2018-07-13 6:47 ` Martin Wilck 2018-07-13 16:56 ` Martin Wilck 0 siblings, 2 replies; 54+ messages in thread From: Jens Axboe @ 2018-07-12 16:42 UTC (permalink / raw) To: Hannes Reinecke; +Cc: Christoph Hellwig, linux-block, Martin Wilck On 7/12/18 10:20 AM, Jens Axboe wrote: > On 7/12/18 10:14 AM, Hannes Reinecke wrote: >> On 07/12/2018 05:08 PM, Jens Axboe wrote: >>> On 7/12/18 8:36 AM, Hannes Reinecke wrote: >>>> Hi Jens, Christoph, >>>> >>>> we're currently hunting down a silent data corruption occurring due to >>>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for >>>> simplified bdev direct-io"). >>>> >>>> While the whole thing is still hazy on the details, the one thing we've >>>> found is that reverting that patch fixes the data corruption. >>>> >>>> And looking closer, I've found this: >>>> >>>> static ssize_t >>>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >>>> { >>>> int nr_pages; >>>> >>>> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >>>> if (!nr_pages) >>>> return 0; >>>> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >>>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >>>> >>>> return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >>>> } >>>> >>>> When checking the call path >>>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() >>>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. >>>> >>>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? >>>> It's not that we can handle it in __blkdev_direct_IO() ... >>> >>> The logic could be cleaned up like below, the sync part is really all >>> we care about. What is the test case for this? async or sync? >>> >>> I also don't remember why it's BIO_MAX_PAGES + 1... >>> >>> diff --git a/fs/block_dev.c b/fs/block_dev.c >>> index 0dd87aaeb39a..14ef3d71b55f 100644 >>> --- a/fs/block_dev.c >>> +++ b/fs/block_dev.c >>> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >>> { >>> int nr_pages; >>> >>> - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >>> + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); >>> if (!nr_pages) >>> return 0; >>> - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >>> + if (is_sync_kiocb(iocb)) >>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >>> >>> - return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >>> + return __blkdev_direct_IO(iocb, iter, nr_pages); >>> } >>> >>> static __init int blkdev_init(void) >>> >> Hmm. We'll give it a go, but somehow I feel this won't solve our problem. > > It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it > does simplify that part... OK, now I remember. The +1 is just to check if there are actually more pages. __blkdev_direct_IO_simple() only does one bio, so it has to fit within that one bio. __blkdev_direct_IO() will loop just fine and will finish any size, BIO_MAX_PAGES at the time. Hence the patch I sent is wrong, the code actually looks fine. Which means we're back to trying to figure out what is going on here. It'd be great with a test case... -- Jens Axboe ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-12 16:42 ` Jens Axboe @ 2018-07-13 6:47 ` Martin Wilck 2018-07-13 16:56 ` Martin Wilck 1 sibling, 0 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-13 6:47 UTC (permalink / raw) To: Jens Axboe, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote: > On 7/12/18 10:20 AM, Jens Axboe wrote: > > On 7/12/18 10:14 AM, Hannes Reinecke wrote: > > > On 07/12/2018 05:08 PM, Jens Axboe wrote: > > > > On 7/12/18 8:36 AM, Hannes Reinecke wrote: > > > > > Hi Jens, Christoph, > > > > > > > > > > we're currently hunting down a silent data corruption > > > > > occurring due to > > > > > commit 72ecad22d9f1 ("block: support a full bio worth of IO > > > > > for > > > > > simplified bdev direct-io"). > > > > > > > > > > While the whole thing is still hazy on the details, the one > > > > > thing we've > > > > > found is that reverting that patch fixes the data corruption. > > > > > > > > > > And looking closer, I've found this: > > > > > > > > > > static ssize_t > > > > > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > > > > > { > > > > > int nr_pages; > > > > > > > > > > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > > > > > if (!nr_pages) > > > > > return 0; > > > > > if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > > > > > return __blkdev_direct_IO_simple(iocb, iter, > > > > > nr_pages); > > > > > > > > > > return __blkdev_direct_IO(iocb, iter, min(nr_pages, > > > > > BIO_MAX_PAGES)); > > > > > } > > > > > > > > > > When checking the call path > > > > > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() > > > > > I found that bvec_alloc() will fail if nr_pages > > > > > > BIO_MAX_PAGES. > > > > > > > > > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? > > > > > It's not that we can handle it in __blkdev_direct_IO() ... > > > > > > > > The logic could be cleaned up like below, the sync part is > > > > really all > > > > we care about. What is the test case for this? async or sync? > > > > > > > > I also don't remember why it's BIO_MAX_PAGES + 1... > > > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > > index 0dd87aaeb39a..14ef3d71b55f 100644 > > > > --- a/fs/block_dev.c > > > > +++ b/fs/block_dev.c > > > > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, > > > > struct iov_iter *iter) > > > > { > > > > int nr_pages; > > > > > > > > - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > > > > + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); > > > > if (!nr_pages) > > > > return 0; > > > > - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > > > > + if (is_sync_kiocb(iocb)) > > > > return __blkdev_direct_IO_simple(iocb, iter, > > > > nr_pages); > > > > > > > > - return __blkdev_direct_IO(iocb, iter, min(nr_pages, > > > > BIO_MAX_PAGES)); > > > > + return __blkdev_direct_IO(iocb, iter, nr_pages); > > > > } > > > > > > > > static __init int blkdev_init(void) > > > > > > > > > > Hmm. We'll give it a go, but somehow I feel this won't solve our > > > problem. > > > > It probably won't, the only joker here is the BIO_MAX_PAGES + 1. > > But it > > does simplify that part... > > OK, now I remember. The +1 is just to check if there are actually > more > pages. __blkdev_direct_IO_simple() only does one bio, so it has to > fit > within that one bio. __blkdev_direct_IO() will loop just fine and > will finish any size, BIO_MAX_PAGES at the time. Right. Hannes, I think we (at least myself) have been irritated by looking at outdated code. The key point which I missed is that __blkdev_direct_IO() is called with min(nr_pages, BIO_MAX_PAGES), and advances beyond that later in the loop. > Hence the patch I sent is wrong, the code actually looks fine. Which > means we're back to trying to figure out what is going on here. It'd > be great with a test case... Unfortunately we have no reproducer just yet. Only the customer can reproduce it. The scenario is a data base running on a KVM virtual machine on top of a virtio-scsi volume backed by a multipath map, with cache='none' in qemu. My late thinking is that if, for whatever reason I don't yet understand, blkdev_direct_IO() resulted in a short write, __generic_file_write_iter() would fall back to buffered writing, which might be a possible explanation for the data corruption we observe. That's just speculation at the current stage. Regards Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-12 16:42 ` Jens Axboe 2018-07-13 6:47 ` Martin Wilck @ 2018-07-13 16:56 ` Martin Wilck 2018-07-13 18:00 ` Jens Axboe 1 sibling, 1 reply; 54+ messages in thread From: Martin Wilck @ 2018-07-13 16:56 UTC (permalink / raw) To: Jens Axboe, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote: > > Hence the patch I sent is wrong, the code actually looks fine. Which > means we're back to trying to figure out what is going on here. It'd > be great with a test case... We don't have an easy test case yet. But the customer has confirmed that the problem occurs with upstream 4.17.5, too. We also confirmed again that the problem occurs when the kernel uses the kmalloc() code path in __blkdev_direct_IO_simple(). My personal suggestion would be to ditch __blkdev_direct_IO_simple() altogether. After all, it's not _that_ much simpler thatn __blkdev_direct_IO(), and it seems to be broken in a subtle way. However, so far I've only identified a minor problem, see below - it doesn't explain the data corruption we're seeing. Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSELinux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) >From d0c74ef1fc73c03983950c496f7490da8aa56671 Mon Sep 17 00:00:00 2001 From: Martin Wilck <mwilck@suse.com> Date: Fri, 13 Jul 2018 18:38:44 +0200 Subject: [PATCH] fs: fix error exit in __blkdev_direct_IO_simple Cleanup code was missing in the error return path. --- fs/block_dev.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 7ec920e..b82b516 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -218,8 +218,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, bio.bi_end_io = blkdev_bio_end_io_simple; ret = bio_iov_iter_get_pages(&bio, iter); - if (unlikely(ret)) + if (unlikely(ret)) { + if (vecs != inline_vecs) + kfree(vecs); + bio_uninit(&bio); return ret; + } ret = bio.bi_iter.bi_size; if (iov_iter_rw(iter) == READ) { -- 2.17.1 ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-13 16:56 ` Martin Wilck @ 2018-07-13 18:00 ` Jens Axboe 2018-07-13 18:50 ` Jens Axboe 2018-07-13 20:48 ` Martin Wilck 0 siblings, 2 replies; 54+ messages in thread From: Jens Axboe @ 2018-07-13 18:00 UTC (permalink / raw) To: Martin Wilck, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block On 7/13/18 10:56 AM, Martin Wilck wrote: > On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote: >> >> Hence the patch I sent is wrong, the code actually looks fine. Which >> means we're back to trying to figure out what is going on here. It'd >> be great with a test case... > > We don't have an easy test case yet. But the customer has confirmed > that the problem occurs with upstream 4.17.5, too. We also confirmed > again that the problem occurs when the kernel uses the kmalloc() code > path in __blkdev_direct_IO_simple(). > > My personal suggestion would be to ditch __blkdev_direct_IO_simple() > altogether. After all, it's not _that_ much simpler thatn > __blkdev_direct_IO(), and it seems to be broken in a subtle way. That's not a great suggestion at all, we need to find out why we're hitting the issue. For all you know, the bug could be elsewhere and we're just going to be hitting it differently some other way. The head-in-the-sand approach is rarely a win long term. It's saving an allocation per IO, that's definitely measurable on the faster storage. For reads, it's also not causing a context switch for dirtying pages. I'm not a huge fan of multiple cases in general, but this one is definitely warranted in an era where 1 usec is a lot of extra time for an IO. > However, so far I've only identified a minor problem, see below - it > doesn't explain the data corruption we're seeing. What would help is trying to boil down a test case. So far it's a lot of hand waving, and nothing that can really help narrow down what is going on here. -- Jens Axboe ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-13 18:00 ` Jens Axboe @ 2018-07-13 18:50 ` Jens Axboe 2018-07-13 22:21 ` Martin Wilck 2018-07-13 20:48 ` Martin Wilck 1 sibling, 1 reply; 54+ messages in thread From: Jens Axboe @ 2018-07-13 18:50 UTC (permalink / raw) To: Martin Wilck, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block On 7/13/18 12:00 PM, Jens Axboe wrote: > On 7/13/18 10:56 AM, Martin Wilck wrote: >> On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote: >>> >>> Hence the patch I sent is wrong, the code actually looks fine. Which >>> means we're back to trying to figure out what is going on here. It'd >>> be great with a test case... >> >> We don't have an easy test case yet. But the customer has confirmed >> that the problem occurs with upstream 4.17.5, too. We also confirmed >> again that the problem occurs when the kernel uses the kmalloc() code >> path in __blkdev_direct_IO_simple(). >> >> My personal suggestion would be to ditch __blkdev_direct_IO_simple() >> altogether. After all, it's not _that_ much simpler thatn >> __blkdev_direct_IO(), and it seems to be broken in a subtle way. > > That's not a great suggestion at all, we need to find out why we're > hitting the issue. For all you know, the bug could be elsewhere and > we're just going to be hitting it differently some other way. The > head-in-the-sand approach is rarely a win long term. > > It's saving an allocation per IO, that's definitely measurable on > the faster storage. For reads, it's also not causing a context > switch for dirtying pages. I'm not a huge fan of multiple cases > in general, but this one is definitely warranted in an era where > 1 usec is a lot of extra time for an IO. > >> However, so far I've only identified a minor problem, see below - it >> doesn't explain the data corruption we're seeing. > > What would help is trying to boil down a test case. So far it's a lot > of hand waving, and nothing that can really help narrow down what is > going on here. When someone reports to you that some kind of data corruption occurs, you need to find out as many details you can. Ideally they can give you a test case, but if they can't, you ask as many questions as possible to help YOU make a test case. - What is the nature of the corruption? Is it reads or writes? Is it zeroes, random data, or data from somewhere else? How big is the corrupted area? - What does the workload look like that reproduces it? If they don't really know and they can't give you a reproducer, you help them with tracing to give you the information you need. Those are a great start. Right now we have zero information, other than reverting that patch fixes it. This means that we're likely dealing with IO that is larger than 16k, since we're going to be hitting the same path for <= 16k ios with the patch reverted. We also know it's less than 1MB. But that's it. I don't even know if it's writes, since your (and Hannes's) report has no details at all. Go look at what you ask customers to provide for bug reports, then apply some of that to this case... -- Jens Axboe ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-13 18:50 ` Jens Axboe @ 2018-07-13 22:21 ` Martin Wilck 0 siblings, 0 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-13 22:21 UTC (permalink / raw) To: Jens Axboe, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block On Fri, 2018-07-13 at 12:50 -0600, Jens Axboe wrote: > On 7/13/18 12:00 PM, Jens Axboe wrote: > > On 7/13/18 10:56 AM, Martin Wilck wrote: > > > On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote: > > > > > > > > Hence the patch I sent is wrong, the code actually looks fine. > > > > Which > > > > means we're back to trying to figure out what is going on here. > > > > It'd > > > > be great with a test case... > > > > > > We don't have an easy test case yet. But the customer has > > > confirmed > > > that the problem occurs with upstream 4.17.5, too. We also > > > confirmed > > > again that the problem occurs when the kernel uses the kmalloc() > > > code > > > path in __blkdev_direct_IO_simple(). > > > > > > My personal suggestion would be to ditch > > > __blkdev_direct_IO_simple() > > > altogether. After all, it's not _that_ much simpler thatn > > > __blkdev_direct_IO(), and it seems to be broken in a subtle way. > > > > That's not a great suggestion at all, we need to find out why we're > > hitting the issue. For all you know, the bug could be elsewhere and > > we're just going to be hitting it differently some other way. The > > head-in-the-sand approach is rarely a win long term. > > > > It's saving an allocation per IO, that's definitely measurable on > > the faster storage. For reads, it's also not causing a context > > switch for dirtying pages. I'm not a huge fan of multiple cases > > in general, but this one is definitely warranted in an era where > > 1 usec is a lot of extra time for an IO. > > > > > However, so far I've only identified a minor problem, see below - > > > it > > > doesn't explain the data corruption we're seeing. > > > > What would help is trying to boil down a test case. So far it's a > > lot > > of hand waving, and nothing that can really help narrow down what > > is > > going on here. > > When someone reports to you that some kind of data corruption occurs, > you need to find out as many details you can. Ideally they can give > you > a test case, but if they can't, you ask as many questions as possible > to > help YOU make a test case. > > - What is the nature of the corruption? Is it reads or writes? Is it > zeroes, random data, or data from somewhere else? How big is the > corrupted area? > > - What does the workload look like that reproduces it? If they don't > really know and they can't give you a reproducer, you help them > with > tracing to give you the information you need. > > Those are a great start. Right now we have zero information, other > than > reverting that patch fixes it. This means that we're likely dealing > with > IO that is larger than 16k, since we're going to be hitting the same > path for <= 16k ios with the patch reverted. We also know it's less > than > 1MB. The nature of the corruption is an interesting question. The weird thing is that it's not block-aligned - "bad" data starts at some apparently random offset into the log file. That'd actually made me think that we're not looking at a block IO issue but maybe some corruption that occured in memory before write-out. But then we isolated this specific patch by bisection... I had my doubts, but it seems to be sound, at least the test results indicate so. The user application is a database which writes logs in blocks that are multiples of 8k. The IO goes through various layers in the VM (ext3 on LVM on MD on virtio), then through qemu via virtio-scsi, and onto a host multipath device which is backed by FC storage. The VM is running an old SLES 11 distro, the problems were caused by a host kernel update. My attempts to reproduce a corruption with lab equipment have failed up to now. I've been trying to replay the customer's workload with btreplay. I also tried to replay it with fio with verification to try to find corruptions, but I guess I didn't get the setup quite right. > But that's it. I don't even know if it's writes, since your (and > Hannes's) report has no details at all. > Go look at what you ask customers to provide for bug reports, then > apply > some of that to this case... Sorry. It's writes. The reason we posted this here was a) the concrete questions Hannes asked about the code, b) the patch we'd isolated - we thought you or Christoph might just have a bright idea about it, or see what we didn't, and c) that we thought the problem might affect upstream, too. We don't expect you to do the dirty work for us. We'll come back to the list when we know more, or have specific questions. Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-13 18:00 ` Jens Axboe 2018-07-13 18:50 ` Jens Axboe @ 2018-07-13 20:48 ` Martin Wilck 2018-07-13 20:52 ` Jens Axboe 1 sibling, 1 reply; 54+ messages in thread From: Martin Wilck @ 2018-07-13 20:48 UTC (permalink / raw) To: Jens Axboe, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block On Fri, 2018-07-13 at 12:00 -0600, Jens Axboe wrote: > On 7/13/18 10:56 AM, Martin Wilck wrote: > > On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote: > > > > > > Hence the patch I sent is wrong, the code actually looks fine. > > > Which > > > means we're back to trying to figure out what is going on here. > > > It'd > > > be great with a test case... > > > > We don't have an easy test case yet. But the customer has confirmed > > that the problem occurs with upstream 4.17.5, too. We also > > confirmed > > again that the problem occurs when the kernel uses the kmalloc() > > code > > path in __blkdev_direct_IO_simple(). > > > > My personal suggestion would be to ditch > > __blkdev_direct_IO_simple() > > altogether. After all, it's not _that_ much simpler thatn > > __blkdev_direct_IO(), and it seems to be broken in a subtle way. > > That's not a great suggestion at all, we need to find out why we're > hitting the issue. We're trying. > For all you know, the bug could be elsewhere and > we're just going to be hitting it differently some other way. The > head-in-the-sand approach is rarely a win long term. > > It's saving an allocation per IO, that's definitely measurable on > the faster storage. I an see that for the inline path, but is there still an advantage if we need to kmalloc() the biovec? > For reads, it's also not causing a context > switch for dirtying pages. I'm not a huge fan of multiple cases > in general, but this one is definitely warranted in an era where > 1 usec is a lot of extra time for an IO. Ok, thanks for pointing that out. > > > However, so far I've only identified a minor problem, see below - > > it > > doesn't explain the data corruption we're seeing. > > What would help is trying to boil down a test case. So far it's a lot > of hand waving, and nothing that can really help narrow down what is > going on here. It's not that we didn't try. We've run fio with verification on block devices with varying io sizes, block sizes, and alignments, but so far we haven't hit the issue. We've also tried to reproduce it by approximating the customer's VM setup, with no success up to now. However, we're now much closer than we used to be, so I'm confident that we'll be able to present more concrete facts soon. Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-13 20:48 ` Martin Wilck @ 2018-07-13 20:52 ` Jens Axboe 2018-07-16 19:05 ` Martin Wilck 0 siblings, 1 reply; 54+ messages in thread From: Jens Axboe @ 2018-07-13 20:52 UTC (permalink / raw) To: Martin Wilck, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block On 7/13/18 2:48 PM, Martin Wilck wrote: >> For all you know, the bug could be elsewhere and >> we're just going to be hitting it differently some other way. The >> head-in-the-sand approach is rarely a win long term. >> >> It's saving an allocation per IO, that's definitely measurable on >> the faster storage. > > I an see that for the inline path, but is there still an advantage if > we need to kmalloc() the biovec? It's still one allocation instead of two. In the era of competing with spdk on single digit latency devices, the answer is yes. >>> However, so far I've only identified a minor problem, see below - >>> it >>> doesn't explain the data corruption we're seeing. >> >> What would help is trying to boil down a test case. So far it's a lot >> of hand waving, and nothing that can really help narrow down what is >> going on here. > > It's not that we didn't try. We've run fio with verification on block > devices with varying io sizes, block sizes, and alignments, but so far > we haven't hit the issue. We've also tried to reproduce it by > approximating the customer's VM setup, with no success up to now. I ran some testing yesterday as well, but didn't trigger anything. Didn't expect to either, as all the basic functionality was verified when the patch was done. It's not really a path with a lot of corner cases, so it's really weird that we're seeing anything at all. Which is why I'm suspecting it's something else entirely, but it's really hard to guesstimate on that with no clues at all. > However, we're now much closer than we used to be, so I'm confident > that we'll be able to present more concrete facts soon. OK, sounds good. -- Jens Axboe ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-13 20:52 ` Jens Axboe @ 2018-07-16 19:05 ` Martin Wilck 0 siblings, 0 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-16 19:05 UTC (permalink / raw) To: Jens Axboe, Hannes Reinecke, Jan Kara; +Cc: Christoph Hellwig, linux-block On Fri, 2018-07-13 at 14:52 -0600, Jens Axboe wrote: > On 7/13/18 2:48 PM, Martin Wilck wrote: > > > > > > > However, so far I've only identified a minor problem, see below > > > > - > > > > it > > > > doesn't explain the data corruption we're seeing. > > > > > > What would help is trying to boil down a test case. So far it's a > > > lot > > > of hand waving, and nothing that can really help narrow down what > > > is > > > going on here. > > > > It's not that we didn't try. We've run fio with verification on > > block > > devices with varying io sizes, block sizes, and alignments, but so > > far > > we haven't hit the issue. We've also tried to reproduce it by > > approximating the customer's VM setup, with no success up to now. > > I ran some testing yesterday as well, but didn't trigger anything. > Didn't expect to either, as all the basic functionality was verified > when the patch was done. It's not really a path with a lot of corner > cases, so it's really weird that we're seeing anything at all. Which > is > why I'm suspecting it's something else entirely, but it's really hard > to > guesstimate on that with no clues at all. > > > However, we're now much closer than we used to be, so I'm confident > > that we'll be able to present more concrete facts soon. > > OK, sounds good. Jan Kara has provided very convincing analysis and provided a patch which we are going to have to the customer test. By calling bio_iov_iter_get_pages() only once, __blkdev_direct_IO_simple() may not transfer all requested bytes, because bio_iov_iter_get_pages() doesn't necessarily exhaust all data in the iov_iter. Thus a short write may occur, and __generic_file_write_iter() falls back to buffered IO. We've actually observed these "short direct writes" in the error case with an instrumented kernel (in a trace I got, ~13000/800000 direct write ops on a block device transferred less data than requested). We suspect that this concurrency of direct and buffered writes may cause the corruption the customer observes. Does that make sense to you? Regards, Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-12 14:36 Silent data corruption in blkdev_direct_IO() Hannes Reinecke 2018-07-12 15:08 ` Jens Axboe @ 2018-07-12 23:29 ` Ming Lei 2018-07-13 18:54 ` Jens Axboe 1 sibling, 1 reply; 54+ messages in thread From: Ming Lei @ 2018-07-12 23:29 UTC (permalink / raw) To: Hannes Reinecke; +Cc: Jens Axboe, Christoph Hellwig, linux-block, Martin Wilck On Thu, Jul 12, 2018 at 10:36 PM, Hannes Reinecke <hare@suse.de> wrote: > Hi Jens, Christoph, > > we're currently hunting down a silent data corruption occurring due to > commit 72ecad22d9f1 ("block: support a full bio worth of IO for > simplified bdev direct-io"). > > While the whole thing is still hazy on the details, the one thing we've > found is that reverting that patch fixes the data corruption. > > And looking closer, I've found this: > > static ssize_t > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > { > int nr_pages; > > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > if (!nr_pages) > return 0; > if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > return __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); > } > > When checking the call path > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() > I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? > It's not that we can handle it in __blkdev_direct_IO() ... > > Thanks for any clarification. Maybe you can try the following patch from Christoph to see if it makes a difference: https://marc.info/?l=linux-kernel&m=153013977816825&w=2 thanks, Ming Lei ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-12 23:29 ` Ming Lei @ 2018-07-13 18:54 ` Jens Axboe 2018-07-13 22:29 ` Martin Wilck 0 siblings, 1 reply; 54+ messages in thread From: Jens Axboe @ 2018-07-13 18:54 UTC (permalink / raw) To: Ming Lei, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block, Martin Wilck On 7/12/18 5:29 PM, Ming Lei wrote: > On Thu, Jul 12, 2018 at 10:36 PM, Hannes Reinecke <hare@suse.de> wrote: >> Hi Jens, Christoph, >> >> we're currently hunting down a silent data corruption occurring due to >> commit 72ecad22d9f1 ("block: support a full bio worth of IO for >> simplified bdev direct-io"). >> >> While the whole thing is still hazy on the details, the one thing we've >> found is that reverting that patch fixes the data corruption. >> >> And looking closer, I've found this: >> >> static ssize_t >> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >> { >> int nr_pages; >> >> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >> if (!nr_pages) >> return 0; >> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >> >> return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >> } >> >> When checking the call path >> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() >> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. >> >> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? >> It's not that we can handle it in __blkdev_direct_IO() ... >> >> Thanks for any clarification. > > Maybe you can try the following patch from Christoph to see if it makes a > difference: > > https://marc.info/?l=linux-kernel&m=153013977816825&w=2 That's not a bad idea. -- Jens Axboe ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-13 18:54 ` Jens Axboe @ 2018-07-13 22:29 ` Martin Wilck 2018-07-16 11:45 ` Ming Lei 0 siblings, 1 reply; 54+ messages in thread From: Martin Wilck @ 2018-07-13 22:29 UTC (permalink / raw) To: Jens Axboe, Ming Lei, Hannes Reinecke; +Cc: Christoph Hellwig, linux-block Hi Ming & Jens, On Fri, 2018-07-13 at 12:54 -0600, Jens Axboe wrote: > On 7/12/18 5:29 PM, Ming Lei wrote: > > > > Maybe you can try the following patch from Christoph to see if it > > makes a > > difference: > > > > https://marc.info/?l=linux-kernel&m=153013977816825&w=2 > > That's not a bad idea. Are you saying that the previous "nasty hack" in bio_iov_iter_get_pages() was broken, and the new one is not? I've scratched my head over that (old) code lately, but I couldn't spot an actual error in it. Unfortunately we can only do one test at a time. We're currently trying to find out more about the IO sizes at which the problem occurs. As you noted, currently we know no more than that it happens somewhere between 16k and 1M. Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-13 22:29 ` Martin Wilck @ 2018-07-16 11:45 ` Ming Lei 2018-07-18 0:07 ` Martin Wilck 0 siblings, 1 reply; 54+ messages in thread From: Ming Lei @ 2018-07-16 11:45 UTC (permalink / raw) To: Martin Wilck Cc: Jens Axboe, Hannes Reinecke, Christoph Hellwig, linux-block, Ming Lei On Sat, Jul 14, 2018 at 6:29 AM, Martin Wilck <mwilck@suse.com> wrote: > Hi Ming & Jens, > > On Fri, 2018-07-13 at 12:54 -0600, Jens Axboe wrote: >> On 7/12/18 5:29 PM, Ming Lei wrote: >> > >> > Maybe you can try the following patch from Christoph to see if it >> > makes a >> > difference: >> > >> > https://marc.info/?l=linux-kernel&m=153013977816825&w=2 >> >> That's not a bad idea. > > Are you saying that the previous "nasty hack" in > bio_iov_iter_get_pages() was broken, and the new one is not? > I've scratched my head over that (old) code lately, but I couldn't spot > an actual error in it. Yeah, I think the new patch in above link is better, and looks its correctness is easy to prove. https://marc.info/?t=152812081400001&r=1&w=2 So please test the patch if possible. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-16 11:45 ` Ming Lei @ 2018-07-18 0:07 ` Martin Wilck 2018-07-18 2:48 ` Ming Lei 0 siblings, 1 reply; 54+ messages in thread From: Martin Wilck @ 2018-07-18 0:07 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Hannes Reinecke, Christoph Hellwig, linux-block, Ming Lei, jack, kent.overstreet On Mon, 2018-07-16 at 19:45 +0800, Ming Lei wrote: > On Sat, Jul 14, 2018 at 6:29 AM, Martin Wilck <mwilck@suse.com> > wrote: > > Hi Ming & Jens, > > > > On Fri, 2018-07-13 at 12:54 -0600, Jens Axboe wrote: > > > On 7/12/18 5:29 PM, Ming Lei wrote: > > > > > > > > Maybe you can try the following patch from Christoph to see if > > > > it > > > > makes a > > > > difference: > > > > > > > > https://marc.info/?l=linux-kernel&m=153013977816825&w=2 > > > > > > That's not a bad idea. > > > > Are you saying that the previous "nasty hack" in > > bio_iov_iter_get_pages() was broken, and the new one is not? > > I've scratched my head over that (old) code lately, but I couldn't > > spot > > an actual error in it. > > Yeah, I think the new patch in above link is better, and looks its > correctness > is easy to prove. > > https://marc.info/?t=152812081400001&r=1&w=2 > > So please test the patch if possible. I haven't tested yet, but upon further inspection, I can tell that the current code (without Christoph's patch) is actually broken if the function is called with bio->bi_vcnt > 0. The following patch explains the problem. AFAICS, Christoph's new code is correct. >From b75adc856119346e02126cf8975755300f2d9b7f Mon Sep 17 00:00:00 2001 From: Martin Wilck <mwilck@suse.com> Date: Wed, 18 Jul 2018 01:56:37 +0200 Subject: [PATCH] block: bio_iov_iter_get_pages: fix size of last iovec If the last page of the bio is not "full", the length of the last vector bin needs to be corrected. This bin has the index (bio->bi_vcnt - 1), but in bio->bi_io_vec, not in the "bv" helper array which is shifted by the value of bio->bi_vcnt at function invocation. Signed-off-by: Martin Wilck <mwilck@suse.com> --- block/bio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/bio.c b/block/bio.c index 53e0f0a..c22e76f 100644 --- a/block/bio.c +++ b/block/bio.c @@ -913,7 +913,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) bv[0].bv_offset += offset; bv[0].bv_len -= offset; if (diff) - bv[bio->bi_vcnt - 1].bv_len -= diff; + bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -= diff; iov_iter_advance(iter, size); return 0; -- 2.17.1 ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-18 0:07 ` Martin Wilck @ 2018-07-18 2:48 ` Ming Lei 2018-07-18 7:32 ` Martin Wilck 0 siblings, 1 reply; 54+ messages in thread From: Ming Lei @ 2018-07-18 2:48 UTC (permalink / raw) To: Martin Wilck Cc: Ming Lei, Jens Axboe, Hannes Reinecke, Christoph Hellwig, linux-block, jack, kent.overstreet On Wed, Jul 18, 2018 at 02:07:28AM +0200, Martin Wilck wrote: > On Mon, 2018-07-16 at 19:45 +0800, Ming Lei wrote: > > On Sat, Jul 14, 2018 at 6:29 AM, Martin Wilck <mwilck@suse.com> > > wrote: > > > Hi Ming & Jens, > > > > > > On Fri, 2018-07-13 at 12:54 -0600, Jens Axboe wrote: > > > > On 7/12/18 5:29 PM, Ming Lei wrote: > > > > > > > > > > Maybe you can try the following patch from Christoph to see if > > > > > it > > > > > makes a > > > > > difference: > > > > > > > > > > https://marc.info/?l=linux-kernel&m=153013977816825&w=2 > > > > > > > > That's not a bad idea. > > > > > > Are you saying that the previous "nasty hack" in > > > bio_iov_iter_get_pages() was broken, and the new one is not? > > > I've scratched my head over that (old) code lately, but I couldn't > > > spot > > > an actual error in it. > > > > Yeah, I think the new patch in above link is better, and looks its > > correctness > > is easy to prove. > > > > https://marc.info/?t=152812081400001&r=1&w=2 > > > > So please test the patch if possible. > > I haven't tested yet, but upon further inspection, I can tell that the > current code (without Christoph's patch) is actually broken if the > function is called with bio->bi_vcnt > 0. The following patch explains > the problem. AFAICS, Christoph's new code is correct. > > From b75adc856119346e02126cf8975755300f2d9b7f Mon Sep 17 00:00:00 2001 > From: Martin Wilck <mwilck@suse.com> > Date: Wed, 18 Jul 2018 01:56:37 +0200 > Subject: [PATCH] block: bio_iov_iter_get_pages: fix size of last iovec > > If the last page of the bio is not "full", the length of the last > vector bin needs to be corrected. This bin has the index > (bio->bi_vcnt - 1), but in bio->bi_io_vec, not in the "bv" helper array which > is shifted by the value of bio->bi_vcnt at function invocation. > > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > block/bio.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/bio.c b/block/bio.c > index 53e0f0a..c22e76f 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -913,7 +913,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > bv[0].bv_offset += offset; > bv[0].bv_len -= offset; > if (diff) > - bv[bio->bi_vcnt - 1].bv_len -= diff; > + bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -= diff; > > iov_iter_advance(iter, size); > return 0; Right, that is the issue, we need this fix for -stable, but maybe the following fix is more readable: diff --git a/block/bio.c b/block/bio.c index f3536bfc8298..6e37b803755b 100644 --- a/block/bio.c +++ b/block/bio.c @@ -914,16 +914,16 @@ EXPORT_SYMBOL(bio_add_page); */ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { - unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; + unsigned short idx, nr_pages = bio->bi_max_vecs - bio->bi_vcnt; struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; struct page **pages = (struct page **)bv; - size_t offset, diff; + size_t offset; ssize_t size; size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset); if (unlikely(size <= 0)) return size ? size : -EFAULT; - nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE; + idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE; /* * Deep magic below: We need to walk the pinned pages backwards @@ -936,17 +936,15 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) bio->bi_iter.bi_size += size; bio->bi_vcnt += nr_pages; - diff = (nr_pages * PAGE_SIZE - offset) - size; - while (nr_pages--) { - bv[nr_pages].bv_page = pages[nr_pages]; - bv[nr_pages].bv_len = PAGE_SIZE; - bv[nr_pages].bv_offset = 0; + while (idx--) { + bv[idx].bv_page = pages[idx]; + bv[idx].bv_len = PAGE_SIZE; + bv[idx].bv_offset = 0; } bv[0].bv_offset += offset; bv[0].bv_len -= offset; - if (diff) - bv[bio->bi_vcnt - 1].bv_len -= diff; + bv[nr_pages - 1].bv_len -= (nr_pages * PAGE_SIZE - offset) - size; iov_iter_advance(iter, size); return 0; And for mainline, I suggest to make Christoph's new code in, that is easy to prove its correctness, and seems simpler. Thanks, Ming ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-18 2:48 ` Ming Lei @ 2018-07-18 7:32 ` Martin Wilck 2018-07-18 7:54 ` Ming Lei 0 siblings, 1 reply; 54+ messages in thread From: Martin Wilck @ 2018-07-18 7:32 UTC (permalink / raw) To: Ming Lei Cc: Ming Lei, Jens Axboe, Hannes Reinecke, Christoph Hellwig, linux-block, jack, kent.overstreet On Wed, 2018-07-18 at 10:48 +0800, Ming Lei wrote: > On Wed, Jul 18, 2018 at 02:07:28AM +0200, Martin Wilck wrote: > > > > From b75adc856119346e02126cf8975755300f2d9b7f Mon Sep 17 00:00:00 > > 2001 > > From: Martin Wilck <mwilck@suse.com> > > Date: Wed, 18 Jul 2018 01:56:37 +0200 > > Subject: [PATCH] block: bio_iov_iter_get_pages: fix size of last > > iovec > > > > If the last page of the bio is not "full", the length of the last > > vector bin needs to be corrected. This bin has the index > > (bio->bi_vcnt - 1), but in bio->bi_io_vec, not in the "bv" helper > > array which > > is shifted by the value of bio->bi_vcnt at function invocation. > > > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > --- > > block/bio.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/block/bio.c b/block/bio.c > > index 53e0f0a..c22e76f 100644 > > --- a/block/bio.c > > +++ b/block/bio.c > > @@ -913,7 +913,7 @@ int bio_iov_iter_get_pages(struct bio *bio, > > struct iov_iter *iter) > > bv[0].bv_offset += offset; > > bv[0].bv_len -= offset; > > if (diff) > > - bv[bio->bi_vcnt - 1].bv_len -= diff; > > + bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -= diff; > > > > iov_iter_advance(iter, size); > > return 0; > > Right, that is the issue, we need this fix for -stable, but maybe the > following fix is more readable: > > diff --git a/block/bio.c b/block/bio.c > index f3536bfc8298..6e37b803755b 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -914,16 +914,16 @@ EXPORT_SYMBOL(bio_add_page); > */ > int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > { > - unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; > + unsigned short idx, nr_pages = bio->bi_max_vecs - bio- > >bi_vcnt; > struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; > struct page **pages = (struct page **)bv; > - size_t offset, diff; > + size_t offset; > ssize_t size; > > size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, > &offset); > if (unlikely(size <= 0)) > return size ? size : -EFAULT; > - nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE; > + idx = nr_pages = (size + offset + PAGE_SIZE - 1) / > PAGE_SIZE; > > /* > * Deep magic below: We need to walk the pinned pages > backwards > @@ -936,17 +936,15 @@ int bio_iov_iter_get_pages(struct bio *bio, > struct iov_iter *iter) > bio->bi_iter.bi_size += size; > bio->bi_vcnt += nr_pages; > > - diff = (nr_pages * PAGE_SIZE - offset) - size; > - while (nr_pages--) { > - bv[nr_pages].bv_page = pages[nr_pages]; > - bv[nr_pages].bv_len = PAGE_SIZE; > - bv[nr_pages].bv_offset = 0; > + while (idx--) { > + bv[idx].bv_page = pages[idx]; > + bv[idx].bv_len = PAGE_SIZE; > + bv[idx].bv_offset = 0; > } > > bv[0].bv_offset += offset; > bv[0].bv_len -= offset; > - if (diff) > - bv[bio->bi_vcnt - 1].bv_len -= diff; > + bv[nr_pages - 1].bv_len -= (nr_pages * PAGE_SIZE - offset) - > size; > > iov_iter_advance(iter, size); > return 0; > > And for mainline, I suggest to make Christoph's new code in, that is > easy to prove its correctness, and seems simpler. Fine with me. Will you take care of a submission, or should I? Btw, this is not the full fix for our data corruption issue yet. Another patch is needed which still needs testing. Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-18 7:32 ` Martin Wilck @ 2018-07-18 7:54 ` Ming Lei 2018-07-18 9:20 ` Johannes Thumshirn 2018-07-19 9:39 ` [PATCH 0/2] Fix silent " Martin Wilck 0 siblings, 2 replies; 54+ messages in thread From: Ming Lei @ 2018-07-18 7:54 UTC (permalink / raw) To: Martin Wilck Cc: Ming Lei, Jens Axboe, Hannes Reinecke, Christoph Hellwig, linux-block, jack, kent.overstreet On Wed, Jul 18, 2018 at 09:32:12AM +0200, Martin Wilck wrote: > On Wed, 2018-07-18 at 10:48 +0800, Ming Lei wrote: > > On Wed, Jul 18, 2018 at 02:07:28AM +0200, Martin Wilck wrote: > > > > > > From b75adc856119346e02126cf8975755300f2d9b7f Mon Sep 17 00:00:00 > > > 2001 > > > From: Martin Wilck <mwilck@suse.com> > > > Date: Wed, 18 Jul 2018 01:56:37 +0200 > > > Subject: [PATCH] block: bio_iov_iter_get_pages: fix size of last > > > iovec > > > > > > If the last page of the bio is not "full", the length of the last > > > vector bin needs to be corrected. This bin has the index > > > (bio->bi_vcnt - 1), but in bio->bi_io_vec, not in the "bv" helper > > > array which > > > is shifted by the value of bio->bi_vcnt at function invocation. > > > > > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > > --- > > > block/bio.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/block/bio.c b/block/bio.c > > > index 53e0f0a..c22e76f 100644 > > > --- a/block/bio.c > > > +++ b/block/bio.c > > > @@ -913,7 +913,7 @@ int bio_iov_iter_get_pages(struct bio *bio, > > > struct iov_iter *iter) > > > bv[0].bv_offset += offset; > > > bv[0].bv_len -= offset; > > > if (diff) > > > - bv[bio->bi_vcnt - 1].bv_len -= diff; > > > + bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -= diff; > > > > > > iov_iter_advance(iter, size); > > > return 0; > > > > Right, that is the issue, we need this fix for -stable, but maybe the > > following fix is more readable: > > > > diff --git a/block/bio.c b/block/bio.c > > index f3536bfc8298..6e37b803755b 100644 > > --- a/block/bio.c > > +++ b/block/bio.c > > @@ -914,16 +914,16 @@ EXPORT_SYMBOL(bio_add_page); > > */ > > int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > > { > > - unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; > > + unsigned short idx, nr_pages = bio->bi_max_vecs - bio- > > >bi_vcnt; > > struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; > > struct page **pages = (struct page **)bv; > > - size_t offset, diff; > > + size_t offset; > > ssize_t size; > > > > size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, > > &offset); > > if (unlikely(size <= 0)) > > return size ? size : -EFAULT; > > - nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE; > > + idx = nr_pages = (size + offset + PAGE_SIZE - 1) / > > PAGE_SIZE; > > > > /* > > * Deep magic below: We need to walk the pinned pages > > backwards > > @@ -936,17 +936,15 @@ int bio_iov_iter_get_pages(struct bio *bio, > > struct iov_iter *iter) > > bio->bi_iter.bi_size += size; > > bio->bi_vcnt += nr_pages; > > > > - diff = (nr_pages * PAGE_SIZE - offset) - size; > > - while (nr_pages--) { > > - bv[nr_pages].bv_page = pages[nr_pages]; > > - bv[nr_pages].bv_len = PAGE_SIZE; > > - bv[nr_pages].bv_offset = 0; > > + while (idx--) { > > + bv[idx].bv_page = pages[idx]; > > + bv[idx].bv_len = PAGE_SIZE; > > + bv[idx].bv_offset = 0; > > } > > > > bv[0].bv_offset += offset; > > bv[0].bv_len -= offset; > > - if (diff) > > - bv[bio->bi_vcnt - 1].bv_len -= diff; > > + bv[nr_pages - 1].bv_len -= (nr_pages * PAGE_SIZE - offset) - > > size; > > > > iov_iter_advance(iter, size); > > return 0; > > > > And for mainline, I suggest to make Christoph's new code in, that is > > easy to prove its correctness, and seems simpler. > > Fine with me. Will you take care of a submission, or should I? > Btw, this is not the full fix for our data corruption issue yet. > Another patch is needed which still needs testing. Please go ahead and take care of it since you have the test cases. thanks Ming ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-18 7:54 ` Ming Lei @ 2018-07-18 9:20 ` Johannes Thumshirn 2018-07-18 11:40 ` Jan Kara 2018-07-19 9:39 ` [PATCH 0/2] Fix silent " Martin Wilck 1 sibling, 1 reply; 54+ messages in thread From: Johannes Thumshirn @ 2018-07-18 9:20 UTC (permalink / raw) To: Ming Lei Cc: Martin Wilck, Ming Lei, Jens Axboe, Hannes Reinecke, Christoph Hellwig, linux-block, jack, kent.overstreet On Wed, Jul 18, 2018 at 03:54:46PM +0800, Ming Lei wrote: > Please go ahead and take care of it since you have the test cases. Speaking of which, do we already know how it is triggered and can we cook up a blktests testcase for it? This would be more than helpful for all parties. Thanks, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-18 9:20 ` Johannes Thumshirn @ 2018-07-18 11:40 ` Jan Kara 2018-07-18 11:57 ` Jan Kara 0 siblings, 1 reply; 54+ messages in thread From: Jan Kara @ 2018-07-18 11:40 UTC (permalink / raw) To: Johannes Thumshirn Cc: Ming Lei, Martin Wilck, Ming Lei, Jens Axboe, Hannes Reinecke, Christoph Hellwig, linux-block, jack, kent.overstreet On Wed 18-07-18 11:20:15, Johannes Thumshirn wrote: > On Wed, Jul 18, 2018 at 03:54:46PM +0800, Ming Lei wrote: > > Please go ahead and take care of it since you have the test cases. > > Speaking of which, do we already know how it is triggered and can we > cook up a blktests testcase for it? This would be more than helpful > for all parties. Using multiple iovecs with writev / readv trivially triggers the case of IO that is done partly as direct and partly as buffered. Neither me nor Martin were able to trigger the data corruption the customer is seeing with KVM though (since the generic code tries to maintain data integrity even if the IO is mixed). It should be possible to trigger the corruption by having two processes doing write to the same PAGE_SIZE region of a block device, just at different offsets. And if the first process happens to use direct IO while the second ends up doing read-modify-write cycle through page cache, the first write could end up being lost. I'll try whether something like this is able to see the corruption... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Silent data corruption in blkdev_direct_IO() 2018-07-18 11:40 ` Jan Kara @ 2018-07-18 11:57 ` Jan Kara 0 siblings, 0 replies; 54+ messages in thread From: Jan Kara @ 2018-07-18 11:57 UTC (permalink / raw) To: Johannes Thumshirn Cc: Ming Lei, Martin Wilck, Ming Lei, Jens Axboe, Hannes Reinecke, Christoph Hellwig, linux-block, jack, kent.overstreet [-- Attachment #1: Type: text/plain, Size: 1435 bytes --] On Wed 18-07-18 13:40:07, Jan Kara wrote: > On Wed 18-07-18 11:20:15, Johannes Thumshirn wrote: > > On Wed, Jul 18, 2018 at 03:54:46PM +0800, Ming Lei wrote: > > > Please go ahead and take care of it since you have the test cases. > > > > Speaking of which, do we already know how it is triggered and can we > > cook up a blktests testcase for it? This would be more than helpful > > for all parties. > > Using multiple iovecs with writev / readv trivially triggers the case of IO > that is done partly as direct and partly as buffered. Neither me nor Martin > were able to trigger the data corruption the customer is seeing with KVM > though (since the generic code tries to maintain data integrity even if the > IO is mixed). It should be possible to trigger the corruption by having two > processes doing write to the same PAGE_SIZE region of a block device, just at > different offsets. And if the first process happens to use direct IO while > the second ends up doing read-modify-write cycle through page cache, the > first write could end up being lost. I'll try whether something like this > is able to see the corruption... OK, when I run attached test program like: blkdev-dio-test /dev/loop0 0 & blkdev-dio-test /dev/loop0 2048 & One of them reports lost write almost immediately. On kernel with my fix the test program runs for quite a while without problems. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR [-- Attachment #2: blkdev-dio-test.c --] [-- Type: text/x-c, Size: 1336 bytes --] #define _GNU_SOURCE #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <string.h> #include <stdlib.h> #include <sys/uio.h> #define PAGE_SIZE 4096 #define SECT_SIZE 512 #define BUF_OFF (2*SECT_SIZE) int main(int argc, char **argv) { int fd = open(argv[1], O_RDWR | O_DIRECT); int ret; char *buf; loff_t off; struct iovec iov[2]; unsigned int seq; if (fd < 0) { perror("open"); return 1; } off = strtol(argv[2], NULL, 10); buf = aligned_alloc(PAGE_SIZE, PAGE_SIZE); iov[0].iov_base = buf; iov[0].iov_len = SECT_SIZE; iov[1].iov_base = buf + BUF_OFF; iov[1].iov_len = SECT_SIZE; seq = 0; memset(buf, 0, PAGE_SIZE); while (1) { *(unsigned int *)buf = seq; *(unsigned int *)(buf + BUF_OFF) = seq; ret = pwritev(fd, iov, 2, off); if (ret < 0) { perror("pwritev"); return 1; } if (ret != 2*SECT_SIZE) { fprintf(stderr, "Short pwritev: %d\n", ret); return 1; } ret = pread(fd, buf, PAGE_SIZE, off); if (ret < 0) { perror("pread"); return 1; } if (ret != PAGE_SIZE) { fprintf(stderr, "Short read: %d\n", ret); return 1; } if (*(unsigned int *)buf != seq || *(unsigned int *)(buf + SECT_SIZE) != seq) { printf("Lost write %u: %u %u\n", seq, *(unsigned int *)buf, *(unsigned int *)(buf + SECT_SIZE)); return 1; } seq++; } return 0; } ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO() 2018-07-18 7:54 ` Ming Lei 2018-07-18 9:20 ` Johannes Thumshirn @ 2018-07-19 9:39 ` Martin Wilck 2018-07-19 9:39 ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck ` (3 more replies) 1 sibling, 4 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-19 9:39 UTC (permalink / raw) To: Jens Axboe, Ming Lei, Jan Kara Cc: Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block, Martin Wilck Hello Jens, Ming, Jan, and all others, the following patches have been verified by a customer to fix a silent data corruption which he has been seeing since "72ecad2 block: support a full bio worth of IO for simplified bdev direct-io". The patches are based on our observation that the corruption is only observed if the __blkdev_direct_IO_simple() code path is executed, and if that happens, "short writes" are observed in this code path, which causes a fallback to buffered IO, while the application continues submitting direct IO requests. For the first patch, an alternative solution by Christoph Hellwig exists: https://marc.info/?l=linux-kernel&m=153013977816825&w=2 While I believe that Christoph's patch is correct, the one presented here is smaller. Ming has suggested to use Christoph's for mainline and mine for -stable. Wrt the second patch, we've had an internal discussion at SUSE how to handle (unlikely) error conditions from bio_iov_iter_get_pages(). The patch presented here tries to submit as much IO as possible via the direct path even in the error case, while Jan Kara suggested to abort, not submit any IO, and fall back to buffered IO in that case. Looking forward to your opinions and suggestions. Regards Martin Martin Wilck (2): block: bio_iov_iter_get_pages: fix size of last iovec blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio block/bio.c | 18 ++++++++---------- fs/block_dev.c | 8 +++++++- 2 files changed, 15 insertions(+), 11 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec 2018-07-19 9:39 ` [PATCH 0/2] Fix silent " Martin Wilck @ 2018-07-19 9:39 ` Martin Wilck 2018-07-19 10:05 ` Hannes Reinecke ` (3 more replies) 2018-07-19 9:39 ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck ` (2 subsequent siblings) 3 siblings, 4 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-19 9:39 UTC (permalink / raw) To: Jens Axboe, Ming Lei, Jan Kara Cc: Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block, Martin Wilck If the last page of the bio is not "full", the length of the last vector slot needs to be corrected. This slot has the index (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper array, which is shifted by the value of bio->bi_vcnt at function invocation, the correct index is (nr_pages - 1). V2: improved readability following suggestions from Ming Lei. Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()") Signed-off-by: Martin Wilck <mwilck@suse.com> --- block/bio.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/block/bio.c b/block/bio.c index 67eff5e..0964328 100644 --- a/block/bio.c +++ b/block/bio.c @@ -912,16 +912,16 @@ EXPORT_SYMBOL(bio_add_page); */ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { - unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; + unsigned short idx, nr_pages = bio->bi_max_vecs - bio->bi_vcnt; struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; struct page **pages = (struct page **)bv; - size_t offset, diff; + size_t offset; ssize_t size; size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset); if (unlikely(size <= 0)) return size ? size : -EFAULT; - nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE; + idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE; /* * Deep magic below: We need to walk the pinned pages backwards @@ -934,17 +934,15 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) bio->bi_iter.bi_size += size; bio->bi_vcnt += nr_pages; - diff = (nr_pages * PAGE_SIZE - offset) - size; - while (nr_pages--) { - bv[nr_pages].bv_page = pages[nr_pages]; - bv[nr_pages].bv_len = PAGE_SIZE; - bv[nr_pages].bv_offset = 0; + while (idx--) { + bv[idx].bv_page = pages[idx]; + bv[idx].bv_len = PAGE_SIZE; + bv[idx].bv_offset = 0; } bv[0].bv_offset += offset; bv[0].bv_len -= offset; - if (diff) - bv[bio->bi_vcnt - 1].bv_len -= diff; + bv[nr_pages - 1].bv_len -= nr_pages * PAGE_SIZE - offset - size; iov_iter_advance(iter, size); return 0; -- 2.17.1 ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec 2018-07-19 9:39 ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck @ 2018-07-19 10:05 ` Hannes Reinecke 2018-07-19 10:09 ` Ming Lei ` (2 subsequent siblings) 3 siblings, 0 replies; 54+ messages in thread From: Hannes Reinecke @ 2018-07-19 10:05 UTC (permalink / raw) To: Martin Wilck, Jens Axboe, Ming Lei, Jan Kara Cc: Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On 07/19/2018 11:39 AM, Martin Wilck wrote: > If the last page of the bio is not "full", the length of the last > vector slot needs to be corrected. This slot has the index > (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper > array, which is shifted by the value of bio->bi_vcnt at function > invocation, the correct index is (nr_pages - 1). > > V2: improved readability following suggestions from Ming Lei. > > Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()") > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > block/bio.c | 18 ++++++++---------- > 1 file changed, 8 insertions(+), 10 deletions(-) > Reviewed-by: Hannes Reinecke <hare@suse.com> Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec 2018-07-19 9:39 ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck 2018-07-19 10:05 ` Hannes Reinecke @ 2018-07-19 10:09 ` Ming Lei 2018-07-19 10:20 ` Jan Kara 2018-07-19 14:52 ` Christoph Hellwig 3 siblings, 0 replies; 54+ messages in thread From: Ming Lei @ 2018-07-19 10:09 UTC (permalink / raw) To: Martin Wilck Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu, Jul 19, 2018 at 5:39 PM, Martin Wilck <mwilck@suse.com> wrote: > If the last page of the bio is not "full", the length of the last > vector slot needs to be corrected. This slot has the index > (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper > array, which is shifted by the value of bio->bi_vcnt at function > invocation, the correct index is (nr_pages - 1). > > V2: improved readability following suggestions from Ming Lei. > > Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()") > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > block/bio.c | 18 ++++++++---------- > 1 file changed, 8 insertions(+), 10 deletions(-) > > diff --git a/block/bio.c b/block/bio.c > index 67eff5e..0964328 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -912,16 +912,16 @@ EXPORT_SYMBOL(bio_add_page); > */ > int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > { > - unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; > + unsigned short idx, nr_pages = bio->bi_max_vecs - bio->bi_vcnt; > struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; > struct page **pages = (struct page **)bv; > - size_t offset, diff; > + size_t offset; > ssize_t size; > > size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset); > if (unlikely(size <= 0)) > return size ? size : -EFAULT; > - nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE; > + idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE; > > /* > * Deep magic below: We need to walk the pinned pages backwards > @@ -934,17 +934,15 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > bio->bi_iter.bi_size += size; > bio->bi_vcnt += nr_pages; > > - diff = (nr_pages * PAGE_SIZE - offset) - size; > - while (nr_pages--) { > - bv[nr_pages].bv_page = pages[nr_pages]; > - bv[nr_pages].bv_len = PAGE_SIZE; > - bv[nr_pages].bv_offset = 0; > + while (idx--) { > + bv[idx].bv_page = pages[idx]; > + bv[idx].bv_len = PAGE_SIZE; > + bv[idx].bv_offset = 0; > } > > bv[0].bv_offset += offset; > bv[0].bv_len -= offset; > - if (diff) > - bv[bio->bi_vcnt - 1].bv_len -= diff; > + bv[nr_pages - 1].bv_len -= nr_pages * PAGE_SIZE - offset - size; > > iov_iter_advance(iter, size); > return 0; > -- > 2.17.1 > Reviewed-by: Ming Lei <ming.lei@redhat.com> BTW, we can make it for mainline too, then rebase Christoph's new patch against this one. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec 2018-07-19 9:39 ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck 2018-07-19 10:05 ` Hannes Reinecke 2018-07-19 10:09 ` Ming Lei @ 2018-07-19 10:20 ` Jan Kara 2018-07-19 14:52 ` Christoph Hellwig 3 siblings, 0 replies; 54+ messages in thread From: Jan Kara @ 2018-07-19 10:20 UTC (permalink / raw) To: Martin Wilck Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu 19-07-18 11:39:17, Martin Wilck wrote: > If the last page of the bio is not "full", the length of the last > vector slot needs to be corrected. This slot has the index > (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper > array, which is shifted by the value of bio->bi_vcnt at function > invocation, the correct index is (nr_pages - 1). > > V2: improved readability following suggestions from Ming Lei. > > Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()") > Signed-off-by: Martin Wilck <mwilck@suse.com> Looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> BTW, explicit CC: stable@vger.kernel.org would be good. But Jens can add it I guess. Honza > --- > block/bio.c | 18 ++++++++---------- > 1 file changed, 8 insertions(+), 10 deletions(-) > > diff --git a/block/bio.c b/block/bio.c > index 67eff5e..0964328 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -912,16 +912,16 @@ EXPORT_SYMBOL(bio_add_page); > */ > int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > { > - unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; > + unsigned short idx, nr_pages = bio->bi_max_vecs - bio->bi_vcnt; > struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; > struct page **pages = (struct page **)bv; > - size_t offset, diff; > + size_t offset; > ssize_t size; > > size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset); > if (unlikely(size <= 0)) > return size ? size : -EFAULT; > - nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE; > + idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE; > > /* > * Deep magic below: We need to walk the pinned pages backwards > @@ -934,17 +934,15 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > bio->bi_iter.bi_size += size; > bio->bi_vcnt += nr_pages; > > - diff = (nr_pages * PAGE_SIZE - offset) - size; > - while (nr_pages--) { > - bv[nr_pages].bv_page = pages[nr_pages]; > - bv[nr_pages].bv_len = PAGE_SIZE; > - bv[nr_pages].bv_offset = 0; > + while (idx--) { > + bv[idx].bv_page = pages[idx]; > + bv[idx].bv_len = PAGE_SIZE; > + bv[idx].bv_offset = 0; > } > > bv[0].bv_offset += offset; > bv[0].bv_len -= offset; > - if (diff) > - bv[bio->bi_vcnt - 1].bv_len -= diff; > + bv[nr_pages - 1].bv_len -= nr_pages * PAGE_SIZE - offset - size; > > iov_iter_advance(iter, size); > return 0; > -- > 2.17.1 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec 2018-07-19 9:39 ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck ` (2 preceding siblings ...) 2018-07-19 10:20 ` Jan Kara @ 2018-07-19 14:52 ` Christoph Hellwig 3 siblings, 0 replies; 54+ messages in thread From: Christoph Hellwig @ 2018-07-19 14:52 UTC (permalink / raw) To: Martin Wilck Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu, Jul 19, 2018 at 11:39:17AM +0200, Martin Wilck wrote: > If the last page of the bio is not "full", the length of the last > vector slot needs to be corrected. This slot has the index > (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper > array, which is shifted by the value of bio->bi_vcnt at function > invocation, the correct index is (nr_pages - 1). > > V2: improved readability following suggestions from Ming Lei. > > Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()") > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > block/bio.c | 18 ++++++++---------- > 1 file changed, 8 insertions(+), 10 deletions(-) > > diff --git a/block/bio.c b/block/bio.c > index 67eff5e..0964328 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -912,16 +912,16 @@ EXPORT_SYMBOL(bio_add_page); > */ > int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > { > - unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; > + unsigned short idx, nr_pages = bio->bi_max_vecs - bio->bi_vcnt; Nipick: I prefer to keep the variable(s) that are initialized first on any given line. Otherwise this looks fine: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 9:39 ` [PATCH 0/2] Fix silent " Martin Wilck 2018-07-19 9:39 ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck @ 2018-07-19 9:39 ` Martin Wilck 2018-07-19 10:06 ` Hannes Reinecke ` (3 more replies) 2018-07-19 10:08 ` [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO() Hannes Reinecke 2018-07-19 14:50 ` Christoph Hellwig 3 siblings, 4 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-19 9:39 UTC (permalink / raw) To: Jens Axboe, Ming Lei, Jan Kara Cc: Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block, Martin Wilck bio_iov_iter_get_pages() returns only pages for a single non-empty segment of the input iov_iter's iovec. This may be much less than the number of pages __blkdev_direct_IO_simple() is supposed to process. Call bio_iov_iter_get_pages() repeatedly until either the requested number of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, short writes or reads may occur for direct synchronous IOs with multiple iovec slots (such as generated by writev()). In that case, __generic_file_write_iter() falls back to buffered writes, which has been observed to cause data corruption in certain workloads. Note: if segments aren't page-aligned in the input iovec, this patch may result in multiple adjacent slots of the bi_io_vec array to reference the same page (the byte ranges are guaranteed to be disjunct if the preceding patch is applied). We haven't seen problems with that in our and the customer's tests. It'd be possible to detect this situation and merge bi_io_vec slots that refer to the same page, but I prefer to keep it simple for now. Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") Signed-off-by: Martin Wilck <mwilck@suse.com> --- fs/block_dev.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 0dd87aa..41643c4 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, ret = bio_iov_iter_get_pages(&bio, iter); if (unlikely(ret)) - return ret; + goto out; + + while (ret == 0 && + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) + ret = bio_iov_iter_get_pages(&bio, iter); + ret = bio.bi_iter.bi_size; if (iov_iter_rw(iter) == READ) { @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, put_page(bvec->bv_page); } +out: if (vecs != inline_vecs) kfree(vecs); -- 2.17.1 ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 9:39 ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck @ 2018-07-19 10:06 ` Hannes Reinecke 2018-07-19 10:21 ` Ming Lei ` (2 subsequent siblings) 3 siblings, 0 replies; 54+ messages in thread From: Hannes Reinecke @ 2018-07-19 10:06 UTC (permalink / raw) To: Martin Wilck, Jens Axboe, Ming Lei, Jan Kara Cc: Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On 07/19/2018 11:39 AM, Martin Wilck wrote: > bio_iov_iter_get_pages() returns only pages for a single non-empty > segment of the input iov_iter's iovec. This may be much less than the number > of pages __blkdev_direct_IO_simple() is supposed to process. Call > bio_iov_iter_get_pages() repeatedly until either the requested number > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > short writes or reads may occur for direct synchronous IOs with multiple > iovec slots (such as generated by writev()). In that case, > __generic_file_write_iter() falls back to buffered writes, which > has been observed to cause data corruption in certain workloads. > > Note: if segments aren't page-aligned in the input iovec, this patch may > result in multiple adjacent slots of the bi_io_vec array to reference the same > page (the byte ranges are guaranteed to be disjunct if the preceding patch is > applied). We haven't seen problems with that in our and the customer's > tests. It'd be possible to detect this situation and merge bi_io_vec slots > that refer to the same page, but I prefer to keep it simple for now. > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > fs/block_dev.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > Reviewed-by: Hannes Reinecke <hare@suse.com> Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 9:39 ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck 2018-07-19 10:06 ` Hannes Reinecke @ 2018-07-19 10:21 ` Ming Lei 2018-07-19 10:37 ` Jan Kara 2018-07-19 10:45 ` Jan Kara 2018-07-19 11:04 ` Ming Lei 3 siblings, 1 reply; 54+ messages in thread From: Ming Lei @ 2018-07-19 10:21 UTC (permalink / raw) To: Martin Wilck Cc: Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > bio_iov_iter_get_pages() returns only pages for a single non-empty > segment of the input iov_iter's iovec. This may be much less than the number > of pages __blkdev_direct_IO_simple() is supposed to process. Call In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve as many as possible pages since both 'maxsize' and 'maxpages' are provided to cover all. So the question is why iov_iter_get_pages() doesn't work as expected? Thanks, Ming ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 10:21 ` Ming Lei @ 2018-07-19 10:37 ` Jan Kara 2018-07-19 10:46 ` Ming Lei 2018-07-19 11:08 ` Al Viro 0 siblings, 2 replies; 54+ messages in thread From: Jan Kara @ 2018-07-19 10:37 UTC (permalink / raw) To: Ming Lei Cc: Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block, Al Viro On Thu 19-07-18 18:21:23, Ming Lei wrote: > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > segment of the input iov_iter's iovec. This may be much less than the number > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve > as many as possible pages since both 'maxsize' and 'maxpages' are provided > to cover all. > > So the question is why iov_iter_get_pages() doesn't work as expected? Well, there has never been a promise that it will grab *all* pages in the iter AFAIK. Practically, I think that it was just too hairy to implement in the macro magic that iter processing is... Al might know more (added to CC). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 10:37 ` Jan Kara @ 2018-07-19 10:46 ` Ming Lei 2018-07-19 11:08 ` Al Viro 1 sibling, 0 replies; 54+ messages in thread From: Ming Lei @ 2018-07-19 10:46 UTC (permalink / raw) To: Jan Kara Cc: Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block, Al Viro On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote: > On Thu 19-07-18 18:21:23, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > > segment of the input iov_iter's iovec. This may be much less than the number > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve > > as many as possible pages since both 'maxsize' and 'maxpages' are provided > > to cover all. > > > > So the question is why iov_iter_get_pages() doesn't work as expected? > > Well, there has never been a promise that it will grab *all* pages in the > iter AFAIK. Practically, I think that it was just too hairy to implement in > the macro magic that iter processing is... Al might know more (added to > CC). OK, I got it, given get_user_pages_fast() may fail and it is reasonable to see less pages retrieved. Thanks, Ming ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 10:37 ` Jan Kara 2018-07-19 10:46 ` Ming Lei @ 2018-07-19 11:08 ` Al Viro 2018-07-19 14:53 ` Christoph Hellwig 1 sibling, 1 reply; 54+ messages in thread From: Al Viro @ 2018-07-19 11:08 UTC (permalink / raw) To: Jan Kara Cc: Ming Lei, Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu, Jul 19, 2018 at 12:37:13PM +0200, Jan Kara wrote: > On Thu 19-07-18 18:21:23, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > > segment of the input iov_iter's iovec. This may be much less than the number > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > > In bio_iov_iter_get_pages(), iov_iter_get_pages() supposes to retrieve > > as many as possible pages since both 'maxsize' and 'maxpages' are provided > > to cover all. Huh? Why does having a way to say "I (the caller) don't want more than <amount>" implies anything of that sort? It never had been promised, explicitly or not... > > So the question is why iov_iter_get_pages() doesn't work as expected? > > Well, there has never been a promise that it will grab *all* pages in the > iter AFAIK. Practically, I think that it was just too hairy to implement in > the macro magic that iter processing is... Al might know more (added to > CC). Not really - it's more that VM has every right to refuse letting you pin an arbitrary amount of pages anyway. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 11:08 ` Al Viro @ 2018-07-19 14:53 ` Christoph Hellwig 2018-07-19 15:06 ` Jan Kara 2018-07-19 19:34 ` Martin Wilck 0 siblings, 2 replies; 54+ messages in thread From: Christoph Hellwig @ 2018-07-19 14:53 UTC (permalink / raw) To: Al Viro Cc: Jan Kara, Ming Lei, Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote: > > Well, there has never been a promise that it will grab *all* pages in the > > iter AFAIK. Practically, I think that it was just too hairy to implement in > > the macro magic that iter processing is... Al might know more (added to > > CC). > > Not really - it's more that VM has every right to refuse letting you pin > an arbitrary amount of pages anyway. In which case the code after this patch isn't going to help either, because it still tries to pin it all, just in multiple calls to get_user_pages(). ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 14:53 ` Christoph Hellwig @ 2018-07-19 15:06 ` Jan Kara 2018-07-19 15:11 ` Christoph Hellwig 2018-07-19 19:34 ` Martin Wilck 1 sibling, 1 reply; 54+ messages in thread From: Jan Kara @ 2018-07-19 15:06 UTC (permalink / raw) To: Christoph Hellwig Cc: Al Viro, Jan Kara, Ming Lei, Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, linux-block On Thu 19-07-18 16:53:51, Christoph Hellwig wrote: > On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote: > > > Well, there has never been a promise that it will grab *all* pages in the > > > iter AFAIK. Practically, I think that it was just too hairy to implement in > > > the macro magic that iter processing is... Al might know more (added to > > > CC). > > > > Not really - it's more that VM has every right to refuse letting you pin > > an arbitrary amount of pages anyway. > > In which case the code after this patch isn't going to help either, because > it still tries to pin it all, just in multiple calls to get_user_pages(). Yeah. Actually previous version of the fix (not posted publicly) submitted partial bio and then reused the bio to submit more. This is also the way __blkdev_direct_IO operates. Martin optimized this to fill the bio completely (as we know we have enough bvecs) before submitting which has chances to perform better. I'm fine with either approach, just we have to decide which way to go. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 15:06 ` Jan Kara @ 2018-07-19 15:11 ` Christoph Hellwig 2018-07-19 19:21 ` Martin Wilck 0 siblings, 1 reply; 54+ messages in thread From: Christoph Hellwig @ 2018-07-19 15:11 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Al Viro, Ming Lei, Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, linux-block On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote: > Yeah. Actually previous version of the fix (not posted publicly) submitted > partial bio and then reused the bio to submit more. This is also the way > __blkdev_direct_IO operates. Martin optimized this to fill the bio > completely (as we know we have enough bvecs) before submitting which has > chances to perform better. I'm fine with either approach, just we have to > decide which way to go. I think this first version is going to be less fragile, so I we should aim for that. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 15:11 ` Christoph Hellwig @ 2018-07-19 19:21 ` Martin Wilck 0 siblings, 0 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-19 19:21 UTC (permalink / raw) To: Christoph Hellwig, Jan Kara Cc: Al Viro, Ming Lei, Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, linux-block On Thu, 2018-07-19 at 17:11 +0200, Christoph Hellwig wrote: > On Thu, Jul 19, 2018 at 05:06:42PM +0200, Jan Kara wrote: > > Yeah. Actually previous version of the fix (not posted publicly) > > submitted > > partial bio and then reused the bio to submit more. This is also > > the way > > __blkdev_direct_IO operates. Martin optimized this to fill the > > bio > > completely (as we know we have enough bvecs) before submitting > > which has > > chances to perform better. I'm fine with either approach, just we > > have to > > decide which way to go. > > I think this first version is going to be less fragile, so I we > should > aim for that. Wait a second. We changed the approach for a reason. By submitting partial bios synchronously and sequentially, we loose a lot of performance, so much that this "fast path" quickly falls behind the regular __blkkdev_direct_IO() path as the number of input IO segments increases. The patch I submitted avoids that. The whole point of this fast path is to submit a single bio, and return as quickly as possible. It's not an error condition if bio_iov_iter_get_pages() returns less data than possible. The function just takes one iteration step at a time, and thus iterating over it until the desired number of pages is obtained, and collecting the resulting pages in a single bio, isn't "fragile", it's just the natural thing to do. Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 14:53 ` Christoph Hellwig 2018-07-19 15:06 ` Jan Kara @ 2018-07-19 19:34 ` Martin Wilck 1 sibling, 0 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-19 19:34 UTC (permalink / raw) To: Christoph Hellwig, Al Viro Cc: Jan Kara, Ming Lei, Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, linux-block On Thu, 2018-07-19 at 16:53 +0200, Christoph Hellwig wrote: > On Thu, Jul 19, 2018 at 12:08:41PM +0100, Al Viro wrote: > > > Well, there has never been a promise that it will grab *all* > > > pages in the > > > iter AFAIK. Practically, I think that it was just too hairy to > > > implement in > > > the macro magic that iter processing is... Al might know more > > > (added to > > > CC). > > > > Not really - it's more that VM has every right to refuse letting > > you pin > > an arbitrary amount of pages anyway. > > In which case the code after this patch isn't going to help either, > because > it still tries to pin it all, just in multiple calls to > get_user_pages(). That was not the point of the patch. It's not about a situation in which MM refuses to pin more pages. The patch is about the "iterator::next()" nature of bio_iov_iter_get_pages(). If it can't pin the pages, bio_iov_iter_get_pages() returns an error code (elsewhere in this thread we discussed how to treat that right). Otherwise, it always adds _some_ data to the bio, but the amount added depends on the segment structure of the input iov_iter. If the input iovec has just a single segment, it fills the bio in a single call. With multiple segments, it just returns the page(s) of the first segment. The point of my patch is to make no difference between single- segment and multi-segment IOs. Regards Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 9:39 ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck 2018-07-19 10:06 ` Hannes Reinecke 2018-07-19 10:21 ` Ming Lei @ 2018-07-19 10:45 ` Jan Kara 2018-07-19 12:23 ` Martin Wilck 2018-07-19 11:04 ` Ming Lei 3 siblings, 1 reply; 54+ messages in thread From: Jan Kara @ 2018-07-19 10:45 UTC (permalink / raw) To: Martin Wilck Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu 19-07-18 11:39:18, Martin Wilck wrote: > bio_iov_iter_get_pages() returns only pages for a single non-empty > segment of the input iov_iter's iovec. This may be much less than the number > of pages __blkdev_direct_IO_simple() is supposed to process. Call > bio_iov_iter_get_pages() repeatedly until either the requested number > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > short writes or reads may occur for direct synchronous IOs with multiple > iovec slots (such as generated by writev()). In that case, > __generic_file_write_iter() falls back to buffered writes, which > has been observed to cause data corruption in certain workloads. > > Note: if segments aren't page-aligned in the input iovec, this patch may > result in multiple adjacent slots of the bi_io_vec array to reference the same > page (the byte ranges are guaranteed to be disjunct if the preceding patch is > applied). We haven't seen problems with that in our and the customer's > tests. It'd be possible to detect this situation and merge bi_io_vec slots > that refer to the same page, but I prefer to keep it simple for now. > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > fs/block_dev.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 0dd87aa..41643c4 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > > ret = bio_iov_iter_get_pages(&bio, iter); > if (unlikely(ret)) > - return ret; > + goto out; > + > + while (ret == 0 && > + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) > + ret = bio_iov_iter_get_pages(&bio, iter); > + I have two suggestions here (posting them now in public): Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we made sure we have enough vecs for pages in iter. So I'd WARN if this isn't true. Secondly, I don't think it is good to discard error from bio_iov_iter_get_pages() here and just submit partial IO. It will again lead to part of IO being done as direct and part attempted to be done as buffered. Also the "slow" direct IO path in __blkdev_direct_IO() behaves differently - it aborts and returns error if bio_iov_iter_get_pages() ever returned error. IMO we should do the same here. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 10:45 ` Jan Kara @ 2018-07-19 12:23 ` Martin Wilck 2018-07-19 15:15 ` Jan Kara 0 siblings, 1 reply; 54+ messages in thread From: Martin Wilck @ 2018-07-19 12:23 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote: > On Thu 19-07-18 11:39:18, Martin Wilck wrote: > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > segment of the input iov_iter's iovec. This may be much less than > > the number > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > bio_iov_iter_get_pages() repeatedly until either the requested > > number > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not > > done, > > short writes or reads may occur for direct synchronous IOs with > > multiple > > iovec slots (such as generated by writev()). In that case, > > __generic_file_write_iter() falls back to buffered writes, which > > has been observed to cause data corruption in certain workloads. > > > > Note: if segments aren't page-aligned in the input iovec, this > > patch may > > result in multiple adjacent slots of the bi_io_vec array to > > reference the same > > page (the byte ranges are guaranteed to be disjunct if the > > preceding patch is > > applied). We haven't seen problems with that in our and the > > customer's > > tests. It'd be possible to detect this situation and merge > > bi_io_vec slots > > that refer to the same page, but I prefer to keep it simple for > > now. > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for > > simplified bdev direct-io") > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > --- > > fs/block_dev.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index 0dd87aa..41643c4 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, > > struct iov_iter *iter, > > > > ret = bio_iov_iter_get_pages(&bio, iter); > > if (unlikely(ret)) > > - return ret; > > + goto out; > > + > > + while (ret == 0 && > > + bio.bi_vcnt < bio.bi_max_vecs && > > iov_iter_count(iter) > 0) > > + ret = bio_iov_iter_get_pages(&bio, iter); > > + > > I have two suggestions here (posting them now in public): > > Condition bio.bi_vcnt < bio.bi_max_vecs should always be true - we > made > sure we have enough vecs for pages in iter. So I'd WARN if this isn't > true. Yeah. I wanted to add that to the patch. Slipped through, somehow. Sorry about that. > Secondly, I don't think it is good to discard error from > bio_iov_iter_get_pages() here and just submit partial IO. It will > again > lead to part of IO being done as direct and part attempted to be done > as > buffered. Also the "slow" direct IO path in __blkdev_direct_IO() > behaves > differently - it aborts and returns error if bio_iov_iter_get_pages() > ever > returned error. IMO we should do the same here. Well, it aborts the loop, but then (in the sync case) it still waits for the already submitted IOs to finish. Here, too, I'd find it more logical to return the number of successfully transmitted bytes rather than an error code. In the async case, the submitted bios are left in place, and will probably sooner or later finish, changing iocb->ki_pos. I'm actually not quite certain if that's correct. In the sync case, it causes the already-performed IO to be done again, buffered. In the async case, it it may even cause two IOs for the same range to be in flight at the same time ... ? Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 12:23 ` Martin Wilck @ 2018-07-19 15:15 ` Jan Kara 2018-07-19 20:01 ` Martin Wilck 0 siblings, 1 reply; 54+ messages in thread From: Jan Kara @ 2018-07-19 15:15 UTC (permalink / raw) To: Martin Wilck Cc: Jan Kara, Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu 19-07-18 14:23:53, Martin Wilck wrote: > On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote: > > Secondly, I don't think it is good to discard error from > > bio_iov_iter_get_pages() here and just submit partial IO. It will > > again > > lead to part of IO being done as direct and part attempted to be done > > as > > buffered. Also the "slow" direct IO path in __blkdev_direct_IO() > > behaves > > differently - it aborts and returns error if bio_iov_iter_get_pages() > > ever > > returned error. IMO we should do the same here. > > Well, it aborts the loop, but then (in the sync case) it still waits > for the already submitted IOs to finish. Here, too, I'd find it more > logical to return the number of successfully transmitted bytes rather > than an error code. In the async case, the submitted bios are left in > place, and will probably sooner or later finish, changing iocb->ki_pos. Well, both these behaviors make sense, just traditionally (defined by our implementation) DIO returns error even if part of IO has actually been successfully submitted. Making a userspace visible change like you suggest thus has to be very carefully analyzed and frankly I don't think it's worth the bother. > I'm actually not quite certain if that's correct. In the sync case, it > causes the already-performed IO to be done again, buffered. In the > async case, it it may even cause two IOs for the same range to be in > flight at the same time ... ? It doesn't cause IO to be done again. Look at __generic_file_write_iter(). If generic_file_direct_write() returned error, we immediately return error as well without retrying buffered IO. We only retry buffered IO for partial (or 0) return value. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 15:15 ` Jan Kara @ 2018-07-19 20:01 ` Martin Wilck 0 siblings, 0 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-19 20:01 UTC (permalink / raw) To: Jan Kara Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu, 2018-07-19 at 17:15 +0200, Jan Kara wrote: > On Thu 19-07-18 14:23:53, Martin Wilck wrote: > > On Thu, 2018-07-19 at 12:45 +0200, Jan Kara wrote: > > > Secondly, I don't think it is good to discard error from > > > bio_iov_iter_get_pages() here and just submit partial IO. It will > > > again > > > lead to part of IO being done as direct and part attempted to be > > > done > > > as > > > buffered. Also the "slow" direct IO path in __blkdev_direct_IO() > > > behaves > > > differently - it aborts and returns error if > > > bio_iov_iter_get_pages() > > > ever > > > returned error. IMO we should do the same here. > > > > Well, it aborts the loop, but then (in the sync case) it still > > waits > > for the already submitted IOs to finish. Here, too, I'd find it > > more > > logical to return the number of successfully transmitted bytes > > rather > > than an error code. In the async case, the submitted bios are left > > in > > place, and will probably sooner or later finish, changing iocb- > > >ki_pos. > > Well, both these behaviors make sense, just traditionally (defined by > our > implementation) DIO returns error even if part of IO has actually > been > successfully submitted. Making a userspace visible change like you > suggest > thus has to be very carefully analyzed and frankly I don't think it's > worth > the bother. OK, maybe not. I'll rework the treatment of the error case. Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 9:39 ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck ` (2 preceding siblings ...) 2018-07-19 10:45 ` Jan Kara @ 2018-07-19 11:04 ` Ming Lei 2018-07-19 11:56 ` Jan Kara 3 siblings, 1 reply; 54+ messages in thread From: Ming Lei @ 2018-07-19 11:04 UTC (permalink / raw) To: Martin Wilck Cc: Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > bio_iov_iter_get_pages() returns only pages for a single non-empty > segment of the input iov_iter's iovec. This may be much less than the number > of pages __blkdev_direct_IO_simple() is supposed to process. Call > bio_iov_iter_get_pages() repeatedly until either the requested number > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > short writes or reads may occur for direct synchronous IOs with multiple > iovec slots (such as generated by writev()). In that case, > __generic_file_write_iter() falls back to buffered writes, which > has been observed to cause data corruption in certain workloads. > > Note: if segments aren't page-aligned in the input iovec, this patch may > result in multiple adjacent slots of the bi_io_vec array to reference the same > page (the byte ranges are guaranteed to be disjunct if the preceding patch is > applied). We haven't seen problems with that in our and the customer's > tests. It'd be possible to detect this situation and merge bi_io_vec slots > that refer to the same page, but I prefer to keep it simple for now. > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") > Signed-off-by: Martin Wilck <mwilck@suse.com> > --- > fs/block_dev.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 0dd87aa..41643c4 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > > ret = bio_iov_iter_get_pages(&bio, iter); > if (unlikely(ret)) > - return ret; > + goto out; > + > + while (ret == 0 && > + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) > + ret = bio_iov_iter_get_pages(&bio, iter); > + > ret = bio.bi_iter.bi_size; > > if (iov_iter_rw(iter) == READ) { > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > put_page(bvec->bv_page); > } > > +out: > if (vecs != inline_vecs) > kfree(vecs); > You might put the 'vecs' leak fix into another patch, and resue the current code block for that. Looks all users of bio_iov_iter_get_pages() need this kind of fix, so what do you think about the following way? diff --git a/block/bio.c b/block/bio.c index f3536bfc8298..23dd4c163dfc 100644 --- a/block/bio.c +++ b/block/bio.c @@ -904,15 +904,7 @@ int bio_add_page(struct bio *bio, struct page *page, } EXPORT_SYMBOL(bio_add_page); -/** - * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio - * @bio: bio to add pages to - * @iter: iov iterator describing the region to be mapped - * - * Pins as many pages from *iter and appends them to @bio's bvec array. The - * pages will have to be released using put_page() when done. - */ -int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) +static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; @@ -951,6 +943,28 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) iov_iter_advance(iter, size); return 0; } + +/** + * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio + * @bio: bio to add pages to + * @iter: iov iterator describing the region to be mapped + * + * Pins as many pages from *iter and appends them to @bio's bvec array. The + * pages will have to be released using put_page() when done. + */ +int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) +{ + int ret; + unsigned int size; + + do { + size = bio->bi_iter.bi_size; + ret = __bio_iov_iter_get_pages(bio, iter); + } while (!bio_full(bio) && iov_iter_count(iter) > 0 && + bio->bi_iter.bi_size > size); + + return bio->bi_iter.bi_size > 0 ? 0 : ret; +} EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages); static void submit_bio_wait_endio(struct bio *bio) Thanks, Ming ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 11:04 ` Ming Lei @ 2018-07-19 11:56 ` Jan Kara 2018-07-19 12:20 ` Ming Lei 2018-07-19 12:25 ` Martin Wilck 0 siblings, 2 replies; 54+ messages in thread From: Jan Kara @ 2018-07-19 11:56 UTC (permalink / raw) To: Ming Lei Cc: Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu 19-07-18 19:04:46, Ming Lei wrote: > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > segment of the input iov_iter's iovec. This may be much less than the number > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > bio_iov_iter_get_pages() repeatedly until either the requested number > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > > short writes or reads may occur for direct synchronous IOs with multiple > > iovec slots (such as generated by writev()). In that case, > > __generic_file_write_iter() falls back to buffered writes, which > > has been observed to cause data corruption in certain workloads. > > > > Note: if segments aren't page-aligned in the input iovec, this patch may > > result in multiple adjacent slots of the bi_io_vec array to reference the same > > page (the byte ranges are guaranteed to be disjunct if the preceding patch is > > applied). We haven't seen problems with that in our and the customer's > > tests. It'd be possible to detect this situation and merge bi_io_vec slots > > that refer to the same page, but I prefer to keep it simple for now. > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > --- > > fs/block_dev.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index 0dd87aa..41643c4 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > > > > ret = bio_iov_iter_get_pages(&bio, iter); > > if (unlikely(ret)) > > - return ret; > > + goto out; > > + > > + while (ret == 0 && > > + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) > > + ret = bio_iov_iter_get_pages(&bio, iter); > > + > > ret = bio.bi_iter.bi_size; > > > > if (iov_iter_rw(iter) == READ) { > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > > put_page(bvec->bv_page); > > } > > > > +out: > > if (vecs != inline_vecs) > > kfree(vecs); > > > > You might put the 'vecs' leak fix into another patch, and resue the > current code block for that. > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so > what do you think about the following way? No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly fine with it returning less pages and they loop appropriately. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 11:56 ` Jan Kara @ 2018-07-19 12:20 ` Ming Lei 2018-07-19 15:21 ` Jan Kara 2018-07-19 12:25 ` Martin Wilck 1 sibling, 1 reply; 54+ messages in thread From: Ming Lei @ 2018-07-19 12:20 UTC (permalink / raw) To: Jan Kara Cc: Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote: > On Thu 19-07-18 19:04:46, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > > segment of the input iov_iter's iovec. This may be much less than the number > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > bio_iov_iter_get_pages() repeatedly until either the requested number > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > > > short writes or reads may occur for direct synchronous IOs with multiple > > > iovec slots (such as generated by writev()). In that case, > > > __generic_file_write_iter() falls back to buffered writes, which > > > has been observed to cause data corruption in certain workloads. > > > > > > Note: if segments aren't page-aligned in the input iovec, this patch may > > > result in multiple adjacent slots of the bi_io_vec array to reference the same > > > page (the byte ranges are guaranteed to be disjunct if the preceding patch is > > > applied). We haven't seen problems with that in our and the customer's > > > tests. It'd be possible to detect this situation and merge bi_io_vec slots > > > that refer to the same page, but I prefer to keep it simple for now. > > > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") > > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > > --- > > > fs/block_dev.c | 8 +++++++- > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 0dd87aa..41643c4 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > > > > > > ret = bio_iov_iter_get_pages(&bio, iter); > > > if (unlikely(ret)) > > > - return ret; > > > + goto out; > > > + > > > + while (ret == 0 && > > > + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) > > > + ret = bio_iov_iter_get_pages(&bio, iter); > > > + > > > ret = bio.bi_iter.bi_size; > > > > > > if (iov_iter_rw(iter) == READ) { > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > > > put_page(bvec->bv_page); > > > } > > > > > > +out: > > > if (vecs != inline_vecs) > > > kfree(vecs); > > > > > > > You might put the 'vecs' leak fix into another patch, and resue the > > current code block for that. > > > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so > > what do you think about the following way? > > No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly > fine with it returning less pages and they loop appropriately. OK, but this way still may make one bio to hold more data, especially the comment of bio_iov_iter_get_pages() says that 'Pins as many pages from *iter', so looks it is the correct way to do. thanks, Ming ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 12:20 ` Ming Lei @ 2018-07-19 15:21 ` Jan Kara 2018-07-19 19:06 ` Martin Wilck 0 siblings, 1 reply; 54+ messages in thread From: Jan Kara @ 2018-07-19 15:21 UTC (permalink / raw) To: Ming Lei Cc: Jan Kara, Martin Wilck, Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu 19-07-18 20:20:51, Ming Lei wrote: > On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote: > > On Thu 19-07-18 19:04:46, Ming Lei wrote: > > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > > bio_iov_iter_get_pages() returns only pages for a single non-empty > > > > segment of the input iov_iter's iovec. This may be much less than the number > > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > > bio_iov_iter_get_pages() repeatedly until either the requested number > > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done, > > > > short writes or reads may occur for direct synchronous IOs with multiple > > > > iovec slots (such as generated by writev()). In that case, > > > > __generic_file_write_iter() falls back to buffered writes, which > > > > has been observed to cause data corruption in certain workloads. > > > > > > > > Note: if segments aren't page-aligned in the input iovec, this patch may > > > > result in multiple adjacent slots of the bi_io_vec array to reference the same > > > > page (the byte ranges are guaranteed to be disjunct if the preceding patch is > > > > applied). We haven't seen problems with that in our and the customer's > > > > tests. It'd be possible to detect this situation and merge bi_io_vec slots > > > > that refer to the same page, but I prefer to keep it simple for now. > > > > > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") > > > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > > > --- > > > > fs/block_dev.c | 8 +++++++- > > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > > index 0dd87aa..41643c4 100644 > > > > --- a/fs/block_dev.c > > > > +++ b/fs/block_dev.c > > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > > > > > > > > ret = bio_iov_iter_get_pages(&bio, iter); > > > > if (unlikely(ret)) > > > > - return ret; > > > > + goto out; > > > > + > > > > + while (ret == 0 && > > > > + bio.bi_vcnt < bio.bi_max_vecs && iov_iter_count(iter) > 0) > > > > + ret = bio_iov_iter_get_pages(&bio, iter); > > > > + > > > > ret = bio.bi_iter.bi_size; > > > > > > > > if (iov_iter_rw(iter) == READ) { > > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > > > > put_page(bvec->bv_page); > > > > } > > > > > > > > +out: > > > > if (vecs != inline_vecs) > > > > kfree(vecs); > > > > > > > > > > You might put the 'vecs' leak fix into another patch, and resue the > > > current code block for that. > > > > > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, so > > > what do you think about the following way? > > > > No. AFAICT all the other users of bio_iov_iter_get_pages() are perfectly > > fine with it returning less pages and they loop appropriately. > > OK, but this way still may make one bio to hold more data, especially > the comment of bio_iov_iter_get_pages() says that 'Pins as many pages from > *iter', so looks it is the correct way to do. Well, but as Al pointed out MM may decide that user has already pinned too many pages and refuse to pin more. So pinning full iter worth of pages may just be impossible. Currently, there are no checks like this in MM but eventually, I'd like to account pinned pages in mlock ulimit (or a limit of similar kind) and then what Al speaks about would become very real. So I'd prefer to not develop new locations that must be able to pin arbitrary amount of pages. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 15:21 ` Jan Kara @ 2018-07-19 19:06 ` Martin Wilck 0 siblings, 0 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-19 19:06 UTC (permalink / raw) To: Jan Kara, Ming Lei Cc: Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu, 2018-07-19 at 17:21 +0200, Jan Kara wrote: > On Thu 19-07-18 20:20:51, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 01:56:16PM +0200, Jan Kara wrote: > > > On Thu 19-07-18 19:04:46, Ming Lei wrote: > > > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > > > bio_iov_iter_get_pages() returns only pages for a single non- > > > > > empty > > > > > segment of the input iov_iter's iovec. This may be much less > > > > > than the number > > > > > of pages __blkdev_direct_IO_simple() is supposed to process. > > > > > Call > > > > > bio_iov_iter_get_pages() repeatedly until either the > > > > > requested number > > > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this > > > > > is not done, > > > > > short writes or reads may occur for direct synchronous IOs > > > > > with multiple > > > > > iovec slots (such as generated by writev()). In that case, > > > > > __generic_file_write_iter() falls back to buffered writes, > > > > > which > > > > > has been observed to cause data corruption in certain > > > > > workloads. > > > > > > > > > > Note: if segments aren't page-aligned in the input iovec, > > > > > this patch may > > > > > result in multiple adjacent slots of the bi_io_vec array to > > > > > reference the same > > > > > page (the byte ranges are guaranteed to be disjunct if the > > > > > preceding patch is > > > > > applied). We haven't seen problems with that in our and the > > > > > customer's > > > > > tests. It'd be possible to detect this situation and merge > > > > > bi_io_vec slots > > > > > that refer to the same page, but I prefer to keep it simple > > > > > for now. > > > > > > > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO > > > > > for simplified bdev direct-io") > > > > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > > > > --- > > > > > fs/block_dev.c | 8 +++++++- > > > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > > > index 0dd87aa..41643c4 100644 > > > > > --- a/fs/block_dev.c > > > > > +++ b/fs/block_dev.c > > > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb > > > > > *iocb, struct iov_iter *iter, > > > > > > > > > > ret = bio_iov_iter_get_pages(&bio, iter); > > > > > if (unlikely(ret)) > > > > > - return ret; > > > > > + goto out; > > > > > + > > > > > + while (ret == 0 && > > > > > + bio.bi_vcnt < bio.bi_max_vecs && > > > > > iov_iter_count(iter) > 0) > > > > > + ret = bio_iov_iter_get_pages(&bio, iter); > > > > > + > > > > > ret = bio.bi_iter.bi_size; > > > > > > > > > > if (iov_iter_rw(iter) == READ) { > > > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb > > > > > *iocb, struct iov_iter *iter, > > > > > put_page(bvec->bv_page); > > > > > } > > > > > > > > > > +out: > > > > > if (vecs != inline_vecs) > > > > > kfree(vecs); > > > > > > > > > > > > > You might put the 'vecs' leak fix into another patch, and resue > > > > the > > > > current code block for that. > > > > > > > > Looks all users of bio_iov_iter_get_pages() need this kind of > > > > fix, so > > > > what do you think about the following way? > > > > > > No. AFAICT all the other users of bio_iov_iter_get_pages() are > > > perfectly > > > fine with it returning less pages and they loop appropriately. > > > > OK, but this way still may make one bio to hold more data, > > especially > > the comment of bio_iov_iter_get_pages() says that 'Pins as many > > pages from > > *iter', so looks it is the correct way to do. > > Well, but as Al pointed out MM may decide that user has already > pinned too > many pages and refuse to pin more. So pinning full iter worth of > pages may > just be impossible. Currently, there are no checks like this in MM > but > eventually, I'd like to account pinned pages in mlock ulimit (or a > limit of > similar kind) and then what Al speaks about would become very real. > So I'd > prefer to not develop new locations that must be able to pin > arbitrary > amount of pages. Doesn't this wrongfully penalize scatter-gather IO? Consider two 1MiB IO requests, one with a single 1 MiB slot, and one with 256 single-page slots. Calling bio_iov_iter_get_pages() once will result in a 1MiB bio in first case, and a 4k bio in the second. You need to submit 256 individual bios to finish the IO. By using the approach which my patch is takes (repeatedly calling bio_io_iter_get_pages before submitting the bio), you'd get a 1Mib bio in both cases. So Ming's idea to generalize this may not be so bad, after all. I don't see that it would increase the pressure on MM. If we are concerned about MM, we should limit the total number of pages that can be pinned, rather than the number of iovec segments that are processed. Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio 2018-07-19 11:56 ` Jan Kara 2018-07-19 12:20 ` Ming Lei @ 2018-07-19 12:25 ` Martin Wilck 1 sibling, 0 replies; 54+ messages in thread From: Martin Wilck @ 2018-07-19 12:25 UTC (permalink / raw) To: Jan Kara, Ming Lei Cc: Jens Axboe, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On Thu, 2018-07-19 at 13:56 +0200, Jan Kara wrote: > On Thu 19-07-18 19:04:46, Ming Lei wrote: > > On Thu, Jul 19, 2018 at 11:39:18AM +0200, Martin Wilck wrote: > > > bio_iov_iter_get_pages() returns only pages for a single non- > > > empty > > > segment of the input iov_iter's iovec. This may be much less than > > > the number > > > of pages __blkdev_direct_IO_simple() is supposed to process. Call > > > bio_iov_iter_get_pages() repeatedly until either the requested > > > number > > > of bytes is reached, or bio.bi_io_vec is exhausted. If this is > > > not done, > > > short writes or reads may occur for direct synchronous IOs with > > > multiple > > > iovec slots (such as generated by writev()). In that case, > > > __generic_file_write_iter() falls back to buffered writes, which > > > has been observed to cause data corruption in certain workloads. > > > > > > Note: if segments aren't page-aligned in the input iovec, this > > > patch may > > > result in multiple adjacent slots of the bi_io_vec array to > > > reference the same > > > page (the byte ranges are guaranteed to be disjunct if the > > > preceding patch is > > > applied). We haven't seen problems with that in our and the > > > customer's > > > tests. It'd be possible to detect this situation and merge > > > bi_io_vec slots > > > that refer to the same page, but I prefer to keep it simple for > > > now. > > > > > > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for > > > simplified bdev direct-io") > > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > > --- > > > fs/block_dev.c | 8 +++++++- > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 0dd87aa..41643c4 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -221,7 +221,12 @@ __blkdev_direct_IO_simple(struct kiocb > > > *iocb, struct iov_iter *iter, > > > > > > ret = bio_iov_iter_get_pages(&bio, iter); > > > if (unlikely(ret)) > > > - return ret; > > > + goto out; > > > + > > > + while (ret == 0 && > > > + bio.bi_vcnt < bio.bi_max_vecs && > > > iov_iter_count(iter) > 0) > > > + ret = bio_iov_iter_get_pages(&bio, iter); > > > + > > > ret = bio.bi_iter.bi_size; > > > > > > if (iov_iter_rw(iter) == READ) { > > > @@ -250,6 +255,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, > > > struct iov_iter *iter, > > > put_page(bvec->bv_page); > > > } > > > > > > +out: > > > if (vecs != inline_vecs) > > > kfree(vecs); > > > > > > > You might put the 'vecs' leak fix into another patch, and resue the > > current code block for that. > > > > Looks all users of bio_iov_iter_get_pages() need this kind of fix, > > so > > what do you think about the following way? > > No. AFAICT all the other users of bio_iov_iter_get_pages() are > perfectly > fine with it returning less pages and they loop appropriately. We might consider to better fill the bios in __blkdev_direct_IO(), too, it might be a performance gain. Regards Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO() 2018-07-19 9:39 ` [PATCH 0/2] Fix silent " Martin Wilck 2018-07-19 9:39 ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck 2018-07-19 9:39 ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck @ 2018-07-19 10:08 ` Hannes Reinecke 2018-07-19 14:50 ` Christoph Hellwig 3 siblings, 0 replies; 54+ messages in thread From: Hannes Reinecke @ 2018-07-19 10:08 UTC (permalink / raw) To: Martin Wilck, Jens Axboe, Ming Lei, Jan Kara Cc: Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block On 07/19/2018 11:39 AM, Martin Wilck wrote: > Hello Jens, Ming, Jan, and all others, > > the following patches have been verified by a customer to fix a silent data > corruption which he has been seeing since "72ecad2 block: support a full bio > worth of IO for simplified bdev direct-io". > > The patches are based on our observation that the corruption is only > observed if the __blkdev_direct_IO_simple() code path is executed, > and if that happens, "short writes" are observed in this code path, > which causes a fallback to buffered IO, while the application continues > submitting direct IO requests. > > For the first patch, an alternative solution by Christoph Hellwig > exists: > > https://marc.info/?l=linux-kernel&m=153013977816825&w=2 > > While I believe that Christoph's patch is correct, the one presented > here is smaller. Ming has suggested to use Christoph's for mainline and > mine for -stable. > > Wrt the second patch, we've had an internal discussion at SUSE how to handle > (unlikely) error conditions from bio_iov_iter_get_pages(). The patch presented > here tries to submit as much IO as possible via the direct path even in the > error case, while Jan Kara suggested to abort, not submit any IO, and fall > back to buffered IO in that case. > > Looking forward to your opinions and suggestions. > Can you please add the test program from Jan Kara to the blktest suite? This issue is something we really should cover there too, seeing the amount of time we've spend debugging it... Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO() 2018-07-19 9:39 ` [PATCH 0/2] Fix silent " Martin Wilck ` (2 preceding siblings ...) 2018-07-19 10:08 ` [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO() Hannes Reinecke @ 2018-07-19 14:50 ` Christoph Hellwig 3 siblings, 0 replies; 54+ messages in thread From: Christoph Hellwig @ 2018-07-19 14:50 UTC (permalink / raw) To: Martin Wilck Cc: Jens Axboe, Ming Lei, Jan Kara, Hannes Reinecke, Johannes Thumshirn, Kent Overstreet, Christoph Hellwig, linux-block Martin, please NEVER send a patch series as a reply to a previous thread. That makes it complete hell to find in the inbox. ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2018-07-19 20:01 UTC | newest] Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-12 14:36 Silent data corruption in blkdev_direct_IO() Hannes Reinecke 2018-07-12 15:08 ` Jens Axboe 2018-07-12 16:11 ` Martin Wilck 2018-07-12 16:14 ` Hannes Reinecke 2018-07-12 16:20 ` Jens Axboe 2018-07-12 16:42 ` Jens Axboe 2018-07-13 6:47 ` Martin Wilck 2018-07-13 16:56 ` Martin Wilck 2018-07-13 18:00 ` Jens Axboe 2018-07-13 18:50 ` Jens Axboe 2018-07-13 22:21 ` Martin Wilck 2018-07-13 20:48 ` Martin Wilck 2018-07-13 20:52 ` Jens Axboe 2018-07-16 19:05 ` Martin Wilck 2018-07-12 23:29 ` Ming Lei 2018-07-13 18:54 ` Jens Axboe 2018-07-13 22:29 ` Martin Wilck 2018-07-16 11:45 ` Ming Lei 2018-07-18 0:07 ` Martin Wilck 2018-07-18 2:48 ` Ming Lei 2018-07-18 7:32 ` Martin Wilck 2018-07-18 7:54 ` Ming Lei 2018-07-18 9:20 ` Johannes Thumshirn 2018-07-18 11:40 ` Jan Kara 2018-07-18 11:57 ` Jan Kara 2018-07-19 9:39 ` [PATCH 0/2] Fix silent " Martin Wilck 2018-07-19 9:39 ` [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec Martin Wilck 2018-07-19 10:05 ` Hannes Reinecke 2018-07-19 10:09 ` Ming Lei 2018-07-19 10:20 ` Jan Kara 2018-07-19 14:52 ` Christoph Hellwig 2018-07-19 9:39 ` [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio Martin Wilck 2018-07-19 10:06 ` Hannes Reinecke 2018-07-19 10:21 ` Ming Lei 2018-07-19 10:37 ` Jan Kara 2018-07-19 10:46 ` Ming Lei 2018-07-19 11:08 ` Al Viro 2018-07-19 14:53 ` Christoph Hellwig 2018-07-19 15:06 ` Jan Kara 2018-07-19 15:11 ` Christoph Hellwig 2018-07-19 19:21 ` Martin Wilck 2018-07-19 19:34 ` Martin Wilck 2018-07-19 10:45 ` Jan Kara 2018-07-19 12:23 ` Martin Wilck 2018-07-19 15:15 ` Jan Kara 2018-07-19 20:01 ` Martin Wilck 2018-07-19 11:04 ` Ming Lei 2018-07-19 11:56 ` Jan Kara 2018-07-19 12:20 ` Ming Lei 2018-07-19 15:21 ` Jan Kara 2018-07-19 19:06 ` Martin Wilck 2018-07-19 12:25 ` Martin Wilck 2018-07-19 10:08 ` [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO() Hannes Reinecke 2018-07-19 14:50 ` Christoph Hellwig
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.