* [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO @ 2017-07-13 15:17 Lukas Czerner 2017-07-14 10:41 ` kbuild test robot 2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner 0 siblings, 2 replies; 40+ messages in thread From: Lukas Czerner @ 2017-07-13 15:17 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, jack, Lukas Czerner, Jeff Moyer Currently when mixing buffered reads and asynchronous direct writes it is possible to end up with the situation where we have stale data in the page cache while the new data is already written to disk. This is permanent until the affected pages are flushed away. Despite the fact that mixing buffered and direct IO is ill-advised it does pose a thread for a data integrity, is unexpected and should be fixed. Fix this by deferring completion of asynchronous direct writes to a process context in the case that there are mapped pages to be found in the inode. Later before the completion in dio_complete() invalidate the pages in question. This ensures that after the completion the pages in the written area are either unmapped, or populated with up-to-date data. Also do the same for the iomap case which uses iomap_dio_complete() instead. This has a side effect of deferring the completion to a process context for every AIO DIO that happens on inode that has pages mapped. However since the consensus is that this is ill-advised practice the performance implication should not be a problem. This was based on proposal from Jeff Moyer, thanks! Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> --- fs/direct-io.c | 31 ++++++++++++++++++++++++++----- fs/iomap.c | 7 +++++++ 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 08cf278..2db9ada 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -258,6 +258,11 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) if (ret == 0) ret = transferred; + if ((dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages)) + invalidate_inode_pages2_range(dio->inode->i_mapping, + offset >> PAGE_SHIFT, + (offset + ret - 1) >> PAGE_SHIFT); + if (dio->end_io) { int err; @@ -304,6 +309,7 @@ static void dio_bio_end_aio(struct bio *bio) struct dio *dio = bio->bi_private; unsigned long remaining; unsigned long flags; + bool defer_completion = false; /* cleanup the bio */ dio_bio_complete(dio, bio); @@ -315,7 +321,19 @@ static void dio_bio_end_aio(struct bio *bio) spin_unlock_irqrestore(&dio->bio_lock, flags); if (remaining == 0) { - if (dio->result && dio->defer_completion) { + /* + * Defer completion when defer_completion is set or + * when the inode has pages mapped and this is AIO write. + * We need to invalidate those pages because there is a + * chance they contain stale data in the case buffered IO + * went in between AIO submission and completion into the + * same region. + */ + if (dio->result) + defer_completion = dio->defer_completion || + (dio->op == REQ_OP_WRITE && + dio->inode->i_mapping->nrpages); + if (defer_completion) { INIT_WORK(&dio->complete_work, dio_aio_complete_work); queue_work(dio->inode->i_sb->s_dio_done_wq, &dio->complete_work); @@ -1210,10 +1228,13 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, * For AIO O_(D)SYNC writes we need to defer completions to a workqueue * so that we can call ->fsync. */ - if (dio->is_async && iov_iter_rw(iter) == WRITE && - ((iocb->ki_filp->f_flags & O_DSYNC) || - IS_SYNC(iocb->ki_filp->f_mapping->host))) { - retval = dio_set_defer_completion(dio); + if (dio->is_async && iov_iter_rw(iter) == WRITE) { + retval = 0; + if ((iocb->ki_filp->f_flags & O_DSYNC) || + IS_SYNC(iocb->ki_filp->f_mapping->host)) + retval = dio_set_defer_completion(dio); + else if (!dio->inode->i_sb->s_dio_done_wq) + retval = sb_init_dio_done_wq(dio->inode->i_sb); if (retval) { /* * We grab i_mutex only for reads so we don't have diff --git a/fs/iomap.c b/fs/iomap.c index 1732228..a1ad4ca 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -713,8 +713,15 @@ struct iomap_dio { static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; + loff_t offset = iocb->ki_pos; + struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; + if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) + invalidate_inode_pages2_range(inode->i_mapping, + offset >> PAGE_SHIFT, + (offset + dio->size + ret - 1) >> PAGE_SHIFT); + if (dio->end_io) { ret = dio->end_io(iocb, dio->error ? dio->error : dio->size, -- 2.7.5 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-13 15:17 [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO Lukas Czerner @ 2017-07-14 10:41 ` kbuild test robot 2017-07-14 13:40 ` Lukas Czerner 2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner 1 sibling, 1 reply; 40+ messages in thread From: kbuild test robot @ 2017-07-14 10:41 UTC (permalink / raw) To: Lukas Czerner Cc: kbuild-all, linux-fsdevel, viro, jack, Lukas Czerner, Jeff Moyer [-- Attachment #1: Type: text/plain, Size: 2212 bytes --] Hi Lukas, [auto build test WARNING on linus/master] [also build test WARNING on v4.12 next-20170713] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Lukas-Czerner/fs-Fix-page-cache-inconsistency-when-mixing-buffered-and-AIO-DIO/20170714-181130 config: x86_64-randconfig-x010-201728 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 Note: it may well be a FALSE warning. FWIW you are at least aware of it now. http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings All warnings (new ones prefixed by >>): fs/iomap.c: In function 'iomap_dio_complete': >> fs/iomap.c:629:25: warning: 'ret' may be used uninitialized in this function [-Wmaybe-uninitialized] (offset + dio->size + ret - 1) >> PAGE_SHIFT); ~~~~~~~~~~~~~~~~~~~^~~~~ vim +/ret +629 fs/iomap.c 618 619 static ssize_t iomap_dio_complete(struct iomap_dio *dio) 620 { 621 struct kiocb *iocb = dio->iocb; 622 loff_t offset = iocb->ki_pos; 623 struct inode *inode = file_inode(iocb->ki_filp); 624 ssize_t ret; 625 626 if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) 627 invalidate_inode_pages2_range(inode->i_mapping, 628 offset >> PAGE_SHIFT, > 629 (offset + dio->size + ret - 1) >> PAGE_SHIFT); 630 631 if (dio->end_io) { 632 ret = dio->end_io(iocb, 633 dio->error ? dio->error : dio->size, 634 dio->flags); 635 } else { 636 ret = dio->error; 637 } 638 639 if (likely(!ret)) { 640 ret = dio->size; 641 /* check for short read */ 642 if (iocb->ki_pos + ret > dio->i_size && 643 !(dio->flags & IOMAP_DIO_WRITE)) 644 ret = dio->i_size - iocb->ki_pos; 645 iocb->ki_pos += ret; 646 } 647 648 inode_dio_end(file_inode(iocb->ki_filp)); 649 kfree(dio); 650 651 return ret; 652 } 653 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 24696 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-14 10:41 ` kbuild test robot @ 2017-07-14 13:40 ` Lukas Czerner 0 siblings, 0 replies; 40+ messages in thread From: Lukas Czerner @ 2017-07-14 13:40 UTC (permalink / raw) To: kbuild test robot; +Cc: kbuild-all, linux-fsdevel, viro, jack, Jeff Moyer On Fri, Jul 14, 2017 at 06:41:52PM +0800, kbuild test robot wrote: > Hi Lukas, > > [auto build test WARNING on linus/master] > [also build test WARNING on v4.12 next-20170713] > [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] > > url: https://github.com/0day-ci/linux/commits/Lukas-Czerner/fs-Fix-page-cache-inconsistency-when-mixing-buffered-and-AIO-DIO/20170714-181130 > config: x86_64-randconfig-x010-201728 (attached as .config) > compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 > reproduce: > # save the attached .config to linux build tree > make ARCH=x86_64 > > Note: it may well be a FALSE warning. FWIW you are at least aware of it now. > http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings > > All warnings (new ones prefixed by >>): > > fs/iomap.c: In function 'iomap_dio_complete': > >> fs/iomap.c:629:25: warning: 'ret' may be used uninitialized in this function [-Wmaybe-uninitialized] > (offset + dio->size + ret - 1) >> PAGE_SHIFT); > ~~~~~~~~~~~~~~~~~~~^~~~~ Oops, right, this is obviously wrong. Thanks! -Lukas > > vim +/ret +629 fs/iomap.c > > 618 > 619 static ssize_t iomap_dio_complete(struct iomap_dio *dio) > 620 { > 621 struct kiocb *iocb = dio->iocb; > 622 loff_t offset = iocb->ki_pos; > 623 struct inode *inode = file_inode(iocb->ki_filp); > 624 ssize_t ret; > 625 > 626 if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) > 627 invalidate_inode_pages2_range(inode->i_mapping, > 628 offset >> PAGE_SHIFT, > > 629 (offset + dio->size + ret - 1) >> PAGE_SHIFT); > 630 > 631 if (dio->end_io) { > 632 ret = dio->end_io(iocb, > 633 dio->error ? dio->error : dio->size, > 634 dio->flags); > 635 } else { > 636 ret = dio->error; > 637 } > 638 > 639 if (likely(!ret)) { > 640 ret = dio->size; > 641 /* check for short read */ > 642 if (iocb->ki_pos + ret > dio->i_size && > 643 !(dio->flags & IOMAP_DIO_WRITE)) > 644 ret = dio->i_size - iocb->ki_pos; > 645 iocb->ki_pos += ret; > 646 } > 647 > 648 inode_dio_end(file_inode(iocb->ki_filp)); > 649 kfree(dio); > 650 > 651 return ret; > 652 } > 653 > > --- > 0-DAY kernel test infrastructure Open Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-13 15:17 [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO Lukas Czerner 2017-07-14 10:41 ` kbuild test robot @ 2017-07-14 15:40 ` Lukas Czerner 2017-07-17 15:12 ` Jan Kara 2017-07-18 12:19 ` [PATCH v3] " Lukas Czerner 1 sibling, 2 replies; 40+ messages in thread From: Lukas Czerner @ 2017-07-14 15:40 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, jack, Lukas Czerner, Jeff Moyer Currently when mixing buffered reads and asynchronous direct writes it is possible to end up with the situation where we have stale data in the page cache while the new data is already written to disk. This is permanent until the affected pages are flushed away. Despite the fact that mixing buffered and direct IO is ill-advised it does pose a thread for a data integrity, is unexpected and should be fixed. Fix this by deferring completion of asynchronous direct writes to a process context in the case that there are mapped pages to be found in the inode. Later before the completion in dio_complete() invalidate the pages in question. This ensures that after the completion the pages in the written area are either unmapped, or populated with up-to-date data. Also do the same for the iomap case which uses iomap_dio_complete() instead. This has a side effect of deferring the completion to a process context for every AIO DIO that happens on inode that has pages mapped. However since the consensus is that this is ill-advised practice the performance implication should not be a problem. This was based on proposal from Jeff Moyer, thanks! Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> --- v2: Remove leftover ret variable from invalidate call in iomap_dio_complete fs/direct-io.c | 31 ++++++++++++++++++++++++++----- fs/iomap.c | 7 +++++++ 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 08cf278..2db9ada 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -258,6 +258,11 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) if (ret == 0) ret = transferred; + if ((dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages)) + invalidate_inode_pages2_range(dio->inode->i_mapping, + offset >> PAGE_SHIFT, + (offset + ret - 1) >> PAGE_SHIFT); + if (dio->end_io) { int err; @@ -304,6 +309,7 @@ static void dio_bio_end_aio(struct bio *bio) struct dio *dio = bio->bi_private; unsigned long remaining; unsigned long flags; + bool defer_completion = false; /* cleanup the bio */ dio_bio_complete(dio, bio); @@ -315,7 +321,19 @@ static void dio_bio_end_aio(struct bio *bio) spin_unlock_irqrestore(&dio->bio_lock, flags); if (remaining == 0) { - if (dio->result && dio->defer_completion) { + /* + * Defer completion when defer_completion is set or + * when the inode has pages mapped and this is AIO write. + * We need to invalidate those pages because there is a + * chance they contain stale data in the case buffered IO + * went in between AIO submission and completion into the + * same region. + */ + if (dio->result) + defer_completion = dio->defer_completion || + (dio->op == REQ_OP_WRITE && + dio->inode->i_mapping->nrpages); + if (defer_completion) { INIT_WORK(&dio->complete_work, dio_aio_complete_work); queue_work(dio->inode->i_sb->s_dio_done_wq, &dio->complete_work); @@ -1210,10 +1228,13 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, * For AIO O_(D)SYNC writes we need to defer completions to a workqueue * so that we can call ->fsync. */ - if (dio->is_async && iov_iter_rw(iter) == WRITE && - ((iocb->ki_filp->f_flags & O_DSYNC) || - IS_SYNC(iocb->ki_filp->f_mapping->host))) { - retval = dio_set_defer_completion(dio); + if (dio->is_async && iov_iter_rw(iter) == WRITE) { + retval = 0; + if ((iocb->ki_filp->f_flags & O_DSYNC) || + IS_SYNC(iocb->ki_filp->f_mapping->host)) + retval = dio_set_defer_completion(dio); + else if (!dio->inode->i_sb->s_dio_done_wq) + retval = sb_init_dio_done_wq(dio->inode->i_sb); if (retval) { /* * We grab i_mutex only for reads so we don't have diff --git a/fs/iomap.c b/fs/iomap.c index 1732228..3baeed2 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -713,8 +713,15 @@ struct iomap_dio { static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; + loff_t offset = iocb->ki_pos; + struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; + if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) + invalidate_inode_pages2_range(inode->i_mapping, + offset >> PAGE_SHIFT, + (offset + dio->size - 1) >> PAGE_SHIFT); + if (dio->end_io) { ret = dio->end_io(iocb, dio->error ? dio->error : dio->size, -- 2.7.5 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner @ 2017-07-17 15:12 ` Jan Kara 2017-07-17 15:28 ` Lukas Czerner 2017-07-18 12:19 ` [PATCH v3] " Lukas Czerner 1 sibling, 1 reply; 40+ messages in thread From: Jan Kara @ 2017-07-17 15:12 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, Jeff Moyer On Fri 14-07-17 17:40:23, Lukas Czerner wrote: > Currently when mixing buffered reads and asynchronous direct writes it > is possible to end up with the situation where we have stale data in the > page cache while the new data is already written to disk. This is > permanent until the affected pages are flushed away. Despite the fact > that mixing buffered and direct IO is ill-advised it does pose a thread > for a data integrity, is unexpected and should be fixed. > > Fix this by deferring completion of asynchronous direct writes to a > process context in the case that there are mapped pages to be found in > the inode. Later before the completion in dio_complete() invalidate > the pages in question. This ensures that after the completion the pages > in the written area are either unmapped, or populated with up-to-date > data. Also do the same for the iomap case which uses > iomap_dio_complete() instead. > > This has a side effect of deferring the completion to a process context > for every AIO DIO that happens on inode that has pages mapped. However > since the consensus is that this is ill-advised practice the performance > implication should not be a problem. > > This was based on proposal from Jeff Moyer, thanks! > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > Cc: Jeff Moyer <jmoyer@redhat.com> OK, this looks like it could work. Some comments below. > fs/direct-io.c | 31 ++++++++++++++++++++++++++----- > fs/iomap.c | 7 +++++++ > 2 files changed, 33 insertions(+), 5 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 08cf278..2db9ada 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -258,6 +258,11 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > if (ret == 0) > ret = transferred; > > + if ((dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages)) Superfluous braces here... Also you should not call invalidate_inode_pages2_range() in case of error I suppose. > + invalidate_inode_pages2_range(dio->inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + ret - 1) >> PAGE_SHIFT); > + > if (dio->end_io) { > int err; > > @@ -304,6 +309,7 @@ static void dio_bio_end_aio(struct bio *bio) > struct dio *dio = bio->bi_private; > unsigned long remaining; > unsigned long flags; > + bool defer_completion = false; > > /* cleanup the bio */ > dio_bio_complete(dio, bio); > @@ -315,7 +321,19 @@ static void dio_bio_end_aio(struct bio *bio) > spin_unlock_irqrestore(&dio->bio_lock, flags); > > if (remaining == 0) { > - if (dio->result && dio->defer_completion) { > + /* > + * Defer completion when defer_completion is set or > + * when the inode has pages mapped and this is AIO write. > + * We need to invalidate those pages because there is a > + * chance they contain stale data in the case buffered IO > + * went in between AIO submission and completion into the > + * same region. > + */ > + if (dio->result) > + defer_completion = dio->defer_completion || > + (dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages); > + if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > @@ -1210,10 +1228,13 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > * so that we can call ->fsync. > */ > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > - ((iocb->ki_filp->f_flags & O_DSYNC) || > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > - retval = dio_set_defer_completion(dio); > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) > + retval = sb_init_dio_done_wq(dio->inode->i_sb); Please add a comment explaining why sb_init_dio_done_wq() is needed here. > if (retval) { > /* > * We grab i_mutex only for reads so we don't have > diff --git a/fs/iomap.c b/fs/iomap.c > index 1732228..3baeed2 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -713,8 +713,15 @@ struct iomap_dio { > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > { > struct kiocb *iocb = dio->iocb; > + loff_t offset = iocb->ki_pos; > + struct inode *inode = file_inode(iocb->ki_filp); > ssize_t ret; > > + if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) Again I don't think you want to invalidate pages in case DIO failed with an error... > + invalidate_inode_pages2_range(inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + dio->size - 1) >> PAGE_SHIFT); > + > if (dio->end_io) { > ret = dio->end_io(iocb, > dio->error ? dio->error : dio->size, Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-17 15:12 ` Jan Kara @ 2017-07-17 15:28 ` Lukas Czerner 2017-07-17 15:39 ` Jeff Moyer 0 siblings, 1 reply; 40+ messages in thread From: Lukas Czerner @ 2017-07-17 15:28 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, viro, Jeff Moyer On Mon, Jul 17, 2017 at 05:12:28PM +0200, Jan Kara wrote: > On Fri 14-07-17 17:40:23, Lukas Czerner wrote: > > Currently when mixing buffered reads and asynchronous direct writes it > > is possible to end up with the situation where we have stale data in the > > page cache while the new data is already written to disk. This is > > permanent until the affected pages are flushed away. Despite the fact > > that mixing buffered and direct IO is ill-advised it does pose a thread > > for a data integrity, is unexpected and should be fixed. > > > > Fix this by deferring completion of asynchronous direct writes to a > > process context in the case that there are mapped pages to be found in > > the inode. Later before the completion in dio_complete() invalidate > > the pages in question. This ensures that after the completion the pages > > in the written area are either unmapped, or populated with up-to-date > > data. Also do the same for the iomap case which uses > > iomap_dio_complete() instead. > > > > This has a side effect of deferring the completion to a process context > > for every AIO DIO that happens on inode that has pages mapped. However > > since the consensus is that this is ill-advised practice the performance > > implication should not be a problem. > > > > This was based on proposal from Jeff Moyer, thanks! > > > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > > Cc: Jeff Moyer <jmoyer@redhat.com> > > OK, this looks like it could work. Some comments below. > > > fs/direct-io.c | 31 ++++++++++++++++++++++++++----- > > fs/iomap.c | 7 +++++++ > > 2 files changed, 33 insertions(+), 5 deletions(-) > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > index 08cf278..2db9ada 100644 > > --- a/fs/direct-io.c > > +++ b/fs/direct-io.c > > @@ -258,6 +258,11 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > > if (ret == 0) > > ret = transferred; > > > > + if ((dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages)) > > Superfluous braces here... Also you should not call > invalidate_inode_pages2_range() in case of error I suppose. Sure, I'll fix the braces. About the error case, is it not possible that some data has already been writtent to the disk despite the error ? Thanks! -Lukas > > > + invalidate_inode_pages2_range(dio->inode->i_mapping, > > + offset >> PAGE_SHIFT, > > + (offset + ret - 1) >> PAGE_SHIFT); > > + > > if (dio->end_io) { > > int err; > > > > @@ -304,6 +309,7 @@ static void dio_bio_end_aio(struct bio *bio) > > struct dio *dio = bio->bi_private; > > unsigned long remaining; > > unsigned long flags; > > + bool defer_completion = false; > > > > /* cleanup the bio */ > > dio_bio_complete(dio, bio); > > @@ -315,7 +321,19 @@ static void dio_bio_end_aio(struct bio *bio) > > spin_unlock_irqrestore(&dio->bio_lock, flags); > > > > if (remaining == 0) { > > - if (dio->result && dio->defer_completion) { > > + /* > > + * Defer completion when defer_completion is set or > > + * when the inode has pages mapped and this is AIO write. > > + * We need to invalidate those pages because there is a > > + * chance they contain stale data in the case buffered IO > > + * went in between AIO submission and completion into the > > + * same region. > > + */ > > + if (dio->result) > > + defer_completion = dio->defer_completion || > > + (dio->op == REQ_OP_WRITE && > > + dio->inode->i_mapping->nrpages); > > + if (defer_completion) { > > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > > queue_work(dio->inode->i_sb->s_dio_done_wq, > > &dio->complete_work); > > @@ -1210,10 +1228,13 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > > * so that we can call ->fsync. > > */ > > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > > - ((iocb->ki_filp->f_flags & O_DSYNC) || > > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > > - retval = dio_set_defer_completion(dio); > > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > > + retval = 0; > > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > > + retval = dio_set_defer_completion(dio); > > + else if (!dio->inode->i_sb->s_dio_done_wq) > > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > > Please add a comment explaining why sb_init_dio_done_wq() is needed here. ok, thanks. > > > if (retval) { > > /* > > * We grab i_mutex only for reads so we don't have > > diff --git a/fs/iomap.c b/fs/iomap.c > > index 1732228..3baeed2 100644 > > --- a/fs/iomap.c > > +++ b/fs/iomap.c > > @@ -713,8 +713,15 @@ struct iomap_dio { > > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > > { > > struct kiocb *iocb = dio->iocb; > > + loff_t offset = iocb->ki_pos; > > + struct inode *inode = file_inode(iocb->ki_filp); > > ssize_t ret; > > > > + if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) > > Again I don't think you want to invalidate pages in case DIO failed with an > error... > > > + invalidate_inode_pages2_range(inode->i_mapping, > > + offset >> PAGE_SHIFT, > > + (offset + dio->size - 1) >> PAGE_SHIFT); > > + > > if (dio->end_io) { > > ret = dio->end_io(iocb, > > dio->error ? dio->error : dio->size, > > Honza > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-17 15:28 ` Lukas Czerner @ 2017-07-17 15:39 ` Jeff Moyer 2017-07-17 16:17 ` Jan Kara 2017-07-18 7:39 ` Lukas Czerner 0 siblings, 2 replies; 40+ messages in thread From: Jeff Moyer @ 2017-07-17 15:39 UTC (permalink / raw) To: Lukas Czerner, Jan Kara; +Cc: linux-fsdevel, viro Lukas Czerner <lczerner@redhat.com> writes: > About the error case, is it not possible that some data has already been > writtent to the disk despite the error ? Yes, it's possible. However, that data is in an inconsistent state, so it shouldn't be read, anyway. Now, in the non-async path, we do the invalidation unconditionally, so I could go either way on this. I don't think it's going to matter for performance or data integrity. Cheers, Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-17 15:39 ` Jeff Moyer @ 2017-07-17 16:17 ` Jan Kara 2017-07-17 19:52 ` Jeff Moyer 2017-07-18 7:39 ` Lukas Czerner 1 sibling, 1 reply; 40+ messages in thread From: Jan Kara @ 2017-07-17 16:17 UTC (permalink / raw) To: Jeff Moyer; +Cc: Lukas Czerner, Jan Kara, linux-fsdevel, viro On Mon 17-07-17 11:39:09, Jeff Moyer wrote: > Lukas Czerner <lczerner@redhat.com> writes: > > > About the error case, is it not possible that some data has already been > > writtent to the disk despite the error ? > > Yes, it's possible. However, that data is in an inconsistent state, so > it shouldn't be read, anyway. > > Now, in the non-async path, we do the invalidation unconditionally, so I > could go either way on this. I don't think it's going to matter for > performance or data integrity. Well, at least 'ret' would be negative in the error case so arguments passed to invalidate_inode_pages2_range() would be bogus if I'm reading the code right... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-17 16:17 ` Jan Kara @ 2017-07-17 19:52 ` Jeff Moyer 0 siblings, 0 replies; 40+ messages in thread From: Jeff Moyer @ 2017-07-17 19:52 UTC (permalink / raw) To: Jan Kara; +Cc: Lukas Czerner, linux-fsdevel, viro Jan Kara <jack@suse.cz> writes: > On Mon 17-07-17 11:39:09, Jeff Moyer wrote: >> Lukas Czerner <lczerner@redhat.com> writes: >> >> > About the error case, is it not possible that some data has already been >> > writtent to the disk despite the error ? >> >> Yes, it's possible. However, that data is in an inconsistent state, so >> it shouldn't be read, anyway. >> >> Now, in the non-async path, we do the invalidation unconditionally, so I >> could go either way on this. I don't think it's going to matter for >> performance or data integrity. > > Well, at least 'ret' would be negative in the error case so arguments > passed to invalidate_inode_pages2_range() would be bogus if I'm reading the > code right... Ah, yes. Sorry, I was commenting on the more general point. You are correct, in dio_complete, ret could be set to dio->page_errors or dio->io_error. So yes, that needs to be checked. -Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-17 15:39 ` Jeff Moyer 2017-07-17 16:17 ` Jan Kara @ 2017-07-18 7:39 ` Lukas Czerner 2017-07-18 9:06 ` Jan Kara 1 sibling, 1 reply; 40+ messages in thread From: Lukas Czerner @ 2017-07-18 7:39 UTC (permalink / raw) To: Jeff Moyer; +Cc: Jan Kara, linux-fsdevel, viro On Mon, Jul 17, 2017 at 11:39:09AM -0400, Jeff Moyer wrote: > Lukas Czerner <lczerner@redhat.com> writes: > > > About the error case, is it not possible that some data has already been > > writtent to the disk despite the error ? > > Yes, it's possible. However, that data is in an inconsistent state, so > it shouldn't be read, anyway. I think it can be read if we wrote into already allocated space. > > Now, in the non-async path, we do the invalidation unconditionally, so I > could go either way on this. I don't think it's going to matter for > performance or data integrity. That's part of the reason why I did it unconditionaly as well, however Jan is right that ret would be negative. The way to fix it would differ depending on whether I am right about reading partially written data from AIO that failed. We still want to invalidate in that case. -Lukas > > Cheers, > Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-18 7:39 ` Lukas Czerner @ 2017-07-18 9:06 ` Jan Kara 2017-07-18 9:32 ` Lukas Czerner 0 siblings, 1 reply; 40+ messages in thread From: Jan Kara @ 2017-07-18 9:06 UTC (permalink / raw) To: Lukas Czerner; +Cc: Jeff Moyer, Jan Kara, linux-fsdevel, viro On Tue 18-07-17 09:39:35, Lukas Czerner wrote: > On Mon, Jul 17, 2017 at 11:39:09AM -0400, Jeff Moyer wrote: > > Lukas Czerner <lczerner@redhat.com> writes: > > > > > About the error case, is it not possible that some data has already been > > > writtent to the disk despite the error ? > > > > Yes, it's possible. However, that data is in an inconsistent state, so > > it shouldn't be read, anyway. > > I think it can be read if we wrote into already allocated space. > > > > > Now, in the non-async path, we do the invalidation unconditionally, so I > > could go either way on this. I don't think it's going to matter for > > performance or data integrity. > > That's part of the reason why I did it unconditionaly as well, however > Jan is right that ret would be negative. The way to fix it would differ > depending on whether I am right about reading partially written data > from AIO that failed. We still want to invalidate in that case. Frankly, I don't think it really matters so I'd go for not invalidating anything on error just out of philosophy: "There's something weird going on, bail out as quickly as you can." Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-18 9:06 ` Jan Kara @ 2017-07-18 9:32 ` Lukas Czerner 0 siblings, 0 replies; 40+ messages in thread From: Lukas Czerner @ 2017-07-18 9:32 UTC (permalink / raw) To: Jan Kara; +Cc: Jeff Moyer, linux-fsdevel, viro On Tue, Jul 18, 2017 at 11:06:26AM +0200, Jan Kara wrote: > On Tue 18-07-17 09:39:35, Lukas Czerner wrote: > > On Mon, Jul 17, 2017 at 11:39:09AM -0400, Jeff Moyer wrote: > > > Lukas Czerner <lczerner@redhat.com> writes: > > > > > > > About the error case, is it not possible that some data has already been > > > > writtent to the disk despite the error ? > > > > > > Yes, it's possible. However, that data is in an inconsistent state, so > > > it shouldn't be read, anyway. > > > > I think it can be read if we wrote into already allocated space. > > > > > > > > Now, in the non-async path, we do the invalidation unconditionally, so I > > > could go either way on this. I don't think it's going to matter for > > > performance or data integrity. > > > > That's part of the reason why I did it unconditionaly as well, however > > Jan is right that ret would be negative. The way to fix it would differ > > depending on whether I am right about reading partially written data > > from AIO that failed. We still want to invalidate in that case. > > Frankly, I don't think it really matters so I'd go for not invalidating > anything on error just out of philosophy: "There's something weird going > on, bail out as quickly as you can." > > Honza Fair enough, thanks! -Lukas > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v3] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner 2017-07-17 15:12 ` Jan Kara @ 2017-07-18 12:19 ` Lukas Czerner 2017-07-18 13:44 ` Christoph Hellwig ` (2 more replies) 1 sibling, 3 replies; 40+ messages in thread From: Lukas Czerner @ 2017-07-18 12:19 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, jack, Lukas Czerner, Jeff Moyer Currently when mixing buffered reads and asynchronous direct writes it is possible to end up with the situation where we have stale data in the page cache while the new data is already written to disk. This is permanent until the affected pages are flushed away. Despite the fact that mixing buffered and direct IO is ill-advised it does pose a thread for a data integrity, is unexpected and should be fixed. Fix this by deferring completion of asynchronous direct writes to a process context in the case that there are mapped pages to be found in the inode. Later before the completion in dio_complete() invalidate the pages in question. This ensures that after the completion the pages in the written area are either unmapped, or populated with up-to-date data. Also do the same for the iomap case which uses iomap_dio_complete() instead. This has a side effect of deferring the completion to a process context for every AIO DIO that happens on inode that has pages mapped. However since the consensus is that this is ill-advised practice the performance implication should not be a problem. This was based on proposal from Jeff Moyer, thanks! Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> --- v2: Remove leftover ret variable from invalidate call in iomap_dio_complete v3: Do not invalidate in case of error. Add some coments fs/direct-io.c | 37 ++++++++++++++++++++++++++++++++----- fs/iomap.c | 8 ++++++++ 2 files changed, 40 insertions(+), 5 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 08cf278..efd3246 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -258,6 +258,12 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) if (ret == 0) ret = transferred; + if ((ret > 0) && + (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages)) + invalidate_inode_pages2_range(dio->inode->i_mapping, + offset >> PAGE_SHIFT, + (offset + ret - 1) >> PAGE_SHIFT); + if (dio->end_io) { int err; @@ -304,6 +310,7 @@ static void dio_bio_end_aio(struct bio *bio) struct dio *dio = bio->bi_private; unsigned long remaining; unsigned long flags; + bool defer_completion = false; /* cleanup the bio */ dio_bio_complete(dio, bio); @@ -315,7 +322,19 @@ static void dio_bio_end_aio(struct bio *bio) spin_unlock_irqrestore(&dio->bio_lock, flags); if (remaining == 0) { - if (dio->result && dio->defer_completion) { + /* + * Defer completion when defer_completion is set or + * when the inode has pages mapped and this is AIO write. + * We need to invalidate those pages because there is a + * chance they contain stale data in the case buffered IO + * went in between AIO submission and completion into the + * same region. + */ + if (dio->result) + defer_completion = dio->defer_completion || + (dio->op == REQ_OP_WRITE && + dio->inode->i_mapping->nrpages); + if (defer_completion) { INIT_WORK(&dio->complete_work, dio_aio_complete_work); queue_work(dio->inode->i_sb->s_dio_done_wq, &dio->complete_work); @@ -1210,10 +1229,18 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, * For AIO O_(D)SYNC writes we need to defer completions to a workqueue * so that we can call ->fsync. */ - if (dio->is_async && iov_iter_rw(iter) == WRITE && - ((iocb->ki_filp->f_flags & O_DSYNC) || - IS_SYNC(iocb->ki_filp->f_mapping->host))) { - retval = dio_set_defer_completion(dio); + if (dio->is_async && iov_iter_rw(iter) == WRITE) { + retval = 0; + if ((iocb->ki_filp->f_flags & O_DSYNC) || + IS_SYNC(iocb->ki_filp->f_mapping->host)) + retval = dio_set_defer_completion(dio); + else if (!dio->inode->i_sb->s_dio_done_wq) + /* + * In case of AIO write racing with buffered read we + * need to defer completion. We can't decide this now, + * however the workqueue needs to be initialized here. + */ + retval = sb_init_dio_done_wq(dio->inode->i_sb); if (retval) { /* * We grab i_mutex only for reads so we don't have diff --git a/fs/iomap.c b/fs/iomap.c index 1732228..2f8dbf9 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -713,8 +713,16 @@ struct iomap_dio { static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; + loff_t offset = iocb->ki_pos; + struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; + if ((!dio->error) && + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) + invalidate_inode_pages2_range(inode->i_mapping, + offset >> PAGE_SHIFT, + (offset + dio->size - 1) >> PAGE_SHIFT); + if (dio->end_io) { ret = dio->end_io(iocb, dio->error ? dio->error : dio->size, -- 2.7.5 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v3] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-18 12:19 ` [PATCH v3] " Lukas Czerner @ 2017-07-18 13:44 ` Christoph Hellwig 2017-07-18 14:17 ` Jan Kara 2017-07-19 8:42 ` Lukas Czerner 2017-07-19 8:48 ` [PATCH v4] " Lukas Czerner 2017-07-19 11:28 ` [PATCH v5] " Lukas Czerner 2 siblings, 2 replies; 40+ messages in thread From: Christoph Hellwig @ 2017-07-18 13:44 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, Jeff Moyer > + if ((ret > 0) && No need for the braces here. > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) > + /* > + * In case of AIO write racing with buffered read we > + * need to defer completion. We can't decide this now, > + * however the workqueue needs to be initialized here. > + */ > + retval = sb_init_dio_done_wq(dio->inode->i_sb); So now we initialize the workqueue on the first aio write. Maybe we should just always initialize it? Especially given that the cost of a workqueue is rather cheap. I also don't really understand why we even need the workqueue per-superblock instead of global. > index 1732228..2f8dbf9 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -713,8 +713,16 @@ struct iomap_dio { > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > { > struct kiocb *iocb = dio->iocb; > + loff_t offset = iocb->ki_pos; If you introduce this variable please also use it later in the function instead of iocb->ki_pos. OR remove the variable, which would be fine with me as well. > + struct inode *inode = file_inode(iocb->ki_filp); > ssize_t ret; > > + if ((!dio->error) && no need for the inner braces. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-18 13:44 ` Christoph Hellwig @ 2017-07-18 14:17 ` Jan Kara 2017-07-19 8:42 ` Lukas Czerner 1 sibling, 0 replies; 40+ messages in thread From: Jan Kara @ 2017-07-18 14:17 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Lukas Czerner, linux-fsdevel, viro, jack, Jeff Moyer On Tue 18-07-17 06:44:20, Christoph Hellwig wrote: > > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > > + retval = 0; > > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > > + retval = dio_set_defer_completion(dio); > > + else if (!dio->inode->i_sb->s_dio_done_wq) > > + /* > > + * In case of AIO write racing with buffered read we > > + * need to defer completion. We can't decide this now, > > + * however the workqueue needs to be initialized here. > > + */ > > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > > So now we initialize the workqueue on the first aio write. Maybe we > should just always initialize it? Especially given that the cost of > a workqueue is rather cheap. I also don't really understand why > we even need the workqueue per-superblock instead of global. So the workqueue is WQ_MEM_RECLAIM which means there will be always "rescue" worker running. Not that it would make workqueue too expensive but it is not zero cost either. So saving the cost for filesystems that don't support AIO DIO makes sense to me. Regarding creating global workqueue - it would create IO completion dependencies between filesystems which could have unwanted side effects and possibly create deadlocks. The default paralelism of workqueues would mostly hide this but I'm not sure there won't be some corner case e.g. when memory is tight... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-18 13:44 ` Christoph Hellwig 2017-07-18 14:17 ` Jan Kara @ 2017-07-19 8:42 ` Lukas Czerner 1 sibling, 0 replies; 40+ messages in thread From: Lukas Czerner @ 2017-07-19 8:42 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, viro, jack, Jeff Moyer On Tue, Jul 18, 2017 at 06:44:20AM -0700, Christoph Hellwig wrote: > ->inode->i_sb); > > So now we initialize the workqueue on the first aio write. Maybe we > should just always initialize it? Especially given that the cost of > a workqueue is rather cheap. I also don't really understand why > we even need the workqueue per-superblock instead of global. As Jan mentioned and I agree it's worth initializing it only when actually needed. > > > index 1732228..2f8dbf9 100644 > > --- a/fs/iomap.c > > +++ b/fs/iomap.c > > @@ -713,8 +713,16 @@ struct iomap_dio { > > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > > { > > struct kiocb *iocb = dio->iocb; > > + loff_t offset = iocb->ki_pos; > > If you introduce this variable please also use it later in the function > instead of iocb->ki_pos. OR remove the variable, which would be fine > with me as well. Right, I did not use it later in the fucntion because it would be confusing (we're changing iocb->ki_pos). So I'll just remove the variable. > > > + struct inode *inode = file_inode(iocb->ki_filp); > > ssize_t ret; > > > > + if ((!dio->error) && > > no need for the inner braces. ok Thanks! -Lukas ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v4] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-18 12:19 ` [PATCH v3] " Lukas Czerner 2017-07-18 13:44 ` Christoph Hellwig @ 2017-07-19 8:48 ` Lukas Czerner 2017-07-19 9:26 ` Jan Kara 2017-07-19 11:28 ` [PATCH v5] " Lukas Czerner 2 siblings, 1 reply; 40+ messages in thread From: Lukas Czerner @ 2017-07-19 8:48 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, jack, hch, Lukas Czerner, Jeff Moyer Currently when mixing buffered reads and asynchronous direct writes it is possible to end up with the situation where we have stale data in the page cache while the new data is already written to disk. This is permanent until the affected pages are flushed away. Despite the fact that mixing buffered and direct IO is ill-advised it does pose a thread for a data integrity, is unexpected and should be fixed. Fix this by deferring completion of asynchronous direct writes to a process context in the case that there are mapped pages to be found in the inode. Later before the completion in dio_complete() invalidate the pages in question. This ensures that after the completion the pages in the written area are either unmapped, or populated with up-to-date data. Also do the same for the iomap case which uses iomap_dio_complete() instead. This has a side effect of deferring the completion to a process context for every AIO DIO that happens on inode that has pages mapped. However since the consensus is that this is ill-advised practice the performance implication should not be a problem. This was based on proposal from Jeff Moyer, thanks! Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> --- v2: Remove leftover ret variable from invalidate call in iomap_dio_complete v3: Do not invalidate in case of error. Add some coments v4: Remove unnecessary variable, remove unnecessary inner braces fs/direct-io.c | 37 ++++++++++++++++++++++++++++++++----- fs/iomap.c | 7 +++++++ 2 files changed, 39 insertions(+), 5 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 08cf278..efd3246 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -258,6 +258,12 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) if (ret == 0) ret = transferred; + if ((ret > 0) && + (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages)) + invalidate_inode_pages2_range(dio->inode->i_mapping, + offset >> PAGE_SHIFT, + (offset + ret - 1) >> PAGE_SHIFT); + if (dio->end_io) { int err; @@ -304,6 +310,7 @@ static void dio_bio_end_aio(struct bio *bio) struct dio *dio = bio->bi_private; unsigned long remaining; unsigned long flags; + bool defer_completion = false; /* cleanup the bio */ dio_bio_complete(dio, bio); @@ -315,7 +322,19 @@ static void dio_bio_end_aio(struct bio *bio) spin_unlock_irqrestore(&dio->bio_lock, flags); if (remaining == 0) { - if (dio->result && dio->defer_completion) { + /* + * Defer completion when defer_completion is set or + * when the inode has pages mapped and this is AIO write. + * We need to invalidate those pages because there is a + * chance they contain stale data in the case buffered IO + * went in between AIO submission and completion into the + * same region. + */ + if (dio->result) + defer_completion = dio->defer_completion || + (dio->op == REQ_OP_WRITE && + dio->inode->i_mapping->nrpages); + if (defer_completion) { INIT_WORK(&dio->complete_work, dio_aio_complete_work); queue_work(dio->inode->i_sb->s_dio_done_wq, &dio->complete_work); @@ -1210,10 +1229,18 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, * For AIO O_(D)SYNC writes we need to defer completions to a workqueue * so that we can call ->fsync. */ - if (dio->is_async && iov_iter_rw(iter) == WRITE && - ((iocb->ki_filp->f_flags & O_DSYNC) || - IS_SYNC(iocb->ki_filp->f_mapping->host))) { - retval = dio_set_defer_completion(dio); + if (dio->is_async && iov_iter_rw(iter) == WRITE) { + retval = 0; + if ((iocb->ki_filp->f_flags & O_DSYNC) || + IS_SYNC(iocb->ki_filp->f_mapping->host)) + retval = dio_set_defer_completion(dio); + else if (!dio->inode->i_sb->s_dio_done_wq) + /* + * In case of AIO write racing with buffered read we + * need to defer completion. We can't decide this now, + * however the workqueue needs to be initialized here. + */ + retval = sb_init_dio_done_wq(dio->inode->i_sb); if (retval) { /* * We grab i_mutex only for reads so we don't have diff --git a/fs/iomap.c b/fs/iomap.c index 1732228..144512e 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -713,8 +713,15 @@ struct iomap_dio { static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; + struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; + if (!dio->error && + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) + invalidate_inode_pages2_range(inode->i_mapping, + iocb->ki_pos >> PAGE_SHIFT, + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); + if (dio->end_io) { ret = dio->end_io(iocb, dio->error ? dio->error : dio->size, -- 2.7.5 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v4] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-19 8:48 ` [PATCH v4] " Lukas Czerner @ 2017-07-19 9:26 ` Jan Kara 2017-07-19 11:01 ` Lukas Czerner 0 siblings, 1 reply; 40+ messages in thread From: Jan Kara @ 2017-07-19 9:26 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, hch, Jeff Moyer On Wed 19-07-17 10:48:44, Lukas Czerner wrote: > Currently when mixing buffered reads and asynchronous direct writes it > is possible to end up with the situation where we have stale data in the > page cache while the new data is already written to disk. This is > permanent until the affected pages are flushed away. Despite the fact > that mixing buffered and direct IO is ill-advised it does pose a thread > for a data integrity, is unexpected and should be fixed. > > Fix this by deferring completion of asynchronous direct writes to a > process context in the case that there are mapped pages to be found in > the inode. Later before the completion in dio_complete() invalidate > the pages in question. This ensures that after the completion the pages > in the written area are either unmapped, or populated with up-to-date > data. Also do the same for the iomap case which uses > iomap_dio_complete() instead. > > This has a side effect of deferring the completion to a process context > for every AIO DIO that happens on inode that has pages mapped. However > since the consensus is that this is ill-advised practice the performance > implication should not be a problem. > > This was based on proposal from Jeff Moyer, thanks! > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > Cc: Jeff Moyer <jmoyer@redhat.com> > --- > v2: Remove leftover ret variable from invalidate call in iomap_dio_complete > v3: Do not invalidate in case of error. Add some coments > v4: Remove unnecessary variable, remove unnecessary inner braces Looks good to me now, just two style nits below. You can add: Reviewed-by: Jan Kara <jack@suse.cz> > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 08cf278..efd3246 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -258,6 +258,12 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > if (ret == 0) > ret = transferred; > > + if ((ret > 0) && > + (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages)) Heh, you seem to love braces. The general rule is that braces should be around bit-ops (as there people find the priority unclear and also it is too easy to forget to add those braces when negating the condition) but not around comparison or such. I.e. the above would be: if (ret > 0 && dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages) ... > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) > + /* > + * In case of AIO write racing with buffered read we > + * need to defer completion. We can't decide this now, > + * however the workqueue needs to be initialized here. > + */ > + retval = sb_init_dio_done_wq(dio->inode->i_sb); Curly braces here please. When command block is multi-line we enforce those despite it is only a single statement and thus they are not necessary strictly speaking. Thanks! Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-19 9:26 ` Jan Kara @ 2017-07-19 11:01 ` Lukas Czerner 0 siblings, 0 replies; 40+ messages in thread From: Lukas Czerner @ 2017-07-19 11:01 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, viro, hch, Jeff Moyer On Wed, Jul 19, 2017 at 11:26:37AM +0200, Jan Kara wrote: > in question. This ensures that after the completion the pages > > in the written area are either unmapped, or populated with up-to-date > > data. Also do the same for the iomap case which uses > > iomap_dio_complete() instead. > > > > This has a side effect of deferring the completion to a process context > > for every AIO DIO that happens on inode that has pages mapped. However > > since the consensus is that this is ill-advised practice the performance > > implication should not be a problem. > > > > This was based on proposal from Jeff Moyer, thanks! > > > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > > Cc: Jeff Moyer <jmoyer@redhat.com> > > --- > > v2: Remove leftover ret variable from invalidate call in iomap_dio_complete > > v3: Do not invalidate in case of error. Add some coments > > v4: Remove unnecessary variable, remove unnecessary inner braces > > Looks good to me now, just two style nits below. You can add: > > Reviewed-by: Jan Kara <jack@suse.cz> > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > index 08cf278..efd3246 100644 > > --- a/fs/direct-io.c > > +++ b/fs/direct-io.c > > @@ -258,6 +258,12 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > > if (ret == 0) > > ret = transferred; > > > > + if ((ret > 0) && > > + (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages)) > > Heh, you seem to love braces. The general rule is that braces should be > around bit-ops (as there people find the priority unclear and also it is > too easy to forget to add those braces when negating the condition) but not > around comparison or such. I.e. the above would be: > > if (ret > 0 && dio->op == REQ_OP_WRITE && > dio->inode->i_mapping->nrpages) :D sure, I'll resend. > > ... > > > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > > + retval = 0; > > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > > + retval = dio_set_defer_completion(dio); > > + else if (!dio->inode->i_sb->s_dio_done_wq) > > + /* > > + * In case of AIO write racing with buffered read we > > + * need to defer completion. We can't decide this now, > > + * however the workqueue needs to be initialized here. > > + */ > > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > > Curly braces here please. When command block is multi-line we enforce those > despite it is only a single statement and thus they are not necessary > strictly speaking. Thanks! ok. Thanks! -Lukas > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-18 12:19 ` [PATCH v3] " Lukas Czerner 2017-07-18 13:44 ` Christoph Hellwig 2017-07-19 8:48 ` [PATCH v4] " Lukas Czerner @ 2017-07-19 11:28 ` Lukas Czerner 2017-07-19 11:37 ` Jan Kara ` (3 more replies) 2 siblings, 4 replies; 40+ messages in thread From: Lukas Czerner @ 2017-07-19 11:28 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, jack, hch, Lukas Czerner, Jeff Moyer Currently when mixing buffered reads and asynchronous direct writes it is possible to end up with the situation where we have stale data in the page cache while the new data is already written to disk. This is permanent until the affected pages are flushed away. Despite the fact that mixing buffered and direct IO is ill-advised it does pose a thread for a data integrity, is unexpected and should be fixed. Fix this by deferring completion of asynchronous direct writes to a process context in the case that there are mapped pages to be found in the inode. Later before the completion in dio_complete() invalidate the pages in question. This ensures that after the completion the pages in the written area are either unmapped, or populated with up-to-date data. Also do the same for the iomap case which uses iomap_dio_complete() instead. This has a side effect of deferring the completion to a process context for every AIO DIO that happens on inode that has pages mapped. However since the consensus is that this is ill-advised practice the performance implication should not be a problem. This was based on proposal from Jeff Moyer, thanks! Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> --- v2: Remove leftover ret variable from invalidate call in iomap_dio_complete v3: Do not invalidate in case of error. Add some coments v4: Remove unnecessary variable, remove unnecessary inner braces v5: Style changes fs/direct-io.c | 39 ++++++++++++++++++++++++++++++++++----- fs/iomap.c | 7 +++++++ 2 files changed, 41 insertions(+), 5 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 08cf278..0d1befd 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -258,6 +258,13 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) if (ret == 0) ret = transferred; + if (ret > 0 && dio->op == REQ_OP_WRITE && + dio->inode->i_mapping->nrpages) { + invalidate_inode_pages2_range(dio->inode->i_mapping, + offset >> PAGE_SHIFT, + (offset + ret - 1) >> PAGE_SHIFT); + } + if (dio->end_io) { int err; @@ -304,6 +311,7 @@ static void dio_bio_end_aio(struct bio *bio) struct dio *dio = bio->bi_private; unsigned long remaining; unsigned long flags; + bool defer_completion = false; /* cleanup the bio */ dio_bio_complete(dio, bio); @@ -315,7 +323,19 @@ static void dio_bio_end_aio(struct bio *bio) spin_unlock_irqrestore(&dio->bio_lock, flags); if (remaining == 0) { - if (dio->result && dio->defer_completion) { + /* + * Defer completion when defer_completion is set or + * when the inode has pages mapped and this is AIO write. + * We need to invalidate those pages because there is a + * chance they contain stale data in the case buffered IO + * went in between AIO submission and completion into the + * same region. + */ + if (dio->result) + defer_completion = dio->defer_completion || + (dio->op == REQ_OP_WRITE && + dio->inode->i_mapping->nrpages); + if (defer_completion) { INIT_WORK(&dio->complete_work, dio_aio_complete_work); queue_work(dio->inode->i_sb->s_dio_done_wq, &dio->complete_work); @@ -1210,10 +1230,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, * For AIO O_(D)SYNC writes we need to defer completions to a workqueue * so that we can call ->fsync. */ - if (dio->is_async && iov_iter_rw(iter) == WRITE && - ((iocb->ki_filp->f_flags & O_DSYNC) || - IS_SYNC(iocb->ki_filp->f_mapping->host))) { - retval = dio_set_defer_completion(dio); + if (dio->is_async && iov_iter_rw(iter) == WRITE) { + retval = 0; + if ((iocb->ki_filp->f_flags & O_DSYNC) || + IS_SYNC(iocb->ki_filp->f_mapping->host)) + retval = dio_set_defer_completion(dio); + else if (!dio->inode->i_sb->s_dio_done_wq) { + /* + * In case of AIO write racing with buffered read we + * need to defer completion. We can't decide this now, + * however the workqueue needs to be initialized here. + */ + retval = sb_init_dio_done_wq(dio->inode->i_sb); + } if (retval) { /* * We grab i_mutex only for reads so we don't have diff --git a/fs/iomap.c b/fs/iomap.c index 1732228..144512e 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -713,8 +713,15 @@ struct iomap_dio { static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; + struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; + if (!dio->error && + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) + invalidate_inode_pages2_range(inode->i_mapping, + iocb->ki_pos >> PAGE_SHIFT, + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); + if (dio->end_io) { ret = dio->end_io(iocb, dio->error ? dio->error : dio->size, -- 2.7.5 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-19 11:28 ` [PATCH v5] " Lukas Czerner @ 2017-07-19 11:37 ` Jan Kara 2017-07-19 12:17 ` Jeff Moyer ` (2 subsequent siblings) 3 siblings, 0 replies; 40+ messages in thread From: Jan Kara @ 2017-07-19 11:37 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, hch, Jeff Moyer On Wed 19-07-17 13:28:12, Lukas Czerner wrote: > Currently when mixing buffered reads and asynchronous direct writes it > is possible to end up with the situation where we have stale data in the > page cache while the new data is already written to disk. This is > permanent until the affected pages are flushed away. Despite the fact > that mixing buffered and direct IO is ill-advised it does pose a thread > for a data integrity, is unexpected and should be fixed. > > Fix this by deferring completion of asynchronous direct writes to a > process context in the case that there are mapped pages to be found in > the inode. Later before the completion in dio_complete() invalidate > the pages in question. This ensures that after the completion the pages > in the written area are either unmapped, or populated with up-to-date > data. Also do the same for the iomap case which uses > iomap_dio_complete() instead. > > This has a side effect of deferring the completion to a process context > for every AIO DIO that happens on inode that has pages mapped. However > since the consensus is that this is ill-advised practice the performance > implication should not be a problem. > > This was based on proposal from Jeff Moyer, thanks! > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > Cc: Jeff Moyer <jmoyer@redhat.com> You forgot to add my Reviewed-by tag. So feel free to add it now: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > v2: Remove leftover ret variable from invalidate call in iomap_dio_complete > v3: Do not invalidate in case of error. Add some coments > v4: Remove unnecessary variable, remove unnecessary inner braces > v5: Style changes > > fs/direct-io.c | 39 ++++++++++++++++++++++++++++++++++----- > fs/iomap.c | 7 +++++++ > 2 files changed, 41 insertions(+), 5 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 08cf278..0d1befd 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -258,6 +258,13 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > if (ret == 0) > ret = transferred; > > + if (ret > 0 && dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages) { > + invalidate_inode_pages2_range(dio->inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + ret - 1) >> PAGE_SHIFT); > + } > + > if (dio->end_io) { > int err; > > @@ -304,6 +311,7 @@ static void dio_bio_end_aio(struct bio *bio) > struct dio *dio = bio->bi_private; > unsigned long remaining; > unsigned long flags; > + bool defer_completion = false; > > /* cleanup the bio */ > dio_bio_complete(dio, bio); > @@ -315,7 +323,19 @@ static void dio_bio_end_aio(struct bio *bio) > spin_unlock_irqrestore(&dio->bio_lock, flags); > > if (remaining == 0) { > - if (dio->result && dio->defer_completion) { > + /* > + * Defer completion when defer_completion is set or > + * when the inode has pages mapped and this is AIO write. > + * We need to invalidate those pages because there is a > + * chance they contain stale data in the case buffered IO > + * went in between AIO submission and completion into the > + * same region. > + */ > + if (dio->result) > + defer_completion = dio->defer_completion || > + (dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages); > + if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > @@ -1210,10 +1230,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > * so that we can call ->fsync. > */ > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > - ((iocb->ki_filp->f_flags & O_DSYNC) || > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > - retval = dio_set_defer_completion(dio); > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) { > + /* > + * In case of AIO write racing with buffered read we > + * need to defer completion. We can't decide this now, > + * however the workqueue needs to be initialized here. > + */ > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > + } > if (retval) { > /* > * We grab i_mutex only for reads so we don't have > diff --git a/fs/iomap.c b/fs/iomap.c > index 1732228..144512e 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -713,8 +713,15 @@ struct iomap_dio { > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > { > struct kiocb *iocb = dio->iocb; > + struct inode *inode = file_inode(iocb->ki_filp); > ssize_t ret; > > + if (!dio->error && > + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) > + invalidate_inode_pages2_range(inode->i_mapping, > + iocb->ki_pos >> PAGE_SHIFT, > + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); > + > if (dio->end_io) { > ret = dio->end_io(iocb, > dio->error ? dio->error : dio->size, > -- > 2.7.5 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-19 11:28 ` [PATCH v5] " Lukas Czerner 2017-07-19 11:37 ` Jan Kara @ 2017-07-19 12:17 ` Jeff Moyer 2017-08-03 18:10 ` Jeff Moyer 2017-08-10 12:59 ` [PATCH v6] " Lukas Czerner 3 siblings, 0 replies; 40+ messages in thread From: Jeff Moyer @ 2017-07-19 12:17 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, hch Lukas Czerner <lczerner@redhat.com> writes: > Currently when mixing buffered reads and asynchronous direct writes it > is possible to end up with the situation where we have stale data in the > page cache while the new data is already written to disk. This is > permanent until the affected pages are flushed away. Despite the fact > that mixing buffered and direct IO is ill-advised it does pose a thread > for a data integrity, is unexpected and should be fixed. > > Fix this by deferring completion of asynchronous direct writes to a > process context in the case that there are mapped pages to be found in > the inode. Later before the completion in dio_complete() invalidate > the pages in question. This ensures that after the completion the pages > in the written area are either unmapped, or populated with up-to-date > data. Also do the same for the iomap case which uses > iomap_dio_complete() instead. > > This has a side effect of deferring the completion to a process context > for every AIO DIO that happens on inode that has pages mapped. However > since the consensus is that this is ill-advised practice the performance > implication should not be a problem. > > This was based on proposal from Jeff Moyer, thanks! > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > Cc: Jeff Moyer <jmoyer@redhat.com> Looks good, Lukas! Thanks! Reviewed-by: Jeff Moyer <jmoyer@redhat.com> > --- > v2: Remove leftover ret variable from invalidate call in iomap_dio_complete > v3: Do not invalidate in case of error. Add some coments > v4: Remove unnecessary variable, remove unnecessary inner braces > v5: Style changes > > fs/direct-io.c | 39 ++++++++++++++++++++++++++++++++++----- > fs/iomap.c | 7 +++++++ > 2 files changed, 41 insertions(+), 5 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 08cf278..0d1befd 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -258,6 +258,13 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > if (ret == 0) > ret = transferred; > > + if (ret > 0 && dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages) { > + invalidate_inode_pages2_range(dio->inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + ret - 1) >> PAGE_SHIFT); > + } > + > if (dio->end_io) { > int err; > > @@ -304,6 +311,7 @@ static void dio_bio_end_aio(struct bio *bio) > struct dio *dio = bio->bi_private; > unsigned long remaining; > unsigned long flags; > + bool defer_completion = false; > > /* cleanup the bio */ > dio_bio_complete(dio, bio); > @@ -315,7 +323,19 @@ static void dio_bio_end_aio(struct bio *bio) > spin_unlock_irqrestore(&dio->bio_lock, flags); > > if (remaining == 0) { > - if (dio->result && dio->defer_completion) { > + /* > + * Defer completion when defer_completion is set or > + * when the inode has pages mapped and this is AIO write. > + * We need to invalidate those pages because there is a > + * chance they contain stale data in the case buffered IO > + * went in between AIO submission and completion into the > + * same region. > + */ > + if (dio->result) > + defer_completion = dio->defer_completion || > + (dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages); > + if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > @@ -1210,10 +1230,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > * so that we can call ->fsync. > */ > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > - ((iocb->ki_filp->f_flags & O_DSYNC) || > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > - retval = dio_set_defer_completion(dio); > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) { > + /* > + * In case of AIO write racing with buffered read we > + * need to defer completion. We can't decide this now, > + * however the workqueue needs to be initialized here. > + */ > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > + } > if (retval) { > /* > * We grab i_mutex only for reads so we don't have > diff --git a/fs/iomap.c b/fs/iomap.c > index 1732228..144512e 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -713,8 +713,15 @@ struct iomap_dio { > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > { > struct kiocb *iocb = dio->iocb; > + struct inode *inode = file_inode(iocb->ki_filp); > ssize_t ret; > > + if (!dio->error && > + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) > + invalidate_inode_pages2_range(inode->i_mapping, > + iocb->ki_pos >> PAGE_SHIFT, > + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); > + > if (dio->end_io) { > ret = dio->end_io(iocb, > dio->error ? dio->error : dio->size, ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-19 11:28 ` [PATCH v5] " Lukas Czerner 2017-07-19 11:37 ` Jan Kara 2017-07-19 12:17 ` Jeff Moyer @ 2017-08-03 18:10 ` Jeff Moyer 2017-08-04 10:09 ` Dave Chinner 2017-08-10 12:59 ` [PATCH v6] " Lukas Czerner 3 siblings, 1 reply; 40+ messages in thread From: Jeff Moyer @ 2017-08-03 18:10 UTC (permalink / raw) To: viro, Lukas Czerner; +Cc: linux-fsdevel, jack, hch Al, would you mind taking this in through your tree? It's been reviewed by myself and Jan in this mail thread. Thanks! Jeff Lukas Czerner <lczerner@redhat.com> writes: > Currently when mixing buffered reads and asynchronous direct writes it > is possible to end up with the situation where we have stale data in the > page cache while the new data is already written to disk. This is > permanent until the affected pages are flushed away. Despite the fact > that mixing buffered and direct IO is ill-advised it does pose a thread > for a data integrity, is unexpected and should be fixed. > > Fix this by deferring completion of asynchronous direct writes to a > process context in the case that there are mapped pages to be found in > the inode. Later before the completion in dio_complete() invalidate > the pages in question. This ensures that after the completion the pages > in the written area are either unmapped, or populated with up-to-date > data. Also do the same for the iomap case which uses > iomap_dio_complete() instead. > > This has a side effect of deferring the completion to a process context > for every AIO DIO that happens on inode that has pages mapped. However > since the consensus is that this is ill-advised practice the performance > implication should not be a problem. > > This was based on proposal from Jeff Moyer, thanks! > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > Cc: Jeff Moyer <jmoyer@redhat.com> > --- > v2: Remove leftover ret variable from invalidate call in iomap_dio_complete > v3: Do not invalidate in case of error. Add some coments > v4: Remove unnecessary variable, remove unnecessary inner braces > v5: Style changes > > fs/direct-io.c | 39 ++++++++++++++++++++++++++++++++++----- > fs/iomap.c | 7 +++++++ > 2 files changed, 41 insertions(+), 5 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 08cf278..0d1befd 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -258,6 +258,13 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > if (ret == 0) > ret = transferred; > > + if (ret > 0 && dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages) { > + invalidate_inode_pages2_range(dio->inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + ret - 1) >> PAGE_SHIFT); > + } > + > if (dio->end_io) { > int err; > > @@ -304,6 +311,7 @@ static void dio_bio_end_aio(struct bio *bio) > struct dio *dio = bio->bi_private; > unsigned long remaining; > unsigned long flags; > + bool defer_completion = false; > > /* cleanup the bio */ > dio_bio_complete(dio, bio); > @@ -315,7 +323,19 @@ static void dio_bio_end_aio(struct bio *bio) > spin_unlock_irqrestore(&dio->bio_lock, flags); > > if (remaining == 0) { > - if (dio->result && dio->defer_completion) { > + /* > + * Defer completion when defer_completion is set or > + * when the inode has pages mapped and this is AIO write. > + * We need to invalidate those pages because there is a > + * chance they contain stale data in the case buffered IO > + * went in between AIO submission and completion into the > + * same region. > + */ > + if (dio->result) > + defer_completion = dio->defer_completion || > + (dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages); > + if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > @@ -1210,10 +1230,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > * so that we can call ->fsync. > */ > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > - ((iocb->ki_filp->f_flags & O_DSYNC) || > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > - retval = dio_set_defer_completion(dio); > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) { > + /* > + * In case of AIO write racing with buffered read we > + * need to defer completion. We can't decide this now, > + * however the workqueue needs to be initialized here. > + */ > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > + } > if (retval) { > /* > * We grab i_mutex only for reads so we don't have > diff --git a/fs/iomap.c b/fs/iomap.c > index 1732228..144512e 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -713,8 +713,15 @@ struct iomap_dio { > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > { > struct kiocb *iocb = dio->iocb; > + struct inode *inode = file_inode(iocb->ki_filp); > ssize_t ret; > > + if (!dio->error && > + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) > + invalidate_inode_pages2_range(inode->i_mapping, > + iocb->ki_pos >> PAGE_SHIFT, > + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); > + > if (dio->end_io) { > ret = dio->end_io(iocb, > dio->error ? dio->error : dio->size, ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-03 18:10 ` Jeff Moyer @ 2017-08-04 10:09 ` Dave Chinner 2017-08-07 15:52 ` Jeff Moyer 0 siblings, 1 reply; 40+ messages in thread From: Dave Chinner @ 2017-08-04 10:09 UTC (permalink / raw) To: Jeff Moyer; +Cc: viro, Lukas Czerner, linux-fsdevel, jack, hch On Thu, Aug 03, 2017 at 02:10:47PM -0400, Jeff Moyer wrote: > Al, would you mind taking this in through your tree? It's been reviewed > by myself and Jan in this mail thread. Still needs more fixing, I think? Sorry, this is the first time I've seen this patch.... > > diff --git a/fs/iomap.c b/fs/iomap.c > > index 1732228..144512e 100644 > > --- a/fs/iomap.c > > +++ b/fs/iomap.c > > @@ -713,8 +713,15 @@ struct iomap_dio { > > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > > { > > struct kiocb *iocb = dio->iocb; > > + struct inode *inode = file_inode(iocb->ki_filp); > > ssize_t ret; > > > > + if (!dio->error && > > + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) > > + invalidate_inode_pages2_range(inode->i_mapping, > > + iocb->ki_pos >> PAGE_SHIFT, > > + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); > > + > > if (dio->end_io) { > > ret = dio->end_io(iocb, > > dio->error ? dio->error : dio->size, > This invalidation is already run in iomap_dio_rw() for the sync IO case directly after the call to iomap_dio_complete(). It also has a comment to explain exactly why the the invalidation is needed, and it issues a warning to dmesg if the invalidation fails to indicate the reason why the user is reporting data corruption to us. i.e.: ..... ret = iomap_dio_complete(dio); /* * Try again to invalidate clean pages which might have been cached by * non-direct readahead, or faulted in by get_user_pages() if the source * of the write was an mmap'ed region of the file we're writing. Either * one is a pretty crazy thing to do, so we don't support it 100%. If * this invalidation fails, tough, the write still worked... */ if (iov_iter_rw(iter) == WRITE) { int err = invalidate_inode_pages2_range(mapping, start >> PAGE_SHIFT, end >> PAGE_SHIFT); WARN_ON_ONCE(err); } return ret; If we're going to replace this with an invalidation in iomap_dio_complete() so it also handles the AIO path, then the comment and warning on invalidation failure also need to be moved to iomap_dio_complete() and the duplicate code removed from iomap_dio_rw()... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-04 10:09 ` Dave Chinner @ 2017-08-07 15:52 ` Jeff Moyer 2017-08-08 8:41 ` Lukas Czerner 0 siblings, 1 reply; 40+ messages in thread From: Jeff Moyer @ 2017-08-07 15:52 UTC (permalink / raw) To: Dave Chinner; +Cc: viro, Lukas Czerner, linux-fsdevel, jack, hch Dave Chinner <david@fromorbit.com> writes: > On Thu, Aug 03, 2017 at 02:10:47PM -0400, Jeff Moyer wrote: >> Al, would you mind taking this in through your tree? It's been reviewed >> by myself and Jan in this mail thread. > > Still needs more fixing, I think? > > Sorry, this is the first time I've seen this patch.... > >> > diff --git a/fs/iomap.c b/fs/iomap.c >> > index 1732228..144512e 100644 >> > --- a/fs/iomap.c >> > +++ b/fs/iomap.c >> > @@ -713,8 +713,15 @@ struct iomap_dio { >> > static ssize_t iomap_dio_complete(struct iomap_dio *dio) >> > { >> > struct kiocb *iocb = dio->iocb; >> > + struct inode *inode = file_inode(iocb->ki_filp); >> > ssize_t ret; >> > >> > + if (!dio->error && >> > + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) >> > + invalidate_inode_pages2_range(inode->i_mapping, >> > + iocb->ki_pos >> PAGE_SHIFT, >> > + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); >> > + >> > if (dio->end_io) { >> > ret = dio->end_io(iocb, >> > dio->error ? dio->error : dio->size, >> > > This invalidation is already run in iomap_dio_rw() for the sync IO > case directly after the call to iomap_dio_complete(). It also has a > comment to explain exactly why the the invalidation is needed, and > it issues a warning to dmesg if the invalidation fails to indicate > the reason why the user is reporting data corruption to us. i.e.: > > ..... > ret = iomap_dio_complete(dio); > > /* > * Try again to invalidate clean pages which might have been cached by > * non-direct readahead, or faulted in by get_user_pages() if the source > * of the write was an mmap'ed region of the file we're writing. Either > * one is a pretty crazy thing to do, so we don't support it 100%. If > * this invalidation fails, tough, the write still worked... > */ > if (iov_iter_rw(iter) == WRITE) { > int err = invalidate_inode_pages2_range(mapping, > start >> PAGE_SHIFT, end >> PAGE_SHIFT); > WARN_ON_ONCE(err); > } > > return ret; > > If we're going to replace this with an invalidation in > iomap_dio_complete() so it also handles the AIO path, then the > comment and warning on invalidation failure also need to be moved to > iomap_dio_complete() and the duplicate code removed from > iomap_dio_rw()... Yep, good catch. Lukas, care to respin? -Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-07 15:52 ` Jeff Moyer @ 2017-08-08 8:41 ` Lukas Czerner 0 siblings, 0 replies; 40+ messages in thread From: Lukas Czerner @ 2017-08-08 8:41 UTC (permalink / raw) To: Jeff Moyer; +Cc: Dave Chinner, viro, linux-fsdevel, jack, hch On Mon, Aug 07, 2017 at 11:52:45AM -0400, Jeff Moyer wrote: > Dave Chinner <david@fromorbit.com> writes: > > > On Thu, Aug 03, 2017 at 02:10:47PM -0400, Jeff Moyer wrote: > >> Al, would you mind taking this in through your tree? It's been reviewed > >> by myself and Jan in this mail thread. > > > > Still needs more fixing, I think? > > > > Sorry, this is the first time I've seen this patch.... > > > >> > diff --git a/fs/iomap.c b/fs/iomap.c > >> > index 1732228..144512e 100644 > >> > --- a/fs/iomap.c > >> > +++ b/fs/iomap.c > >> > @@ -713,8 +713,15 @@ struct iomap_dio { > >> > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > >> > { > >> > struct kiocb *iocb = dio->iocb; > >> > + struct inode *inode = file_inode(iocb->ki_filp); > >> > ssize_t ret; > >> > > >> > + if (!dio->error && > >> > + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) > >> > + invalidate_inode_pages2_range(inode->i_mapping, > >> > + iocb->ki_pos >> PAGE_SHIFT, > >> > + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); > >> > + > >> > if (dio->end_io) { > >> > ret = dio->end_io(iocb, > >> > dio->error ? dio->error : dio->size, > >> > > > > This invalidation is already run in iomap_dio_rw() for the sync IO > > case directly after the call to iomap_dio_complete(). It also has a > > comment to explain exactly why the the invalidation is needed, and > > it issues a warning to dmesg if the invalidation fails to indicate > > the reason why the user is reporting data corruption to us. i.e.: > > > > ..... > > ret = iomap_dio_complete(dio); > > > > /* > > * Try again to invalidate clean pages which might have been cached by > > * non-direct readahead, or faulted in by get_user_pages() if the source > > * of the write was an mmap'ed region of the file we're writing. Either > > * one is a pretty crazy thing to do, so we don't support it 100%. If > > * this invalidation fails, tough, the write still worked... > > */ > > if (iov_iter_rw(iter) == WRITE) { > > int err = invalidate_inode_pages2_range(mapping, > > start >> PAGE_SHIFT, end >> PAGE_SHIFT); > > WARN_ON_ONCE(err); > > } > > > > return ret; > > > > If we're going to replace this with an invalidation in > > iomap_dio_complete() so it also handles the AIO path, then the > > comment and warning on invalidation failure also need to be moved to > > iomap_dio_complete() and the duplicate code removed from > > iomap_dio_rw()... > > Yep, good catch. Lukas, care to respin? Of course, I'll respin. -Lukas > > -Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-07-19 11:28 ` [PATCH v5] " Lukas Czerner ` (2 preceding siblings ...) 2017-08-03 18:10 ` Jeff Moyer @ 2017-08-10 12:59 ` Lukas Czerner 2017-08-10 13:56 ` Jan Kara 2017-08-15 13:28 ` [PATCH v7] " Lukas Czerner 3 siblings, 2 replies; 40+ messages in thread From: Lukas Czerner @ 2017-08-10 12:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, jack, jmoyer, david, Lukas Czerner Currently when mixing buffered reads and asynchronous direct writes it is possible to end up with the situation where we have stale data in the page cache while the new data is already written to disk. This is permanent until the affected pages are flushed away. Despite the fact that mixing buffered and direct IO is ill-advised it does pose a thread for a data integrity, is unexpected and should be fixed. Fix this by deferring completion of asynchronous direct writes to a process context in the case that there are mapped pages to be found in the inode. Later before the completion in dio_complete() invalidate the pages in question. This ensures that after the completion the pages in the written area are either unmapped, or populated with up-to-date data. Also do the same for the iomap case which uses iomap_dio_complete() instead. This has a side effect of deferring the completion to a process context for every AIO DIO that happens on inode that has pages mapped. However since the consensus is that this is ill-advised practice the performance implication should not be a problem. This was based on proposal from Jeff Moyer, thanks! Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> --- v2: Remove leftover ret variable from invalidate call in iomap_dio_complete v3: Do not invalidate in case of error. Add some coments v4: Remove unnecessary variable, remove unnecessary inner braces v5: Style changes v6: Remove redundant invalidatepage, add warning and comment fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------ fs/iomap.c | 29 ++++++++++++++++------------- 2 files changed, 59 insertions(+), 19 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 08cf278..ffb9e19 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) { loff_t offset = dio->iocb->ki_pos; ssize_t transferred = 0; + int err; /* * AIO submission can race with bio completion to get here while @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) if (ret == 0) ret = transferred; + /* + * Try again to invalidate clean pages which might have been cached by + * non-direct readahead, or faulted in by get_user_pages() if the source + * of the write was an mmap'ed region of the file we're writing. Either + * one is a pretty crazy thing to do, so we don't support it 100%. If + * this invalidation fails, tough, the write still worked... + */ + if (ret > 0 && dio->op == REQ_OP_WRITE && + dio->inode->i_mapping->nrpages) { + err = invalidate_inode_pages2_range(dio->inode->i_mapping, + offset >> PAGE_SHIFT, + (offset + ret - 1) >> PAGE_SHIFT); + WARN_ON_ONCE(err); + } + if (dio->end_io) { - int err; // XXX: ki_pos?? err = dio->end_io(dio->iocb, offset, ret, dio->private); @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio) struct dio *dio = bio->bi_private; unsigned long remaining; unsigned long flags; + bool defer_completion = false; /* cleanup the bio */ dio_bio_complete(dio, bio); @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio) spin_unlock_irqrestore(&dio->bio_lock, flags); if (remaining == 0) { - if (dio->result && dio->defer_completion) { + /* + * Defer completion when defer_completion is set or + * when the inode has pages mapped and this is AIO write. + * We need to invalidate those pages because there is a + * chance they contain stale data in the case buffered IO + * went in between AIO submission and completion into the + * same region. + */ + if (dio->result) + defer_completion = dio->defer_completion || + (dio->op == REQ_OP_WRITE && + dio->inode->i_mapping->nrpages); + if (defer_completion) { INIT_WORK(&dio->complete_work, dio_aio_complete_work); queue_work(dio->inode->i_sb->s_dio_done_wq, &dio->complete_work); @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, * For AIO O_(D)SYNC writes we need to defer completions to a workqueue * so that we can call ->fsync. */ - if (dio->is_async && iov_iter_rw(iter) == WRITE && - ((iocb->ki_filp->f_flags & O_DSYNC) || - IS_SYNC(iocb->ki_filp->f_mapping->host))) { - retval = dio_set_defer_completion(dio); + if (dio->is_async && iov_iter_rw(iter) == WRITE) { + retval = 0; + if ((iocb->ki_filp->f_flags & O_DSYNC) || + IS_SYNC(iocb->ki_filp->f_mapping->host)) + retval = dio_set_defer_completion(dio); + else if (!dio->inode->i_sb->s_dio_done_wq) { + /* + * In case of AIO write racing with buffered read we + * need to defer completion. We can't decide this now, + * however the workqueue needs to be initialized here. + */ + retval = sb_init_dio_done_wq(dio->inode->i_sb); + } if (retval) { /* * We grab i_mutex only for reads so we don't have diff --git a/fs/iomap.c b/fs/iomap.c index 0392661..c3e299a 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -713,8 +713,24 @@ struct iomap_dio { static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; + struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; + /* + * Try again to invalidate clean pages which might have been cached by + * non-direct readahead, or faulted in by get_user_pages() if the source + * of the write was an mmap'ed region of the file we're writing. Either + * one is a pretty crazy thing to do, so we don't support it 100%. If + * this invalidation fails, tough, the write still worked... + */ + if (!dio->error && + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) { + ret = invalidate_inode_pages2_range(inode->i_mapping, + iocb->ki_pos >> PAGE_SHIFT, + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); + WARN_ON_ONCE(ret); + } + if (dio->end_io) { ret = dio->end_io(iocb, dio->error ? dio->error : dio->size, @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, ret = iomap_dio_complete(dio); - /* - * Try again to invalidate clean pages which might have been cached by - * non-direct readahead, or faulted in by get_user_pages() if the source - * of the write was an mmap'ed region of the file we're writing. Either - * one is a pretty crazy thing to do, so we don't support it 100%. If - * this invalidation fails, tough, the write still worked... - */ - if (iov_iter_rw(iter) == WRITE) { - int err = invalidate_inode_pages2_range(mapping, - start >> PAGE_SHIFT, end >> PAGE_SHIFT); - WARN_ON_ONCE(err); - } - return ret; out_free_dio: -- 2.7.5 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-10 12:59 ` [PATCH v6] " Lukas Czerner @ 2017-08-10 13:56 ` Jan Kara 2017-08-10 14:22 ` Jeff Moyer 2017-08-15 13:28 ` [PATCH v7] " Lukas Czerner 1 sibling, 1 reply; 40+ messages in thread From: Jan Kara @ 2017-08-10 13:56 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, jmoyer, david On Thu 10-08-17 14:59:57, Lukas Czerner wrote: > Currently when mixing buffered reads and asynchronous direct writes it > is possible to end up with the situation where we have stale data in the > page cache while the new data is already written to disk. This is > permanent until the affected pages are flushed away. Despite the fact > that mixing buffered and direct IO is ill-advised it does pose a thread > for a data integrity, is unexpected and should be fixed. > > Fix this by deferring completion of asynchronous direct writes to a > process context in the case that there are mapped pages to be found in > the inode. Later before the completion in dio_complete() invalidate > the pages in question. This ensures that after the completion the pages > in the written area are either unmapped, or populated with up-to-date > data. Also do the same for the iomap case which uses > iomap_dio_complete() instead. > > This has a side effect of deferring the completion to a process context > for every AIO DIO that happens on inode that has pages mapped. However > since the consensus is that this is ill-advised practice the performance > implication should not be a problem. > > This was based on proposal from Jeff Moyer, thanks! It seems the invalidation can be also removed from generic_file_direct_write(), can't it? It is duplicit there the same way as it was in the iomap code... Honza > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > Cc: Jeff Moyer <jmoyer@redhat.com> > --- > v2: Remove leftover ret variable from invalidate call in iomap_dio_complete > v3: Do not invalidate in case of error. Add some coments > v4: Remove unnecessary variable, remove unnecessary inner braces > v5: Style changes > v6: Remove redundant invalidatepage, add warning and comment > > fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------ > fs/iomap.c | 29 ++++++++++++++++------------- > 2 files changed, 59 insertions(+), 19 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 08cf278..ffb9e19 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > { > loff_t offset = dio->iocb->ki_pos; > ssize_t transferred = 0; > + int err; > > /* > * AIO submission can race with bio completion to get here while > @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > if (ret == 0) > ret = transferred; > > + /* > + * Try again to invalidate clean pages which might have been cached by > + * non-direct readahead, or faulted in by get_user_pages() if the source > + * of the write was an mmap'ed region of the file we're writing. Either > + * one is a pretty crazy thing to do, so we don't support it 100%. If > + * this invalidation fails, tough, the write still worked... > + */ > + if (ret > 0 && dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages) { > + err = invalidate_inode_pages2_range(dio->inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + ret - 1) >> PAGE_SHIFT); > + WARN_ON_ONCE(err); > + } > + > if (dio->end_io) { > - int err; > > // XXX: ki_pos?? > err = dio->end_io(dio->iocb, offset, ret, dio->private); > @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio) > struct dio *dio = bio->bi_private; > unsigned long remaining; > unsigned long flags; > + bool defer_completion = false; > > /* cleanup the bio */ > dio_bio_complete(dio, bio); > @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio) > spin_unlock_irqrestore(&dio->bio_lock, flags); > > if (remaining == 0) { > - if (dio->result && dio->defer_completion) { > + /* > + * Defer completion when defer_completion is set or > + * when the inode has pages mapped and this is AIO write. > + * We need to invalidate those pages because there is a > + * chance they contain stale data in the case buffered IO > + * went in between AIO submission and completion into the > + * same region. > + */ > + if (dio->result) > + defer_completion = dio->defer_completion || > + (dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages); > + if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > * so that we can call ->fsync. > */ > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > - ((iocb->ki_filp->f_flags & O_DSYNC) || > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > - retval = dio_set_defer_completion(dio); > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) { > + /* > + * In case of AIO write racing with buffered read we > + * need to defer completion. We can't decide this now, > + * however the workqueue needs to be initialized here. > + */ > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > + } > if (retval) { > /* > * We grab i_mutex only for reads so we don't have > diff --git a/fs/iomap.c b/fs/iomap.c > index 0392661..c3e299a 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -713,8 +713,24 @@ struct iomap_dio { > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > { > struct kiocb *iocb = dio->iocb; > + struct inode *inode = file_inode(iocb->ki_filp); > ssize_t ret; > > + /* > + * Try again to invalidate clean pages which might have been cached by > + * non-direct readahead, or faulted in by get_user_pages() if the source > + * of the write was an mmap'ed region of the file we're writing. Either > + * one is a pretty crazy thing to do, so we don't support it 100%. If > + * this invalidation fails, tough, the write still worked... > + */ > + if (!dio->error && > + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) { > + ret = invalidate_inode_pages2_range(inode->i_mapping, > + iocb->ki_pos >> PAGE_SHIFT, > + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); > + WARN_ON_ONCE(ret); > + } > + > if (dio->end_io) { > ret = dio->end_io(iocb, > dio->error ? dio->error : dio->size, > @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > ret = iomap_dio_complete(dio); > > - /* > - * Try again to invalidate clean pages which might have been cached by > - * non-direct readahead, or faulted in by get_user_pages() if the source > - * of the write was an mmap'ed region of the file we're writing. Either > - * one is a pretty crazy thing to do, so we don't support it 100%. If > - * this invalidation fails, tough, the write still worked... > - */ > - if (iov_iter_rw(iter) == WRITE) { > - int err = invalidate_inode_pages2_range(mapping, > - start >> PAGE_SHIFT, end >> PAGE_SHIFT); > - WARN_ON_ONCE(err); > - } > - > return ret; > > out_free_dio: > -- > 2.7.5 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-10 13:56 ` Jan Kara @ 2017-08-10 14:22 ` Jeff Moyer 2017-08-11 9:03 ` Lukas Czerner 0 siblings, 1 reply; 40+ messages in thread From: Jeff Moyer @ 2017-08-10 14:22 UTC (permalink / raw) To: Jan Kara; +Cc: Lukas Czerner, linux-fsdevel, viro, david Jan Kara <jack@suse.cz> writes: > On Thu 10-08-17 14:59:57, Lukas Czerner wrote: >> Currently when mixing buffered reads and asynchronous direct writes it >> is possible to end up with the situation where we have stale data in the >> page cache while the new data is already written to disk. This is >> permanent until the affected pages are flushed away. Despite the fact >> that mixing buffered and direct IO is ill-advised it does pose a thread >> for a data integrity, is unexpected and should be fixed. >> >> Fix this by deferring completion of asynchronous direct writes to a >> process context in the case that there are mapped pages to be found in >> the inode. Later before the completion in dio_complete() invalidate >> the pages in question. This ensures that after the completion the pages >> in the written area are either unmapped, or populated with up-to-date >> data. Also do the same for the iomap case which uses >> iomap_dio_complete() instead. >> >> This has a side effect of deferring the completion to a process context >> for every AIO DIO that happens on inode that has pages mapped. However >> since the consensus is that this is ill-advised practice the performance >> implication should not be a problem. >> >> This was based on proposal from Jeff Moyer, thanks! > > It seems the invalidation can be also removed from > generic_file_direct_write(), can't it? It is duplicit there the same way as > it was in the iomap code... Yep, sure looks that way. -Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-10 14:22 ` Jeff Moyer @ 2017-08-11 9:03 ` Lukas Czerner 2017-08-14 9:43 ` Jan Kara 0 siblings, 1 reply; 40+ messages in thread From: Lukas Czerner @ 2017-08-11 9:03 UTC (permalink / raw) To: Jeff Moyer; +Cc: Jan Kara, linux-fsdevel, viro, david On Thu, Aug 10, 2017 at 10:22:47AM -0400, Jeff Moyer wrote: > Jan Kara <jack@suse.cz> writes: > > > On Thu 10-08-17 14:59:57, Lukas Czerner wrote: > >> Currently when mixing buffered reads and asynchronous direct writes it > >> is possible to end up with the situation where we have stale data in the > >> page cache while the new data is already written to disk. This is > >> permanent until the affected pages are flushed away. Despite the fact > >> that mixing buffered and direct IO is ill-advised it does pose a thread > >> for a data integrity, is unexpected and should be fixed. > >> > >> Fix this by deferring completion of asynchronous direct writes to a > >> process context in the case that there are mapped pages to be found in > >> the inode. Later before the completion in dio_complete() invalidate > >> the pages in question. This ensures that after the completion the pages > >> in the written area are either unmapped, or populated with up-to-date > >> data. Also do the same for the iomap case which uses > >> iomap_dio_complete() instead. > >> > >> This has a side effect of deferring the completion to a process context > >> for every AIO DIO that happens on inode that has pages mapped. However > >> since the consensus is that this is ill-advised practice the performance > >> implication should not be a problem. > >> > >> This was based on proposal from Jeff Moyer, thanks! > > > > It seems the invalidation can be also removed from > > generic_file_direct_write(), can't it? It is duplicit there the same way as > > it was in the iomap code... > > Yep, sure looks that way. Hrm, ok. Technically speaking generic_file_direct_write does not have to eventually end up with dio_complete() being called. This will change the behaviour for those that implement dio differently. Looking at the users now, vast majority will end up with complete_dio() so maybe this is not a problem. This is in contrast with iomap_dio_rw() which will end up calling iomap_dio_complete() so the situation is different there. Maybe adding mapping->nrpages check would be better than outright removing it ? -Lukas > > -Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-11 9:03 ` Lukas Czerner @ 2017-08-14 9:43 ` Jan Kara 2017-08-15 12:47 ` Lukas Czerner 0 siblings, 1 reply; 40+ messages in thread From: Jan Kara @ 2017-08-14 9:43 UTC (permalink / raw) To: Lukas Czerner; +Cc: Jeff Moyer, Jan Kara, linux-fsdevel, viro, david Hello, On Fri 11-08-17 11:03:01, Lukas Czerner wrote: > On Thu, Aug 10, 2017 at 10:22:47AM -0400, Jeff Moyer wrote: > > Jan Kara <jack@suse.cz> writes: > > > > > On Thu 10-08-17 14:59:57, Lukas Czerner wrote: > > >> Currently when mixing buffered reads and asynchronous direct writes it > > >> is possible to end up with the situation where we have stale data in the > > >> page cache while the new data is already written to disk. This is > > >> permanent until the affected pages are flushed away. Despite the fact > > >> that mixing buffered and direct IO is ill-advised it does pose a thread > > >> for a data integrity, is unexpected and should be fixed. > > >> > > >> Fix this by deferring completion of asynchronous direct writes to a > > >> process context in the case that there are mapped pages to be found in > > >> the inode. Later before the completion in dio_complete() invalidate > > >> the pages in question. This ensures that after the completion the pages > > >> in the written area are either unmapped, or populated with up-to-date > > >> data. Also do the same for the iomap case which uses > > >> iomap_dio_complete() instead. > > >> > > >> This has a side effect of deferring the completion to a process context > > >> for every AIO DIO that happens on inode that has pages mapped. However > > >> since the consensus is that this is ill-advised practice the performance > > >> implication should not be a problem. > > >> > > >> This was based on proposal from Jeff Moyer, thanks! > > > > > > It seems the invalidation can be also removed from > > > generic_file_direct_write(), can't it? It is duplicit there the same way as > > > it was in the iomap code... > > > > Yep, sure looks that way. > > Hrm, ok. Technically speaking generic_file_direct_write does not have to > eventually end up with dio_complete() being called. This will change the > behaviour for those that implement dio differently. Looking at the users > now, vast majority will end up with complete_dio() so maybe this is not > a problem. OK, so this seems to be the problem with 9p, fuse, nfs, lustre. > This is in contrast with iomap_dio_rw() which will end up calling > iomap_dio_complete() so the situation is different there. > > Maybe adding mapping->nrpages check would be better than outright > removing it ? OK, I agree we cannot just remove the invalidation. But shouldn't we rather fix the above mentioned filesystems? Otherwise they will keep having issues you are trying to fix? But for now I could live with keeping the invalidation behind nrpages check and adding a comment why we kept it there... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-14 9:43 ` Jan Kara @ 2017-08-15 12:47 ` Lukas Czerner 0 siblings, 0 replies; 40+ messages in thread From: Lukas Czerner @ 2017-08-15 12:47 UTC (permalink / raw) To: Jan Kara; +Cc: Jeff Moyer, linux-fsdevel, viro, david On Mon, Aug 14, 2017 at 11:43:31AM +0200, Jan Kara wrote: > Hello, > > On Fri 11-08-17 11:03:01, Lukas Czerner wrote: > > On Thu, Aug 10, 2017 at 10:22:47AM -0400, Jeff Moyer wrote: > > > Jan Kara <jack@suse.cz> writes: > > > > > > > On Thu 10-08-17 14:59:57, Lukas Czerner wrote: > > > >> Currently when mixing buffered reads and asynchronous direct writes it > > > >> is possible to end up with the situation where we have stale data in the > > > >> page cache while the new data is already written to disk. This is > > > >> permanent until the affected pages are flushed away. Despite the fact > > > >> that mixing buffered and direct IO is ill-advised it does pose a thread > > > >> for a data integrity, is unexpected and should be fixed. > > > >> > > > >> Fix this by deferring completion of asynchronous direct writes to a > > > >> process context in the case that there are mapped pages to be found in > > > >> the inode. Later before the completion in dio_complete() invalidate > > > >> the pages in question. This ensures that after the completion the pages > > > >> in the written area are either unmapped, or populated with up-to-date > > > >> data. Also do the same for the iomap case which uses > > > >> iomap_dio_complete() instead. > > > >> > > > >> This has a side effect of deferring the completion to a process context > > > >> for every AIO DIO that happens on inode that has pages mapped. However > > > >> since the consensus is that this is ill-advised practice the performance > > > >> implication should not be a problem. > > > >> > > > >> This was based on proposal from Jeff Moyer, thanks! > > > > > > > > It seems the invalidation can be also removed from > > > > generic_file_direct_write(), can't it? It is duplicit there the same way as > > > > it was in the iomap code... > > > > > > Yep, sure looks that way. > > > > Hrm, ok. Technically speaking generic_file_direct_write does not have to > > eventually end up with dio_complete() being called. This will change the > > behaviour for those that implement dio differently. Looking at the users > > now, vast majority will end up with complete_dio() so maybe this is not > > a problem. > > OK, so this seems to be the problem with 9p, fuse, nfs, lustre. > > > This is in contrast with iomap_dio_rw() which will end up calling > > iomap_dio_complete() so the situation is different there. > > > > Maybe adding mapping->nrpages check would be better than outright > > removing it ? > > OK, I agree we cannot just remove the invalidation. But shouldn't we rather > fix the above mentioned filesystems? Otherwise they will keep having issues > you are trying to fix? But for now I could live with keeping the > invalidation behind nrpages check and adding a comment why we kept it > there... Right, I'd rather have closere on this neverending patch. The rest of the fs can be fixed later. -Lukas > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-10 12:59 ` [PATCH v6] " Lukas Czerner 2017-08-10 13:56 ` Jan Kara @ 2017-08-15 13:28 ` Lukas Czerner 2017-08-16 13:15 ` Jan Kara ` (4 more replies) 1 sibling, 5 replies; 40+ messages in thread From: Lukas Czerner @ 2017-08-15 13:28 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, jack, jmoyer, david, Lukas Czerner Currently when mixing buffered reads and asynchronous direct writes it is possible to end up with the situation where we have stale data in the page cache while the new data is already written to disk. This is permanent until the affected pages are flushed away. Despite the fact that mixing buffered and direct IO is ill-advised it does pose a thread for a data integrity, is unexpected and should be fixed. Fix this by deferring completion of asynchronous direct writes to a process context in the case that there are mapped pages to be found in the inode. Later before the completion in dio_complete() invalidate the pages in question. This ensures that after the completion the pages in the written area are either unmapped, or populated with up-to-date data. Also do the same for the iomap case which uses iomap_dio_complete() instead. This has a side effect of deferring the completion to a process context for every AIO DIO that happens on inode that has pages mapped. However since the consensus is that this is ill-advised practice the performance implication should not be a problem. This was based on proposal from Jeff Moyer, thanks! Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> --- v2: Remove leftover ret variable from invalidate call in iomap_dio_complete v3: Do not invalidate in case of error. Add some coments v4: Remove unnecessary variable, remove unnecessary inner braces v5: Style changes v6: Remove redundant invalidatepage, add warning and comment v7: Run invalidateion conditionally from generic_file_direct_write() fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------ fs/iomap.c | 29 ++++++++++++++++------------- mm/filemap.c | 10 ++++++++-- 3 files changed, 67 insertions(+), 21 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 08cf278..ffb9e19 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) { loff_t offset = dio->iocb->ki_pos; ssize_t transferred = 0; + int err; /* * AIO submission can race with bio completion to get here while @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) if (ret == 0) ret = transferred; + /* + * Try again to invalidate clean pages which might have been cached by + * non-direct readahead, or faulted in by get_user_pages() if the source + * of the write was an mmap'ed region of the file we're writing. Either + * one is a pretty crazy thing to do, so we don't support it 100%. If + * this invalidation fails, tough, the write still worked... + */ + if (ret > 0 && dio->op == REQ_OP_WRITE && + dio->inode->i_mapping->nrpages) { + err = invalidate_inode_pages2_range(dio->inode->i_mapping, + offset >> PAGE_SHIFT, + (offset + ret - 1) >> PAGE_SHIFT); + WARN_ON_ONCE(err); + } + if (dio->end_io) { - int err; // XXX: ki_pos?? err = dio->end_io(dio->iocb, offset, ret, dio->private); @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio) struct dio *dio = bio->bi_private; unsigned long remaining; unsigned long flags; + bool defer_completion = false; /* cleanup the bio */ dio_bio_complete(dio, bio); @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio) spin_unlock_irqrestore(&dio->bio_lock, flags); if (remaining == 0) { - if (dio->result && dio->defer_completion) { + /* + * Defer completion when defer_completion is set or + * when the inode has pages mapped and this is AIO write. + * We need to invalidate those pages because there is a + * chance they contain stale data in the case buffered IO + * went in between AIO submission and completion into the + * same region. + */ + if (dio->result) + defer_completion = dio->defer_completion || + (dio->op == REQ_OP_WRITE && + dio->inode->i_mapping->nrpages); + if (defer_completion) { INIT_WORK(&dio->complete_work, dio_aio_complete_work); queue_work(dio->inode->i_sb->s_dio_done_wq, &dio->complete_work); @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, * For AIO O_(D)SYNC writes we need to defer completions to a workqueue * so that we can call ->fsync. */ - if (dio->is_async && iov_iter_rw(iter) == WRITE && - ((iocb->ki_filp->f_flags & O_DSYNC) || - IS_SYNC(iocb->ki_filp->f_mapping->host))) { - retval = dio_set_defer_completion(dio); + if (dio->is_async && iov_iter_rw(iter) == WRITE) { + retval = 0; + if ((iocb->ki_filp->f_flags & O_DSYNC) || + IS_SYNC(iocb->ki_filp->f_mapping->host)) + retval = dio_set_defer_completion(dio); + else if (!dio->inode->i_sb->s_dio_done_wq) { + /* + * In case of AIO write racing with buffered read we + * need to defer completion. We can't decide this now, + * however the workqueue needs to be initialized here. + */ + retval = sb_init_dio_done_wq(dio->inode->i_sb); + } if (retval) { /* * We grab i_mutex only for reads so we don't have diff --git a/fs/iomap.c b/fs/iomap.c index 0392661..c3e299a 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -713,8 +713,24 @@ struct iomap_dio { static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; + struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; + /* + * Try again to invalidate clean pages which might have been cached by + * non-direct readahead, or faulted in by get_user_pages() if the source + * of the write was an mmap'ed region of the file we're writing. Either + * one is a pretty crazy thing to do, so we don't support it 100%. If + * this invalidation fails, tough, the write still worked... + */ + if (!dio->error && + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) { + ret = invalidate_inode_pages2_range(inode->i_mapping, + iocb->ki_pos >> PAGE_SHIFT, + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); + WARN_ON_ONCE(ret); + } + if (dio->end_io) { ret = dio->end_io(iocb, dio->error ? dio->error : dio->size, @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, ret = iomap_dio_complete(dio); - /* - * Try again to invalidate clean pages which might have been cached by - * non-direct readahead, or faulted in by get_user_pages() if the source - * of the write was an mmap'ed region of the file we're writing. Either - * one is a pretty crazy thing to do, so we don't support it 100%. If - * this invalidation fails, tough, the write still worked... - */ - if (iov_iter_rw(iter) == WRITE) { - int err = invalidate_inode_pages2_range(mapping, - start >> PAGE_SHIFT, end >> PAGE_SHIFT); - WARN_ON_ONCE(err); - } - return ret; out_free_dio: diff --git a/mm/filemap.c b/mm/filemap.c index a497024..9440e02 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) * we're writing. Either one is a pretty crazy thing to do, * so we don't support it 100%. If this invalidation * fails, tough, the write still worked... + * + * Most of the time we do not need this since dio_complete() will do + * the invalidation for us. However there are some file systems that + * do not end up with dio_complete() being called, so let's not break + * them by removing it completely */ - invalidate_inode_pages2_range(mapping, - pos >> PAGE_SHIFT, end); + if (mapping->nrpages) + invalidate_inode_pages2_range(mapping, + pos >> PAGE_SHIFT, end); if (written > 0) { pos += written; -- 2.7.5 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-15 13:28 ` [PATCH v7] " Lukas Czerner @ 2017-08-16 13:15 ` Jan Kara 2017-08-16 16:01 ` Darrick J. Wong ` (3 subsequent siblings) 4 siblings, 0 replies; 40+ messages in thread From: Jan Kara @ 2017-08-16 13:15 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, jmoyer, david On Tue 15-08-17 15:28:54, Lukas Czerner wrote: > Currently when mixing buffered reads and asynchronous direct writes it > is possible to end up with the situation where we have stale data in the > page cache while the new data is already written to disk. This is > permanent until the affected pages are flushed away. Despite the fact > that mixing buffered and direct IO is ill-advised it does pose a thread > for a data integrity, is unexpected and should be fixed. > > Fix this by deferring completion of asynchronous direct writes to a > process context in the case that there are mapped pages to be found in > the inode. Later before the completion in dio_complete() invalidate > the pages in question. This ensures that after the completion the pages > in the written area are either unmapped, or populated with up-to-date > data. Also do the same for the iomap case which uses > iomap_dio_complete() instead. > > This has a side effect of deferring the completion to a process context > for every AIO DIO that happens on inode that has pages mapped. However > since the consensus is that this is ill-advised practice the performance > implication should not be a problem. > > This was based on proposal from Jeff Moyer, thanks! > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > Cc: Jeff Moyer <jmoyer@redhat.com> Looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > v2: Remove leftover ret variable from invalidate call in iomap_dio_complete > v3: Do not invalidate in case of error. Add some coments > v4: Remove unnecessary variable, remove unnecessary inner braces > v5: Style changes > v6: Remove redundant invalidatepage, add warning and comment > v7: Run invalidateion conditionally from generic_file_direct_write() > > fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------ > fs/iomap.c | 29 ++++++++++++++++------------- > mm/filemap.c | 10 ++++++++-- > 3 files changed, 67 insertions(+), 21 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 08cf278..ffb9e19 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > { > loff_t offset = dio->iocb->ki_pos; > ssize_t transferred = 0; > + int err; > > /* > * AIO submission can race with bio completion to get here while > @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > if (ret == 0) > ret = transferred; > > + /* > + * Try again to invalidate clean pages which might have been cached by > + * non-direct readahead, or faulted in by get_user_pages() if the source > + * of the write was an mmap'ed region of the file we're writing. Either > + * one is a pretty crazy thing to do, so we don't support it 100%. If > + * this invalidation fails, tough, the write still worked... > + */ > + if (ret > 0 && dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages) { > + err = invalidate_inode_pages2_range(dio->inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + ret - 1) >> PAGE_SHIFT); > + WARN_ON_ONCE(err); > + } > + > if (dio->end_io) { > - int err; > > // XXX: ki_pos?? > err = dio->end_io(dio->iocb, offset, ret, dio->private); > @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio) > struct dio *dio = bio->bi_private; > unsigned long remaining; > unsigned long flags; > + bool defer_completion = false; > > /* cleanup the bio */ > dio_bio_complete(dio, bio); > @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio) > spin_unlock_irqrestore(&dio->bio_lock, flags); > > if (remaining == 0) { > - if (dio->result && dio->defer_completion) { > + /* > + * Defer completion when defer_completion is set or > + * when the inode has pages mapped and this is AIO write. > + * We need to invalidate those pages because there is a > + * chance they contain stale data in the case buffered IO > + * went in between AIO submission and completion into the > + * same region. > + */ > + if (dio->result) > + defer_completion = dio->defer_completion || > + (dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages); > + if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > * so that we can call ->fsync. > */ > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > - ((iocb->ki_filp->f_flags & O_DSYNC) || > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > - retval = dio_set_defer_completion(dio); > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) { > + /* > + * In case of AIO write racing with buffered read we > + * need to defer completion. We can't decide this now, > + * however the workqueue needs to be initialized here. > + */ > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > + } > if (retval) { > /* > * We grab i_mutex only for reads so we don't have > diff --git a/fs/iomap.c b/fs/iomap.c > index 0392661..c3e299a 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -713,8 +713,24 @@ struct iomap_dio { > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > { > struct kiocb *iocb = dio->iocb; > + struct inode *inode = file_inode(iocb->ki_filp); > ssize_t ret; > > + /* > + * Try again to invalidate clean pages which might have been cached by > + * non-direct readahead, or faulted in by get_user_pages() if the source > + * of the write was an mmap'ed region of the file we're writing. Either > + * one is a pretty crazy thing to do, so we don't support it 100%. If > + * this invalidation fails, tough, the write still worked... > + */ > + if (!dio->error && > + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) { > + ret = invalidate_inode_pages2_range(inode->i_mapping, > + iocb->ki_pos >> PAGE_SHIFT, > + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); > + WARN_ON_ONCE(ret); > + } > + > if (dio->end_io) { > ret = dio->end_io(iocb, > dio->error ? dio->error : dio->size, > @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > ret = iomap_dio_complete(dio); > > - /* > - * Try again to invalidate clean pages which might have been cached by > - * non-direct readahead, or faulted in by get_user_pages() if the source > - * of the write was an mmap'ed region of the file we're writing. Either > - * one is a pretty crazy thing to do, so we don't support it 100%. If > - * this invalidation fails, tough, the write still worked... > - */ > - if (iov_iter_rw(iter) == WRITE) { > - int err = invalidate_inode_pages2_range(mapping, > - start >> PAGE_SHIFT, end >> PAGE_SHIFT); > - WARN_ON_ONCE(err); > - } > - > return ret; > > out_free_dio: > diff --git a/mm/filemap.c b/mm/filemap.c > index a497024..9440e02 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) > * we're writing. Either one is a pretty crazy thing to do, > * so we don't support it 100%. If this invalidation > * fails, tough, the write still worked... > + * > + * Most of the time we do not need this since dio_complete() will do > + * the invalidation for us. However there are some file systems that > + * do not end up with dio_complete() being called, so let's not break > + * them by removing it completely > */ > - invalidate_inode_pages2_range(mapping, > - pos >> PAGE_SHIFT, end); > + if (mapping->nrpages) > + invalidate_inode_pages2_range(mapping, > + pos >> PAGE_SHIFT, end); > > if (written > 0) { > pos += written; > -- > 2.7.5 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-15 13:28 ` [PATCH v7] " Lukas Czerner 2017-08-16 13:15 ` Jan Kara @ 2017-08-16 16:01 ` Darrick J. Wong 2017-09-21 13:44 ` Jeff Moyer ` (2 subsequent siblings) 4 siblings, 0 replies; 40+ messages in thread From: Darrick J. Wong @ 2017-08-16 16:01 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, jmoyer, david On Tue, Aug 15, 2017 at 03:28:54PM +0200, Lukas Czerner wrote: > Currently when mixing buffered reads and asynchronous direct writes it > is possible to end up with the situation where we have stale data in the > page cache while the new data is already written to disk. This is > permanent until the affected pages are flushed away. Despite the fact > that mixing buffered and direct IO is ill-advised it does pose a thread > for a data integrity, is unexpected and should be fixed. > > Fix this by deferring completion of asynchronous direct writes to a > process context in the case that there are mapped pages to be found in > the inode. Later before the completion in dio_complete() invalidate > the pages in question. This ensures that after the completion the pages > in the written area are either unmapped, or populated with up-to-date > data. Also do the same for the iomap case which uses > iomap_dio_complete() instead. > > This has a side effect of deferring the completion to a process context > for every AIO DIO that happens on inode that has pages mapped. However > since the consensus is that this is ill-advised practice the performance > implication should not be a problem. > > This was based on proposal from Jeff Moyer, thanks! > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > Cc: Jeff Moyer <jmoyer@redhat.com> Looks ok to me, Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > --- > v2: Remove leftover ret variable from invalidate call in iomap_dio_complete > v3: Do not invalidate in case of error. Add some coments > v4: Remove unnecessary variable, remove unnecessary inner braces > v5: Style changes > v6: Remove redundant invalidatepage, add warning and comment > v7: Run invalidateion conditionally from generic_file_direct_write() > > fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------ > fs/iomap.c | 29 ++++++++++++++++------------- > mm/filemap.c | 10 ++++++++-- > 3 files changed, 67 insertions(+), 21 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 08cf278..ffb9e19 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > { > loff_t offset = dio->iocb->ki_pos; > ssize_t transferred = 0; > + int err; > > /* > * AIO submission can race with bio completion to get here while > @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > if (ret == 0) > ret = transferred; > > + /* > + * Try again to invalidate clean pages which might have been cached by > + * non-direct readahead, or faulted in by get_user_pages() if the source > + * of the write was an mmap'ed region of the file we're writing. Either > + * one is a pretty crazy thing to do, so we don't support it 100%. If > + * this invalidation fails, tough, the write still worked... > + */ > + if (ret > 0 && dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages) { > + err = invalidate_inode_pages2_range(dio->inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + ret - 1) >> PAGE_SHIFT); > + WARN_ON_ONCE(err); > + } > + > if (dio->end_io) { > - int err; > > // XXX: ki_pos?? > err = dio->end_io(dio->iocb, offset, ret, dio->private); > @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio) > struct dio *dio = bio->bi_private; > unsigned long remaining; > unsigned long flags; > + bool defer_completion = false; > > /* cleanup the bio */ > dio_bio_complete(dio, bio); > @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio) > spin_unlock_irqrestore(&dio->bio_lock, flags); > > if (remaining == 0) { > - if (dio->result && dio->defer_completion) { > + /* > + * Defer completion when defer_completion is set or > + * when the inode has pages mapped and this is AIO write. > + * We need to invalidate those pages because there is a > + * chance they contain stale data in the case buffered IO > + * went in between AIO submission and completion into the > + * same region. > + */ > + if (dio->result) > + defer_completion = dio->defer_completion || > + (dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages); > + if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > * so that we can call ->fsync. > */ > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > - ((iocb->ki_filp->f_flags & O_DSYNC) || > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > - retval = dio_set_defer_completion(dio); > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) { > + /* > + * In case of AIO write racing with buffered read we > + * need to defer completion. We can't decide this now, > + * however the workqueue needs to be initialized here. > + */ > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > + } > if (retval) { > /* > * We grab i_mutex only for reads so we don't have > diff --git a/fs/iomap.c b/fs/iomap.c > index 0392661..c3e299a 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -713,8 +713,24 @@ struct iomap_dio { > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > { > struct kiocb *iocb = dio->iocb; > + struct inode *inode = file_inode(iocb->ki_filp); > ssize_t ret; > > + /* > + * Try again to invalidate clean pages which might have been cached by > + * non-direct readahead, or faulted in by get_user_pages() if the source > + * of the write was an mmap'ed region of the file we're writing. Either > + * one is a pretty crazy thing to do, so we don't support it 100%. If > + * this invalidation fails, tough, the write still worked... > + */ > + if (!dio->error && > + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) { > + ret = invalidate_inode_pages2_range(inode->i_mapping, > + iocb->ki_pos >> PAGE_SHIFT, > + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); > + WARN_ON_ONCE(ret); > + } > + > if (dio->end_io) { > ret = dio->end_io(iocb, > dio->error ? dio->error : dio->size, > @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > ret = iomap_dio_complete(dio); > > - /* > - * Try again to invalidate clean pages which might have been cached by > - * non-direct readahead, or faulted in by get_user_pages() if the source > - * of the write was an mmap'ed region of the file we're writing. Either > - * one is a pretty crazy thing to do, so we don't support it 100%. If > - * this invalidation fails, tough, the write still worked... > - */ > - if (iov_iter_rw(iter) == WRITE) { > - int err = invalidate_inode_pages2_range(mapping, > - start >> PAGE_SHIFT, end >> PAGE_SHIFT); > - WARN_ON_ONCE(err); > - } > - > return ret; > > out_free_dio: > diff --git a/mm/filemap.c b/mm/filemap.c > index a497024..9440e02 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) > * we're writing. Either one is a pretty crazy thing to do, > * so we don't support it 100%. If this invalidation > * fails, tough, the write still worked... > + * > + * Most of the time we do not need this since dio_complete() will do > + * the invalidation for us. However there are some file systems that > + * do not end up with dio_complete() being called, so let's not break > + * them by removing it completely > */ > - invalidate_inode_pages2_range(mapping, > - pos >> PAGE_SHIFT, end); > + if (mapping->nrpages) > + invalidate_inode_pages2_range(mapping, > + pos >> PAGE_SHIFT, end); > > if (written > 0) { > pos += written; > -- > 2.7.5 > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-15 13:28 ` [PATCH v7] " Lukas Czerner 2017-08-16 13:15 ` Jan Kara 2017-08-16 16:01 ` Darrick J. Wong @ 2017-09-21 13:44 ` Jeff Moyer 2017-09-21 13:44 ` Lukas Czerner 2017-10-10 14:34 ` David Sterba 4 siblings, 0 replies; 40+ messages in thread From: Jeff Moyer @ 2017-09-21 13:44 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, david Lukas Czerner <lczerner@redhat.com> writes: > Currently when mixing buffered reads and asynchronous direct writes it > is possible to end up with the situation where we have stale data in the > page cache while the new data is already written to disk. This is > permanent until the affected pages are flushed away. Despite the fact > that mixing buffered and direct IO is ill-advised it does pose a thread > for a data integrity, is unexpected and should be fixed. > > Fix this by deferring completion of asynchronous direct writes to a > process context in the case that there are mapped pages to be found in > the inode. Later before the completion in dio_complete() invalidate > the pages in question. This ensures that after the completion the pages > in the written area are either unmapped, or populated with up-to-date > data. Also do the same for the iomap case which uses > iomap_dio_complete() instead. > > This has a side effect of deferring the completion to a process context > for every AIO DIO that happens on inode that has pages mapped. However > since the consensus is that this is ill-advised practice the performance > implication should not be a problem. > > This was based on proposal from Jeff Moyer, thanks! > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > Cc: Jeff Moyer <jmoyer@redhat.com> Is this still in limbo? Reviewed-by: Jeff Moyer <jmoyer@redhat.com> > --- > v2: Remove leftover ret variable from invalidate call in iomap_dio_complete > v3: Do not invalidate in case of error. Add some coments > v4: Remove unnecessary variable, remove unnecessary inner braces > v5: Style changes > v6: Remove redundant invalidatepage, add warning and comment > v7: Run invalidateion conditionally from generic_file_direct_write() > > fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------ > fs/iomap.c | 29 ++++++++++++++++------------- > mm/filemap.c | 10 ++++++++-- > 3 files changed, 67 insertions(+), 21 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 08cf278..ffb9e19 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > { > loff_t offset = dio->iocb->ki_pos; > ssize_t transferred = 0; > + int err; > > /* > * AIO submission can race with bio completion to get here while > @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > if (ret == 0) > ret = transferred; > > + /* > + * Try again to invalidate clean pages which might have been cached by > + * non-direct readahead, or faulted in by get_user_pages() if the source > + * of the write was an mmap'ed region of the file we're writing. Either > + * one is a pretty crazy thing to do, so we don't support it 100%. If > + * this invalidation fails, tough, the write still worked... > + */ > + if (ret > 0 && dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages) { > + err = invalidate_inode_pages2_range(dio->inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + ret - 1) >> PAGE_SHIFT); > + WARN_ON_ONCE(err); > + } > + > if (dio->end_io) { > - int err; > > // XXX: ki_pos?? > err = dio->end_io(dio->iocb, offset, ret, dio->private); > @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio) > struct dio *dio = bio->bi_private; > unsigned long remaining; > unsigned long flags; > + bool defer_completion = false; > > /* cleanup the bio */ > dio_bio_complete(dio, bio); > @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio) > spin_unlock_irqrestore(&dio->bio_lock, flags); > > if (remaining == 0) { > - if (dio->result && dio->defer_completion) { > + /* > + * Defer completion when defer_completion is set or > + * when the inode has pages mapped and this is AIO write. > + * We need to invalidate those pages because there is a > + * chance they contain stale data in the case buffered IO > + * went in between AIO submission and completion into the > + * same region. > + */ > + if (dio->result) > + defer_completion = dio->defer_completion || > + (dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages); > + if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > * so that we can call ->fsync. > */ > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > - ((iocb->ki_filp->f_flags & O_DSYNC) || > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > - retval = dio_set_defer_completion(dio); > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) { > + /* > + * In case of AIO write racing with buffered read we > + * need to defer completion. We can't decide this now, > + * however the workqueue needs to be initialized here. > + */ > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > + } > if (retval) { > /* > * We grab i_mutex only for reads so we don't have > diff --git a/fs/iomap.c b/fs/iomap.c > index 0392661..c3e299a 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -713,8 +713,24 @@ struct iomap_dio { > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > { > struct kiocb *iocb = dio->iocb; > + struct inode *inode = file_inode(iocb->ki_filp); > ssize_t ret; > > + /* > + * Try again to invalidate clean pages which might have been cached by > + * non-direct readahead, or faulted in by get_user_pages() if the source > + * of the write was an mmap'ed region of the file we're writing. Either > + * one is a pretty crazy thing to do, so we don't support it 100%. If > + * this invalidation fails, tough, the write still worked... > + */ > + if (!dio->error && > + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) { > + ret = invalidate_inode_pages2_range(inode->i_mapping, > + iocb->ki_pos >> PAGE_SHIFT, > + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); > + WARN_ON_ONCE(ret); > + } > + > if (dio->end_io) { > ret = dio->end_io(iocb, > dio->error ? dio->error : dio->size, > @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > ret = iomap_dio_complete(dio); > > - /* > - * Try again to invalidate clean pages which might have been cached by > - * non-direct readahead, or faulted in by get_user_pages() if the source > - * of the write was an mmap'ed region of the file we're writing. Either > - * one is a pretty crazy thing to do, so we don't support it 100%. If > - * this invalidation fails, tough, the write still worked... > - */ > - if (iov_iter_rw(iter) == WRITE) { > - int err = invalidate_inode_pages2_range(mapping, > - start >> PAGE_SHIFT, end >> PAGE_SHIFT); > - WARN_ON_ONCE(err); > - } > - > return ret; > > out_free_dio: > diff --git a/mm/filemap.c b/mm/filemap.c > index a497024..9440e02 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) > * we're writing. Either one is a pretty crazy thing to do, > * so we don't support it 100%. If this invalidation > * fails, tough, the write still worked... > + * > + * Most of the time we do not need this since dio_complete() will do > + * the invalidation for us. However there are some file systems that > + * do not end up with dio_complete() being called, so let's not break > + * them by removing it completely > */ > - invalidate_inode_pages2_range(mapping, > - pos >> PAGE_SHIFT, end); > + if (mapping->nrpages) > + invalidate_inode_pages2_range(mapping, > + pos >> PAGE_SHIFT, end); > > if (written > 0) { > pos += written; ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-15 13:28 ` [PATCH v7] " Lukas Czerner ` (2 preceding siblings ...) 2017-09-21 13:44 ` Jeff Moyer @ 2017-09-21 13:44 ` Lukas Czerner 2017-09-21 14:14 ` Jens Axboe 2017-10-10 14:34 ` David Sterba 4 siblings, 1 reply; 40+ messages in thread From: Lukas Czerner @ 2017-09-21 13:44 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, jack, jmoyer, david, Jens Axboe Al, Jens, can any of you please take this throught your tree ? Thanks! -Lukas On Tue, Aug 15, 2017 at 03:28:54PM +0200, Lukas Czerner wrote: > Currently when mixing buffered reads and asynchronous direct writes it > is possible to end up with the situation where we have stale data in the > page cache while the new data is already written to disk. This is > permanent until the affected pages are flushed away. Despite the fact > that mixing buffered and direct IO is ill-advised it does pose a thread > for a data integrity, is unexpected and should be fixed. > > Fix this by deferring completion of asynchronous direct writes to a > process context in the case that there are mapped pages to be found in > the inode. Later before the completion in dio_complete() invalidate > the pages in question. This ensures that after the completion the pages > in the written area are either unmapped, or populated with up-to-date > data. Also do the same for the iomap case which uses > iomap_dio_complete() instead. > > This has a side effect of deferring the completion to a process context > for every AIO DIO that happens on inode that has pages mapped. However > since the consensus is that this is ill-advised practice the performance > implication should not be a problem. > > This was based on proposal from Jeff Moyer, thanks! > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > Cc: Jeff Moyer <jmoyer@redhat.com> > --- > v2: Remove leftover ret variable from invalidate call in iomap_dio_complete > v3: Do not invalidate in case of error. Add some coments > v4: Remove unnecessary variable, remove unnecessary inner braces > v5: Style changes > v6: Remove redundant invalidatepage, add warning and comment > v7: Run invalidateion conditionally from generic_file_direct_write() > > fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------ > fs/iomap.c | 29 ++++++++++++++++------------- > mm/filemap.c | 10 ++++++++-- > 3 files changed, 67 insertions(+), 21 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 08cf278..ffb9e19 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > { > loff_t offset = dio->iocb->ki_pos; > ssize_t transferred = 0; > + int err; > > /* > * AIO submission can race with bio completion to get here while > @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > if (ret == 0) > ret = transferred; > > + /* > + * Try again to invalidate clean pages which might have been cached by > + * non-direct readahead, or faulted in by get_user_pages() if the source > + * of the write was an mmap'ed region of the file we're writing. Either > + * one is a pretty crazy thing to do, so we don't support it 100%. If > + * this invalidation fails, tough, the write still worked... > + */ > + if (ret > 0 && dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages) { > + err = invalidate_inode_pages2_range(dio->inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + ret - 1) >> PAGE_SHIFT); > + WARN_ON_ONCE(err); > + } > + > if (dio->end_io) { > - int err; > > // XXX: ki_pos?? > err = dio->end_io(dio->iocb, offset, ret, dio->private); > @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio) > struct dio *dio = bio->bi_private; > unsigned long remaining; > unsigned long flags; > + bool defer_completion = false; > > /* cleanup the bio */ > dio_bio_complete(dio, bio); > @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio) > spin_unlock_irqrestore(&dio->bio_lock, flags); > > if (remaining == 0) { > - if (dio->result && dio->defer_completion) { > + /* > + * Defer completion when defer_completion is set or > + * when the inode has pages mapped and this is AIO write. > + * We need to invalidate those pages because there is a > + * chance they contain stale data in the case buffered IO > + * went in between AIO submission and completion into the > + * same region. > + */ > + if (dio->result) > + defer_completion = dio->defer_completion || > + (dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages); > + if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > * so that we can call ->fsync. > */ > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > - ((iocb->ki_filp->f_flags & O_DSYNC) || > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > - retval = dio_set_defer_completion(dio); > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) { > + /* > + * In case of AIO write racing with buffered read we > + * need to defer completion. We can't decide this now, > + * however the workqueue needs to be initialized here. > + */ > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > + } > if (retval) { > /* > * We grab i_mutex only for reads so we don't have > diff --git a/fs/iomap.c b/fs/iomap.c > index 0392661..c3e299a 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -713,8 +713,24 @@ struct iomap_dio { > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > { > struct kiocb *iocb = dio->iocb; > + struct inode *inode = file_inode(iocb->ki_filp); > ssize_t ret; > > + /* > + * Try again to invalidate clean pages which might have been cached by > + * non-direct readahead, or faulted in by get_user_pages() if the source > + * of the write was an mmap'ed region of the file we're writing. Either > + * one is a pretty crazy thing to do, so we don't support it 100%. If > + * this invalidation fails, tough, the write still worked... > + */ > + if (!dio->error && > + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) { > + ret = invalidate_inode_pages2_range(inode->i_mapping, > + iocb->ki_pos >> PAGE_SHIFT, > + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT); > + WARN_ON_ONCE(ret); > + } > + > if (dio->end_io) { > ret = dio->end_io(iocb, > dio->error ? dio->error : dio->size, > @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > ret = iomap_dio_complete(dio); > > - /* > - * Try again to invalidate clean pages which might have been cached by > - * non-direct readahead, or faulted in by get_user_pages() if the source > - * of the write was an mmap'ed region of the file we're writing. Either > - * one is a pretty crazy thing to do, so we don't support it 100%. If > - * this invalidation fails, tough, the write still worked... > - */ > - if (iov_iter_rw(iter) == WRITE) { > - int err = invalidate_inode_pages2_range(mapping, > - start >> PAGE_SHIFT, end >> PAGE_SHIFT); > - WARN_ON_ONCE(err); > - } > - > return ret; > > out_free_dio: > diff --git a/mm/filemap.c b/mm/filemap.c > index a497024..9440e02 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) > * we're writing. Either one is a pretty crazy thing to do, > * so we don't support it 100%. If this invalidation > * fails, tough, the write still worked... > + * > + * Most of the time we do not need this since dio_complete() will do > + * the invalidation for us. However there are some file systems that > + * do not end up with dio_complete() being called, so let's not break > + * them by removing it completely > */ > - invalidate_inode_pages2_range(mapping, > - pos >> PAGE_SHIFT, end); > + if (mapping->nrpages) > + invalidate_inode_pages2_range(mapping, > + pos >> PAGE_SHIFT, end); > > if (written > 0) { > pos += written; > -- > 2.7.5 > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-09-21 13:44 ` Lukas Czerner @ 2017-09-21 14:14 ` Jens Axboe 0 siblings, 0 replies; 40+ messages in thread From: Jens Axboe @ 2017-09-21 14:14 UTC (permalink / raw) To: Lukas Czerner, linux-fsdevel; +Cc: viro, jack, jmoyer, david On 09/21/2017 07:44 AM, Lukas Czerner wrote: > Al, Jens, > > can any of you please take this throught your tree ? I can take it, it's well reviewed at this point (to say the least). -- Jens Axboe ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-08-15 13:28 ` [PATCH v7] " Lukas Czerner ` (3 preceding siblings ...) 2017-09-21 13:44 ` Lukas Czerner @ 2017-10-10 14:34 ` David Sterba 2017-10-11 9:21 ` Lukas Czerner 4 siblings, 1 reply; 40+ messages in thread From: David Sterba @ 2017-10-10 14:34 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, jmoyer, david, bo.li.liu, clm On Tue, Aug 15, 2017 at 03:28:54PM +0200, Lukas Czerner wrote: > + /* > + * Try again to invalidate clean pages which might have been cached by > + * non-direct readahead, or faulted in by get_user_pages() if the source > + * of the write was an mmap'ed region of the file we're writing. Either > + * one is a pretty crazy thing to do, so we don't support it 100%. If > + * this invalidation fails, tough, the write still worked... > + */ > + if (ret > 0 && dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages) { > + err = invalidate_inode_pages2_range(dio->inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + ret - 1) >> PAGE_SHIFT); > + WARN_ON_ONCE(err); fstests/btrfs/062 reports this: [ 6235.547298] ------------[ cut here ]------------ [ 6235.552098] WARNING: CPU: 7 PID: 24321 at fs/direct-io.c:274 dio_complete+0x16f/0x1f0 [ 6235.560858] Modules linked in: dm_flakey loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet br_netfilter bridge stp llc iscsi_ibft iscsi_boot_sysfs btrfs xor zstd_decompress zstd_compress i2c_algo_bit drm_kms_helper xxhash zlib_deflate raid6_pq syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 drm dm_mod dax ptp kvm_amd pps_core kvm libphy tpm_infineon mptctl shpchp k10temp tpm_tis tpm_tis_core button i2c_piix4 tpm pcspkr irqbypass acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom ohci_pci ehci_pci ohci_hcd mptsas ehci_hcd scsi_transport_sas ata_generic mptscsih serio_raw mptbase usbcore sata_svw pata_serverworks sg scsi_dh_rdac scsi_dh_emc scsi_dh_alua [ 6235.560942] CPU: 7 PID: 24321 Comm: kworker/7:1 Not tainted 4.14.0-rc4-1.ge195904-vanilla+ #71 [ 6235.560944] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 [ 6235.560950] Workqueue: dio/sdb6 dio_aio_complete_work [ 6235.560953] task: ffff894fe0bd8300 task.stack: ffffb45742f7c000 [ 6235.560957] RIP: 0010:dio_complete+0x16f/0x1f0 [ 6235.560959] RSP: 0018:ffffb45742f7fde8 EFLAGS: 00010286 [ 6235.560968] RAX: 00000000fffffff0 RBX: ffff894fd1e3a680 RCX: ffff894fe0bd8300 [ 6235.560970] RDX: 0000000000000000 RSI: 00000000000002b4 RDI: ffffffffaba438e9 [ 6235.560971] RBP: ffffb45742f7fe10 R08: 0000000000000000 R09: 0000000000000025 [ 6235.560973] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000001e000 [ 6235.560974] R13: 000000000001e000 R14: 0000000000007000 R15: 0000000000000001 [ 6235.560977] FS: 0000000000000000(0000) GS:ffff894fefdc0000(0000) knlGS:0000000000000000 [ 6235.560978] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6235.560980] CR2: 00007fe1e1dfb610 CR3: 0000000213ee7000 CR4: 00000000000006e0 [ 6235.560982] Call Trace: [ 6235.561075] dio_aio_complete_work+0x1c/0x20 [ 6235.561082] process_one_work+0x1d8/0x620 [ 6235.561085] ? process_one_work+0x14b/0x620 [ 6235.561092] worker_thread+0x4d/0x3c0 [ 6235.561097] ? trace_hardirqs_on+0xd/0x10 [ 6235.561105] kthread+0x152/0x190 [ 6235.561107] ? process_one_work+0x620/0x620 [ 6235.561111] ? kthread_create_on_node+0x40/0x40 [ 6235.561116] ? do_syscall_64+0x69/0x180 [ 6235.561122] ret_from_fork+0x2a/0x40 [ 6235.561131] Code: 48 83 bf 00 01 00 00 00 0f 84 37 ff ff ff 4b 8d 54 34 ff 4c 89 f6 48 c1 fe 0c 48 c1 fa 0c e8 49 82 f2 ff 85 c0 0f 84 1a ff ff ff <0f> ff e9 13 ff ff ff 48 81 c7 e0 00 00 00 be 09 00 00 00 e8 79 [ 6235.561179] ---[ end trace ba80cd81f19cb389 ]--- I've added Chris and Bo to CC if they have more to say about the specifics of dio and buffered writes as implemented in btrfs. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-10-10 14:34 ` David Sterba @ 2017-10-11 9:21 ` Lukas Czerner 0 siblings, 0 replies; 40+ messages in thread From: Lukas Czerner @ 2017-10-11 9:21 UTC (permalink / raw) To: David Sterba; +Cc: linux-fsdevel, viro, jack, jmoyer, david, bo.li.liu, clm On Tue, Oct 10, 2017 at 04:34:45PM +0200, David Sterba wrote: > On Tue, Aug 15, 2017 at 03:28:54PM +0200, Lukas Czerner wrote: > > + /* > > + * Try again to invalidate clean pages which might have been cached by > > + * non-direct readahead, or faulted in by get_user_pages() if the source > > + * of the write was an mmap'ed region of the file we're writing. Either > > + * one is a pretty crazy thing to do, so we don't support it 100%. If > > + * this invalidation fails, tough, the write still worked... > > + */ > > + if (ret > 0 && dio->op == REQ_OP_WRITE && > > + dio->inode->i_mapping->nrpages) { > > + err = invalidate_inode_pages2_range(dio->inode->i_mapping, > > + offset >> PAGE_SHIFT, > > + (offset + ret - 1) >> PAGE_SHIFT); > > + WARN_ON_ONCE(err); > > fstests/btrfs/062 reports this: > > [ 6235.547298] ------------[ cut here ]------------ > [ 6235.552098] WARNING: CPU: 7 PID: 24321 at fs/direct-io.c:274 dio_complete+0x16f/0x1f0 > [ 6235.560858] Modules linked in: dm_flakey loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet br_netfilter bridge stp llc iscsi_ibft iscsi_boot_sysfs btrfs xor zstd_decompress zstd_compress i2c_algo_bit drm_kms_helper xxhash zlib_deflate raid6_pq syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 drm dm_mod dax ptp kvm_amd pps_core kvm libphy tpm_infineon mptctl shpchp k10temp tpm_tis tpm_tis_core button i2c_piix4 tpm pcspkr irqbypass acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom ohci_pci ehci_pci ohci_hcd mptsas ehci_hcd scsi_transport_sas ata_generic mptscsih serio_raw mptbase usbcore sata_svw pata_serverworks sg scsi_dh_rdac scsi_dh_emc scsi_dh_alua > [ 6235.560942] CPU: 7 PID: 24321 Comm: kworker/7:1 Not tainted 4.14.0-rc4-1.ge195904-vanilla+ #71 > [ 6235.560944] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 > [ 6235.560950] Workqueue: dio/sdb6 dio_aio_complete_work > [ 6235.560953] task: ffff894fe0bd8300 task.stack: ffffb45742f7c000 > [ 6235.560957] RIP: 0010:dio_complete+0x16f/0x1f0 > [ 6235.560959] RSP: 0018:ffffb45742f7fde8 EFLAGS: 00010286 > [ 6235.560968] RAX: 00000000fffffff0 RBX: ffff894fd1e3a680 RCX: ffff894fe0bd8300 > [ 6235.560970] RDX: 0000000000000000 RSI: 00000000000002b4 RDI: ffffffffaba438e9 > [ 6235.560971] RBP: ffffb45742f7fe10 R08: 0000000000000000 R09: 0000000000000025 > [ 6235.560973] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000001e000 > [ 6235.560974] R13: 000000000001e000 R14: 0000000000007000 R15: 0000000000000001 > [ 6235.560977] FS: 0000000000000000(0000) GS:ffff894fefdc0000(0000) knlGS:0000000000000000 > [ 6235.560978] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 6235.560980] CR2: 00007fe1e1dfb610 CR3: 0000000213ee7000 CR4: 00000000000006e0 > [ 6235.560982] Call Trace: > [ 6235.561075] dio_aio_complete_work+0x1c/0x20 > [ 6235.561082] process_one_work+0x1d8/0x620 > [ 6235.561085] ? process_one_work+0x14b/0x620 > [ 6235.561092] worker_thread+0x4d/0x3c0 > [ 6235.561097] ? trace_hardirqs_on+0xd/0x10 > [ 6235.561105] kthread+0x152/0x190 > [ 6235.561107] ? process_one_work+0x620/0x620 > [ 6235.561111] ? kthread_create_on_node+0x40/0x40 > [ 6235.561116] ? do_syscall_64+0x69/0x180 > [ 6235.561122] ret_from_fork+0x2a/0x40 > [ 6235.561131] Code: 48 83 bf 00 01 00 00 00 0f 84 37 ff ff ff 4b 8d 54 34 ff 4c 89 f6 48 c1 fe 0c 48 c1 fa 0c e8 49 82 f2 ff 85 c0 0f 84 1a ff ff ff <0f> ff e9 13 ff ff ff 48 81 c7 e0 00 00 00 be 09 00 00 00 e8 79 > [ 6235.561179] ---[ end trace ba80cd81f19cb389 ]--- > > I've added Chris and Bo to CC if they have more to say about the specifics of > dio and buffered writes as implemented in btrfs. There are several places where we warn on invalidation failing. The question is why invalidation failed on btrfs in this test. -Lukas ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2017-10-11 9:21 UTC | newest] Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-07-13 15:17 [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO Lukas Czerner 2017-07-14 10:41 ` kbuild test robot 2017-07-14 13:40 ` Lukas Czerner 2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner 2017-07-17 15:12 ` Jan Kara 2017-07-17 15:28 ` Lukas Czerner 2017-07-17 15:39 ` Jeff Moyer 2017-07-17 16:17 ` Jan Kara 2017-07-17 19:52 ` Jeff Moyer 2017-07-18 7:39 ` Lukas Czerner 2017-07-18 9:06 ` Jan Kara 2017-07-18 9:32 ` Lukas Czerner 2017-07-18 12:19 ` [PATCH v3] " Lukas Czerner 2017-07-18 13:44 ` Christoph Hellwig 2017-07-18 14:17 ` Jan Kara 2017-07-19 8:42 ` Lukas Czerner 2017-07-19 8:48 ` [PATCH v4] " Lukas Czerner 2017-07-19 9:26 ` Jan Kara 2017-07-19 11:01 ` Lukas Czerner 2017-07-19 11:28 ` [PATCH v5] " Lukas Czerner 2017-07-19 11:37 ` Jan Kara 2017-07-19 12:17 ` Jeff Moyer 2017-08-03 18:10 ` Jeff Moyer 2017-08-04 10:09 ` Dave Chinner 2017-08-07 15:52 ` Jeff Moyer 2017-08-08 8:41 ` Lukas Czerner 2017-08-10 12:59 ` [PATCH v6] " Lukas Czerner 2017-08-10 13:56 ` Jan Kara 2017-08-10 14:22 ` Jeff Moyer 2017-08-11 9:03 ` Lukas Czerner 2017-08-14 9:43 ` Jan Kara 2017-08-15 12:47 ` Lukas Czerner 2017-08-15 13:28 ` [PATCH v7] " Lukas Czerner 2017-08-16 13:15 ` Jan Kara 2017-08-16 16:01 ` Darrick J. Wong 2017-09-21 13:44 ` Jeff Moyer 2017-09-21 13:44 ` Lukas Czerner 2017-09-21 14:14 ` Jens Axboe 2017-10-10 14:34 ` David Sterba 2017-10-11 9:21 ` Lukas Czerner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.