All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
@ 2017-07-13 15:17 Lukas Czerner
  2017-07-14 10:41 ` kbuild test robot
  2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner
  0 siblings, 2 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-07-13 15:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, jack, Lukas Czerner, Jeff Moyer

Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
---
 fs/direct-io.c | 31 ++++++++++++++++++++++++++-----
 fs/iomap.c     |  7 +++++++
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 08cf278..2db9ada 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -258,6 +258,11 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 	if (ret == 0)
 		ret = transferred;
 
+	if ((dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages))
+		invalidate_inode_pages2_range(dio->inode->i_mapping,
+					offset >> PAGE_SHIFT,
+					(offset + ret - 1) >> PAGE_SHIFT);
+
 	if (dio->end_io) {
 		int err;
 
@@ -304,6 +309,7 @@ static void dio_bio_end_aio(struct bio *bio)
 	struct dio *dio = bio->bi_private;
 	unsigned long remaining;
 	unsigned long flags;
+	bool defer_completion = false;
 
 	/* cleanup the bio */
 	dio_bio_complete(dio, bio);
@@ -315,7 +321,19 @@ static void dio_bio_end_aio(struct bio *bio)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
 	if (remaining == 0) {
-		if (dio->result && dio->defer_completion) {
+		/*
+		 * Defer completion when defer_completion is set or
+		 * when the inode has pages mapped and this is AIO write.
+		 * We need to invalidate those pages because there is a
+		 * chance they contain stale data in the case buffered IO
+		 * went in between AIO submission and completion into the
+		 * same region.
+		 */
+		if (dio->result)
+			defer_completion = dio->defer_completion ||
+					   (dio->op == REQ_OP_WRITE &&
+					    dio->inode->i_mapping->nrpages);
+		if (defer_completion) {
 			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
 			queue_work(dio->inode->i_sb->s_dio_done_wq,
 				   &dio->complete_work);
@@ -1210,10 +1228,13 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
 	 * so that we can call ->fsync.
 	 */
-	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
-	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
-	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
-		retval = dio_set_defer_completion(dio);
+	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
+		retval = 0;
+		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
+		    IS_SYNC(iocb->ki_filp->f_mapping->host))
+			retval = dio_set_defer_completion(dio);
+		else if (!dio->inode->i_sb->s_dio_done_wq)
+			retval = sb_init_dio_done_wq(dio->inode->i_sb);
 		if (retval) {
 			/*
 			 * We grab i_mutex only for reads so we don't have
diff --git a/fs/iomap.c b/fs/iomap.c
index 1732228..a1ad4ca 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -713,8 +713,15 @@ struct iomap_dio {
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
+	loff_t offset = iocb->ki_pos;
+	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
+		invalidate_inode_pages2_range(inode->i_mapping,
+				offset >> PAGE_SHIFT,
+				(offset + dio->size + ret - 1) >> PAGE_SHIFT);
+
 	if (dio->end_io) {
 		ret = dio->end_io(iocb,
 				dio->error ? dio->error : dio->size,
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-13 15:17 [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO Lukas Czerner
@ 2017-07-14 10:41 ` kbuild test robot
  2017-07-14 13:40   ` Lukas Czerner
  2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner
  1 sibling, 1 reply; 40+ messages in thread
From: kbuild test robot @ 2017-07-14 10:41 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: kbuild-all, linux-fsdevel, viro, jack, Lukas Czerner, Jeff Moyer

[-- Attachment #1: Type: text/plain, Size: 2212 bytes --]

Hi Lukas,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.12 next-20170713]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Lukas-Czerner/fs-Fix-page-cache-inconsistency-when-mixing-buffered-and-AIO-DIO/20170714-181130
config: x86_64-randconfig-x010-201728 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   fs/iomap.c: In function 'iomap_dio_complete':
>> fs/iomap.c:629:25: warning: 'ret' may be used uninitialized in this function [-Wmaybe-uninitialized]
        (offset + dio->size + ret - 1) >> PAGE_SHIFT);
         ~~~~~~~~~~~~~~~~~~~^~~~~

vim +/ret +629 fs/iomap.c

   618	
   619	static ssize_t iomap_dio_complete(struct iomap_dio *dio)
   620	{
   621		struct kiocb *iocb = dio->iocb;
   622		loff_t offset = iocb->ki_pos;
   623		struct inode *inode = file_inode(iocb->ki_filp);
   624		ssize_t ret;
   625	
   626		if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
   627			invalidate_inode_pages2_range(inode->i_mapping,
   628					offset >> PAGE_SHIFT,
 > 629					(offset + dio->size + ret - 1) >> PAGE_SHIFT);
   630	
   631		if (dio->end_io) {
   632			ret = dio->end_io(iocb,
   633					dio->error ? dio->error : dio->size,
   634					dio->flags);
   635		} else {
   636			ret = dio->error;
   637		}
   638	
   639		if (likely(!ret)) {
   640			ret = dio->size;
   641			/* check for short read */
   642			if (iocb->ki_pos + ret > dio->i_size &&
   643			    !(dio->flags & IOMAP_DIO_WRITE))
   644				ret = dio->i_size - iocb->ki_pos;
   645			iocb->ki_pos += ret;
   646		}
   647	
   648		inode_dio_end(file_inode(iocb->ki_filp));
   649		kfree(dio);
   650	
   651		return ret;
   652	}
   653	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 24696 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-14 10:41 ` kbuild test robot
@ 2017-07-14 13:40   ` Lukas Czerner
  0 siblings, 0 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-07-14 13:40 UTC (permalink / raw)
  To: kbuild test robot; +Cc: kbuild-all, linux-fsdevel, viro, jack, Jeff Moyer

On Fri, Jul 14, 2017 at 06:41:52PM +0800, kbuild test robot wrote:
> Hi Lukas,
> 
> [auto build test WARNING on linus/master]
> [also build test WARNING on v4.12 next-20170713]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Lukas-Czerner/fs-Fix-page-cache-inconsistency-when-mixing-buffered-and-AIO-DIO/20170714-181130
> config: x86_64-randconfig-x010-201728 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
> http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings
> 
> All warnings (new ones prefixed by >>):
> 
>    fs/iomap.c: In function 'iomap_dio_complete':
> >> fs/iomap.c:629:25: warning: 'ret' may be used uninitialized in this function [-Wmaybe-uninitialized]
>         (offset + dio->size + ret - 1) >> PAGE_SHIFT);
>          ~~~~~~~~~~~~~~~~~~~^~~~~

Oops, right, this is obviously wrong.

Thanks!
-Lukas

> 
> vim +/ret +629 fs/iomap.c
> 
>    618	
>    619	static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>    620	{
>    621		struct kiocb *iocb = dio->iocb;
>    622		loff_t offset = iocb->ki_pos;
>    623		struct inode *inode = file_inode(iocb->ki_filp);
>    624		ssize_t ret;
>    625	
>    626		if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
>    627			invalidate_inode_pages2_range(inode->i_mapping,
>    628					offset >> PAGE_SHIFT,
>  > 629					(offset + dio->size + ret - 1) >> PAGE_SHIFT);
>    630	
>    631		if (dio->end_io) {
>    632			ret = dio->end_io(iocb,
>    633					dio->error ? dio->error : dio->size,
>    634					dio->flags);
>    635		} else {
>    636			ret = dio->error;
>    637		}
>    638	
>    639		if (likely(!ret)) {
>    640			ret = dio->size;
>    641			/* check for short read */
>    642			if (iocb->ki_pos + ret > dio->i_size &&
>    643			    !(dio->flags & IOMAP_DIO_WRITE))
>    644				ret = dio->i_size - iocb->ki_pos;
>    645			iocb->ki_pos += ret;
>    646		}
>    647	
>    648		inode_dio_end(file_inode(iocb->ki_filp));
>    649		kfree(dio);
>    650	
>    651		return ret;
>    652	}
>    653	
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-13 15:17 [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO Lukas Czerner
  2017-07-14 10:41 ` kbuild test robot
@ 2017-07-14 15:40 ` Lukas Czerner
  2017-07-17 15:12   ` Jan Kara
  2017-07-18 12:19   ` [PATCH v3] " Lukas Czerner
  1 sibling, 2 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-07-14 15:40 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, jack, Lukas Czerner, Jeff Moyer

Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
---
v2: Remove leftover ret variable from invalidate call in iomap_dio_complete

 fs/direct-io.c | 31 ++++++++++++++++++++++++++-----
 fs/iomap.c     |  7 +++++++
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 08cf278..2db9ada 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -258,6 +258,11 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 	if (ret == 0)
 		ret = transferred;
 
+	if ((dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages))
+		invalidate_inode_pages2_range(dio->inode->i_mapping,
+					offset >> PAGE_SHIFT,
+					(offset + ret - 1) >> PAGE_SHIFT);
+
 	if (dio->end_io) {
 		int err;
 
@@ -304,6 +309,7 @@ static void dio_bio_end_aio(struct bio *bio)
 	struct dio *dio = bio->bi_private;
 	unsigned long remaining;
 	unsigned long flags;
+	bool defer_completion = false;
 
 	/* cleanup the bio */
 	dio_bio_complete(dio, bio);
@@ -315,7 +321,19 @@ static void dio_bio_end_aio(struct bio *bio)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
 	if (remaining == 0) {
-		if (dio->result && dio->defer_completion) {
+		/*
+		 * Defer completion when defer_completion is set or
+		 * when the inode has pages mapped and this is AIO write.
+		 * We need to invalidate those pages because there is a
+		 * chance they contain stale data in the case buffered IO
+		 * went in between AIO submission and completion into the
+		 * same region.
+		 */
+		if (dio->result)
+			defer_completion = dio->defer_completion ||
+					   (dio->op == REQ_OP_WRITE &&
+					    dio->inode->i_mapping->nrpages);
+		if (defer_completion) {
 			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
 			queue_work(dio->inode->i_sb->s_dio_done_wq,
 				   &dio->complete_work);
@@ -1210,10 +1228,13 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
 	 * so that we can call ->fsync.
 	 */
-	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
-	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
-	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
-		retval = dio_set_defer_completion(dio);
+	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
+		retval = 0;
+		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
+		    IS_SYNC(iocb->ki_filp->f_mapping->host))
+			retval = dio_set_defer_completion(dio);
+		else if (!dio->inode->i_sb->s_dio_done_wq)
+			retval = sb_init_dio_done_wq(dio->inode->i_sb);
 		if (retval) {
 			/*
 			 * We grab i_mutex only for reads so we don't have
diff --git a/fs/iomap.c b/fs/iomap.c
index 1732228..3baeed2 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -713,8 +713,15 @@ struct iomap_dio {
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
+	loff_t offset = iocb->ki_pos;
+	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
+		invalidate_inode_pages2_range(inode->i_mapping,
+				offset >> PAGE_SHIFT,
+				(offset + dio->size - 1) >> PAGE_SHIFT);
+
 	if (dio->end_io) {
 		ret = dio->end_io(iocb,
 				dio->error ? dio->error : dio->size,
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner
@ 2017-07-17 15:12   ` Jan Kara
  2017-07-17 15:28     ` Lukas Czerner
  2017-07-18 12:19   ` [PATCH v3] " Lukas Czerner
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Kara @ 2017-07-17 15:12 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, Jeff Moyer

On Fri 14-07-17 17:40:23, Lukas Czerner wrote:
> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
> 
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
> 
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
> 
> This was based on proposal from Jeff Moyer, thanks!
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>

OK, this looks like it could work. Some comments below.

>  fs/direct-io.c | 31 ++++++++++++++++++++++++++-----
>  fs/iomap.c     |  7 +++++++
>  2 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..2db9ada 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -258,6 +258,11 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  	if (ret == 0)
>  		ret = transferred;
>  
> +	if ((dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages))

Superfluous braces here... Also you should not call
invalidate_inode_pages2_range() in case of error I suppose.

> +		invalidate_inode_pages2_range(dio->inode->i_mapping,
> +					offset >> PAGE_SHIFT,
> +					(offset + ret - 1) >> PAGE_SHIFT);
> +
>  	if (dio->end_io) {
>  		int err;
>  
> @@ -304,6 +309,7 @@ static void dio_bio_end_aio(struct bio *bio)
>  	struct dio *dio = bio->bi_private;
>  	unsigned long remaining;
>  	unsigned long flags;
> +	bool defer_completion = false;
>  
>  	/* cleanup the bio */
>  	dio_bio_complete(dio, bio);
> @@ -315,7 +321,19 @@ static void dio_bio_end_aio(struct bio *bio)
>  	spin_unlock_irqrestore(&dio->bio_lock, flags);
>  
>  	if (remaining == 0) {
> -		if (dio->result && dio->defer_completion) {
> +		/*
> +		 * Defer completion when defer_completion is set or
> +		 * when the inode has pages mapped and this is AIO write.
> +		 * We need to invalidate those pages because there is a
> +		 * chance they contain stale data in the case buffered IO
> +		 * went in between AIO submission and completion into the
> +		 * same region.
> +		 */
> +		if (dio->result)
> +			defer_completion = dio->defer_completion ||
> +					   (dio->op == REQ_OP_WRITE &&
> +					    dio->inode->i_mapping->nrpages);
> +		if (defer_completion) {
>  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>  			queue_work(dio->inode->i_sb->s_dio_done_wq,
>  				   &dio->complete_work);
> @@ -1210,10 +1228,13 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
>  	 * so that we can call ->fsync.
>  	 */
> -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> -		retval = dio_set_defer_completion(dio);
> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq)
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);

Please add a comment explaining why sb_init_dio_done_wq() is needed here.

>  		if (retval) {
>  			/*
>  			 * We grab i_mutex only for reads so we don't have
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 1732228..3baeed2 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,15 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	loff_t offset = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)

Again I don't think you want to invalidate pages in case DIO failed with an
error...

> +		invalidate_inode_pages2_range(inode->i_mapping,
> +				offset >> PAGE_SHIFT,
> +				(offset + dio->size - 1) >> PAGE_SHIFT);
> +
>  	if (dio->end_io) {
>  		ret = dio->end_io(iocb,
>  				dio->error ? dio->error : dio->size,

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-17 15:12   ` Jan Kara
@ 2017-07-17 15:28     ` Lukas Czerner
  2017-07-17 15:39       ` Jeff Moyer
  0 siblings, 1 reply; 40+ messages in thread
From: Lukas Czerner @ 2017-07-17 15:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, viro, Jeff Moyer

On Mon, Jul 17, 2017 at 05:12:28PM +0200, Jan Kara wrote:
> On Fri 14-07-17 17:40:23, Lukas Czerner wrote:
> > Currently when mixing buffered reads and asynchronous direct writes it
> > is possible to end up with the situation where we have stale data in the
> > page cache while the new data is already written to disk. This is
> > permanent until the affected pages are flushed away. Despite the fact
> > that mixing buffered and direct IO is ill-advised it does pose a thread
> > for a data integrity, is unexpected and should be fixed.
> > 
> > Fix this by deferring completion of asynchronous direct writes to a
> > process context in the case that there are mapped pages to be found in
> > the inode. Later before the completion in dio_complete() invalidate
> > the pages in question. This ensures that after the completion the pages
> > in the written area are either unmapped, or populated with up-to-date
> > data. Also do the same for the iomap case which uses
> > iomap_dio_complete() instead.
> > 
> > This has a side effect of deferring the completion to a process context
> > for every AIO DIO that happens on inode that has pages mapped. However
> > since the consensus is that this is ill-advised practice the performance
> > implication should not be a problem.
> > 
> > This was based on proposal from Jeff Moyer, thanks!
> > 
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > Cc: Jeff Moyer <jmoyer@redhat.com>
> 
> OK, this looks like it could work. Some comments below.
> 
> >  fs/direct-io.c | 31 ++++++++++++++++++++++++++-----
> >  fs/iomap.c     |  7 +++++++
> >  2 files changed, 33 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index 08cf278..2db9ada 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -258,6 +258,11 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
> >  	if (ret == 0)
> >  		ret = transferred;
> >  
> > +	if ((dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages))
> 
> Superfluous braces here... Also you should not call
> invalidate_inode_pages2_range() in case of error I suppose.

Sure, I'll fix the braces.

About the error case, is it not possible that some data has already been
writtent to the disk despite the error ?

Thanks!
-Lukas

> 
> > +		invalidate_inode_pages2_range(dio->inode->i_mapping,
> > +					offset >> PAGE_SHIFT,
> > +					(offset + ret - 1) >> PAGE_SHIFT);
> > +
> >  	if (dio->end_io) {
> >  		int err;
> >  
> > @@ -304,6 +309,7 @@ static void dio_bio_end_aio(struct bio *bio)
> >  	struct dio *dio = bio->bi_private;
> >  	unsigned long remaining;
> >  	unsigned long flags;
> > +	bool defer_completion = false;
> >  
> >  	/* cleanup the bio */
> >  	dio_bio_complete(dio, bio);
> > @@ -315,7 +321,19 @@ static void dio_bio_end_aio(struct bio *bio)
> >  	spin_unlock_irqrestore(&dio->bio_lock, flags);
> >  
> >  	if (remaining == 0) {
> > -		if (dio->result && dio->defer_completion) {
> > +		/*
> > +		 * Defer completion when defer_completion is set or
> > +		 * when the inode has pages mapped and this is AIO write.
> > +		 * We need to invalidate those pages because there is a
> > +		 * chance they contain stale data in the case buffered IO
> > +		 * went in between AIO submission and completion into the
> > +		 * same region.
> > +		 */
> > +		if (dio->result)
> > +			defer_completion = dio->defer_completion ||
> > +					   (dio->op == REQ_OP_WRITE &&
> > +					    dio->inode->i_mapping->nrpages);
> > +		if (defer_completion) {
> >  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
> >  			queue_work(dio->inode->i_sb->s_dio_done_wq,
> >  				   &dio->complete_work);
> > @@ -1210,10 +1228,13 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> >  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
> >  	 * so that we can call ->fsync.
> >  	 */
> > -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> > -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> > -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> > -		retval = dio_set_defer_completion(dio);
> > +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> > +		retval = 0;
> > +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> > +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> > +			retval = dio_set_defer_completion(dio);
> > +		else if (!dio->inode->i_sb->s_dio_done_wq)
> > +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> 
> Please add a comment explaining why sb_init_dio_done_wq() is needed here.

ok, thanks.

> 
> >  		if (retval) {
> >  			/*
> >  			 * We grab i_mutex only for reads so we don't have
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index 1732228..3baeed2 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -713,8 +713,15 @@ struct iomap_dio {
> >  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >  {
> >  	struct kiocb *iocb = dio->iocb;
> > +	loff_t offset = iocb->ki_pos;
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> >  	ssize_t ret;
> >  
> > +	if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
> 
> Again I don't think you want to invalidate pages in case DIO failed with an
> error...
> 
> > +		invalidate_inode_pages2_range(inode->i_mapping,
> > +				offset >> PAGE_SHIFT,
> > +				(offset + dio->size - 1) >> PAGE_SHIFT);
> > +
> >  	if (dio->end_io) {
> >  		ret = dio->end_io(iocb,
> >  				dio->error ? dio->error : dio->size,
> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-17 15:28     ` Lukas Czerner
@ 2017-07-17 15:39       ` Jeff Moyer
  2017-07-17 16:17         ` Jan Kara
  2017-07-18  7:39         ` Lukas Czerner
  0 siblings, 2 replies; 40+ messages in thread
From: Jeff Moyer @ 2017-07-17 15:39 UTC (permalink / raw)
  To: Lukas Czerner, Jan Kara; +Cc: linux-fsdevel, viro

Lukas Czerner <lczerner@redhat.com> writes:

> About the error case, is it not possible that some data has already been
> writtent to the disk despite the error ?

Yes, it's possible.  However, that data is in an inconsistent state, so
it shouldn't be read, anyway.

Now, in the non-async path, we do the invalidation unconditionally, so I
could go either way on this.  I don't think it's going to matter for
performance or data integrity.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-17 15:39       ` Jeff Moyer
@ 2017-07-17 16:17         ` Jan Kara
  2017-07-17 19:52           ` Jeff Moyer
  2017-07-18  7:39         ` Lukas Czerner
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Kara @ 2017-07-17 16:17 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Lukas Czerner, Jan Kara, linux-fsdevel, viro

On Mon 17-07-17 11:39:09, Jeff Moyer wrote:
> Lukas Czerner <lczerner@redhat.com> writes:
> 
> > About the error case, is it not possible that some data has already been
> > writtent to the disk despite the error ?
> 
> Yes, it's possible.  However, that data is in an inconsistent state, so
> it shouldn't be read, anyway.
> 
> Now, in the non-async path, we do the invalidation unconditionally, so I
> could go either way on this.  I don't think it's going to matter for
> performance or data integrity.

Well, at least 'ret' would be negative in the error case so arguments
passed to invalidate_inode_pages2_range() would be bogus if I'm reading the
code right...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-17 16:17         ` Jan Kara
@ 2017-07-17 19:52           ` Jeff Moyer
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Moyer @ 2017-07-17 19:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: Lukas Czerner, linux-fsdevel, viro

Jan Kara <jack@suse.cz> writes:

> On Mon 17-07-17 11:39:09, Jeff Moyer wrote:
>> Lukas Czerner <lczerner@redhat.com> writes:
>> 
>> > About the error case, is it not possible that some data has already been
>> > writtent to the disk despite the error ?
>> 
>> Yes, it's possible.  However, that data is in an inconsistent state, so
>> it shouldn't be read, anyway.
>> 
>> Now, in the non-async path, we do the invalidation unconditionally, so I
>> could go either way on this.  I don't think it's going to matter for
>> performance or data integrity.
>
> Well, at least 'ret' would be negative in the error case so arguments
> passed to invalidate_inode_pages2_range() would be bogus if I'm reading the
> code right...

Ah, yes.  Sorry, I was commenting on the more general point.  You are
correct, in dio_complete, ret could be set to dio->page_errors or
dio->io_error.  So yes, that needs to be checked.

-Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-17 15:39       ` Jeff Moyer
  2017-07-17 16:17         ` Jan Kara
@ 2017-07-18  7:39         ` Lukas Czerner
  2017-07-18  9:06           ` Jan Kara
  1 sibling, 1 reply; 40+ messages in thread
From: Lukas Czerner @ 2017-07-18  7:39 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jan Kara, linux-fsdevel, viro

On Mon, Jul 17, 2017 at 11:39:09AM -0400, Jeff Moyer wrote:
> Lukas Czerner <lczerner@redhat.com> writes:
> 
> > About the error case, is it not possible that some data has already been
> > writtent to the disk despite the error ?
> 
> Yes, it's possible.  However, that data is in an inconsistent state, so
> it shouldn't be read, anyway.

I think it can be read if we wrote into already allocated space.

> 
> Now, in the non-async path, we do the invalidation unconditionally, so I
> could go either way on this.  I don't think it's going to matter for
> performance or data integrity.

That's part of the reason why I did it unconditionaly as well, however
Jan is right that ret would be negative. The way to fix it would differ
depending on whether I am right about reading partially written data
from AIO that failed. We still want to invalidate in that case.

-Lukas

> 
> Cheers,
> Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-18  7:39         ` Lukas Czerner
@ 2017-07-18  9:06           ` Jan Kara
  2017-07-18  9:32             ` Lukas Czerner
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2017-07-18  9:06 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: Jeff Moyer, Jan Kara, linux-fsdevel, viro

On Tue 18-07-17 09:39:35, Lukas Czerner wrote:
> On Mon, Jul 17, 2017 at 11:39:09AM -0400, Jeff Moyer wrote:
> > Lukas Czerner <lczerner@redhat.com> writes:
> > 
> > > About the error case, is it not possible that some data has already been
> > > writtent to the disk despite the error ?
> > 
> > Yes, it's possible.  However, that data is in an inconsistent state, so
> > it shouldn't be read, anyway.
> 
> I think it can be read if we wrote into already allocated space.
> 
> > 
> > Now, in the non-async path, we do the invalidation unconditionally, so I
> > could go either way on this.  I don't think it's going to matter for
> > performance or data integrity.
> 
> That's part of the reason why I did it unconditionaly as well, however
> Jan is right that ret would be negative. The way to fix it would differ
> depending on whether I am right about reading partially written data
> from AIO that failed. We still want to invalidate in that case.

Frankly, I don't think it really matters so I'd go for not invalidating
anything on error just out of philosophy: "There's something weird going
on, bail out as quickly as you can."

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-18  9:06           ` Jan Kara
@ 2017-07-18  9:32             ` Lukas Czerner
  0 siblings, 0 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-07-18  9:32 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jeff Moyer, linux-fsdevel, viro

On Tue, Jul 18, 2017 at 11:06:26AM +0200, Jan Kara wrote:
> On Tue 18-07-17 09:39:35, Lukas Czerner wrote:
> > On Mon, Jul 17, 2017 at 11:39:09AM -0400, Jeff Moyer wrote:
> > > Lukas Czerner <lczerner@redhat.com> writes:
> > > 
> > > > About the error case, is it not possible that some data has already been
> > > > writtent to the disk despite the error ?
> > > 
> > > Yes, it's possible.  However, that data is in an inconsistent state, so
> > > it shouldn't be read, anyway.
> > 
> > I think it can be read if we wrote into already allocated space.
> > 
> > > 
> > > Now, in the non-async path, we do the invalidation unconditionally, so I
> > > could go either way on this.  I don't think it's going to matter for
> > > performance or data integrity.
> > 
> > That's part of the reason why I did it unconditionaly as well, however
> > Jan is right that ret would be negative. The way to fix it would differ
> > depending on whether I am right about reading partially written data
> > from AIO that failed. We still want to invalidate in that case.
> 
> Frankly, I don't think it really matters so I'd go for not invalidating
> anything on error just out of philosophy: "There's something weird going
> on, bail out as quickly as you can."
> 
> 								Honza

Fair enough, thanks!

-Lukas

> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner
  2017-07-17 15:12   ` Jan Kara
@ 2017-07-18 12:19   ` Lukas Czerner
  2017-07-18 13:44     ` Christoph Hellwig
                       ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-07-18 12:19 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, jack, Lukas Czerner, Jeff Moyer

Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
---
v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
v3: Do not invalidate in case of error. Add some coments

 fs/direct-io.c | 37 ++++++++++++++++++++++++++++++++-----
 fs/iomap.c     |  8 ++++++++
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 08cf278..efd3246 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -258,6 +258,12 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 	if (ret == 0)
 		ret = transferred;
 
+	if ((ret > 0) &&
+	    (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages))
+		invalidate_inode_pages2_range(dio->inode->i_mapping,
+					offset >> PAGE_SHIFT,
+					(offset + ret - 1) >> PAGE_SHIFT);
+
 	if (dio->end_io) {
 		int err;
 
@@ -304,6 +310,7 @@ static void dio_bio_end_aio(struct bio *bio)
 	struct dio *dio = bio->bi_private;
 	unsigned long remaining;
 	unsigned long flags;
+	bool defer_completion = false;
 
 	/* cleanup the bio */
 	dio_bio_complete(dio, bio);
@@ -315,7 +322,19 @@ static void dio_bio_end_aio(struct bio *bio)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
 	if (remaining == 0) {
-		if (dio->result && dio->defer_completion) {
+		/*
+		 * Defer completion when defer_completion is set or
+		 * when the inode has pages mapped and this is AIO write.
+		 * We need to invalidate those pages because there is a
+		 * chance they contain stale data in the case buffered IO
+		 * went in between AIO submission and completion into the
+		 * same region.
+		 */
+		if (dio->result)
+			defer_completion = dio->defer_completion ||
+					   (dio->op == REQ_OP_WRITE &&
+					    dio->inode->i_mapping->nrpages);
+		if (defer_completion) {
 			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
 			queue_work(dio->inode->i_sb->s_dio_done_wq,
 				   &dio->complete_work);
@@ -1210,10 +1229,18 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
 	 * so that we can call ->fsync.
 	 */
-	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
-	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
-	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
-		retval = dio_set_defer_completion(dio);
+	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
+		retval = 0;
+		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
+		    IS_SYNC(iocb->ki_filp->f_mapping->host))
+			retval = dio_set_defer_completion(dio);
+		else if (!dio->inode->i_sb->s_dio_done_wq)
+			/*
+			 * In case of AIO write racing with buffered read we
+			 * need to defer completion. We can't decide this now,
+			 * however the workqueue needs to be initialized here.
+			 */
+			retval = sb_init_dio_done_wq(dio->inode->i_sb);
 		if (retval) {
 			/*
 			 * We grab i_mutex only for reads so we don't have
diff --git a/fs/iomap.c b/fs/iomap.c
index 1732228..2f8dbf9 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -713,8 +713,16 @@ struct iomap_dio {
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
+	loff_t offset = iocb->ki_pos;
+	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	if ((!dio->error) &&
+	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
+		invalidate_inode_pages2_range(inode->i_mapping,
+				offset >> PAGE_SHIFT,
+				(offset + dio->size - 1) >> PAGE_SHIFT);
+
 	if (dio->end_io) {
 		ret = dio->end_io(iocb,
 				dio->error ? dio->error : dio->size,
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-18 12:19   ` [PATCH v3] " Lukas Czerner
@ 2017-07-18 13:44     ` Christoph Hellwig
  2017-07-18 14:17       ` Jan Kara
  2017-07-19  8:42       ` Lukas Czerner
  2017-07-19  8:48     ` [PATCH v4] " Lukas Czerner
  2017-07-19 11:28     ` [PATCH v5] " Lukas Czerner
  2 siblings, 2 replies; 40+ messages in thread
From: Christoph Hellwig @ 2017-07-18 13:44 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, Jeff Moyer

> +	if ((ret > 0) &&

No need for the braces here.

> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq)
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);

So now we initialize the workqueue on the first aio write.  Maybe we
should just always initialize it?  Especially given that the cost of
a workqueue is rather cheap.  I also don't really understand why
we even need the workqueue per-superblock instead of global.

> index 1732228..2f8dbf9 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,16 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	loff_t offset = iocb->ki_pos;

If you introduce this variable please also use it later in the function
instead of iocb->ki_pos.  OR remove the variable, which would be fine
with me as well.

> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	if ((!dio->error) &&

no need for the inner braces.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-18 13:44     ` Christoph Hellwig
@ 2017-07-18 14:17       ` Jan Kara
  2017-07-19  8:42       ` Lukas Czerner
  1 sibling, 0 replies; 40+ messages in thread
From: Jan Kara @ 2017-07-18 14:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Lukas Czerner, linux-fsdevel, viro, jack, Jeff Moyer

On Tue 18-07-17 06:44:20, Christoph Hellwig wrote:
> > +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> > +		retval = 0;
> > +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> > +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> > +			retval = dio_set_defer_completion(dio);
> > +		else if (!dio->inode->i_sb->s_dio_done_wq)
> > +			/*
> > +			 * In case of AIO write racing with buffered read we
> > +			 * need to defer completion. We can't decide this now,
> > +			 * however the workqueue needs to be initialized here.
> > +			 */
> > +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> 
> So now we initialize the workqueue on the first aio write.  Maybe we
> should just always initialize it?  Especially given that the cost of
> a workqueue is rather cheap.  I also don't really understand why
> we even need the workqueue per-superblock instead of global.

So the workqueue is WQ_MEM_RECLAIM which means there will be always
"rescue" worker running. Not that it would make workqueue too expensive but
it is not zero cost either. So saving the cost for filesystems that don't
support AIO DIO makes sense to me.

Regarding creating global workqueue - it would create IO completion
dependencies between filesystems which could have unwanted side effects and
possibly create deadlocks. The default paralelism of workqueues would
mostly hide this but I'm not sure there won't be some corner case e.g. when
memory is tight...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-18 13:44     ` Christoph Hellwig
  2017-07-18 14:17       ` Jan Kara
@ 2017-07-19  8:42       ` Lukas Czerner
  1 sibling, 0 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-07-19  8:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, viro, jack, Jeff Moyer

On Tue, Jul 18, 2017 at 06:44:20AM -0700, Christoph Hellwig wrote:
> ->inode->i_sb);
> 
> So now we initialize the workqueue on the first aio write.  Maybe we
> should just always initialize it?  Especially given that the cost of
> a workqueue is rather cheap.  I also don't really understand why
> we even need the workqueue per-superblock instead of global.

As Jan mentioned and I agree it's worth initializing it only when
actually needed.

> 
> > index 1732228..2f8dbf9 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -713,8 +713,16 @@ struct iomap_dio {
> >  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >  {
> >  	struct kiocb *iocb = dio->iocb;
> > +	loff_t offset = iocb->ki_pos;
> 
> If you introduce this variable please also use it later in the function
> instead of iocb->ki_pos.  OR remove the variable, which would be fine
> with me as well.

Right, I did not use it later in the fucntion because it would be
confusing (we're changing iocb->ki_pos). So I'll just remove the
variable.

> 
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> >  	ssize_t ret;
> >  
> > +	if ((!dio->error) &&
> 
> no need for the inner braces.

ok


Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v4] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-18 12:19   ` [PATCH v3] " Lukas Czerner
  2017-07-18 13:44     ` Christoph Hellwig
@ 2017-07-19  8:48     ` Lukas Czerner
  2017-07-19  9:26       ` Jan Kara
  2017-07-19 11:28     ` [PATCH v5] " Lukas Czerner
  2 siblings, 1 reply; 40+ messages in thread
From: Lukas Czerner @ 2017-07-19  8:48 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, jack, hch, Lukas Czerner, Jeff Moyer

Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
---
v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
v3: Do not invalidate in case of error. Add some coments
v4: Remove unnecessary variable, remove unnecessary inner braces

 fs/direct-io.c | 37 ++++++++++++++++++++++++++++++++-----
 fs/iomap.c     |  7 +++++++
 2 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 08cf278..efd3246 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -258,6 +258,12 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 	if (ret == 0)
 		ret = transferred;
 
+	if ((ret > 0) &&
+	    (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages))
+		invalidate_inode_pages2_range(dio->inode->i_mapping,
+					offset >> PAGE_SHIFT,
+					(offset + ret - 1) >> PAGE_SHIFT);
+
 	if (dio->end_io) {
 		int err;
 
@@ -304,6 +310,7 @@ static void dio_bio_end_aio(struct bio *bio)
 	struct dio *dio = bio->bi_private;
 	unsigned long remaining;
 	unsigned long flags;
+	bool defer_completion = false;
 
 	/* cleanup the bio */
 	dio_bio_complete(dio, bio);
@@ -315,7 +322,19 @@ static void dio_bio_end_aio(struct bio *bio)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
 	if (remaining == 0) {
-		if (dio->result && dio->defer_completion) {
+		/*
+		 * Defer completion when defer_completion is set or
+		 * when the inode has pages mapped and this is AIO write.
+		 * We need to invalidate those pages because there is a
+		 * chance they contain stale data in the case buffered IO
+		 * went in between AIO submission and completion into the
+		 * same region.
+		 */
+		if (dio->result)
+			defer_completion = dio->defer_completion ||
+					   (dio->op == REQ_OP_WRITE &&
+					    dio->inode->i_mapping->nrpages);
+		if (defer_completion) {
 			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
 			queue_work(dio->inode->i_sb->s_dio_done_wq,
 				   &dio->complete_work);
@@ -1210,10 +1229,18 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
 	 * so that we can call ->fsync.
 	 */
-	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
-	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
-	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
-		retval = dio_set_defer_completion(dio);
+	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
+		retval = 0;
+		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
+		    IS_SYNC(iocb->ki_filp->f_mapping->host))
+			retval = dio_set_defer_completion(dio);
+		else if (!dio->inode->i_sb->s_dio_done_wq)
+			/*
+			 * In case of AIO write racing with buffered read we
+			 * need to defer completion. We can't decide this now,
+			 * however the workqueue needs to be initialized here.
+			 */
+			retval = sb_init_dio_done_wq(dio->inode->i_sb);
 		if (retval) {
 			/*
 			 * We grab i_mutex only for reads so we don't have
diff --git a/fs/iomap.c b/fs/iomap.c
index 1732228..144512e 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -713,8 +713,15 @@ struct iomap_dio {
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
+	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	if (!dio->error &&
+	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
+		invalidate_inode_pages2_range(inode->i_mapping,
+				iocb->ki_pos >> PAGE_SHIFT,
+				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
+
 	if (dio->end_io) {
 		ret = dio->end_io(iocb,
 				dio->error ? dio->error : dio->size,
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v4] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-19  8:48     ` [PATCH v4] " Lukas Czerner
@ 2017-07-19  9:26       ` Jan Kara
  2017-07-19 11:01         ` Lukas Czerner
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2017-07-19  9:26 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, hch, Jeff Moyer

On Wed 19-07-17 10:48:44, Lukas Czerner wrote:
> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
> 
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
> 
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
> 
> This was based on proposal from Jeff Moyer, thanks!
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> ---
> v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> v3: Do not invalidate in case of error. Add some coments
> v4: Remove unnecessary variable, remove unnecessary inner braces

Looks good to me now, just two style nits below. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..efd3246 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -258,6 +258,12 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  	if (ret == 0)
>  		ret = transferred;
>  
> +	if ((ret > 0) &&
> +	    (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages))

Heh, you seem to love braces. The general rule is that braces should be
around bit-ops (as there people find the priority unclear and also it is
too easy to forget to add those braces when negating the condition) but not
around comparison or such. I.e. the above would be:

	if (ret > 0 && dio->op == REQ_OP_WRITE &&
	    dio->inode->i_mapping->nrpages)

...

> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq)
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);

Curly braces here please. When command block is multi-line we enforce those
despite it is only a single statement and thus they are not necessary
strictly speaking. Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v4] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-19  9:26       ` Jan Kara
@ 2017-07-19 11:01         ` Lukas Czerner
  0 siblings, 0 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-07-19 11:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, viro, hch, Jeff Moyer

On Wed, Jul 19, 2017 at 11:26:37AM +0200, Jan Kara wrote:
> in question. This ensures that after the completion the pages
> > in the written area are either unmapped, or populated with up-to-date
> > data. Also do the same for the iomap case which uses
> > iomap_dio_complete() instead.
> > 
> > This has a side effect of deferring the completion to a process context
> > for every AIO DIO that happens on inode that has pages mapped. However
> > since the consensus is that this is ill-advised practice the performance
> > implication should not be a problem.
> > 
> > This was based on proposal from Jeff Moyer, thanks!
> > 
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > Cc: Jeff Moyer <jmoyer@redhat.com>
> > ---
> > v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> > v3: Do not invalidate in case of error. Add some coments
> > v4: Remove unnecessary variable, remove unnecessary inner braces
> 
> Looks good to me now, just two style nits below. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index 08cf278..efd3246 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -258,6 +258,12 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
> >  	if (ret == 0)
> >  		ret = transferred;
> >  
> > +	if ((ret > 0) &&
> > +	    (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages))
> 
> Heh, you seem to love braces. The general rule is that braces should be
> around bit-ops (as there people find the priority unclear and also it is
> too easy to forget to add those braces when negating the condition) but not
> around comparison or such. I.e. the above would be:
> 
> 	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> 	    dio->inode->i_mapping->nrpages)

:D
sure, I'll resend.

> 
> ...
> 
> > +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> > +		retval = 0;
> > +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> > +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> > +			retval = dio_set_defer_completion(dio);
> > +		else if (!dio->inode->i_sb->s_dio_done_wq)
> > +			/*
> > +			 * In case of AIO write racing with buffered read we
> > +			 * need to defer completion. We can't decide this now,
> > +			 * however the workqueue needs to be initialized here.
> > +			 */
> > +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> 
> Curly braces here please. When command block is multi-line we enforce those
> despite it is only a single statement and thus they are not necessary
> strictly speaking. Thanks!

ok.

Thanks!
-Lukas

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-18 12:19   ` [PATCH v3] " Lukas Czerner
  2017-07-18 13:44     ` Christoph Hellwig
  2017-07-19  8:48     ` [PATCH v4] " Lukas Czerner
@ 2017-07-19 11:28     ` Lukas Czerner
  2017-07-19 11:37       ` Jan Kara
                         ` (3 more replies)
  2 siblings, 4 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-07-19 11:28 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, jack, hch, Lukas Czerner, Jeff Moyer

Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
---
v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
v3: Do not invalidate in case of error. Add some coments
v4: Remove unnecessary variable, remove unnecessary inner braces
v5: Style changes

 fs/direct-io.c | 39 ++++++++++++++++++++++++++++++++++-----
 fs/iomap.c     |  7 +++++++
 2 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 08cf278..0d1befd 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -258,6 +258,13 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 	if (ret == 0)
 		ret = transferred;
 
+	if (ret > 0 && dio->op == REQ_OP_WRITE &&
+	    dio->inode->i_mapping->nrpages) {
+		invalidate_inode_pages2_range(dio->inode->i_mapping,
+					offset >> PAGE_SHIFT,
+					(offset + ret - 1) >> PAGE_SHIFT);
+	}
+
 	if (dio->end_io) {
 		int err;
 
@@ -304,6 +311,7 @@ static void dio_bio_end_aio(struct bio *bio)
 	struct dio *dio = bio->bi_private;
 	unsigned long remaining;
 	unsigned long flags;
+	bool defer_completion = false;
 
 	/* cleanup the bio */
 	dio_bio_complete(dio, bio);
@@ -315,7 +323,19 @@ static void dio_bio_end_aio(struct bio *bio)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
 	if (remaining == 0) {
-		if (dio->result && dio->defer_completion) {
+		/*
+		 * Defer completion when defer_completion is set or
+		 * when the inode has pages mapped and this is AIO write.
+		 * We need to invalidate those pages because there is a
+		 * chance they contain stale data in the case buffered IO
+		 * went in between AIO submission and completion into the
+		 * same region.
+		 */
+		if (dio->result)
+			defer_completion = dio->defer_completion ||
+					   (dio->op == REQ_OP_WRITE &&
+					    dio->inode->i_mapping->nrpages);
+		if (defer_completion) {
 			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
 			queue_work(dio->inode->i_sb->s_dio_done_wq,
 				   &dio->complete_work);
@@ -1210,10 +1230,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
 	 * so that we can call ->fsync.
 	 */
-	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
-	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
-	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
-		retval = dio_set_defer_completion(dio);
+	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
+		retval = 0;
+		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
+		    IS_SYNC(iocb->ki_filp->f_mapping->host))
+			retval = dio_set_defer_completion(dio);
+		else if (!dio->inode->i_sb->s_dio_done_wq) {
+			/*
+			 * In case of AIO write racing with buffered read we
+			 * need to defer completion. We can't decide this now,
+			 * however the workqueue needs to be initialized here.
+			 */
+			retval = sb_init_dio_done_wq(dio->inode->i_sb);
+		}
 		if (retval) {
 			/*
 			 * We grab i_mutex only for reads so we don't have
diff --git a/fs/iomap.c b/fs/iomap.c
index 1732228..144512e 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -713,8 +713,15 @@ struct iomap_dio {
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
+	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	if (!dio->error &&
+	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
+		invalidate_inode_pages2_range(inode->i_mapping,
+				iocb->ki_pos >> PAGE_SHIFT,
+				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
+
 	if (dio->end_io) {
 		ret = dio->end_io(iocb,
 				dio->error ? dio->error : dio->size,
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-19 11:28     ` [PATCH v5] " Lukas Czerner
@ 2017-07-19 11:37       ` Jan Kara
  2017-07-19 12:17       ` Jeff Moyer
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2017-07-19 11:37 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, hch, Jeff Moyer

On Wed 19-07-17 13:28:12, Lukas Czerner wrote:
> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
> 
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
> 
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
> 
> This was based on proposal from Jeff Moyer, thanks!
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>

You forgot to add my Reviewed-by tag. So feel free to add it now:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
> v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> v3: Do not invalidate in case of error. Add some coments
> v4: Remove unnecessary variable, remove unnecessary inner braces
> v5: Style changes
> 
>  fs/direct-io.c | 39 ++++++++++++++++++++++++++++++++++-----
>  fs/iomap.c     |  7 +++++++
>  2 files changed, 41 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..0d1befd 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -258,6 +258,13 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  	if (ret == 0)
>  		ret = transferred;
>  
> +	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> +	    dio->inode->i_mapping->nrpages) {
> +		invalidate_inode_pages2_range(dio->inode->i_mapping,
> +					offset >> PAGE_SHIFT,
> +					(offset + ret - 1) >> PAGE_SHIFT);
> +	}
> +
>  	if (dio->end_io) {
>  		int err;
>  
> @@ -304,6 +311,7 @@ static void dio_bio_end_aio(struct bio *bio)
>  	struct dio *dio = bio->bi_private;
>  	unsigned long remaining;
>  	unsigned long flags;
> +	bool defer_completion = false;
>  
>  	/* cleanup the bio */
>  	dio_bio_complete(dio, bio);
> @@ -315,7 +323,19 @@ static void dio_bio_end_aio(struct bio *bio)
>  	spin_unlock_irqrestore(&dio->bio_lock, flags);
>  
>  	if (remaining == 0) {
> -		if (dio->result && dio->defer_completion) {
> +		/*
> +		 * Defer completion when defer_completion is set or
> +		 * when the inode has pages mapped and this is AIO write.
> +		 * We need to invalidate those pages because there is a
> +		 * chance they contain stale data in the case buffered IO
> +		 * went in between AIO submission and completion into the
> +		 * same region.
> +		 */
> +		if (dio->result)
> +			defer_completion = dio->defer_completion ||
> +					   (dio->op == REQ_OP_WRITE &&
> +					    dio->inode->i_mapping->nrpages);
> +		if (defer_completion) {
>  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>  			queue_work(dio->inode->i_sb->s_dio_done_wq,
>  				   &dio->complete_work);
> @@ -1210,10 +1230,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
>  	 * so that we can call ->fsync.
>  	 */
> -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> -		retval = dio_set_defer_completion(dio);
> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq) {
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> +		}
>  		if (retval) {
>  			/*
>  			 * We grab i_mutex only for reads so we don't have
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 1732228..144512e 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,15 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	if (!dio->error &&
> +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
> +		invalidate_inode_pages2_range(inode->i_mapping,
> +				iocb->ki_pos >> PAGE_SHIFT,
> +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> +
>  	if (dio->end_io) {
>  		ret = dio->end_io(iocb,
>  				dio->error ? dio->error : dio->size,
> -- 
> 2.7.5
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-19 11:28     ` [PATCH v5] " Lukas Czerner
  2017-07-19 11:37       ` Jan Kara
@ 2017-07-19 12:17       ` Jeff Moyer
  2017-08-03 18:10       ` Jeff Moyer
  2017-08-10 12:59       ` [PATCH v6] " Lukas Czerner
  3 siblings, 0 replies; 40+ messages in thread
From: Jeff Moyer @ 2017-07-19 12:17 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, hch

Lukas Czerner <lczerner@redhat.com> writes:

> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
>
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
>
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
>
> This was based on proposal from Jeff Moyer, thanks!
>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>

Looks good, Lukas!  Thanks!

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>


> ---
> v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> v3: Do not invalidate in case of error. Add some coments
> v4: Remove unnecessary variable, remove unnecessary inner braces
> v5: Style changes
>
>  fs/direct-io.c | 39 ++++++++++++++++++++++++++++++++++-----
>  fs/iomap.c     |  7 +++++++
>  2 files changed, 41 insertions(+), 5 deletions(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..0d1befd 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -258,6 +258,13 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  	if (ret == 0)
>  		ret = transferred;
>  
> +	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> +	    dio->inode->i_mapping->nrpages) {
> +		invalidate_inode_pages2_range(dio->inode->i_mapping,
> +					offset >> PAGE_SHIFT,
> +					(offset + ret - 1) >> PAGE_SHIFT);
> +	}
> +
>  	if (dio->end_io) {
>  		int err;
>  
> @@ -304,6 +311,7 @@ static void dio_bio_end_aio(struct bio *bio)
>  	struct dio *dio = bio->bi_private;
>  	unsigned long remaining;
>  	unsigned long flags;
> +	bool defer_completion = false;
>  
>  	/* cleanup the bio */
>  	dio_bio_complete(dio, bio);
> @@ -315,7 +323,19 @@ static void dio_bio_end_aio(struct bio *bio)
>  	spin_unlock_irqrestore(&dio->bio_lock, flags);
>  
>  	if (remaining == 0) {
> -		if (dio->result && dio->defer_completion) {
> +		/*
> +		 * Defer completion when defer_completion is set or
> +		 * when the inode has pages mapped and this is AIO write.
> +		 * We need to invalidate those pages because there is a
> +		 * chance they contain stale data in the case buffered IO
> +		 * went in between AIO submission and completion into the
> +		 * same region.
> +		 */
> +		if (dio->result)
> +			defer_completion = dio->defer_completion ||
> +					   (dio->op == REQ_OP_WRITE &&
> +					    dio->inode->i_mapping->nrpages);
> +		if (defer_completion) {
>  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>  			queue_work(dio->inode->i_sb->s_dio_done_wq,
>  				   &dio->complete_work);
> @@ -1210,10 +1230,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
>  	 * so that we can call ->fsync.
>  	 */
> -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> -		retval = dio_set_defer_completion(dio);
> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq) {
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> +		}
>  		if (retval) {
>  			/*
>  			 * We grab i_mutex only for reads so we don't have
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 1732228..144512e 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,15 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	if (!dio->error &&
> +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
> +		invalidate_inode_pages2_range(inode->i_mapping,
> +				iocb->ki_pos >> PAGE_SHIFT,
> +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> +
>  	if (dio->end_io) {
>  		ret = dio->end_io(iocb,
>  				dio->error ? dio->error : dio->size,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-19 11:28     ` [PATCH v5] " Lukas Czerner
  2017-07-19 11:37       ` Jan Kara
  2017-07-19 12:17       ` Jeff Moyer
@ 2017-08-03 18:10       ` Jeff Moyer
  2017-08-04 10:09         ` Dave Chinner
  2017-08-10 12:59       ` [PATCH v6] " Lukas Czerner
  3 siblings, 1 reply; 40+ messages in thread
From: Jeff Moyer @ 2017-08-03 18:10 UTC (permalink / raw)
  To: viro, Lukas Czerner; +Cc: linux-fsdevel, jack, hch

Al, would you mind taking this in through your tree?  It's been reviewed
by myself and Jan in this mail thread.

Thanks!
Jeff

Lukas Czerner <lczerner@redhat.com> writes:

> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
>
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
>
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
>
> This was based on proposal from Jeff Moyer, thanks!
>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> ---
> v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> v3: Do not invalidate in case of error. Add some coments
> v4: Remove unnecessary variable, remove unnecessary inner braces
> v5: Style changes
>
>  fs/direct-io.c | 39 ++++++++++++++++++++++++++++++++++-----
>  fs/iomap.c     |  7 +++++++
>  2 files changed, 41 insertions(+), 5 deletions(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..0d1befd 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -258,6 +258,13 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  	if (ret == 0)
>  		ret = transferred;
>  
> +	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> +	    dio->inode->i_mapping->nrpages) {
> +		invalidate_inode_pages2_range(dio->inode->i_mapping,
> +					offset >> PAGE_SHIFT,
> +					(offset + ret - 1) >> PAGE_SHIFT);
> +	}
> +
>  	if (dio->end_io) {
>  		int err;
>  
> @@ -304,6 +311,7 @@ static void dio_bio_end_aio(struct bio *bio)
>  	struct dio *dio = bio->bi_private;
>  	unsigned long remaining;
>  	unsigned long flags;
> +	bool defer_completion = false;
>  
>  	/* cleanup the bio */
>  	dio_bio_complete(dio, bio);
> @@ -315,7 +323,19 @@ static void dio_bio_end_aio(struct bio *bio)
>  	spin_unlock_irqrestore(&dio->bio_lock, flags);
>  
>  	if (remaining == 0) {
> -		if (dio->result && dio->defer_completion) {
> +		/*
> +		 * Defer completion when defer_completion is set or
> +		 * when the inode has pages mapped and this is AIO write.
> +		 * We need to invalidate those pages because there is a
> +		 * chance they contain stale data in the case buffered IO
> +		 * went in between AIO submission and completion into the
> +		 * same region.
> +		 */
> +		if (dio->result)
> +			defer_completion = dio->defer_completion ||
> +					   (dio->op == REQ_OP_WRITE &&
> +					    dio->inode->i_mapping->nrpages);
> +		if (defer_completion) {
>  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>  			queue_work(dio->inode->i_sb->s_dio_done_wq,
>  				   &dio->complete_work);
> @@ -1210,10 +1230,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
>  	 * so that we can call ->fsync.
>  	 */
> -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> -		retval = dio_set_defer_completion(dio);
> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq) {
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> +		}
>  		if (retval) {
>  			/*
>  			 * We grab i_mutex only for reads so we don't have
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 1732228..144512e 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,15 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	if (!dio->error &&
> +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
> +		invalidate_inode_pages2_range(inode->i_mapping,
> +				iocb->ki_pos >> PAGE_SHIFT,
> +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> +
>  	if (dio->end_io) {
>  		ret = dio->end_io(iocb,
>  				dio->error ? dio->error : dio->size,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-03 18:10       ` Jeff Moyer
@ 2017-08-04 10:09         ` Dave Chinner
  2017-08-07 15:52           ` Jeff Moyer
  0 siblings, 1 reply; 40+ messages in thread
From: Dave Chinner @ 2017-08-04 10:09 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: viro, Lukas Czerner, linux-fsdevel, jack, hch

On Thu, Aug 03, 2017 at 02:10:47PM -0400, Jeff Moyer wrote:
> Al, would you mind taking this in through your tree?  It's been reviewed
> by myself and Jan in this mail thread.

Still needs more fixing, I think?

Sorry, this is the first time I've seen this patch....

> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index 1732228..144512e 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -713,8 +713,15 @@ struct iomap_dio {
> >  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >  {
> >  	struct kiocb *iocb = dio->iocb;
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> >  	ssize_t ret;
> >  
> > +	if (!dio->error &&
> > +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
> > +		invalidate_inode_pages2_range(inode->i_mapping,
> > +				iocb->ki_pos >> PAGE_SHIFT,
> > +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> > +
> >  	if (dio->end_io) {
> >  		ret = dio->end_io(iocb,
> >  				dio->error ? dio->error : dio->size,
> 

This invalidation is already run in iomap_dio_rw() for the sync IO
case directly after the call to iomap_dio_complete().  It also has a
comment to explain exactly why the the invalidation is needed, and
it issues a warning to dmesg if the invalidation fails to indicate
the reason why the user is reporting data corruption to us. i.e.:

.....
        ret = iomap_dio_complete(dio);

        /*
         * Try again to invalidate clean pages which might have been cached by
         * non-direct readahead, or faulted in by get_user_pages() if the source
         * of the write was an mmap'ed region of the file we're writing.  Either
         * one is a pretty crazy thing to do, so we don't support it 100%.  If
         * this invalidation fails, tough, the write still worked...
         */
        if (iov_iter_rw(iter) == WRITE) {
                int err = invalidate_inode_pages2_range(mapping,
                                start >> PAGE_SHIFT, end >> PAGE_SHIFT);
                WARN_ON_ONCE(err);
        }

        return ret;

If we're going to replace this with an invalidation in
iomap_dio_complete() so it also handles the AIO path, then the
comment and warning on invalidation failure also need to be moved to
iomap_dio_complete() and the duplicate code removed from
iomap_dio_rw()...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-04 10:09         ` Dave Chinner
@ 2017-08-07 15:52           ` Jeff Moyer
  2017-08-08  8:41             ` Lukas Czerner
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Moyer @ 2017-08-07 15:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, Lukas Czerner, linux-fsdevel, jack, hch

Dave Chinner <david@fromorbit.com> writes:

> On Thu, Aug 03, 2017 at 02:10:47PM -0400, Jeff Moyer wrote:
>> Al, would you mind taking this in through your tree?  It's been reviewed
>> by myself and Jan in this mail thread.
>
> Still needs more fixing, I think?
>
> Sorry, this is the first time I've seen this patch....
>
>> > diff --git a/fs/iomap.c b/fs/iomap.c
>> > index 1732228..144512e 100644
>> > --- a/fs/iomap.c
>> > +++ b/fs/iomap.c
>> > @@ -713,8 +713,15 @@ struct iomap_dio {
>> >  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>> >  {
>> >  	struct kiocb *iocb = dio->iocb;
>> > +	struct inode *inode = file_inode(iocb->ki_filp);
>> >  	ssize_t ret;
>> >  
>> > +	if (!dio->error &&
>> > +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
>> > +		invalidate_inode_pages2_range(inode->i_mapping,
>> > +				iocb->ki_pos >> PAGE_SHIFT,
>> > +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
>> > +
>> >  	if (dio->end_io) {
>> >  		ret = dio->end_io(iocb,
>> >  				dio->error ? dio->error : dio->size,
>> 
>
> This invalidation is already run in iomap_dio_rw() for the sync IO
> case directly after the call to iomap_dio_complete().  It also has a
> comment to explain exactly why the the invalidation is needed, and
> it issues a warning to dmesg if the invalidation fails to indicate
> the reason why the user is reporting data corruption to us. i.e.:
>
> .....
>         ret = iomap_dio_complete(dio);
>
>         /*
>          * Try again to invalidate clean pages which might have been cached by
>          * non-direct readahead, or faulted in by get_user_pages() if the source
>          * of the write was an mmap'ed region of the file we're writing.  Either
>          * one is a pretty crazy thing to do, so we don't support it 100%.  If
>          * this invalidation fails, tough, the write still worked...
>          */
>         if (iov_iter_rw(iter) == WRITE) {
>                 int err = invalidate_inode_pages2_range(mapping,
>                                 start >> PAGE_SHIFT, end >> PAGE_SHIFT);
>                 WARN_ON_ONCE(err);
>         }
>
>         return ret;
>
> If we're going to replace this with an invalidation in
> iomap_dio_complete() so it also handles the AIO path, then the
> comment and warning on invalidation failure also need to be moved to
> iomap_dio_complete() and the duplicate code removed from
> iomap_dio_rw()...

Yep, good catch.  Lukas, care to respin?

-Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-07 15:52           ` Jeff Moyer
@ 2017-08-08  8:41             ` Lukas Czerner
  0 siblings, 0 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-08-08  8:41 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Dave Chinner, viro, linux-fsdevel, jack, hch

On Mon, Aug 07, 2017 at 11:52:45AM -0400, Jeff Moyer wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Thu, Aug 03, 2017 at 02:10:47PM -0400, Jeff Moyer wrote:
> >> Al, would you mind taking this in through your tree?  It's been reviewed
> >> by myself and Jan in this mail thread.
> >
> > Still needs more fixing, I think?
> >
> > Sorry, this is the first time I've seen this patch....
> >
> >> > diff --git a/fs/iomap.c b/fs/iomap.c
> >> > index 1732228..144512e 100644
> >> > --- a/fs/iomap.c
> >> > +++ b/fs/iomap.c
> >> > @@ -713,8 +713,15 @@ struct iomap_dio {
> >> >  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >> >  {
> >> >  	struct kiocb *iocb = dio->iocb;
> >> > +	struct inode *inode = file_inode(iocb->ki_filp);
> >> >  	ssize_t ret;
> >> >  
> >> > +	if (!dio->error &&
> >> > +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
> >> > +		invalidate_inode_pages2_range(inode->i_mapping,
> >> > +				iocb->ki_pos >> PAGE_SHIFT,
> >> > +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> >> > +
> >> >  	if (dio->end_io) {
> >> >  		ret = dio->end_io(iocb,
> >> >  				dio->error ? dio->error : dio->size,
> >> 
> >
> > This invalidation is already run in iomap_dio_rw() for the sync IO
> > case directly after the call to iomap_dio_complete().  It also has a
> > comment to explain exactly why the the invalidation is needed, and
> > it issues a warning to dmesg if the invalidation fails to indicate
> > the reason why the user is reporting data corruption to us. i.e.:
> >
> > .....
> >         ret = iomap_dio_complete(dio);
> >
> >         /*
> >          * Try again to invalidate clean pages which might have been cached by
> >          * non-direct readahead, or faulted in by get_user_pages() if the source
> >          * of the write was an mmap'ed region of the file we're writing.  Either
> >          * one is a pretty crazy thing to do, so we don't support it 100%.  If
> >          * this invalidation fails, tough, the write still worked...
> >          */
> >         if (iov_iter_rw(iter) == WRITE) {
> >                 int err = invalidate_inode_pages2_range(mapping,
> >                                 start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> >                 WARN_ON_ONCE(err);
> >         }
> >
> >         return ret;
> >
> > If we're going to replace this with an invalidation in
> > iomap_dio_complete() so it also handles the AIO path, then the
> > comment and warning on invalidation failure also need to be moved to
> > iomap_dio_complete() and the duplicate code removed from
> > iomap_dio_rw()...
> 
> Yep, good catch.  Lukas, care to respin?

Of course, I'll respin.

-Lukas

> 
> -Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-07-19 11:28     ` [PATCH v5] " Lukas Czerner
                         ` (2 preceding siblings ...)
  2017-08-03 18:10       ` Jeff Moyer
@ 2017-08-10 12:59       ` Lukas Czerner
  2017-08-10 13:56         ` Jan Kara
  2017-08-15 13:28         ` [PATCH v7] " Lukas Czerner
  3 siblings, 2 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-08-10 12:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, jack, jmoyer, david, Lukas Czerner

Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
---
v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
v3: Do not invalidate in case of error. Add some coments
v4: Remove unnecessary variable, remove unnecessary inner braces
v5: Style changes
v6: Remove redundant invalidatepage, add warning and comment

 fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------
 fs/iomap.c     | 29 ++++++++++++++++-------------
 2 files changed, 59 insertions(+), 19 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 08cf278..ffb9e19 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 {
 	loff_t offset = dio->iocb->ki_pos;
 	ssize_t transferred = 0;
+	int err;
 
 	/*
 	 * AIO submission can race with bio completion to get here while
@@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 	if (ret == 0)
 		ret = transferred;
 
+	/*
+	 * Try again to invalidate clean pages which might have been cached by
+	 * non-direct readahead, or faulted in by get_user_pages() if the source
+	 * of the write was an mmap'ed region of the file we're writing.  Either
+	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
+	 * this invalidation fails, tough, the write still worked...
+	 */
+	if (ret > 0 && dio->op == REQ_OP_WRITE &&
+	    dio->inode->i_mapping->nrpages) {
+		err = invalidate_inode_pages2_range(dio->inode->i_mapping,
+					offset >> PAGE_SHIFT,
+					(offset + ret - 1) >> PAGE_SHIFT);
+		WARN_ON_ONCE(err);
+	}
+
 	if (dio->end_io) {
-		int err;
 
 		// XXX: ki_pos??
 		err = dio->end_io(dio->iocb, offset, ret, dio->private);
@@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio)
 	struct dio *dio = bio->bi_private;
 	unsigned long remaining;
 	unsigned long flags;
+	bool defer_completion = false;
 
 	/* cleanup the bio */
 	dio_bio_complete(dio, bio);
@@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
 	if (remaining == 0) {
-		if (dio->result && dio->defer_completion) {
+		/*
+		 * Defer completion when defer_completion is set or
+		 * when the inode has pages mapped and this is AIO write.
+		 * We need to invalidate those pages because there is a
+		 * chance they contain stale data in the case buffered IO
+		 * went in between AIO submission and completion into the
+		 * same region.
+		 */
+		if (dio->result)
+			defer_completion = dio->defer_completion ||
+					   (dio->op == REQ_OP_WRITE &&
+					    dio->inode->i_mapping->nrpages);
+		if (defer_completion) {
 			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
 			queue_work(dio->inode->i_sb->s_dio_done_wq,
 				   &dio->complete_work);
@@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
 	 * so that we can call ->fsync.
 	 */
-	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
-	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
-	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
-		retval = dio_set_defer_completion(dio);
+	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
+		retval = 0;
+		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
+		    IS_SYNC(iocb->ki_filp->f_mapping->host))
+			retval = dio_set_defer_completion(dio);
+		else if (!dio->inode->i_sb->s_dio_done_wq) {
+			/*
+			 * In case of AIO write racing with buffered read we
+			 * need to defer completion. We can't decide this now,
+			 * however the workqueue needs to be initialized here.
+			 */
+			retval = sb_init_dio_done_wq(dio->inode->i_sb);
+		}
 		if (retval) {
 			/*
 			 * We grab i_mutex only for reads so we don't have
diff --git a/fs/iomap.c b/fs/iomap.c
index 0392661..c3e299a 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -713,8 +713,24 @@ struct iomap_dio {
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
+	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	/*
+	 * Try again to invalidate clean pages which might have been cached by
+	 * non-direct readahead, or faulted in by get_user_pages() if the source
+	 * of the write was an mmap'ed region of the file we're writing.  Either
+	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
+	 * this invalidation fails, tough, the write still worked...
+	 */
+	if (!dio->error &&
+	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) {
+		ret = invalidate_inode_pages2_range(inode->i_mapping,
+				iocb->ki_pos >> PAGE_SHIFT,
+				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
+		WARN_ON_ONCE(ret);
+	}
+
 	if (dio->end_io) {
 		ret = dio->end_io(iocb,
 				dio->error ? dio->error : dio->size,
@@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	ret = iomap_dio_complete(dio);
 
-	/*
-	 * Try again to invalidate clean pages which might have been cached by
-	 * non-direct readahead, or faulted in by get_user_pages() if the source
-	 * of the write was an mmap'ed region of the file we're writing.  Either
-	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
-	 * this invalidation fails, tough, the write still worked...
-	 */
-	if (iov_iter_rw(iter) == WRITE) {
-		int err = invalidate_inode_pages2_range(mapping,
-				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
-		WARN_ON_ONCE(err);
-	}
-
 	return ret;
 
 out_free_dio:
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-10 12:59       ` [PATCH v6] " Lukas Czerner
@ 2017-08-10 13:56         ` Jan Kara
  2017-08-10 14:22           ` Jeff Moyer
  2017-08-15 13:28         ` [PATCH v7] " Lukas Czerner
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Kara @ 2017-08-10 13:56 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, jmoyer, david

On Thu 10-08-17 14:59:57, Lukas Czerner wrote:
> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
> 
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
> 
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
> 
> This was based on proposal from Jeff Moyer, thanks!

It seems the invalidation can be also removed from
generic_file_direct_write(), can't it? It is duplicit there the same way as
it was in the iomap code...

								Honza

> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> ---
> v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> v3: Do not invalidate in case of error. Add some coments
> v4: Remove unnecessary variable, remove unnecessary inner braces
> v5: Style changes
> v6: Remove redundant invalidatepage, add warning and comment
> 
>  fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------
>  fs/iomap.c     | 29 ++++++++++++++++-------------
>  2 files changed, 59 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..ffb9e19 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  {
>  	loff_t offset = dio->iocb->ki_pos;
>  	ssize_t transferred = 0;
> +	int err;
>  
>  	/*
>  	 * AIO submission can race with bio completion to get here while
> @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  	if (ret == 0)
>  		ret = transferred;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> +	    dio->inode->i_mapping->nrpages) {
> +		err = invalidate_inode_pages2_range(dio->inode->i_mapping,
> +					offset >> PAGE_SHIFT,
> +					(offset + ret - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(err);
> +	}
> +
>  	if (dio->end_io) {
> -		int err;
>  
>  		// XXX: ki_pos??
>  		err = dio->end_io(dio->iocb, offset, ret, dio->private);
> @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio)
>  	struct dio *dio = bio->bi_private;
>  	unsigned long remaining;
>  	unsigned long flags;
> +	bool defer_completion = false;
>  
>  	/* cleanup the bio */
>  	dio_bio_complete(dio, bio);
> @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio)
>  	spin_unlock_irqrestore(&dio->bio_lock, flags);
>  
>  	if (remaining == 0) {
> -		if (dio->result && dio->defer_completion) {
> +		/*
> +		 * Defer completion when defer_completion is set or
> +		 * when the inode has pages mapped and this is AIO write.
> +		 * We need to invalidate those pages because there is a
> +		 * chance they contain stale data in the case buffered IO
> +		 * went in between AIO submission and completion into the
> +		 * same region.
> +		 */
> +		if (dio->result)
> +			defer_completion = dio->defer_completion ||
> +					   (dio->op == REQ_OP_WRITE &&
> +					    dio->inode->i_mapping->nrpages);
> +		if (defer_completion) {
>  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>  			queue_work(dio->inode->i_sb->s_dio_done_wq,
>  				   &dio->complete_work);
> @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
>  	 * so that we can call ->fsync.
>  	 */
> -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> -		retval = dio_set_defer_completion(dio);
> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq) {
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> +		}
>  		if (retval) {
>  			/*
>  			 * We grab i_mutex only for reads so we don't have
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 0392661..c3e299a 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,24 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (!dio->error &&
> +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) {
> +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> +				iocb->ki_pos >> PAGE_SHIFT,
> +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(ret);
> +	}
> +
>  	if (dio->end_io) {
>  		ret = dio->end_io(iocb,
>  				dio->error ? dio->error : dio->size,
> @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	ret = iomap_dio_complete(dio);
>  
> -	/*
> -	 * Try again to invalidate clean pages which might have been cached by
> -	 * non-direct readahead, or faulted in by get_user_pages() if the source
> -	 * of the write was an mmap'ed region of the file we're writing.  Either
> -	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> -	 * this invalidation fails, tough, the write still worked...
> -	 */
> -	if (iov_iter_rw(iter) == WRITE) {
> -		int err = invalidate_inode_pages2_range(mapping,
> -				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -		WARN_ON_ONCE(err);
> -	}
> -
>  	return ret;
>  
>  out_free_dio:
> -- 
> 2.7.5
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-10 13:56         ` Jan Kara
@ 2017-08-10 14:22           ` Jeff Moyer
  2017-08-11  9:03             ` Lukas Czerner
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Moyer @ 2017-08-10 14:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Lukas Czerner, linux-fsdevel, viro, david

Jan Kara <jack@suse.cz> writes:

> On Thu 10-08-17 14:59:57, Lukas Czerner wrote:
>> Currently when mixing buffered reads and asynchronous direct writes it
>> is possible to end up with the situation where we have stale data in the
>> page cache while the new data is already written to disk. This is
>> permanent until the affected pages are flushed away. Despite the fact
>> that mixing buffered and direct IO is ill-advised it does pose a thread
>> for a data integrity, is unexpected and should be fixed.
>> 
>> Fix this by deferring completion of asynchronous direct writes to a
>> process context in the case that there are mapped pages to be found in
>> the inode. Later before the completion in dio_complete() invalidate
>> the pages in question. This ensures that after the completion the pages
>> in the written area are either unmapped, or populated with up-to-date
>> data. Also do the same for the iomap case which uses
>> iomap_dio_complete() instead.
>> 
>> This has a side effect of deferring the completion to a process context
>> for every AIO DIO that happens on inode that has pages mapped. However
>> since the consensus is that this is ill-advised practice the performance
>> implication should not be a problem.
>> 
>> This was based on proposal from Jeff Moyer, thanks!
>
> It seems the invalidation can be also removed from
> generic_file_direct_write(), can't it? It is duplicit there the same way as
> it was in the iomap code...

Yep, sure looks that way.

-Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-10 14:22           ` Jeff Moyer
@ 2017-08-11  9:03             ` Lukas Czerner
  2017-08-14  9:43               ` Jan Kara
  0 siblings, 1 reply; 40+ messages in thread
From: Lukas Czerner @ 2017-08-11  9:03 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jan Kara, linux-fsdevel, viro, david

On Thu, Aug 10, 2017 at 10:22:47AM -0400, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Thu 10-08-17 14:59:57, Lukas Czerner wrote:
> >> Currently when mixing buffered reads and asynchronous direct writes it
> >> is possible to end up with the situation where we have stale data in the
> >> page cache while the new data is already written to disk. This is
> >> permanent until the affected pages are flushed away. Despite the fact
> >> that mixing buffered and direct IO is ill-advised it does pose a thread
> >> for a data integrity, is unexpected and should be fixed.
> >> 
> >> Fix this by deferring completion of asynchronous direct writes to a
> >> process context in the case that there are mapped pages to be found in
> >> the inode. Later before the completion in dio_complete() invalidate
> >> the pages in question. This ensures that after the completion the pages
> >> in the written area are either unmapped, or populated with up-to-date
> >> data. Also do the same for the iomap case which uses
> >> iomap_dio_complete() instead.
> >> 
> >> This has a side effect of deferring the completion to a process context
> >> for every AIO DIO that happens on inode that has pages mapped. However
> >> since the consensus is that this is ill-advised practice the performance
> >> implication should not be a problem.
> >> 
> >> This was based on proposal from Jeff Moyer, thanks!
> >
> > It seems the invalidation can be also removed from
> > generic_file_direct_write(), can't it? It is duplicit there the same way as
> > it was in the iomap code...
> 
> Yep, sure looks that way.

Hrm, ok. Technically speaking generic_file_direct_write does not have to
eventually end up with dio_complete() being called. This will change the
behaviour for those that implement dio differently. Looking at the users
now, vast majority will end up with complete_dio() so maybe this is not
a problem.

This is in contrast with iomap_dio_rw() which will end up calling
iomap_dio_complete() so the situation is different there.

Maybe adding mapping->nrpages check would be better than outright
removing it ?

-Lukas

> 
> -Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-11  9:03             ` Lukas Czerner
@ 2017-08-14  9:43               ` Jan Kara
  2017-08-15 12:47                 ` Lukas Czerner
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2017-08-14  9:43 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: Jeff Moyer, Jan Kara, linux-fsdevel, viro, david

Hello,

On Fri 11-08-17 11:03:01, Lukas Czerner wrote:
> On Thu, Aug 10, 2017 at 10:22:47AM -0400, Jeff Moyer wrote:
> > Jan Kara <jack@suse.cz> writes:
> > 
> > > On Thu 10-08-17 14:59:57, Lukas Czerner wrote:
> > >> Currently when mixing buffered reads and asynchronous direct writes it
> > >> is possible to end up with the situation where we have stale data in the
> > >> page cache while the new data is already written to disk. This is
> > >> permanent until the affected pages are flushed away. Despite the fact
> > >> that mixing buffered and direct IO is ill-advised it does pose a thread
> > >> for a data integrity, is unexpected and should be fixed.
> > >> 
> > >> Fix this by deferring completion of asynchronous direct writes to a
> > >> process context in the case that there are mapped pages to be found in
> > >> the inode. Later before the completion in dio_complete() invalidate
> > >> the pages in question. This ensures that after the completion the pages
> > >> in the written area are either unmapped, or populated with up-to-date
> > >> data. Also do the same for the iomap case which uses
> > >> iomap_dio_complete() instead.
> > >> 
> > >> This has a side effect of deferring the completion to a process context
> > >> for every AIO DIO that happens on inode that has pages mapped. However
> > >> since the consensus is that this is ill-advised practice the performance
> > >> implication should not be a problem.
> > >> 
> > >> This was based on proposal from Jeff Moyer, thanks!
> > >
> > > It seems the invalidation can be also removed from
> > > generic_file_direct_write(), can't it? It is duplicit there the same way as
> > > it was in the iomap code...
> > 
> > Yep, sure looks that way.
> 
> Hrm, ok. Technically speaking generic_file_direct_write does not have to
> eventually end up with dio_complete() being called. This will change the
> behaviour for those that implement dio differently. Looking at the users
> now, vast majority will end up with complete_dio() so maybe this is not
> a problem.

OK, so this seems to be the problem with 9p, fuse, nfs, lustre.

> This is in contrast with iomap_dio_rw() which will end up calling
> iomap_dio_complete() so the situation is different there.
> 
> Maybe adding mapping->nrpages check would be better than outright
> removing it ?

OK, I agree we cannot just remove the invalidation. But shouldn't we rather
fix the above mentioned filesystems? Otherwise they will keep having issues
you are trying to fix? But for now I could live with keeping the
invalidation behind nrpages check and adding a comment why we kept it
there...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-14  9:43               ` Jan Kara
@ 2017-08-15 12:47                 ` Lukas Czerner
  0 siblings, 0 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-08-15 12:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jeff Moyer, linux-fsdevel, viro, david

On Mon, Aug 14, 2017 at 11:43:31AM +0200, Jan Kara wrote:
> Hello,
> 
> On Fri 11-08-17 11:03:01, Lukas Czerner wrote:
> > On Thu, Aug 10, 2017 at 10:22:47AM -0400, Jeff Moyer wrote:
> > > Jan Kara <jack@suse.cz> writes:
> > > 
> > > > On Thu 10-08-17 14:59:57, Lukas Czerner wrote:
> > > >> Currently when mixing buffered reads and asynchronous direct writes it
> > > >> is possible to end up with the situation where we have stale data in the
> > > >> page cache while the new data is already written to disk. This is
> > > >> permanent until the affected pages are flushed away. Despite the fact
> > > >> that mixing buffered and direct IO is ill-advised it does pose a thread
> > > >> for a data integrity, is unexpected and should be fixed.
> > > >> 
> > > >> Fix this by deferring completion of asynchronous direct writes to a
> > > >> process context in the case that there are mapped pages to be found in
> > > >> the inode. Later before the completion in dio_complete() invalidate
> > > >> the pages in question. This ensures that after the completion the pages
> > > >> in the written area are either unmapped, or populated with up-to-date
> > > >> data. Also do the same for the iomap case which uses
> > > >> iomap_dio_complete() instead.
> > > >> 
> > > >> This has a side effect of deferring the completion to a process context
> > > >> for every AIO DIO that happens on inode that has pages mapped. However
> > > >> since the consensus is that this is ill-advised practice the performance
> > > >> implication should not be a problem.
> > > >> 
> > > >> This was based on proposal from Jeff Moyer, thanks!
> > > >
> > > > It seems the invalidation can be also removed from
> > > > generic_file_direct_write(), can't it? It is duplicit there the same way as
> > > > it was in the iomap code...
> > > 
> > > Yep, sure looks that way.
> > 
> > Hrm, ok. Technically speaking generic_file_direct_write does not have to
> > eventually end up with dio_complete() being called. This will change the
> > behaviour for those that implement dio differently. Looking at the users
> > now, vast majority will end up with complete_dio() so maybe this is not
> > a problem.
> 
> OK, so this seems to be the problem with 9p, fuse, nfs, lustre.
> 
> > This is in contrast with iomap_dio_rw() which will end up calling
> > iomap_dio_complete() so the situation is different there.
> > 
> > Maybe adding mapping->nrpages check would be better than outright
> > removing it ?
> 
> OK, I agree we cannot just remove the invalidation. But shouldn't we rather
> fix the above mentioned filesystems? Otherwise they will keep having issues
> you are trying to fix? But for now I could live with keeping the
> invalidation behind nrpages check and adding a comment why we kept it
> there...

Right, I'd rather have closere on this neverending patch. The rest of
the fs can be fixed later.

-Lukas

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-10 12:59       ` [PATCH v6] " Lukas Czerner
  2017-08-10 13:56         ` Jan Kara
@ 2017-08-15 13:28         ` Lukas Czerner
  2017-08-16 13:15           ` Jan Kara
                             ` (4 more replies)
  1 sibling, 5 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-08-15 13:28 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, jack, jmoyer, david, Lukas Czerner

Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
---
v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
v3: Do not invalidate in case of error. Add some coments
v4: Remove unnecessary variable, remove unnecessary inner braces
v5: Style changes
v6: Remove redundant invalidatepage, add warning and comment
v7: Run invalidateion conditionally from generic_file_direct_write()

 fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------
 fs/iomap.c     | 29 ++++++++++++++++-------------
 mm/filemap.c   | 10 ++++++++--
 3 files changed, 67 insertions(+), 21 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 08cf278..ffb9e19 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 {
 	loff_t offset = dio->iocb->ki_pos;
 	ssize_t transferred = 0;
+	int err;
 
 	/*
 	 * AIO submission can race with bio completion to get here while
@@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 	if (ret == 0)
 		ret = transferred;
 
+	/*
+	 * Try again to invalidate clean pages which might have been cached by
+	 * non-direct readahead, or faulted in by get_user_pages() if the source
+	 * of the write was an mmap'ed region of the file we're writing.  Either
+	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
+	 * this invalidation fails, tough, the write still worked...
+	 */
+	if (ret > 0 && dio->op == REQ_OP_WRITE &&
+	    dio->inode->i_mapping->nrpages) {
+		err = invalidate_inode_pages2_range(dio->inode->i_mapping,
+					offset >> PAGE_SHIFT,
+					(offset + ret - 1) >> PAGE_SHIFT);
+		WARN_ON_ONCE(err);
+	}
+
 	if (dio->end_io) {
-		int err;
 
 		// XXX: ki_pos??
 		err = dio->end_io(dio->iocb, offset, ret, dio->private);
@@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio)
 	struct dio *dio = bio->bi_private;
 	unsigned long remaining;
 	unsigned long flags;
+	bool defer_completion = false;
 
 	/* cleanup the bio */
 	dio_bio_complete(dio, bio);
@@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
 	if (remaining == 0) {
-		if (dio->result && dio->defer_completion) {
+		/*
+		 * Defer completion when defer_completion is set or
+		 * when the inode has pages mapped and this is AIO write.
+		 * We need to invalidate those pages because there is a
+		 * chance they contain stale data in the case buffered IO
+		 * went in between AIO submission and completion into the
+		 * same region.
+		 */
+		if (dio->result)
+			defer_completion = dio->defer_completion ||
+					   (dio->op == REQ_OP_WRITE &&
+					    dio->inode->i_mapping->nrpages);
+		if (defer_completion) {
 			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
 			queue_work(dio->inode->i_sb->s_dio_done_wq,
 				   &dio->complete_work);
@@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
 	 * so that we can call ->fsync.
 	 */
-	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
-	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
-	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
-		retval = dio_set_defer_completion(dio);
+	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
+		retval = 0;
+		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
+		    IS_SYNC(iocb->ki_filp->f_mapping->host))
+			retval = dio_set_defer_completion(dio);
+		else if (!dio->inode->i_sb->s_dio_done_wq) {
+			/*
+			 * In case of AIO write racing with buffered read we
+			 * need to defer completion. We can't decide this now,
+			 * however the workqueue needs to be initialized here.
+			 */
+			retval = sb_init_dio_done_wq(dio->inode->i_sb);
+		}
 		if (retval) {
 			/*
 			 * We grab i_mutex only for reads so we don't have
diff --git a/fs/iomap.c b/fs/iomap.c
index 0392661..c3e299a 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -713,8 +713,24 @@ struct iomap_dio {
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
+	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	/*
+	 * Try again to invalidate clean pages which might have been cached by
+	 * non-direct readahead, or faulted in by get_user_pages() if the source
+	 * of the write was an mmap'ed region of the file we're writing.  Either
+	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
+	 * this invalidation fails, tough, the write still worked...
+	 */
+	if (!dio->error &&
+	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) {
+		ret = invalidate_inode_pages2_range(inode->i_mapping,
+				iocb->ki_pos >> PAGE_SHIFT,
+				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
+		WARN_ON_ONCE(ret);
+	}
+
 	if (dio->end_io) {
 		ret = dio->end_io(iocb,
 				dio->error ? dio->error : dio->size,
@@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	ret = iomap_dio_complete(dio);
 
-	/*
-	 * Try again to invalidate clean pages which might have been cached by
-	 * non-direct readahead, or faulted in by get_user_pages() if the source
-	 * of the write was an mmap'ed region of the file we're writing.  Either
-	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
-	 * this invalidation fails, tough, the write still worked...
-	 */
-	if (iov_iter_rw(iter) == WRITE) {
-		int err = invalidate_inode_pages2_range(mapping,
-				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
-		WARN_ON_ONCE(err);
-	}
-
 	return ret;
 
 out_free_dio:
diff --git a/mm/filemap.c b/mm/filemap.c
index a497024..9440e02 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	 * we're writing.  Either one is a pretty crazy thing to do,
 	 * so we don't support it 100%.  If this invalidation
 	 * fails, tough, the write still worked...
+	 *
+	 * Most of the time we do not need this since dio_complete() will do
+	 * the invalidation for us. However there are some file systems that
+	 * do not end up with dio_complete() being called, so let's not break
+	 * them by removing it completely
 	 */
-	invalidate_inode_pages2_range(mapping,
-				pos >> PAGE_SHIFT, end);
+	if (mapping->nrpages)
+		invalidate_inode_pages2_range(mapping,
+					pos >> PAGE_SHIFT, end);
 
 	if (written > 0) {
 		pos += written;
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-15 13:28         ` [PATCH v7] " Lukas Czerner
@ 2017-08-16 13:15           ` Jan Kara
  2017-08-16 16:01           ` Darrick J. Wong
                             ` (3 subsequent siblings)
  4 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2017-08-16 13:15 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, jmoyer, david

On Tue 15-08-17 15:28:54, Lukas Czerner wrote:
> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
> 
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
> 
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
> 
> This was based on proposal from Jeff Moyer, thanks!
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
> v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> v3: Do not invalidate in case of error. Add some coments
> v4: Remove unnecessary variable, remove unnecessary inner braces
> v5: Style changes
> v6: Remove redundant invalidatepage, add warning and comment
> v7: Run invalidateion conditionally from generic_file_direct_write()
> 
>  fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------
>  fs/iomap.c     | 29 ++++++++++++++++-------------
>  mm/filemap.c   | 10 ++++++++--
>  3 files changed, 67 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..ffb9e19 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  {
>  	loff_t offset = dio->iocb->ki_pos;
>  	ssize_t transferred = 0;
> +	int err;
>  
>  	/*
>  	 * AIO submission can race with bio completion to get here while
> @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  	if (ret == 0)
>  		ret = transferred;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> +	    dio->inode->i_mapping->nrpages) {
> +		err = invalidate_inode_pages2_range(dio->inode->i_mapping,
> +					offset >> PAGE_SHIFT,
> +					(offset + ret - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(err);
> +	}
> +
>  	if (dio->end_io) {
> -		int err;
>  
>  		// XXX: ki_pos??
>  		err = dio->end_io(dio->iocb, offset, ret, dio->private);
> @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio)
>  	struct dio *dio = bio->bi_private;
>  	unsigned long remaining;
>  	unsigned long flags;
> +	bool defer_completion = false;
>  
>  	/* cleanup the bio */
>  	dio_bio_complete(dio, bio);
> @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio)
>  	spin_unlock_irqrestore(&dio->bio_lock, flags);
>  
>  	if (remaining == 0) {
> -		if (dio->result && dio->defer_completion) {
> +		/*
> +		 * Defer completion when defer_completion is set or
> +		 * when the inode has pages mapped and this is AIO write.
> +		 * We need to invalidate those pages because there is a
> +		 * chance they contain stale data in the case buffered IO
> +		 * went in between AIO submission and completion into the
> +		 * same region.
> +		 */
> +		if (dio->result)
> +			defer_completion = dio->defer_completion ||
> +					   (dio->op == REQ_OP_WRITE &&
> +					    dio->inode->i_mapping->nrpages);
> +		if (defer_completion) {
>  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>  			queue_work(dio->inode->i_sb->s_dio_done_wq,
>  				   &dio->complete_work);
> @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
>  	 * so that we can call ->fsync.
>  	 */
> -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> -		retval = dio_set_defer_completion(dio);
> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq) {
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> +		}
>  		if (retval) {
>  			/*
>  			 * We grab i_mutex only for reads so we don't have
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 0392661..c3e299a 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,24 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (!dio->error &&
> +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) {
> +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> +				iocb->ki_pos >> PAGE_SHIFT,
> +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(ret);
> +	}
> +
>  	if (dio->end_io) {
>  		ret = dio->end_io(iocb,
>  				dio->error ? dio->error : dio->size,
> @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	ret = iomap_dio_complete(dio);
>  
> -	/*
> -	 * Try again to invalidate clean pages which might have been cached by
> -	 * non-direct readahead, or faulted in by get_user_pages() if the source
> -	 * of the write was an mmap'ed region of the file we're writing.  Either
> -	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> -	 * this invalidation fails, tough, the write still worked...
> -	 */
> -	if (iov_iter_rw(iter) == WRITE) {
> -		int err = invalidate_inode_pages2_range(mapping,
> -				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -		WARN_ON_ONCE(err);
> -	}
> -
>  	return ret;
>  
>  out_free_dio:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a497024..9440e02 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	 * we're writing.  Either one is a pretty crazy thing to do,
>  	 * so we don't support it 100%.  If this invalidation
>  	 * fails, tough, the write still worked...
> +	 *
> +	 * Most of the time we do not need this since dio_complete() will do
> +	 * the invalidation for us. However there are some file systems that
> +	 * do not end up with dio_complete() being called, so let's not break
> +	 * them by removing it completely
>  	 */
> -	invalidate_inode_pages2_range(mapping,
> -				pos >> PAGE_SHIFT, end);
> +	if (mapping->nrpages)
> +		invalidate_inode_pages2_range(mapping,
> +					pos >> PAGE_SHIFT, end);
>  
>  	if (written > 0) {
>  		pos += written;
> -- 
> 2.7.5
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-15 13:28         ` [PATCH v7] " Lukas Czerner
  2017-08-16 13:15           ` Jan Kara
@ 2017-08-16 16:01           ` Darrick J. Wong
  2017-09-21 13:44           ` Jeff Moyer
                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2017-08-16 16:01 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, jmoyer, david

On Tue, Aug 15, 2017 at 03:28:54PM +0200, Lukas Czerner wrote:
> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
> 
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
> 
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
> 
> This was based on proposal from Jeff Moyer, thanks!
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>

Looks ok to me,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
> v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> v3: Do not invalidate in case of error. Add some coments
> v4: Remove unnecessary variable, remove unnecessary inner braces
> v5: Style changes
> v6: Remove redundant invalidatepage, add warning and comment
> v7: Run invalidateion conditionally from generic_file_direct_write()
> 
>  fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------
>  fs/iomap.c     | 29 ++++++++++++++++-------------
>  mm/filemap.c   | 10 ++++++++--
>  3 files changed, 67 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..ffb9e19 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  {
>  	loff_t offset = dio->iocb->ki_pos;
>  	ssize_t transferred = 0;
> +	int err;
>  
>  	/*
>  	 * AIO submission can race with bio completion to get here while
> @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  	if (ret == 0)
>  		ret = transferred;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> +	    dio->inode->i_mapping->nrpages) {
> +		err = invalidate_inode_pages2_range(dio->inode->i_mapping,
> +					offset >> PAGE_SHIFT,
> +					(offset + ret - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(err);
> +	}
> +
>  	if (dio->end_io) {
> -		int err;
>  
>  		// XXX: ki_pos??
>  		err = dio->end_io(dio->iocb, offset, ret, dio->private);
> @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio)
>  	struct dio *dio = bio->bi_private;
>  	unsigned long remaining;
>  	unsigned long flags;
> +	bool defer_completion = false;
>  
>  	/* cleanup the bio */
>  	dio_bio_complete(dio, bio);
> @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio)
>  	spin_unlock_irqrestore(&dio->bio_lock, flags);
>  
>  	if (remaining == 0) {
> -		if (dio->result && dio->defer_completion) {
> +		/*
> +		 * Defer completion when defer_completion is set or
> +		 * when the inode has pages mapped and this is AIO write.
> +		 * We need to invalidate those pages because there is a
> +		 * chance they contain stale data in the case buffered IO
> +		 * went in between AIO submission and completion into the
> +		 * same region.
> +		 */
> +		if (dio->result)
> +			defer_completion = dio->defer_completion ||
> +					   (dio->op == REQ_OP_WRITE &&
> +					    dio->inode->i_mapping->nrpages);
> +		if (defer_completion) {
>  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>  			queue_work(dio->inode->i_sb->s_dio_done_wq,
>  				   &dio->complete_work);
> @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
>  	 * so that we can call ->fsync.
>  	 */
> -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> -		retval = dio_set_defer_completion(dio);
> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq) {
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> +		}
>  		if (retval) {
>  			/*
>  			 * We grab i_mutex only for reads so we don't have
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 0392661..c3e299a 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,24 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (!dio->error &&
> +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) {
> +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> +				iocb->ki_pos >> PAGE_SHIFT,
> +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(ret);
> +	}
> +
>  	if (dio->end_io) {
>  		ret = dio->end_io(iocb,
>  				dio->error ? dio->error : dio->size,
> @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	ret = iomap_dio_complete(dio);
>  
> -	/*
> -	 * Try again to invalidate clean pages which might have been cached by
> -	 * non-direct readahead, or faulted in by get_user_pages() if the source
> -	 * of the write was an mmap'ed region of the file we're writing.  Either
> -	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> -	 * this invalidation fails, tough, the write still worked...
> -	 */
> -	if (iov_iter_rw(iter) == WRITE) {
> -		int err = invalidate_inode_pages2_range(mapping,
> -				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -		WARN_ON_ONCE(err);
> -	}
> -
>  	return ret;
>  
>  out_free_dio:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a497024..9440e02 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	 * we're writing.  Either one is a pretty crazy thing to do,
>  	 * so we don't support it 100%.  If this invalidation
>  	 * fails, tough, the write still worked...
> +	 *
> +	 * Most of the time we do not need this since dio_complete() will do
> +	 * the invalidation for us. However there are some file systems that
> +	 * do not end up with dio_complete() being called, so let's not break
> +	 * them by removing it completely
>  	 */
> -	invalidate_inode_pages2_range(mapping,
> -				pos >> PAGE_SHIFT, end);
> +	if (mapping->nrpages)
> +		invalidate_inode_pages2_range(mapping,
> +					pos >> PAGE_SHIFT, end);
>  
>  	if (written > 0) {
>  		pos += written;
> -- 
> 2.7.5
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-15 13:28         ` [PATCH v7] " Lukas Czerner
  2017-08-16 13:15           ` Jan Kara
  2017-08-16 16:01           ` Darrick J. Wong
@ 2017-09-21 13:44           ` Jeff Moyer
  2017-09-21 13:44           ` Lukas Czerner
  2017-10-10 14:34           ` David Sterba
  4 siblings, 0 replies; 40+ messages in thread
From: Jeff Moyer @ 2017-09-21 13:44 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, david

Lukas Czerner <lczerner@redhat.com> writes:

> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
>
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
>
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
>
> This was based on proposal from Jeff Moyer, thanks!
>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>

Is this still in limbo?

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

> ---
> v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> v3: Do not invalidate in case of error. Add some coments
> v4: Remove unnecessary variable, remove unnecessary inner braces
> v5: Style changes
> v6: Remove redundant invalidatepage, add warning and comment
> v7: Run invalidateion conditionally from generic_file_direct_write()
>
>  fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------
>  fs/iomap.c     | 29 ++++++++++++++++-------------
>  mm/filemap.c   | 10 ++++++++--
>  3 files changed, 67 insertions(+), 21 deletions(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..ffb9e19 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  {
>  	loff_t offset = dio->iocb->ki_pos;
>  	ssize_t transferred = 0;
> +	int err;
>  
>  	/*
>  	 * AIO submission can race with bio completion to get here while
> @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  	if (ret == 0)
>  		ret = transferred;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> +	    dio->inode->i_mapping->nrpages) {
> +		err = invalidate_inode_pages2_range(dio->inode->i_mapping,
> +					offset >> PAGE_SHIFT,
> +					(offset + ret - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(err);
> +	}
> +
>  	if (dio->end_io) {
> -		int err;
>  
>  		// XXX: ki_pos??
>  		err = dio->end_io(dio->iocb, offset, ret, dio->private);
> @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio)
>  	struct dio *dio = bio->bi_private;
>  	unsigned long remaining;
>  	unsigned long flags;
> +	bool defer_completion = false;
>  
>  	/* cleanup the bio */
>  	dio_bio_complete(dio, bio);
> @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio)
>  	spin_unlock_irqrestore(&dio->bio_lock, flags);
>  
>  	if (remaining == 0) {
> -		if (dio->result && dio->defer_completion) {
> +		/*
> +		 * Defer completion when defer_completion is set or
> +		 * when the inode has pages mapped and this is AIO write.
> +		 * We need to invalidate those pages because there is a
> +		 * chance they contain stale data in the case buffered IO
> +		 * went in between AIO submission and completion into the
> +		 * same region.
> +		 */
> +		if (dio->result)
> +			defer_completion = dio->defer_completion ||
> +					   (dio->op == REQ_OP_WRITE &&
> +					    dio->inode->i_mapping->nrpages);
> +		if (defer_completion) {
>  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>  			queue_work(dio->inode->i_sb->s_dio_done_wq,
>  				   &dio->complete_work);
> @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
>  	 * so that we can call ->fsync.
>  	 */
> -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> -		retval = dio_set_defer_completion(dio);
> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq) {
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> +		}
>  		if (retval) {
>  			/*
>  			 * We grab i_mutex only for reads so we don't have
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 0392661..c3e299a 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,24 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (!dio->error &&
> +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) {
> +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> +				iocb->ki_pos >> PAGE_SHIFT,
> +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(ret);
> +	}
> +
>  	if (dio->end_io) {
>  		ret = dio->end_io(iocb,
>  				dio->error ? dio->error : dio->size,
> @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	ret = iomap_dio_complete(dio);
>  
> -	/*
> -	 * Try again to invalidate clean pages which might have been cached by
> -	 * non-direct readahead, or faulted in by get_user_pages() if the source
> -	 * of the write was an mmap'ed region of the file we're writing.  Either
> -	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> -	 * this invalidation fails, tough, the write still worked...
> -	 */
> -	if (iov_iter_rw(iter) == WRITE) {
> -		int err = invalidate_inode_pages2_range(mapping,
> -				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -		WARN_ON_ONCE(err);
> -	}
> -
>  	return ret;
>  
>  out_free_dio:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a497024..9440e02 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	 * we're writing.  Either one is a pretty crazy thing to do,
>  	 * so we don't support it 100%.  If this invalidation
>  	 * fails, tough, the write still worked...
> +	 *
> +	 * Most of the time we do not need this since dio_complete() will do
> +	 * the invalidation for us. However there are some file systems that
> +	 * do not end up with dio_complete() being called, so let's not break
> +	 * them by removing it completely
>  	 */
> -	invalidate_inode_pages2_range(mapping,
> -				pos >> PAGE_SHIFT, end);
> +	if (mapping->nrpages)
> +		invalidate_inode_pages2_range(mapping,
> +					pos >> PAGE_SHIFT, end);
>  
>  	if (written > 0) {
>  		pos += written;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-15 13:28         ` [PATCH v7] " Lukas Czerner
                             ` (2 preceding siblings ...)
  2017-09-21 13:44           ` Jeff Moyer
@ 2017-09-21 13:44           ` Lukas Czerner
  2017-09-21 14:14             ` Jens Axboe
  2017-10-10 14:34           ` David Sterba
  4 siblings, 1 reply; 40+ messages in thread
From: Lukas Czerner @ 2017-09-21 13:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, jack, jmoyer, david, Jens Axboe

Al, Jens,

can any of you please take this throught your tree ?

Thanks!
-Lukas

On Tue, Aug 15, 2017 at 03:28:54PM +0200, Lukas Czerner wrote:
> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
> 
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
> 
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
> 
> This was based on proposal from Jeff Moyer, thanks!
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> ---
> v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> v3: Do not invalidate in case of error. Add some coments
> v4: Remove unnecessary variable, remove unnecessary inner braces
> v5: Style changes
> v6: Remove redundant invalidatepage, add warning and comment
> v7: Run invalidateion conditionally from generic_file_direct_write()
> 
>  fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------
>  fs/iomap.c     | 29 ++++++++++++++++-------------
>  mm/filemap.c   | 10 ++++++++--
>  3 files changed, 67 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..ffb9e19 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  {
>  	loff_t offset = dio->iocb->ki_pos;
>  	ssize_t transferred = 0;
> +	int err;
>  
>  	/*
>  	 * AIO submission can race with bio completion to get here while
> @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  	if (ret == 0)
>  		ret = transferred;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> +	    dio->inode->i_mapping->nrpages) {
> +		err = invalidate_inode_pages2_range(dio->inode->i_mapping,
> +					offset >> PAGE_SHIFT,
> +					(offset + ret - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(err);
> +	}
> +
>  	if (dio->end_io) {
> -		int err;
>  
>  		// XXX: ki_pos??
>  		err = dio->end_io(dio->iocb, offset, ret, dio->private);
> @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio)
>  	struct dio *dio = bio->bi_private;
>  	unsigned long remaining;
>  	unsigned long flags;
> +	bool defer_completion = false;
>  
>  	/* cleanup the bio */
>  	dio_bio_complete(dio, bio);
> @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio)
>  	spin_unlock_irqrestore(&dio->bio_lock, flags);
>  
>  	if (remaining == 0) {
> -		if (dio->result && dio->defer_completion) {
> +		/*
> +		 * Defer completion when defer_completion is set or
> +		 * when the inode has pages mapped and this is AIO write.
> +		 * We need to invalidate those pages because there is a
> +		 * chance they contain stale data in the case buffered IO
> +		 * went in between AIO submission and completion into the
> +		 * same region.
> +		 */
> +		if (dio->result)
> +			defer_completion = dio->defer_completion ||
> +					   (dio->op == REQ_OP_WRITE &&
> +					    dio->inode->i_mapping->nrpages);
> +		if (defer_completion) {
>  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>  			queue_work(dio->inode->i_sb->s_dio_done_wq,
>  				   &dio->complete_work);
> @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
>  	 * so that we can call ->fsync.
>  	 */
> -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> -		retval = dio_set_defer_completion(dio);
> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq) {
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> +		}
>  		if (retval) {
>  			/*
>  			 * We grab i_mutex only for reads so we don't have
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 0392661..c3e299a 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,24 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (!dio->error &&
> +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) {
> +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> +				iocb->ki_pos >> PAGE_SHIFT,
> +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(ret);
> +	}
> +
>  	if (dio->end_io) {
>  		ret = dio->end_io(iocb,
>  				dio->error ? dio->error : dio->size,
> @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	ret = iomap_dio_complete(dio);
>  
> -	/*
> -	 * Try again to invalidate clean pages which might have been cached by
> -	 * non-direct readahead, or faulted in by get_user_pages() if the source
> -	 * of the write was an mmap'ed region of the file we're writing.  Either
> -	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> -	 * this invalidation fails, tough, the write still worked...
> -	 */
> -	if (iov_iter_rw(iter) == WRITE) {
> -		int err = invalidate_inode_pages2_range(mapping,
> -				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -		WARN_ON_ONCE(err);
> -	}
> -
>  	return ret;
>  
>  out_free_dio:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a497024..9440e02 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	 * we're writing.  Either one is a pretty crazy thing to do,
>  	 * so we don't support it 100%.  If this invalidation
>  	 * fails, tough, the write still worked...
> +	 *
> +	 * Most of the time we do not need this since dio_complete() will do
> +	 * the invalidation for us. However there are some file systems that
> +	 * do not end up with dio_complete() being called, so let's not break
> +	 * them by removing it completely
>  	 */
> -	invalidate_inode_pages2_range(mapping,
> -				pos >> PAGE_SHIFT, end);
> +	if (mapping->nrpages)
> +		invalidate_inode_pages2_range(mapping,
> +					pos >> PAGE_SHIFT, end);
>  
>  	if (written > 0) {
>  		pos += written;
> -- 
> 2.7.5
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-09-21 13:44           ` Lukas Czerner
@ 2017-09-21 14:14             ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2017-09-21 14:14 UTC (permalink / raw)
  To: Lukas Czerner, linux-fsdevel; +Cc: viro, jack, jmoyer, david

On 09/21/2017 07:44 AM, Lukas Czerner wrote:
> Al, Jens,
> 
> can any of you please take this throught your tree ?

I can take it, it's well reviewed at this point (to say the least).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-08-15 13:28         ` [PATCH v7] " Lukas Czerner
                             ` (3 preceding siblings ...)
  2017-09-21 13:44           ` Lukas Czerner
@ 2017-10-10 14:34           ` David Sterba
  2017-10-11  9:21             ` Lukas Czerner
  4 siblings, 1 reply; 40+ messages in thread
From: David Sterba @ 2017-10-10 14:34 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-fsdevel, viro, jack, jmoyer, david, bo.li.liu, clm

On Tue, Aug 15, 2017 at 03:28:54PM +0200, Lukas Czerner wrote:
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> +	    dio->inode->i_mapping->nrpages) {
> +		err = invalidate_inode_pages2_range(dio->inode->i_mapping,
> +					offset >> PAGE_SHIFT,
> +					(offset + ret - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(err);

fstests/btrfs/062 reports this:

[ 6235.547298] ------------[ cut here ]------------
[ 6235.552098] WARNING: CPU: 7 PID: 24321 at fs/direct-io.c:274 dio_complete+0x16f/0x1f0
[ 6235.560858] Modules linked in: dm_flakey loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet br_netfilter bridge stp llc iscsi_ibft iscsi_boot_sysfs btrfs xor zstd_decompress zstd_compress i2c_algo_bit drm_kms_helper xxhash zlib_deflate raid6_pq syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 drm dm_mod dax ptp kvm_amd pps_core kvm libphy tpm_infineon mptctl shpchp k10temp tpm_tis tpm_tis_core button i2c_piix4 tpm pcspkr irqbypass acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom ohci_pci ehci_pci ohci_hcd mptsas ehci_hcd scsi_transport_sas ata_generic mptscsih serio_raw mptbase usbcore sata_svw pata_serverworks sg scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[ 6235.560942] CPU: 7 PID: 24321 Comm: kworker/7:1 Not tainted 4.14.0-rc4-1.ge195904-vanilla+ #71
[ 6235.560944] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
[ 6235.560950] Workqueue: dio/sdb6 dio_aio_complete_work
[ 6235.560953] task: ffff894fe0bd8300 task.stack: ffffb45742f7c000
[ 6235.560957] RIP: 0010:dio_complete+0x16f/0x1f0
[ 6235.560959] RSP: 0018:ffffb45742f7fde8 EFLAGS: 00010286
[ 6235.560968] RAX: 00000000fffffff0 RBX: ffff894fd1e3a680 RCX: ffff894fe0bd8300
[ 6235.560970] RDX: 0000000000000000 RSI: 00000000000002b4 RDI: ffffffffaba438e9
[ 6235.560971] RBP: ffffb45742f7fe10 R08: 0000000000000000 R09: 0000000000000025
[ 6235.560973] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000001e000
[ 6235.560974] R13: 000000000001e000 R14: 0000000000007000 R15: 0000000000000001
[ 6235.560977] FS:  0000000000000000(0000) GS:ffff894fefdc0000(0000) knlGS:0000000000000000
[ 6235.560978] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6235.560980] CR2: 00007fe1e1dfb610 CR3: 0000000213ee7000 CR4: 00000000000006e0
[ 6235.560982] Call Trace:
[ 6235.561075]  dio_aio_complete_work+0x1c/0x20
[ 6235.561082]  process_one_work+0x1d8/0x620
[ 6235.561085]  ? process_one_work+0x14b/0x620
[ 6235.561092]  worker_thread+0x4d/0x3c0
[ 6235.561097]  ? trace_hardirqs_on+0xd/0x10
[ 6235.561105]  kthread+0x152/0x190
[ 6235.561107]  ? process_one_work+0x620/0x620
[ 6235.561111]  ? kthread_create_on_node+0x40/0x40
[ 6235.561116]  ? do_syscall_64+0x69/0x180
[ 6235.561122]  ret_from_fork+0x2a/0x40
[ 6235.561131] Code: 48 83 bf 00 01 00 00 00 0f 84 37 ff ff ff 4b 8d 54 34 ff 4c 89 f6 48 c1 fe 0c 48 c1 fa 0c e8 49 82 f2 ff 85 c0 0f 84 1a ff ff ff <0f> ff e9 13 ff ff ff 48 81 c7 e0 00 00 00 be 09 00 00 00 e8 79
[ 6235.561179] ---[ end trace ba80cd81f19cb389 ]---

I've added Chris and Bo to CC if they have more to say about the specifics of
dio and buffered writes as implemented in btrfs.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  2017-10-10 14:34           ` David Sterba
@ 2017-10-11  9:21             ` Lukas Czerner
  0 siblings, 0 replies; 40+ messages in thread
From: Lukas Czerner @ 2017-10-11  9:21 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-fsdevel, viro, jack, jmoyer, david, bo.li.liu, clm

On Tue, Oct 10, 2017 at 04:34:45PM +0200, David Sterba wrote:
> On Tue, Aug 15, 2017 at 03:28:54PM +0200, Lukas Czerner wrote:
> > +	/*
> > +	 * Try again to invalidate clean pages which might have been cached by
> > +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> > +	 * of the write was an mmap'ed region of the file we're writing.  Either
> > +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> > +	 * this invalidation fails, tough, the write still worked...
> > +	 */
> > +	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> > +	    dio->inode->i_mapping->nrpages) {
> > +		err = invalidate_inode_pages2_range(dio->inode->i_mapping,
> > +					offset >> PAGE_SHIFT,
> > +					(offset + ret - 1) >> PAGE_SHIFT);
> > +		WARN_ON_ONCE(err);
> 
> fstests/btrfs/062 reports this:
> 
> [ 6235.547298] ------------[ cut here ]------------
> [ 6235.552098] WARNING: CPU: 7 PID: 24321 at fs/direct-io.c:274 dio_complete+0x16f/0x1f0
> [ 6235.560858] Modules linked in: dm_flakey loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet br_netfilter bridge stp llc iscsi_ibft iscsi_boot_sysfs btrfs xor zstd_decompress zstd_compress i2c_algo_bit drm_kms_helper xxhash zlib_deflate raid6_pq syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 drm dm_mod dax ptp kvm_amd pps_core kvm libphy tpm_infineon mptctl shpchp k10temp tpm_tis tpm_tis_core button i2c_piix4 tpm pcspkr irqbypass acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom ohci_pci ehci_pci ohci_hcd mptsas ehci_hcd scsi_transport_sas ata_generic mptscsih serio_raw mptbase usbcore sata_svw pata_serverworks sg scsi_dh_rdac scsi_dh_emc scsi_dh_alua
> [ 6235.560942] CPU: 7 PID: 24321 Comm: kworker/7:1 Not tainted 4.14.0-rc4-1.ge195904-vanilla+ #71
> [ 6235.560944] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
> [ 6235.560950] Workqueue: dio/sdb6 dio_aio_complete_work
> [ 6235.560953] task: ffff894fe0bd8300 task.stack: ffffb45742f7c000
> [ 6235.560957] RIP: 0010:dio_complete+0x16f/0x1f0
> [ 6235.560959] RSP: 0018:ffffb45742f7fde8 EFLAGS: 00010286
> [ 6235.560968] RAX: 00000000fffffff0 RBX: ffff894fd1e3a680 RCX: ffff894fe0bd8300
> [ 6235.560970] RDX: 0000000000000000 RSI: 00000000000002b4 RDI: ffffffffaba438e9
> [ 6235.560971] RBP: ffffb45742f7fe10 R08: 0000000000000000 R09: 0000000000000025
> [ 6235.560973] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000001e000
> [ 6235.560974] R13: 000000000001e000 R14: 0000000000007000 R15: 0000000000000001
> [ 6235.560977] FS:  0000000000000000(0000) GS:ffff894fefdc0000(0000) knlGS:0000000000000000
> [ 6235.560978] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 6235.560980] CR2: 00007fe1e1dfb610 CR3: 0000000213ee7000 CR4: 00000000000006e0
> [ 6235.560982] Call Trace:
> [ 6235.561075]  dio_aio_complete_work+0x1c/0x20
> [ 6235.561082]  process_one_work+0x1d8/0x620
> [ 6235.561085]  ? process_one_work+0x14b/0x620
> [ 6235.561092]  worker_thread+0x4d/0x3c0
> [ 6235.561097]  ? trace_hardirqs_on+0xd/0x10
> [ 6235.561105]  kthread+0x152/0x190
> [ 6235.561107]  ? process_one_work+0x620/0x620
> [ 6235.561111]  ? kthread_create_on_node+0x40/0x40
> [ 6235.561116]  ? do_syscall_64+0x69/0x180
> [ 6235.561122]  ret_from_fork+0x2a/0x40
> [ 6235.561131] Code: 48 83 bf 00 01 00 00 00 0f 84 37 ff ff ff 4b 8d 54 34 ff 4c 89 f6 48 c1 fe 0c 48 c1 fa 0c e8 49 82 f2 ff 85 c0 0f 84 1a ff ff ff <0f> ff e9 13 ff ff ff 48 81 c7 e0 00 00 00 be 09 00 00 00 e8 79
> [ 6235.561179] ---[ end trace ba80cd81f19cb389 ]---
> 
> I've added Chris and Bo to CC if they have more to say about the specifics of
> dio and buffered writes as implemented in btrfs.

There are several places where we warn on invalidation failing. The
question is why invalidation failed on btrfs in this test.

-Lukas

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2017-10-11  9:21 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-13 15:17 [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO Lukas Czerner
2017-07-14 10:41 ` kbuild test robot
2017-07-14 13:40   ` Lukas Czerner
2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner
2017-07-17 15:12   ` Jan Kara
2017-07-17 15:28     ` Lukas Czerner
2017-07-17 15:39       ` Jeff Moyer
2017-07-17 16:17         ` Jan Kara
2017-07-17 19:52           ` Jeff Moyer
2017-07-18  7:39         ` Lukas Czerner
2017-07-18  9:06           ` Jan Kara
2017-07-18  9:32             ` Lukas Czerner
2017-07-18 12:19   ` [PATCH v3] " Lukas Czerner
2017-07-18 13:44     ` Christoph Hellwig
2017-07-18 14:17       ` Jan Kara
2017-07-19  8:42       ` Lukas Czerner
2017-07-19  8:48     ` [PATCH v4] " Lukas Czerner
2017-07-19  9:26       ` Jan Kara
2017-07-19 11:01         ` Lukas Czerner
2017-07-19 11:28     ` [PATCH v5] " Lukas Czerner
2017-07-19 11:37       ` Jan Kara
2017-07-19 12:17       ` Jeff Moyer
2017-08-03 18:10       ` Jeff Moyer
2017-08-04 10:09         ` Dave Chinner
2017-08-07 15:52           ` Jeff Moyer
2017-08-08  8:41             ` Lukas Czerner
2017-08-10 12:59       ` [PATCH v6] " Lukas Czerner
2017-08-10 13:56         ` Jan Kara
2017-08-10 14:22           ` Jeff Moyer
2017-08-11  9:03             ` Lukas Czerner
2017-08-14  9:43               ` Jan Kara
2017-08-15 12:47                 ` Lukas Czerner
2017-08-15 13:28         ` [PATCH v7] " Lukas Czerner
2017-08-16 13:15           ` Jan Kara
2017-08-16 16:01           ` Darrick J. Wong
2017-09-21 13:44           ` Jeff Moyer
2017-09-21 13:44           ` Lukas Czerner
2017-09-21 14:14             ` Jens Axboe
2017-10-10 14:34           ` David Sterba
2017-10-11  9:21             ` Lukas Czerner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.