[RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page_mkwrite
@ 2020-01-13 11:04 Ritesh Harjani
  2020-01-13 11:04 ` [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range Ritesh Harjani
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Ritesh Harjani @ 2020-01-13 11:04 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, jack, Ritesh Harjani

Hello All, 

Sorry for the delay on this patchset. I guess it's because there were
lot of other context switches while working at it.

Please note that this is a RFC patch and also a WIP (due to a open problem
listed below).
There is also another thread going on where making dioread_nolock as default
mount opt [1] is being discussed. That approach should also solve the given
race at hand. But since nothing is finalized yet, so I wanted to get this patch
out for early review/discussion.

About patch
===========

Currently there is a small race window as pointed out by Jan [2] where, when
ext4 tries to allocate a written block for mapped files and if DIO read is in
progress, then this may result into stale data read exposure problem.

This patch tries to fix the mentioned issue by:
1. For non-delalloc path, page_mkwrite will use unwritten blocks by
   default for extent based files.

2. For delalloc path, we check if DIO is in progress during writeback.
   If yes, then we use unwritten blocks method to avoid this race.

Patch-1: This moves the inode_dio_begin() call before calling for
filemap_write_and_wait_range.

Patch-2: This implementes the points (1) & (2) mentioned above.

Testing:
========
xfstests "-g auto" ran fine except one warn_on issue.

Below tests are giving kernel WARN_ON from "ext4_journalled_invalidatepage()",
with 1024 blocksize, 4K pagesize & with "nodelalloc,data=journal" mount opt.
- generic/013, generic/269, generic/270

In case if someone has any pointers around this, I could dig more deeper into
this. 

References
==========
[1] https://www.spinics.net/lists/linux-ext4/msg69224.html
[2] https://lore.kernel.org/linux-ext4/20190926134726.GA28555@quack2.suse.cz/ 

Ritesh Harjani (2):
  iomap: direct-io: Move inode_dio_begin before
    filemap_write_and_wait_range
  ext4: Fix stale data read issue with DIO read & ext4_page_mkwrite path

 fs/ext4/inode.c      | 45 +++++++++++++++++++++++++++++++-------------
 fs/iomap/direct-io.c | 17 +++++++++++++----
 2 files changed, 45 insertions(+), 17 deletions(-)

-- 
2.21.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range
  2020-01-13 11:04 [RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page_mkwrite Ritesh Harjani
@ 2020-01-13 11:04 ` Ritesh Harjani
  2020-01-13 21:51   ` Darrick J. Wong
                     ` (2 more replies)
  2020-01-13 11:04 ` [RFC 2/2] ext4: Fix stale data read issue with DIO read & ext4_page_mkwrite path Ritesh Harjani
  2020-01-14 16:39 ` [RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page_mkwrite Christoph Hellwig
  2 siblings, 3 replies; 17+ messages in thread
From: Ritesh Harjani @ 2020-01-13 11:04 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, jack, Ritesh Harjani

Some filesystems (e.g. ext4) need to know in it's writeback path, that
whether DIO is in progress or not. This info may be needed to avoid the
stale data exposure race with DIO reads.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/iomap/direct-io.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23837926c0c5..d1c159bd3854 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -468,9 +468,18 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		flags |= IOMAP_NOWAIT;
 	}
 
+	/*
+	 * Call inode_dio_begin() before we write out and wait for writeback to
+	 * complete. This may be needed by some filesystems to prevent race
+	 * like stale data exposure by DIO reads.
+	 */
+	inode_dio_begin(inode);
+	/* So that i_dio_count is incremented before below operation */
+	smp_mb__after_atomic();
+
 	ret = filemap_write_and_wait_range(mapping, pos, end);
 	if (ret)
-		goto out_free_dio;
+		goto out_end_dio;
 
 	/*
 	 * Try to invalidate cache pages for the range we're direct
@@ -488,11 +497,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	    !inode->i_sb->s_dio_done_wq) {
 		ret = sb_init_dio_done_wq(inode->i_sb);
 		if (ret < 0)
-			goto out_free_dio;
+			goto out_end_dio;
 	}
 
-	inode_dio_begin(inode);
-
 	blk_start_plug(&plug);
 	do {
 		ret = iomap_apply(inode, pos, count, flags, ops, dio,
@@ -568,6 +575,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	return iomap_dio_complete(dio);
 
+out_end_dio:
+	inode_dio_end(inode);
 out_free_dio:
 	kfree(dio);
 	return ret;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC 2/2] ext4: Fix stale data read issue with DIO read & ext4_page_mkwrite path
  2020-01-13 11:04 [RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page_mkwrite Ritesh Harjani
  2020-01-13 11:04 ` [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range Ritesh Harjani
@ 2020-01-13 11:04 ` Ritesh Harjani
  2020-01-14  9:47   ` Jan Kara
  2020-01-14 16:39 ` [RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page_mkwrite Christoph Hellwig
  2 siblings, 1 reply; 17+ messages in thread
From: Ritesh Harjani @ 2020-01-13 11:04 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, jack, Ritesh Harjani

Currently there is a small race window where ext4 tries to allocate
a written block for mapped files and if DIO read is in progress, then
this may result into stale data read exposure problem.

This patch fixes the mentioned issue by:
1. For non-delalloc path, page_mkwrite will use unwritten blocks by
   default for extent based files.

2. For delalloc path, we check if DIO is in progress during writeback.
   If yes, then we use unwritten blocks method to avoid this race.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/inode.c | 45 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d035acab5b2a..07f66782335b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1529,6 +1529,7 @@ struct mpage_da_data {
 	struct ext4_map_blocks map;
 	struct ext4_io_submit io_submit;	/* IO submission data */
 	unsigned int do_map:1;
+	bool dio_in_progress:1;
 };
 
 static void mpage_release_unused_pages(struct mpage_da_data *mpd,
@@ -2359,7 +2360,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 			   EXT4_GET_BLOCKS_METADATA_NOFAIL |
 			   EXT4_GET_BLOCKS_IO_SUBMIT;
 	dioread_nolock = ext4_should_dioread_nolock(inode);
-	if (dioread_nolock)
+	if (dioread_nolock || mpd->dio_in_progress)
 		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
 	if (map->m_flags & (1 << BH_Delay))
 		get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
@@ -2367,7 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
 	if (err < 0)
 		return err;
-	if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
+	if ((dioread_nolock || mpd->dio_in_progress) &&
+	    (map->m_flags & EXT4_MAP_UNWRITTEN)) {
 		if (!mpd->io_submit.io_end->handle &&
 		    ext4_handle_valid(handle)) {
 			mpd->io_submit.io_end->handle = handle->h_rsv_handle;
@@ -2626,6 +2628,7 @@ static int ext4_writepages(struct address_space *mapping,
 	bool done;
 	struct blk_plug plug;
 	bool give_up_on_write = false;
+	bool dio_in_progress = false;
 
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
 		return -EIO;
@@ -2680,15 +2683,6 @@ static int ext4_writepages(struct address_space *mapping,
 		ext4_journal_stop(handle);
 	}
 
-	if (ext4_should_dioread_nolock(inode)) {
-		/*
-		 * We may need to convert up to one extent per block in
-		 * the page and we may dirty the inode.
-		 */
-		rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
-						PAGE_SIZE >> inode->i_blkbits);
-	}
-
 	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 		range_whole = 1;
 
@@ -2712,6 +2706,26 @@ static int ext4_writepages(struct address_space *mapping,
 	done = false;
 	blk_start_plug(&plug);
 
+	/*
+	 * If DIO is in progress, then we use unwritten blocks for allocation.
+	 * This is to avoid a small window of race (stale read) with
+	 * ext4_page_mkwrite path in delalloc case & with DIO read in parallel.
+	 *
+	 * Let's check for i_dio_count after we have tagged pages for writeback.
+	 */
+	smp_mb__before_atomic();
+	dio_in_progress = !!atomic_read(&inode->i_dio_count);
+	mpd.dio_in_progress = dio_in_progress;
+
+	if (ext4_should_dioread_nolock(inode) || dio_in_progress) {
+		/*
+		 * We may need to convert up to one extent per block in
+		 * the page and we may dirty the inode.
+		 */
+		rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
+						PAGE_SIZE >> inode->i_blkbits);
+	}
+
 	/*
 	 * First writeback pages that don't need mapping - we can avoid
 	 * starting a transaction unnecessarily and also avoid being blocked
@@ -5965,8 +5979,13 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 		}
 	}
 	unlock_page(page);
-	/* OK, we need to fill the hole... */
-	if (ext4_should_dioread_nolock(inode))
+	/*
+	 * OK, we need to fill the hole...
+	 * By default use unwritten block allocation here to avoid a small
+	 * window of race (stale data read) with DIO read path.
+	 */
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
+	    !ext4_should_journal_data(inode))
 		get_block = ext4_get_block_unwritten;
 	else
 		get_block = ext4_get_block;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range
  2020-01-13 11:04 ` [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range Ritesh Harjani
@ 2020-01-13 21:51   ` Darrick J. Wong
  2020-01-14  9:05     ` Jan Kara
  2020-01-14  9:12   ` Jan Kara
  2020-01-14 16:37   ` Christoph Hellwig
  2 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2020-01-13 21:51 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: linux-ext4, tytso, jack

On Mon, Jan 13, 2020 at 04:34:21PM +0530, Ritesh Harjani wrote:
> Some filesystems (e.g. ext4) need to know in it's writeback path, that
> whether DIO is in progress or not. This info may be needed to avoid the
> stale data exposure race with DIO reads.

Does XFS have this problem too?

Admittedly dio read during mmap write is probably not well supported. ;)

> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> ---
>  fs/iomap/direct-io.c | 17 +++++++++++++----

Might want to cc fsdevel and the iomap maintainers...

--D

>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 23837926c0c5..d1c159bd3854 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -468,9 +468,18 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		flags |= IOMAP_NOWAIT;
>  	}
>  
> +	/*
> +	 * Call inode_dio_begin() before we write out and wait for writeback to
> +	 * complete. This may be needed by some filesystems to prevent race
> +	 * like stale data exposure by DIO reads.
> +	 */
> +	inode_dio_begin(inode);
> +	/* So that i_dio_count is incremented before below operation */
> +	smp_mb__after_atomic();
> +
>  	ret = filemap_write_and_wait_range(mapping, pos, end);
>  	if (ret)
> -		goto out_free_dio;
> +		goto out_end_dio;
>  
>  	/*
>  	 * Try to invalidate cache pages for the range we're direct
> @@ -488,11 +497,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	    !inode->i_sb->s_dio_done_wq) {
>  		ret = sb_init_dio_done_wq(inode->i_sb);
>  		if (ret < 0)
> -			goto out_free_dio;
> +			goto out_end_dio;
>  	}
>  
> -	inode_dio_begin(inode);
> -
>  	blk_start_plug(&plug);
>  	do {
>  		ret = iomap_apply(inode, pos, count, flags, ops, dio,
> @@ -568,6 +575,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	return iomap_dio_complete(dio);
>  
> +out_end_dio:
> +	inode_dio_end(inode);
>  out_free_dio:
>  	kfree(dio);
>  	return ret;
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range
  2020-01-13 21:51   ` Darrick J. Wong
@ 2020-01-14  9:05     ` Jan Kara
  2020-01-14 16:38       ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2020-01-14  9:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Ritesh Harjani, linux-ext4, tytso, jack

On Mon 13-01-20 13:51:59, Darrick J. Wong wrote:
> On Mon, Jan 13, 2020 at 04:34:21PM +0530, Ritesh Harjani wrote:
> > Some filesystems (e.g. ext4) need to know in it's writeback path, that
> > whether DIO is in progress or not. This info may be needed to avoid the
> > stale data exposure race with DIO reads.
> 
> Does XFS have this problem too?
> 
> Admittedly dio read during mmap write is probably not well supported. ;)

Well, XFS always performs buffered writeback using unwritten extents so at
least the immediate problem of stale data exposure ext4 has does not happen
there AFAICT. 

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range
  2020-01-13 11:04 ` [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range Ritesh Harjani
  2020-01-13 21:51   ` Darrick J. Wong
@ 2020-01-14  9:12   ` Jan Kara
  2020-01-14 16:37   ` Christoph Hellwig
  2 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2020-01-14  9:12 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: linux-ext4, tytso, jack

On Mon 13-01-20 16:34:21, Ritesh Harjani wrote:
> Some filesystems (e.g. ext4) need to know in it's writeback path, that
> whether DIO is in progress or not. This info may be needed to avoid the
> stale data exposure race with DIO reads.
> 
> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> ---
>  fs/iomap/direct-io.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 23837926c0c5..d1c159bd3854 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -468,9 +468,18 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		flags |= IOMAP_NOWAIT;
>  	}
>  
> +	/*
> +	 * Call inode_dio_begin() before we write out and wait for writeback to
> +	 * complete. This may be needed by some filesystems to prevent race
> +	 * like stale data exposure by DIO reads.
> +	 */
> +	inode_dio_begin(inode);
> +	/* So that i_dio_count is incremented before below operation */
> +	smp_mb__after_atomic();

I wonder if the barrier shouldn't go into inode_dio_begin() - as a sepatare
patch. Because people just treat this as a lock-kind-of-thingy. E.g. btrfs
or ceph use inode_dio_begin() in places which I'd consider prone to CPU
reordering issues without this barrier...

Otherwise the patch looks good to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] ext4: Fix stale data read issue with DIO read & ext4_page_mkwrite path
  2020-01-13 11:04 ` [RFC 2/2] ext4: Fix stale data read issue with DIO read & ext4_page_mkwrite path Ritesh Harjani
@ 2020-01-14  9:47   ` Jan Kara
  2020-01-14 22:25     ` Ritesh Harjani
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2020-01-14  9:47 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: linux-ext4, tytso, jack

On Mon 13-01-20 16:34:22, Ritesh Harjani wrote:
> Currently there is a small race window where ext4 tries to allocate
> a written block for mapped files and if DIO read is in progress, then
> this may result into stale data read exposure problem.
> 
> This patch fixes the mentioned issue by:
> 1. For non-delalloc path, page_mkwrite will use unwritten blocks by
>    default for extent based files.
> 
> 2. For delalloc path, we check if DIO is in progress during writeback.
>    If yes, then we use unwritten blocks method to avoid this race.
> 
> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> ---
>  fs/ext4/inode.c | 45 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 32 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d035acab5b2a..07f66782335b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1529,6 +1529,7 @@ struct mpage_da_data {
>  	struct ext4_map_blocks map;
>  	struct ext4_io_submit io_submit;	/* IO submission data */
>  	unsigned int do_map:1;
> +	bool dio_in_progress:1;
>  };
>  
>  static void mpage_release_unused_pages(struct mpage_da_data *mpd,
> @@ -2359,7 +2360,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>  			   EXT4_GET_BLOCKS_METADATA_NOFAIL |
>  			   EXT4_GET_BLOCKS_IO_SUBMIT;
>  	dioread_nolock = ext4_should_dioread_nolock(inode);
> -	if (dioread_nolock)
> +	if (dioread_nolock || mpd->dio_in_progress)
>  		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
>  	if (map->m_flags & (1 << BH_Delay))
>  		get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
> @@ -2367,7 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>  	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
>  	if (err < 0)
>  		return err;
> -	if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
> +	if ((dioread_nolock || mpd->dio_in_progress) &&
> +	    (map->m_flags & EXT4_MAP_UNWRITTEN)) {
>  		if (!mpd->io_submit.io_end->handle &&
>  		    ext4_handle_valid(handle)) {
>  			mpd->io_submit.io_end->handle = handle->h_rsv_handle;
> @@ -2626,6 +2628,7 @@ static int ext4_writepages(struct address_space *mapping,
>  	bool done;
>  	struct blk_plug plug;
>  	bool give_up_on_write = false;
> +	bool dio_in_progress = false;
>  
>  	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
>  		return -EIO;
> @@ -2680,15 +2683,6 @@ static int ext4_writepages(struct address_space *mapping,
>  		ext4_journal_stop(handle);
>  	}
>  
> -	if (ext4_should_dioread_nolock(inode)) {
> -		/*
> -		 * We may need to convert up to one extent per block in
> -		 * the page and we may dirty the inode.
> -		 */
> -		rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
> -						PAGE_SIZE >> inode->i_blkbits);
> -	}
> -
>  	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
>  		range_whole = 1;
>  
> @@ -2712,6 +2706,26 @@ static int ext4_writepages(struct address_space *mapping,
>  	done = false;
>  	blk_start_plug(&plug);
>  
> +	/*
> +	 * If DIO is in progress, then we use unwritten blocks for allocation.
> +	 * This is to avoid a small window of race (stale read) with
> +	 * ext4_page_mkwrite path in delalloc case & with DIO read in parallel.
> +	 *
> +	 * Let's check for i_dio_count after we have tagged pages for writeback.
> +	 */
> +	smp_mb__before_atomic();
> +	dio_in_progress = !!atomic_read(&inode->i_dio_count);
> +	mpd.dio_in_progress = dio_in_progress;

Two problems here:

1) smp_mb__before_atomic() does not work with atomic_read(). This kind of
barrier works only with read-modify-write kinds of atomic operations like
atomic_inc(). See Documentation/atomic_t.txt for more details.

2) Even if the barrier worked, this is still too early for the check.
Consider the following race:

Task 1 - flusher		Task 2 - dio read	Task 3 - fault
ext4_writepages()
  atomic_read(&inode->i_dio_count) -> 0
  ...
				iomap_dio_rw()
				  inode_dio_begin()
				  filemap_write_and_wait_range()
				  ...
							ext4_page_mkwrite()
							  fills hole at index I
  ...
  mpage_prepare_extent_to_map()
    finds dirty page at index I - tagging
    not in effect because this is WB_SYNC_NONE
    writeback so we look for PAGECACHE_TAG_DIRTY
    mpage_map_and_submit_extent()
      - allocates block for page I
				  ext4_iomap_begin()
				    finds block under offset I
				  submit_bio()
				    - reads stale data

And what I wanted to use to stop this race is page lock / page writeback
bit on page 'I' because filemap_write_and_wait_range() called from
iomap_dio_rw() ends up waiting for both if the page is seen as dirty. For
this to work, you need to check inode->i_dio_count after you hold the page
locks for written range - i.e., after mpage_prepare_extent_to_map(). And
that means you always have to have rsv_blocks set when starting a
transaction because you don't know in advance whether you'll need them or
not.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range
  2020-01-13 11:04 ` [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range Ritesh Harjani
  2020-01-13 21:51   ` Darrick J. Wong
  2020-01-14  9:12   ` Jan Kara
@ 2020-01-14 16:37   ` Christoph Hellwig
  2020-01-14 17:19     ` Jan Kara
  2 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:37 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: linux-ext4, tytso, jack

Please add at very least the fsdevel and xfs lists to iomap patches.

Using i_dio_count for any kind of detection is bogus.  If you want to
pass flags to the writeback code please do so explicitly through
struct writeback_control.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range
  2020-01-14  9:05     ` Jan Kara
@ 2020-01-14 16:38       ` Christoph Hellwig
  2020-01-15  9:19         ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: Darrick J. Wong, Ritesh Harjani, linux-ext4, tytso

On Tue, Jan 14, 2020 at 10:05:07AM +0100, Jan Kara wrote:
> 
> Well, XFS always performs buffered writeback using unwritten extents so at
> least the immediate problem of stale data exposure ext4 has does not happen
> there AFAICT. 

Currently XFS never uses unwritten extents when converting delalloc
extents.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page_mkwrite
  2020-01-13 11:04 [RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page_mkwrite Ritesh Harjani
  2020-01-13 11:04 ` [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range Ritesh Harjani
  2020-01-13 11:04 ` [RFC 2/2] ext4: Fix stale data read issue with DIO read & ext4_page_mkwrite path Ritesh Harjani
@ 2020-01-14 16:39 ` Christoph Hellwig
  2020-01-14 22:33   ` Ritesh Harjani
  2 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:39 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: linux-ext4, tytso, jack

> Currently there is a small race window as pointed out by Jan [2] where, when
> ext4 tries to allocate a written block for mapped files and if DIO read is in
> progress, then this may result into stale data read exposure problem.

Do we have a test case for the problem?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range
  2020-01-14 16:37   ` Christoph Hellwig
@ 2020-01-14 17:19     ` Jan Kara
  2020-01-14 18:27       ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2020-01-14 17:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ritesh Harjani, linux-ext4, tytso, jack

On Tue 14-01-20 08:37:02, Christoph Hellwig wrote:
> Using i_dio_count for any kind of detection is bogus.  If you want to
> pass flags to the writeback code please do so explicitly through
> struct writeback_control.

We want to detect in the writeback path whether there's direct IO (read)
currently running for the inode. Not for the writeback issued from
iomap_dio_rw() but for any arbitrary writeback that iomap_dio_rw() can be
racing with - so struct writeback_control won't help. Now if you want to
see the ugly details why this hack is needed, see my other email to Ritesh
in this thread with details of the race.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range
  2020-01-14 17:19     ` Jan Kara
@ 2020-01-14 18:27       ` Christoph Hellwig
  2020-01-15  9:08         ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2020-01-14 18:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, Ritesh Harjani, linux-ext4, tytso

On Tue, Jan 14, 2020 at 06:19:34PM +0100, Jan Kara wrote:
> We want to detect in the writeback path whether there's direct IO (read)
> currently running for the inode. Not for the writeback issued from
> iomap_dio_rw() but for any arbitrary writeback that iomap_dio_rw() can be
> racing with - so struct writeback_control won't help. Now if you want to
> see the ugly details why this hack is needed, see my other email to Ritesh
> in this thread with details of the race.

How do we get other writeback after iomap_dio_rw wrote everything out?
Either way I'm trying to kill i_dio_count as it has all kinds of
problems, see the patch sent out earlier today.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] ext4: Fix stale data read issue with DIO read & ext4_page_mkwrite path
  2020-01-14  9:47   ` Jan Kara
@ 2020-01-14 22:25     ` Ritesh Harjani
  0 siblings, 0 replies; 17+ messages in thread
From: Ritesh Harjani @ 2020-01-14 22:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, tytso

Hello Jan,

On 1/14/20 3:17 PM, Jan Kara wrote:
> On Mon 13-01-20 16:34:22, Ritesh Harjani wrote:
>> Currently there is a small race window where ext4 tries to allocate
>> a written block for mapped files and if DIO read is in progress, then
>> this may result into stale data read exposure problem.
>>
>> This patch fixes the mentioned issue by:
>> 1. For non-delalloc path, page_mkwrite will use unwritten blocks by
>>     default for extent based files.
>>
>> 2. For delalloc path, we check if DIO is in progress during writeback.
>>     If yes, then we use unwritten blocks method to avoid this race.
>>
>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
>> ---
>>   fs/ext4/inode.c | 45 ++++++++++++++++++++++++++++++++-------------
>>   1 file changed, 32 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index d035acab5b2a..07f66782335b 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1529,6 +1529,7 @@ struct mpage_da_data {
>>   	struct ext4_map_blocks map;
>>   	struct ext4_io_submit io_submit;	/* IO submission data */
>>   	unsigned int do_map:1;
>> +	bool dio_in_progress:1;
>>   };
>>   
>>   static void mpage_release_unused_pages(struct mpage_da_data *mpd,
>> @@ -2359,7 +2360,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>   			   EXT4_GET_BLOCKS_METADATA_NOFAIL |
>>   			   EXT4_GET_BLOCKS_IO_SUBMIT;
>>   	dioread_nolock = ext4_should_dioread_nolock(inode);
>> -	if (dioread_nolock)
>> +	if (dioread_nolock || mpd->dio_in_progress)
>>   		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
>>   	if (map->m_flags & (1 << BH_Delay))
>>   		get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
>> @@ -2367,7 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>   	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
>>   	if (err < 0)
>>   		return err;
>> -	if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
>> +	if ((dioread_nolock || mpd->dio_in_progress) &&
>> +	    (map->m_flags & EXT4_MAP_UNWRITTEN)) {
>>   		if (!mpd->io_submit.io_end->handle &&
>>   		    ext4_handle_valid(handle)) {
>>   			mpd->io_submit.io_end->handle = handle->h_rsv_handle;
>> @@ -2626,6 +2628,7 @@ static int ext4_writepages(struct address_space *mapping,
>>   	bool done;
>>   	struct blk_plug plug;
>>   	bool give_up_on_write = false;
>> +	bool dio_in_progress = false;
>>   
>>   	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
>>   		return -EIO;
>> @@ -2680,15 +2683,6 @@ static int ext4_writepages(struct address_space *mapping,
>>   		ext4_journal_stop(handle);
>>   	}
>>   
>> -	if (ext4_should_dioread_nolock(inode)) {
>> -		/*
>> -		 * We may need to convert up to one extent per block in
>> -		 * the page and we may dirty the inode.
>> -		 */
>> -		rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
>> -						PAGE_SIZE >> inode->i_blkbits);
>> -	}
>> -
>>   	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
>>   		range_whole = 1;
>>   
>> @@ -2712,6 +2706,26 @@ static int ext4_writepages(struct address_space *mapping,
>>   	done = false;
>>   	blk_start_plug(&plug);
>>   
>> +	/*
>> +	 * If DIO is in progress, then we use unwritten blocks for allocation.
>> +	 * This is to avoid a small window of race (stale read) with
>> +	 * ext4_page_mkwrite path in delalloc case & with DIO read in parallel.
>> +	 *
>> +	 * Let's check for i_dio_count after we have tagged pages for writeback.
>> +	 */
>> +	smp_mb__before_atomic();
>> +	dio_in_progress = !!atomic_read(&inode->i_dio_count);
>> +	mpd.dio_in_progress = dio_in_progress;
> 
> Two problems here:
> 
> 1) smp_mb__before_atomic() does not work with atomic_read(). This kind of
> barrier works only with read-modify-write kinds of atomic operations like
> atomic_inc(). See Documentation/atomic_t.txt for more details.

Yes, I was not 100% sure on that part. But thanks for confirmation.


> 
> 2) Even if the barrier worked, this is still too early for the check.
> Consider the following race:
> 
> Task 1 - flusher		Task 2 - dio read	Task 3 - fault
> ext4_writepages()
>    atomic_read(&inode->i_dio_count) -> 0
>    ...
> 				iomap_dio_rw()
> 				  inode_dio_begin()
> 				  filemap_write_and_wait_range()
> 				  ...
> 							ext4_page_mkwrite()
> 							  fills hole at index I
>    ...
>    mpage_prepare_extent_to_map()
>      finds dirty page at index I - tagging
>      not in effect because this is WB_SYNC_NONE
>      writeback so we look for PAGECACHE_TAG_DIRTY
>      mpage_map_and_submit_extent()
>        - allocates block for page I
> 				  ext4_iomap_begin()
> 				    finds block under offset I
> 				  submit_bio()
> 				    - reads stale data

My bad. So a fault at hole may set the page dirty bit, while
ext4_writepages may be in progress. And if we check for i_dio_count very
early, then we still end up exposing stale data.

> 
> And what I wanted to use to stop this race is page lock / page writeback
> bit on page 'I' because filemap_write_and_wait_range() called from
> iomap_dio_rw() ends up waiting for both if the page is seen as dirty. For

Yes, agreed here. I guess earlier I was thinking of simplifying it
by checking for DIO early on so that we could use the same type of
extent type mapping (unwritten/written) for ext4_writepages.


> this to work, you need to check inode->i_dio_count after you hold the page
> locks for written range - i.e., after mpage_prepare_extent_to_map(). And

Yes, I am thinking we should add that check in 
mpage_map_and_submit_extent() before the do while{} loop.
I guess we should keep extent type mapping common for a given io_end
type, since we have to do unwritten to written handling at the end
of IO transfer.
This can be done if we check for DIO in progress early inside
mpage_map_and_submit_extent(), because by then, we already have the page
lock in place.

But let me also check more on this.

> that means you always have to have rsv_blocks set when starting a
> transaction because you don't know in advance whether you'll need them or
> not.

Yes. Thanks for pointing.


-ritesh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page_mkwrite
  2020-01-14 16:39 ` [RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page_mkwrite Christoph Hellwig
@ 2020-01-14 22:33   ` Ritesh Harjani
  0 siblings, 0 replies; 17+ messages in thread
From: Ritesh Harjani @ 2020-01-14 22:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-ext4, tytso, jack, darrick.wong



On 1/14/20 10:09 PM, Christoph Hellwig wrote:
>> Currently there is a small race window as pointed out by Jan [2] where, when
>> ext4 tries to allocate a written block for mapped files and if DIO read is in
>> progress, then this may result into stale data read exposure problem.
> 
> Do we have a test case for the problem?

I am not very sure if we have it in xfstests, (since I guess, DIO read 
during mmap writes is not well supported anyways).
But sure I was anyway thinking of writing one for my unit testing. Till
now I was mainly following that theoretically it is possible, although
it may be hard to catch it practically.


> Please add at very least the fsdevel and xfs lists to iomap patches.

Yes, sorry about that. Will cc' in next time.


-ritesh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range
  2020-01-14 18:27       ` Christoph Hellwig
@ 2020-01-15  9:08         ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2020-01-15  9:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, Ritesh Harjani, linux-ext4, tytso

On Tue 14-01-20 10:27:36, Christoph Hellwig wrote:
> On Tue, Jan 14, 2020 at 06:19:34PM +0100, Jan Kara wrote:
> > We want to detect in the writeback path whether there's direct IO (read)
> > currently running for the inode. Not for the writeback issued from
> > iomap_dio_rw() but for any arbitrary writeback that iomap_dio_rw() can be
> > racing with - so struct writeback_control won't help. Now if you want to
> > see the ugly details why this hack is needed, see my other email to Ritesh
> > in this thread with details of the race.
> 
> How do we get other writeback after iomap_dio_rw wrote everything out?

You create dirty page using mmap in the range read by iomap_dio_rw() and then
background writeback happens at unfortunate time... Email [1] has the exact
traces.

> Either way I'm trying to kill i_dio_count as it has all kinds of
> problems, see the patch sent out earlier today.

OK, I'll see that patch.

								Honza

[1] https://lore.kernel.org/linux-ext4/20200114094741.GC6466@quack2.suse.cz
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range
  2020-01-14 16:38       ` Christoph Hellwig
@ 2020-01-15  9:19         ` Jan Kara
  2020-01-15 14:56           ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2020-01-15  9:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Darrick J. Wong, Ritesh Harjani, linux-ext4, tytso

On Tue 14-01-20 08:38:18, Christoph Hellwig wrote:
> On Tue, Jan 14, 2020 at 10:05:07AM +0100, Jan Kara wrote:
> > 
> > Well, XFS always performs buffered writeback using unwritten extents so at
> > least the immediate problem of stale data exposure ext4 has does not happen
> > there AFAICT. 
> 
> Currently XFS never uses unwritten extents when converting delalloc
> extents.

I see, it is a long time since I last looked at that part of XFS code. So
then I think XFS might be prone to the same kind of race and data exposure
as I outlined in [1]...

								Honza

[1] https://lore.kernel.org/linux-ext4/20200114094741.GC6466@quack2.suse.cz

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range
  2020-01-15  9:19         ` Jan Kara
@ 2020-01-15 14:56           ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2020-01-15 14:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, Ritesh Harjani, linux-ext4, tytso

On Wed, Jan 15, 2020 at 10:19:25AM +0100, Jan Kara wrote:
> On Tue 14-01-20 08:38:18, Christoph Hellwig wrote:
> > On Tue, Jan 14, 2020 at 10:05:07AM +0100, Jan Kara wrote:
> > > 
> > > Well, XFS always performs buffered writeback using unwritten extents so at
> > > least the immediate problem of stale data exposure ext4 has does not happen
> > > there AFAICT. 
> > 
> > Currently XFS never uses unwritten extents when converting delalloc
> > extents.
> 
> I see, it is a long time since I last looked at that part of XFS code. So
> then I think XFS might be prone to the same kind of race and data exposure
> as I outlined in [1]...

I think not using unwritten extents for filling holes inside i_size will
always lead to the potential for stale data exposure in one form or
another.  Because of that Darrick has started looking into always using
unwritten extents for buffered writes inside i_size.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-01-15 14:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-13 11:04 [RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page_mkwrite Ritesh Harjani
2020-01-13 11:04 ` [RFC 1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range Ritesh Harjani
2020-01-13 21:51   ` Darrick J. Wong
2020-01-14  9:05     ` Jan Kara
2020-01-14 16:38       ` Christoph Hellwig
2020-01-15  9:19         ` Jan Kara
2020-01-15 14:56           ` Christoph Hellwig
2020-01-14  9:12   ` Jan Kara
2020-01-14 16:37   ` Christoph Hellwig
2020-01-14 17:19     ` Jan Kara
2020-01-14 18:27       ` Christoph Hellwig
2020-01-15  9:08         ` Jan Kara
2020-01-13 11:04 ` [RFC 2/2] ext4: Fix stale data read issue with DIO read & ext4_page_mkwrite path Ritesh Harjani
2020-01-14  9:47   ` Jan Kara
2020-01-14 22:25     ` Ritesh Harjani
2020-01-14 16:39 ` [RFC 0/2] ext4: Fix stale data read exposure problem with DIO read/page_mkwrite Christoph Hellwig
2020-01-14 22:33   ` Ritesh Harjani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.