linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] fs: Hole punch vs page cache filling races
@ 2019-06-03 13:21 Jan Kara
  2019-06-03 13:21 ` [PATCH 1/2] mm: Add readahead file operation Jan Kara
  2019-06-03 13:21 ` [PATCH 2/2] ext4: Fix stale data exposure when read races with hole punch Jan Kara
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Kara @ 2019-06-03 13:21 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, linux-mm, linux-fsdevel, Amir Goldstein, Jan Kara

Hello,

Amir has reported a that ext4 has a potential issues when reads can race with
hole punching possibly exposing stale data from freed blocks or even corrupting
filesystem when stale mapping data gets used for writeout. The problem is that
during hole punching, new page cache pages can get instantiated in a punched
range after truncate_inode_pages() has run but before the filesystem removes
blocks from the file.  In principle any filesystem implementing hole punching
thus needs to implement a mechanism to block instantiating page cache pages
during hole punching to avoid this race. This is further complicated by the
fact that there are multiple places that can instantiate pages in page cache.
We can have regular read(2) or page fault doing this but fadvise(2) or
madvise(2) can also result in reading in page cache pages through
force_page_cache_readahead().

This patch set fixes the problem for ext4 by protecting all page cache filling
opearation with EXT4_I(inode)->i_mmap_lock. To be able to do that for
readahead, we introduce new ->readahead file operation and corresponding
vfs_readahead() helper. Note that e.g. ->readpages() cannot be used for getting
the appropriate lock - we also need to protect ordinary read path using
->readpage() and there's no way to distinguish ->readpages() called through
->read_iter() from ->readpages() called e.g. through fadvise(2).

Other filesystems (e.g. XFS, F2FS, GFS2, OCFS2, ...) need a similar fix. I can
write some (e.g. for XFS) once we settle that ->readahead operation is indeed a
way to fix this.

								Honza

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] mm: Add readahead file operation
  2019-06-03 13:21 [PATCH 0/2] fs: Hole punch vs page cache filling races Jan Kara
@ 2019-06-03 13:21 ` Jan Kara
  2019-06-03 16:16   ` Amir Goldstein
  2019-06-03 13:21 ` [PATCH 2/2] ext4: Fix stale data exposure when read races with hole punch Jan Kara
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2019-06-03 13:21 UTC (permalink / raw)
  To: linux-ext4
  Cc: Ted Tso, linux-mm, linux-fsdevel, Amir Goldstein, Jan Kara, stable

Some filesystems need to acquire locks before pages are read into page
cache to protect from races with hole punching. The lock generally
cannot be acquired within readpage as it ranks above page lock so we are
left with acquiring the lock within filesystem's ->read_iter
implementation for normal reads and ->fault implementation during page
faults. That however does not cover all paths how pages can be
instantiated within page cache - namely explicitely requested readahead.
Add new ->readahead file operation which filesystem can use for this.

CC: stable@vger.kernel.org # Needed by following ext4 fix
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h |  5 +++++
 include/linux/mm.h |  3 ---
 mm/fadvise.c       | 12 +-----------
 mm/madvise.c       |  3 ++-
 mm/readahead.c     | 26 ++++++++++++++++++++++++--
 5 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f7fdfe93e25d..9968abcd06ea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1828,6 +1828,7 @@ struct file_operations {
 				   struct file *file_out, loff_t pos_out,
 				   loff_t len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
+	int (*readahead)(struct file *, loff_t, loff_t);
 } __randomize_layout;
 
 struct inode_operations {
@@ -3537,6 +3538,10 @@ extern void inode_nohighmem(struct inode *inode);
 extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
 		       int advice);
 
+/* mm/readahead.c */
+extern int generic_readahead(struct file *filp, loff_t start, loff_t end);
+extern int vfs_readahead(struct file *filp, loff_t start, loff_t end);
+
 #if defined(CONFIG_IO_URING)
 extern struct sock *io_uring_get_socket(struct file *file);
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e8834ac32b7..8f6597295920 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2461,9 +2461,6 @@ void task_dirty_inc(struct task_struct *tsk);
 /* readahead.c */
 #define VM_READAHEAD_PAGES	(SZ_128K / PAGE_SIZE)
 
-int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
-			pgoff_t offset, unsigned long nr_to_read);
-
 void page_cache_sync_readahead(struct address_space *mapping,
 			       struct file_ra_state *ra,
 			       struct file *filp,
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 467bcd032037..e5aab207550e 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -36,7 +36,6 @@ static int generic_fadvise(struct file *file, loff_t offset, loff_t len,
 	loff_t endbyte;			/* inclusive */
 	pgoff_t start_index;
 	pgoff_t end_index;
-	unsigned long nrpages;
 
 	inode = file_inode(file);
 	if (S_ISFIFO(inode->i_mode))
@@ -94,20 +93,11 @@ static int generic_fadvise(struct file *file, loff_t offset, loff_t len,
 		spin_unlock(&file->f_lock);
 		break;
 	case POSIX_FADV_WILLNEED:
-		/* First and last PARTIAL page! */
-		start_index = offset >> PAGE_SHIFT;
-		end_index = endbyte >> PAGE_SHIFT;
-
-		/* Careful about overflow on the "+1" */
-		nrpages = end_index - start_index + 1;
-		if (!nrpages)
-			nrpages = ~0UL;
-
 		/*
 		 * Ignore return value because fadvise() shall return
 		 * success even if filesystem can't retrieve a hint,
 		 */
-		force_page_cache_readahead(mapping, file, start_index, nrpages);
+		vfs_readahead(file, offset, endbyte);
 		break;
 	case POSIX_FADV_NOREUSE:
 		break;
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..9111b75e88cf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -303,7 +303,8 @@ static long madvise_willneed(struct vm_area_struct *vma,
 		end = vma->vm_end;
 	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-	force_page_cache_readahead(file->f_mapping, file, start, end - start);
+	vfs_readahead(file, (loff_t)start << PAGE_SHIFT,
+		      (loff_t)end << PAGE_SHIFT);
 	return 0;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..e66ae8c764ad 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -219,8 +219,9 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
  * Chunk the readahead into 2 megabyte units, so that we don't pin too much
  * memory at once.
  */
-int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
-			       pgoff_t offset, unsigned long nr_to_read)
+static int force_page_cache_readahead(struct address_space *mapping,
+				      struct file *filp, pgoff_t offset,
+				      unsigned long nr_to_read)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
 	struct file_ra_state *ra = &filp->f_ra;
@@ -248,6 +249,20 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	return 0;
 }
 
+int generic_readahead(struct file *filp, loff_t start, loff_t end)
+{
+	pgoff_t first, last;
+	unsigned long count;
+
+	first = start >> PAGE_SHIFT;
+	last = end >> PAGE_SHIFT;
+	count = last - first + 1;
+	if (!count)
+		count = ~0UL;
+	return force_page_cache_readahead(filp->f_mapping, filp, first, count);
+}
+EXPORT_SYMBOL_GPL(generic_readahead);
+
 /*
  * Set the initial window size, round to next power of 2 and square
  * for small size, x 4 for medium, and x 2 for large
@@ -575,6 +590,13 @@ page_cache_async_readahead(struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(page_cache_async_readahead);
 
+int vfs_readahead(struct file *filp, loff_t start, loff_t end)
+{
+	if (filp->f_op->readahead)
+		return filp->f_op->readahead(filp, start, end);
+	return generic_readahead(filp, start, end);
+}
+
 ssize_t ksys_readahead(int fd, loff_t offset, size_t count)
 {
 	ssize_t ret;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] ext4: Fix stale data exposure when read races with hole punch
  2019-06-03 13:21 [PATCH 0/2] fs: Hole punch vs page cache filling races Jan Kara
  2019-06-03 13:21 ` [PATCH 1/2] mm: Add readahead file operation Jan Kara
@ 2019-06-03 13:21 ` Jan Kara
  2019-06-03 16:33   ` Amir Goldstein
  2019-06-05  1:25   ` Dave Chinner
  1 sibling, 2 replies; 9+ messages in thread
From: Jan Kara @ 2019-06-03 13:21 UTC (permalink / raw)
  To: linux-ext4
  Cc: Ted Tso, linux-mm, linux-fsdevel, Amir Goldstein, Jan Kara, stable

Hole puching currently evicts pages from page cache and then goes on to
remove blocks from the inode. This happens under both i_mmap_sem and
i_rwsem held exclusively which provides appropriate serialization with
racing page faults. However there is currently nothing that prevents
ordinary read(2) from racing with the hole punch and instantiating page
cache page after hole punching has evicted page cache but before it has
removed blocks from the inode. This page cache page will be mapping soon
to be freed block and that can lead to returning stale data to userspace
or even filesystem corruption.

Fix the problem by protecting reads as well as readahead requests with
i_mmap_sem.

CC: stable@vger.kernel.org
Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2c5baa5e8291..a21fa9f8fb5d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -34,6 +34,17 @@
 #include "xattr.h"
 #include "acl.h"
 
+static ssize_t ext4_file_buffered_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	ssize_t ret;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	ret = generic_file_read_iter(iocb, to);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	return ret;
+}
+
 #ifdef CONFIG_FS_DAX
 static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
@@ -52,7 +63,7 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (!IS_DAX(inode)) {
 		inode_unlock_shared(inode);
 		/* Fallback to buffered IO in case we cannot support DAX */
-		return generic_file_read_iter(iocb, to);
+		return ext4_file_buffered_read(iocb, to);
 	}
 	ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
 	inode_unlock_shared(inode);
@@ -64,17 +75,32 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
-	if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
 		return -EIO;
 
 	if (!iov_iter_count(to))
 		return 0; /* skip atime */
 
 #ifdef CONFIG_FS_DAX
-	if (IS_DAX(file_inode(iocb->ki_filp)))
+	if (IS_DAX(inode))
 		return ext4_dax_read_iter(iocb, to);
 #endif
-	return generic_file_read_iter(iocb, to);
+	if (iocb->ki_flags & IOCB_DIRECT)
+		return generic_file_read_iter(iocb, to);
+	return ext4_file_buffered_read(iocb, to);
+}
+
+static int ext4_readahead(struct file *filp, loff_t start, loff_t end)
+{
+	struct inode *inode = file_inode(filp);
+	int ret;
+
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	ret = generic_readahead(filp, start, end);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	return ret;
 }
 
 /*
@@ -518,6 +544,7 @@ const struct file_operations ext4_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ext4_fallocate,
+	.readahead	= ext4_readahead,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm: Add readahead file operation
  2019-06-03 13:21 ` [PATCH 1/2] mm: Add readahead file operation Jan Kara
@ 2019-06-03 16:16   ` Amir Goldstein
  2019-06-04  8:00     ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Amir Goldstein @ 2019-06-03 16:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ext4, Ted Tso, Linux MM, linux-fsdevel, stable, Miklos Szeredi

On Mon, Jun 3, 2019 at 4:22 PM Jan Kara <jack@suse.cz> wrote:
>
> Some filesystems need to acquire locks before pages are read into page
> cache to protect from races with hole punching. The lock generally
> cannot be acquired within readpage as it ranks above page lock so we are
> left with acquiring the lock within filesystem's ->read_iter
> implementation for normal reads and ->fault implementation during page
> faults. That however does not cover all paths how pages can be
> instantiated within page cache - namely explicitely requested readahead.
> Add new ->readahead file operation which filesystem can use for this.
>
> CC: stable@vger.kernel.org # Needed by following ext4 fix
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/fs.h |  5 +++++
>  include/linux/mm.h |  3 ---
>  mm/fadvise.c       | 12 +-----------
>  mm/madvise.c       |  3 ++-
>  mm/readahead.c     | 26 ++++++++++++++++++++++++--
>  5 files changed, 32 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f7fdfe93e25d..9968abcd06ea 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1828,6 +1828,7 @@ struct file_operations {
>                                    struct file *file_out, loff_t pos_out,
>                                    loff_t len, unsigned int remap_flags);
>         int (*fadvise)(struct file *, loff_t, loff_t, int);
> +       int (*readahead)(struct file *, loff_t, loff_t);

The new method is redundant, because it is a subset of fadvise.
When overlayfs needed to implement both methods, Miklos
suggested that we unite them into one, hence:
3d8f7615319b vfs: implement readahead(2) using POSIX_FADV_WILLNEED

So you can accomplish the ext4 fix without the new method.
All you need extra is implementing madvise_willneed() with vfs_fadvise().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] ext4: Fix stale data exposure when read races with hole punch
  2019-06-03 13:21 ` [PATCH 2/2] ext4: Fix stale data exposure when read races with hole punch Jan Kara
@ 2019-06-03 16:33   ` Amir Goldstein
  2019-06-04  7:57     ` Jan Kara
  2019-06-05  1:25   ` Dave Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Amir Goldstein @ 2019-06-03 16:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ext4, Ted Tso, Linux MM, linux-fsdevel, stable

On Mon, Jun 3, 2019 at 4:22 PM Jan Kara <jack@suse.cz> wrote:
>
> Hole puching currently evicts pages from page cache and then goes on to
> remove blocks from the inode. This happens under both i_mmap_sem and
> i_rwsem held exclusively which provides appropriate serialization with
> racing page faults. However there is currently nothing that prevents
> ordinary read(2) from racing with the hole punch and instantiating page
> cache page after hole punching has evicted page cache but before it has
> removed blocks from the inode. This page cache page will be mapping soon
> to be freed block and that can lead to returning stale data to userspace
> or even filesystem corruption.
>
> Fix the problem by protecting reads as well as readahead requests with
> i_mmap_sem.
>

So ->write_iter() does not take  i_mmap_sem right?
and therefore mixed randrw workload is not expected to regress heavily
because of this change?

Did you test performance diff?
Here [1] I posted results of fio test that did x5 worse in xfs vs.
ext4, but I've
seen much worse cases.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxhu=Qtme9RJ7uZXYXt0UE+=xD+OC4gQ9EYkDC1ap8Hizg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] ext4: Fix stale data exposure when read races with hole punch
  2019-06-03 16:33   ` Amir Goldstein
@ 2019-06-04  7:57     ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2019-06-04  7:57 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Ext4, Ted Tso, Linux MM, linux-fsdevel, stable

On Mon 03-06-19 19:33:50, Amir Goldstein wrote:
> On Mon, Jun 3, 2019 at 4:22 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hole puching currently evicts pages from page cache and then goes on to
> > remove blocks from the inode. This happens under both i_mmap_sem and
> > i_rwsem held exclusively which provides appropriate serialization with
> > racing page faults. However there is currently nothing that prevents
> > ordinary read(2) from racing with the hole punch and instantiating page
> > cache page after hole punching has evicted page cache but before it has
> > removed blocks from the inode. This page cache page will be mapping soon
> > to be freed block and that can lead to returning stale data to userspace
> > or even filesystem corruption.
> >
> > Fix the problem by protecting reads as well as readahead requests with
> > i_mmap_sem.
> >
> 
> So ->write_iter() does not take  i_mmap_sem right?
> and therefore mixed randrw workload is not expected to regress heavily
> because of this change?

Yes. i_mmap_sem is taken in exclusive mode only for truncate, punch hole,
and similar operations removing blocks from file. So reads will now be more
serialized with such operations. But not with writes. There may be some
regression still visible due to the fact that although readers won't block
one another or with writers, they'll still contend on updating the cacheline
with i_mmap_sem and that's going to be visible for cache hot readers
running from multiple NUMA nodes.

> Did you test performance diff?

No, not really. But I'll queue up some test to see the difference.

> Here [1] I posted results of fio test that did x5 worse in xfs vs.
> ext4, but I've seen much worse cases.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm: Add readahead file operation
  2019-06-03 16:16   ` Amir Goldstein
@ 2019-06-04  8:00     ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2019-06-04  8:00 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Ext4, Ted Tso, Linux MM, linux-fsdevel, stable, Miklos Szeredi

On Mon 03-06-19 19:16:59, Amir Goldstein wrote:
> On Mon, Jun 3, 2019 at 4:22 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Some filesystems need to acquire locks before pages are read into page
> > cache to protect from races with hole punching. The lock generally
> > cannot be acquired within readpage as it ranks above page lock so we are
> > left with acquiring the lock within filesystem's ->read_iter
> > implementation for normal reads and ->fault implementation during page
> > faults. That however does not cover all paths how pages can be
> > instantiated within page cache - namely explicitely requested readahead.
> > Add new ->readahead file operation which filesystem can use for this.
> >
> > CC: stable@vger.kernel.org # Needed by following ext4 fix
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  include/linux/fs.h |  5 +++++
> >  include/linux/mm.h |  3 ---
> >  mm/fadvise.c       | 12 +-----------
> >  mm/madvise.c       |  3 ++-
> >  mm/readahead.c     | 26 ++++++++++++++++++++++++--
> >  5 files changed, 32 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index f7fdfe93e25d..9968abcd06ea 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1828,6 +1828,7 @@ struct file_operations {
> >                                    struct file *file_out, loff_t pos_out,
> >                                    loff_t len, unsigned int remap_flags);
> >         int (*fadvise)(struct file *, loff_t, loff_t, int);
> > +       int (*readahead)(struct file *, loff_t, loff_t);
> 
> The new method is redundant, because it is a subset of fadvise.
> When overlayfs needed to implement both methods, Miklos
> suggested that we unite them into one, hence:
> 3d8f7615319b vfs: implement readahead(2) using POSIX_FADV_WILLNEED

Yes, I've noticed this.

> So you can accomplish the ext4 fix without the new method.
> All you need extra is implementing madvise_willneed() with vfs_fadvise().

Ah, that's an interesting idea. I'll try that out. It will require some
dance in madvise() to drop mmap_sem but we already do that for
madvise_free() so I can just duplicate that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] ext4: Fix stale data exposure when read races with hole punch
  2019-06-03 13:21 ` [PATCH 2/2] ext4: Fix stale data exposure when read races with hole punch Jan Kara
  2019-06-03 16:33   ` Amir Goldstein
@ 2019-06-05  1:25   ` Dave Chinner
  2019-06-05  9:27     ` Jan Kara
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2019-06-05  1:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Ted Tso, linux-mm, linux-fsdevel, Amir Goldstein, stable

On Mon, Jun 03, 2019 at 03:21:55PM +0200, Jan Kara wrote:
> Hole puching currently evicts pages from page cache and then goes on to
> remove blocks from the inode. This happens under both i_mmap_sem and
> i_rwsem held exclusively which provides appropriate serialization with
> racing page faults. However there is currently nothing that prevents
> ordinary read(2) from racing with the hole punch and instantiating page
> cache page after hole punching has evicted page cache but before it has
> removed blocks from the inode. This page cache page will be mapping soon
> to be freed block and that can lead to returning stale data to userspace
> or even filesystem corruption.
> 
> Fix the problem by protecting reads as well as readahead requests with
> i_mmap_sem.
> 
> CC: stable@vger.kernel.org
> Reported-by: Amir Goldstein <amir73il@gmail.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/file.c | 35 +++++++++++++++++++++++++++++++----
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2c5baa5e8291..a21fa9f8fb5d 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -34,6 +34,17 @@
>  #include "xattr.h"
>  #include "acl.h"
>  
> +static ssize_t ext4_file_buffered_read(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	ssize_t ret;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	down_read(&EXT4_I(inode)->i_mmap_sem);
> +	ret = generic_file_read_iter(iocb, to);
> +	up_read(&EXT4_I(inode)->i_mmap_sem);
> +	return ret;

Isn't i_mmap_sem taken in the page fault path? What makes it safe
to take here both outside and inside the mmap_sem at the same time?
I mean, the whole reason for i_mmap_sem existing is that the inode
i_rwsem can't be taken both outside and inside the i_mmap_sem at the
same time, so what makes the i_mmap_sem different?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] ext4: Fix stale data exposure when read races with hole punch
  2019-06-05  1:25   ` Dave Chinner
@ 2019-06-05  9:27     ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2019-06-05  9:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-ext4, Ted Tso, linux-mm, linux-fsdevel,
	Amir Goldstein, stable

On Wed 05-06-19 11:25:51, Dave Chinner wrote:
> On Mon, Jun 03, 2019 at 03:21:55PM +0200, Jan Kara wrote:
> > Hole puching currently evicts pages from page cache and then goes on to
> > remove blocks from the inode. This happens under both i_mmap_sem and
> > i_rwsem held exclusively which provides appropriate serialization with
> > racing page faults. However there is currently nothing that prevents
> > ordinary read(2) from racing with the hole punch and instantiating page
> > cache page after hole punching has evicted page cache but before it has
> > removed blocks from the inode. This page cache page will be mapping soon
> > to be freed block and that can lead to returning stale data to userspace
> > or even filesystem corruption.
> > 
> > Fix the problem by protecting reads as well as readahead requests with
> > i_mmap_sem.
> > 
> > CC: stable@vger.kernel.org
> > Reported-by: Amir Goldstein <amir73il@gmail.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext4/file.c | 35 +++++++++++++++++++++++++++++++----
> >  1 file changed, 31 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index 2c5baa5e8291..a21fa9f8fb5d 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -34,6 +34,17 @@
> >  #include "xattr.h"
> >  #include "acl.h"
> >  
> > +static ssize_t ext4_file_buffered_read(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > +	ssize_t ret;
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > +	down_read(&EXT4_I(inode)->i_mmap_sem);
> > +	ret = generic_file_read_iter(iocb, to);
> > +	up_read(&EXT4_I(inode)->i_mmap_sem);
> > +	return ret;
> 
> Isn't i_mmap_sem taken in the page fault path? What makes it safe
> to take here both outside and inside the mmap_sem at the same time?
> I mean, the whole reason for i_mmap_sem existing is that the inode
> i_rwsem can't be taken both outside and inside the i_mmap_sem at the
> same time, so what makes the i_mmap_sem different?

Drat, you're right that read path may take page fault which will cause lock
inversion with mmap_sem. Just my xfstests run apparently didn't trigger
this as I didn't get any lockdep splat. Thanks for catching this!

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-06-05  9:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-03 13:21 [PATCH 0/2] fs: Hole punch vs page cache filling races Jan Kara
2019-06-03 13:21 ` [PATCH 1/2] mm: Add readahead file operation Jan Kara
2019-06-03 16:16   ` Amir Goldstein
2019-06-04  8:00     ` Jan Kara
2019-06-03 13:21 ` [PATCH 2/2] ext4: Fix stale data exposure when read races with hole punch Jan Kara
2019-06-03 16:33   ` Amir Goldstein
2019-06-04  7:57     ` Jan Kara
2019-06-05  1:25   ` Dave Chinner
2019-06-05  9:27     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).