Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races
@ 2021-01-20 16:06 Jan Kara
  2021-01-20 16:06 ` [PATCH 1/3] mm: Do not pass iter into generic_file_buffered_read_get_pages() Jan Kara
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Jan Kara @ 2021-01-20 16:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox, linux-ext4, Jan Kara

Hello,

Amir has reported [1] a that ext4 has a potential issues when reads can race
with hole punching possibly exposing stale data from freed blocks or even
corrupting filesystem when stale mapping data gets used for writeout. The
problem is that during hole punching, new page cache pages can get instantiated
in a punched range after truncate_inode_pages() has run but before the
filesystem removes blocks from the file.  In principle any filesystem
implementing hole punching thus needs to implement a mechanism to block
instantiating page cache pages during hole punching to avoid this race. This is
further complicated by the fact that there are multiple places that can
instantiate pages in page cache.  We can have regular read(2) or page fault
doing this but fadvise(2) or madvise(2) can also result in reading in page
cache pages through force_page_cache_readahead().

There are couple of ways how to fix this. First way (currently implemented by
XFS) is to protect read(2) and *advise(2) calls with i_rwsem so that they are
serialized with hole punching. This is easy to do but as a result all reads
would then be serialized with writes and thus mixed read-write workloads suffer
heavily on ext4. Thus for ext4 I want to use EXT4_I(inode)->i_mmap_sem for
serialization of reads and hole punching. The same serialization that is
already currently used in ext4 to close this race for page faults. This is
conceptually simple but lock ordering is troublesome - since
EXT4_I(inode)->i_mmap_sem is used in page fault path, it ranks below mmap_sem.
Thus we cannot simply grab EXT4_I(inode)->i_mmap_sem in ext4_file_read_iter()
as generic_file_buffered_read() copies data to userspace which may require
grabbing mmap_sem. Also grabbing EXT4_I(inode)->i_mmap_sem in ext4_readpages()
/ ext4_readpage() is problematic because at that point we already have locked
pages instantiated in the page cache. So EXT4_I(inode)->i_mmap_sem would
effectively rank below page lock which is too low in the locking hierarchy.  So
for ext4 (and other filesystems with similar locking constraints - F2FS, GFS2,
OCFS2, ...) we'd need another hook in the read path that can wrap around
insertion of pages into page cache but does not contain copying of data into
userspace.

This patch set implements one possibility of such hook - we essentially
abstract generic_file_buffered_read_get_pages() into a hook. I'm not completely
sold on the naming or the API, or even whether this is the best place for the
hook. But I wanted to send something out for further discussion. For example
another workable option for ext4 would be to have an aops hook for adding a
page into page cache (essentially abstract add_to_page_cache_lru()). There will
be slight downside that it would mean per-page acquisition of the lock instead
of a per-batch-of-pages, also if we ever transition to range locking the
mapping, per-batch locking would be more efficient.

What do people think about this?

								Honza

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] mm: Do not pass iter into generic_file_buffered_read_get_pages()
  2021-01-20 16:06 [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races Jan Kara
@ 2021-01-20 16:06 ` Jan Kara
  2021-01-20 16:18   ` Christoph Hellwig
  2021-01-20 16:06 ` [PATCH 2/3] mm: Provide address_space operation for filling pages for read Jan Kara
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2021-01-20 16:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox, linux-ext4, Jan Kara

generic_file_buffered_read_get_pages() needs to only know the length we
want to read and whether partially uptodate pages are accepted. Also
principially this function just loads pages into page cache so it does
not need to know about user buffers to copy data to. Pass the length and
whether partially uptodate pages are acceptable into
generic_file_buffered_read_get_pages() instead of iter.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/filemap.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 5c9d564317a5..7029bada8e90 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2235,7 +2235,7 @@ generic_file_buffered_read_readpage(struct kiocb *iocb,
 static struct page *
 generic_file_buffered_read_pagenotuptodate(struct kiocb *iocb,
 					   struct file *filp,
-					   struct iov_iter *iter,
+					   bool partial_page,
 					   struct page *page,
 					   loff_t pos, loff_t count)
 {
@@ -2264,8 +2264,8 @@ generic_file_buffered_read_pagenotuptodate(struct kiocb *iocb,
 	if (inode->i_blkbits == PAGE_SHIFT ||
 			!mapping->a_ops->is_partially_uptodate)
 		goto page_not_up_to_date;
-	/* pipes can't handle partially uptodate pages */
-	if (unlikely(iov_iter_is_pipe(iter)))
+	/* Some reads (e.g. pipes) don't accept partially uptodate pages */
+	if (unlikely(!partial_page))
 		goto page_not_up_to_date;
 	if (!trylock_page(page))
 		goto page_not_up_to_date;
@@ -2304,8 +2304,7 @@ generic_file_buffered_read_pagenotuptodate(struct kiocb *iocb,
 }
 
 static struct page *
-generic_file_buffered_read_no_cached_page(struct kiocb *iocb,
-					  struct iov_iter *iter)
+generic_file_buffered_read_no_cached_page(struct kiocb *iocb)
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
@@ -2335,7 +2334,7 @@ generic_file_buffered_read_no_cached_page(struct kiocb *iocb,
 }
 
 static int generic_file_buffered_read_get_pages(struct kiocb *iocb,
-						struct iov_iter *iter,
+						size_t len, bool partial_page,
 						struct page **pages,
 						unsigned int nr)
 {
@@ -2343,7 +2342,7 @@ static int generic_file_buffered_read_get_pages(struct kiocb *iocb,
 	struct address_space *mapping = filp->f_mapping;
 	struct file_ra_state *ra = &filp->f_ra;
 	pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
-	pgoff_t last_index = (iocb->ki_pos + iter->count + PAGE_SIZE-1) >> PAGE_SHIFT;
+	pgoff_t last_index = (iocb->ki_pos + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	int i, j, nr_got, err = 0;
 
 	nr = min_t(unsigned long, last_index - index, nr);
@@ -2364,7 +2363,7 @@ static int generic_file_buffered_read_get_pages(struct kiocb *iocb,
 	if (nr_got)
 		goto got_pages;
 
-	pages[0] = generic_file_buffered_read_no_cached_page(iocb, iter);
+	pages[0] = generic_file_buffered_read_no_cached_page(iocb);
 	err = PTR_ERR_OR_ZERO(pages[0]);
 	if (!IS_ERR_OR_NULL(pages[0]))
 		nr_got = 1;
@@ -2374,7 +2373,7 @@ static int generic_file_buffered_read_get_pages(struct kiocb *iocb,
 		pgoff_t pg_index = index + i;
 		loff_t pg_pos = max(iocb->ki_pos,
 				    (loff_t) pg_index << PAGE_SHIFT);
-		loff_t pg_count = iocb->ki_pos + iter->count - pg_pos;
+		loff_t pg_count = iocb->ki_pos + len - pg_pos;
 
 		if (PageReadahead(page)) {
 			if (iocb->ki_flags & IOCB_NOIO) {
@@ -2399,7 +2398,8 @@ static int generic_file_buffered_read_get_pages(struct kiocb *iocb,
 			}
 
 			page = generic_file_buffered_read_pagenotuptodate(iocb,
-					filp, iter, page, pg_pos, pg_count);
+					filp, partial_page, page, pg_pos,
+					pg_count);
 			if (IS_ERR_OR_NULL(page)) {
 				for (j = i + 1; j < nr_got; j++)
 					put_page(pages[j]);
@@ -2478,8 +2478,8 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			iocb->ki_flags |= IOCB_NOWAIT;
 
 		i = 0;
-		pg_nr = generic_file_buffered_read_get_pages(iocb, iter,
-							     pages, nr_pages);
+		pg_nr = generic_file_buffered_read_get_pages(iocb, iter->count,
+				!iov_iter_is_pipe(iter), pages, nr_pages);
 		if (pg_nr < 0) {
 			error = pg_nr;
 			break;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/3] mm: Provide address_space operation for filling pages for read
  2021-01-20 16:06 [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races Jan Kara
  2021-01-20 16:06 ` [PATCH 1/3] mm: Do not pass iter into generic_file_buffered_read_get_pages() Jan Kara
@ 2021-01-20 16:06 ` Jan Kara
  2021-01-20 16:20   ` Christoph Hellwig
  2021-01-20 16:06 ` [PATCH 3/3] ext4: Fix stale data exposure when read races with hole punch Jan Kara
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2021-01-20 16:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox, linux-ext4, Jan Kara

Provide an address_space operation for filling pages needed for read
into page cache. Filesystems can use this operation to seriealize
page cache filling with e.g. hole punching properly.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h |  5 +++++
 mm/filemap.c       | 32 ++++++++++++++++++++++++++------
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..1d3f963d0d99 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -381,6 +381,9 @@ struct address_space_operations {
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
 	void (*readahead)(struct readahead_control *);
+	/* Fill in uptodate pages for kiocb into page cache and pagep array */
+	int (*fill_pages)(struct kiocb *, size_t len, bool partial_page,
+			  struct page **pagep, unsigned int nr_pages);
 
 	int (*write_begin)(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
@@ -2962,6 +2965,8 @@ extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
 extern int generic_write_check_limits(struct file *file, loff_t pos,
 		loff_t *count);
 extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
+extern int generic_file_buffered_read_get_pages(struct kiocb *iocb, size_t len,
+		bool partial_page, struct page **pages, unsigned int nr);
 extern ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		struct iov_iter *to, ssize_t already_read);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 7029bada8e90..5b594dd245e0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2333,10 +2333,24 @@ generic_file_buffered_read_no_cached_page(struct kiocb *iocb)
 	return generic_file_buffered_read_readpage(iocb, filp, mapping, page);
 }
 
-static int generic_file_buffered_read_get_pages(struct kiocb *iocb,
-						size_t len, bool partial_page,
-						struct page **pages,
-						unsigned int nr)
+/**
+ * generic_file_buffered_read_get_pages - generic routine to fill in page cache
+ *	pages for a read
+ * @iocb:	the iocb to read
+ * @len:	number of bytes to read
+ * @partial_page:	are partially uptodate pages accepted by read?
+ * @pages:	array where to fill in found pages
+ * @nr:		number of pages in the @pages array
+ *
+ * Fill pages into page cache and @pages array needed for a read of length @len
+ * described by @iocb.
+ *
+ * Return:
+ * Number of pages filled in the array
+ */
+int generic_file_buffered_read_get_pages(struct kiocb *iocb, size_t len,
+					 bool partial_page,
+					 struct page **pages, unsigned int nr)
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
@@ -2419,6 +2433,7 @@ static int generic_file_buffered_read_get_pages(struct kiocb *iocb,
 	 */
 	goto find_page;
 }
+EXPORT_SYMBOL(generic_file_buffered_read_get_pages);
 
 /**
  * generic_file_buffered_read - generic file read routine
@@ -2447,6 +2462,8 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	unsigned int nr_pages = min_t(unsigned int, 512,
 			((iocb->ki_pos + iter->count + PAGE_SIZE - 1) >> PAGE_SHIFT) -
 			(iocb->ki_pos >> PAGE_SHIFT));
+	int (*fill_pages)(struct kiocb *, size_t, bool, struct page **,
+			  unsigned int) = mapping->a_ops->fill_pages;
 	int i, pg_nr, error = 0;
 	bool writably_mapped;
 	loff_t isize, end_offset;
@@ -2456,6 +2473,9 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	if (unlikely(!iov_iter_count(iter)))
 		return 0;
 
+	if (!fill_pages)
+		fill_pages = generic_file_buffered_read_get_pages;
+
 	iov_iter_truncate(iter, inode->i_sb->s_maxbytes);
 
 	if (nr_pages > ARRAY_SIZE(pages_onstack))
@@ -2478,8 +2498,8 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			iocb->ki_flags |= IOCB_NOWAIT;
 
 		i = 0;
-		pg_nr = generic_file_buffered_read_get_pages(iocb, iter->count,
-				!iov_iter_is_pipe(iter), pages, nr_pages);
+		pg_nr = fill_pages(iocb, iter->count, !iov_iter_is_pipe(iter),
+				   pages, nr_pages);
 		if (pg_nr < 0) {
 			error = pg_nr;
 			break;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/3] ext4: Fix stale data exposure when read races with hole punch
  2021-01-20 16:06 [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races Jan Kara
  2021-01-20 16:06 ` [PATCH 1/3] mm: Do not pass iter into generic_file_buffered_read_get_pages() Jan Kara
  2021-01-20 16:06 ` [PATCH 2/3] mm: Provide address_space operation for filling pages for read Jan Kara
@ 2021-01-20 16:06 ` Jan Kara
  2021-01-21 19:27 ` [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races Matthew Wilcox
  2021-04-02 19:34 ` Theodore Ts'o
  4 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2021-01-20 16:06 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox, linux-ext4, Jan Kara, stable, Amir Goldstein

Hole puching currently evicts pages from page cache and then goes on to
remove blocks from the inode. This happens under both i_mmap_sem and
i_rwsem held exclusively which provides appropriate serialization with
racing page faults. However there is currently nothing that prevents
ordinary read(2) from racing with the hole punch and instantiating page
cache page after hole punching has evicted page cache but before it has
removed blocks from the inode. This page cache page will be mapping soon
to be freed block and that can lead to returning stale data to userspace
or even filesystem corruption.

Fix the problem by protecting reads as well as readahead requests with
i_mmap_sem.

CC: stable@vger.kernel.org
Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c  | 16 ++++++++++++++++
 fs/ext4/inode.c | 21 +++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 349b27f0dda0..d66f7c08b123 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -30,6 +30,7 @@
 #include <linux/uio.h>
 #include <linux/mman.h>
 #include <linux/backing-dev.h>
+#include <linux/fadvise.h>
 #include "ext4.h"
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -131,6 +132,20 @@ static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return generic_file_read_iter(iocb, to);
 }
 
+static int ext4_fadvise(struct file *filp, loff_t start, loff_t end, int advice)
+{
+	struct inode *inode = file_inode(filp);
+	int ret;
+
+	/* Readahead needs protection from hole punching and similar ops */
+	if (advice == POSIX_FADV_WILLNEED)
+		down_read(&EXT4_I(inode)->i_mmap_sem);
+	ret = generic_fadvise(filp, start, end, advice);
+	if (advice == POSIX_FADV_WILLNEED)
+		up_read(&EXT4_I(inode)->i_mmap_sem);
+	return ret;
+}
+
 /*
  * Called when an inode is released. Note that this is different
  * from ext4_file_open: open gets called at every open, but release
@@ -911,6 +926,7 @@ const struct file_operations ext4_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ext4_fallocate,
+	.fadvise	= ext4_fadvise,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c173c8405856..a01569f2fa49 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3646,9 +3646,28 @@ static int ext4_iomap_swap_activate(struct swap_info_struct *sis,
 				       &ext4_iomap_report_ops);
 }
 
+static int ext4_fill_pages(struct kiocb *iocb, size_t len, bool partial_page,
+			   struct page **pagep, unsigned int nr_pages)
+{
+	struct ext4_inode_info *ei = EXT4_I(iocb->ki_filp->f_mapping->host);
+	int ret;
+
+	/*
+	 * Protect adding of pages into page cache against hole punching and
+	 * other cache manipulation functions so that we don't expose
+	 * potentially stale user data.
+	 */
+	down_read(&ei->i_mmap_sem);
+	ret = generic_file_buffered_read_get_pages(iocb, len, partial_page,
+						   pagep, nr_pages);
+	up_read(&ei->i_mmap_sem);
+	return ret;
+}
+
 static const struct address_space_operations ext4_aops = {
 	.readpage		= ext4_readpage,
 	.readahead		= ext4_readahead,
+	.fill_pages		= ext4_fill_pages,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_write_begin,
@@ -3667,6 +3686,7 @@ static const struct address_space_operations ext4_aops = {
 static const struct address_space_operations ext4_journalled_aops = {
 	.readpage		= ext4_readpage,
 	.readahead		= ext4_readahead,
+	.fill_pages		= ext4_fill_pages,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_write_begin,
@@ -3684,6 +3704,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 static const struct address_space_operations ext4_da_aops = {
 	.readpage		= ext4_readpage,
 	.readahead		= ext4_readahead,
+	.fill_pages		= ext4_fill_pages,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_da_write_begin,
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] mm: Do not pass iter into generic_file_buffered_read_get_pages()
  2021-01-20 16:06 ` [PATCH 1/3] mm: Do not pass iter into generic_file_buffered_read_get_pages() Jan Kara
@ 2021-01-20 16:18   ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2021-01-20 16:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Matthew Wilcox, linux-ext4

This whole code has a massive refactoring pending that barely didn't
make it in the last merge window.  willy, do you plan to resend it
ASAP?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] mm: Provide address_space operation for filling pages for read
  2021-01-20 16:06 ` [PATCH 2/3] mm: Provide address_space operation for filling pages for read Jan Kara
@ 2021-01-20 16:20   ` Christoph Hellwig
  2021-01-20 17:27     ` Jan Kara
  2021-04-02 21:17     ` Kent Overstreet
  0 siblings, 2 replies; 17+ messages in thread
From: Christoph Hellwig @ 2021-01-20 16:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Matthew Wilcox, linux-ext4

On Wed, Jan 20, 2021 at 05:06:10PM +0100, Jan Kara wrote:
> Provide an address_space operation for filling pages needed for read
> into page cache. Filesystems can use this operation to seriealize
> page cache filling with e.g. hole punching properly.

Besides the impending rewrite of the area - having another indirection
here is just horrible for performance.  If we want locking in this area
it should be in core code and common for multiple file systems.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] mm: Provide address_space operation for filling pages for read
  2021-01-20 16:20   ` Christoph Hellwig
@ 2021-01-20 17:27     ` Jan Kara
  2021-01-20 17:28       ` Christoph Hellwig
  2021-04-02 21:17     ` Kent Overstreet
  1 sibling, 1 reply; 17+ messages in thread
From: Jan Kara @ 2021-01-20 17:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, linux-fsdevel, Matthew Wilcox, linux-ext4

On Wed 20-01-21 16:20:01, Christoph Hellwig wrote:
> On Wed, Jan 20, 2021 at 05:06:10PM +0100, Jan Kara wrote:
> > Provide an address_space operation for filling pages needed for read
> > into page cache. Filesystems can use this operation to seriealize
> > page cache filling with e.g. hole punching properly.
> 
> Besides the impending rewrite of the area - having another indirection
> here is just horrible for performance.  If we want locking in this area
> it should be in core code and common for multiple file systems.

This would mean pulling i_mmap_sem out from ext4/XFS/F2FS private inode
into the VFS inode. Which is fine by me but it would grow struct inode for
proc / tmpfs / btrfs (although for btrfs I'm not convinced it isn't
actually prone to the race and doesn't need similar protection as xfs /
ext4) so some people may object.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] mm: Provide address_space operation for filling pages for read
  2021-01-20 17:27     ` Jan Kara
@ 2021-01-20 17:28       ` Christoph Hellwig
  2021-01-20 17:56         ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-01-20 17:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, Matthew Wilcox, linux-ext4

On Wed, Jan 20, 2021 at 06:27:05PM +0100, Jan Kara wrote:
> This would mean pulling i_mmap_sem out from ext4/XFS/F2FS private inode
> into the VFS inode. Which is fine by me but it would grow struct inode for
> proc / tmpfs / btrfs (although for btrfs I'm not convinced it isn't
> actually prone to the race and doesn't need similar protection as xfs /
> ext4) so some people may object.

The btrfs folks are already looking into lifting it to common code.

Also I have a patch pending to remove a list_head from the inode to
counter the size increase :)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] mm: Provide address_space operation for filling pages for read
  2021-01-20 17:28       ` Christoph Hellwig
@ 2021-01-20 17:56         ` Matthew Wilcox
  0 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2021-01-20 17:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, linux-fsdevel, linux-ext4

On Wed, Jan 20, 2021 at 05:28:36PM +0000, Christoph Hellwig wrote:
> On Wed, Jan 20, 2021 at 06:27:05PM +0100, Jan Kara wrote:
> > This would mean pulling i_mmap_sem out from ext4/XFS/F2FS private inode
> > into the VFS inode. Which is fine by me but it would grow struct inode for
> > proc / tmpfs / btrfs (although for btrfs I'm not convinced it isn't
> > actually prone to the race and doesn't need similar protection as xfs /
> > ext4) so some people may object.
> 
> The btrfs folks are already looking into lifting it to common code.
> 
> Also I have a patch pending to remove a list_head from the inode to
> counter the size increase :)

We can get rid of nrexceptional as well:

https://lore.kernel.org/linux-fsdevel/20201026151849.24232-1-willy@infradead.org/

We can also reduce the size of the rwsem by one word by replacing the list_head with a single pointer.

https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/wlist

(haven't touched it in almost four years, seemed to have a bug last time
i looked at it).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races
  2021-01-20 16:06 [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races Jan Kara
                   ` (2 preceding siblings ...)
  2021-01-20 16:06 ` [PATCH 3/3] ext4: Fix stale data exposure when read races with hole punch Jan Kara
@ 2021-01-21 19:27 ` Matthew Wilcox
  2021-01-22 14:32   ` Jan Kara
  2021-04-02 19:34 ` Theodore Ts'o
  4 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2021-01-21 19:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-ext4

On Wed, Jan 20, 2021 at 05:06:08PM +0100, Jan Kara wrote:
> Hello,
> 
> Amir has reported [1] a that ext4 has a potential issues when reads can race
> with hole punching possibly exposing stale data from freed blocks or even
> corrupting filesystem when stale mapping data gets used for writeout. The
> problem is that during hole punching, new page cache pages can get instantiated
> in a punched range after truncate_inode_pages() has run but before the
> filesystem removes blocks from the file.  In principle any filesystem
> implementing hole punching thus needs to implement a mechanism to block
> instantiating page cache pages during hole punching to avoid this race. This is
> further complicated by the fact that there are multiple places that can
> instantiate pages in page cache.  We can have regular read(2) or page fault
> doing this but fadvise(2) or madvise(2) can also result in reading in page
> cache pages through force_page_cache_readahead().

Doesn't this indicate that we're doing truncates in the wrong order?
ie first we should deallocate the blocks, then we should free the page
cache that was caching the contents of those blocks.  We'd need to
make sure those pages in the page cache don't get written back to disc
(either by taking pages in the page cache off the lru list or having
the filesystem handle writeback of pages to a freed extent as a no-op).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races
  2021-01-21 19:27 ` [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races Matthew Wilcox
@ 2021-01-22 14:32   ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2021-01-22 14:32 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, linux-fsdevel, linux-ext4

On Thu 21-01-21 19:27:55, Matthew Wilcox wrote:
> On Wed, Jan 20, 2021 at 05:06:08PM +0100, Jan Kara wrote:
> > Hello,
> > 
> > Amir has reported [1] a that ext4 has a potential issues when reads can race
> > with hole punching possibly exposing stale data from freed blocks or even
> > corrupting filesystem when stale mapping data gets used for writeout. The
> > problem is that during hole punching, new page cache pages can get instantiated
> > in a punched range after truncate_inode_pages() has run but before the
> > filesystem removes blocks from the file.  In principle any filesystem
> > implementing hole punching thus needs to implement a mechanism to block
> > instantiating page cache pages during hole punching to avoid this race. This is
> > further complicated by the fact that there are multiple places that can
> > instantiate pages in page cache.  We can have regular read(2) or page fault
> > doing this but fadvise(2) or madvise(2) can also result in reading in page
> > cache pages through force_page_cache_readahead().
> 
> Doesn't this indicate that we're doing truncates in the wrong order?
> ie first we should deallocate the blocks, then we should free the page
> cache that was caching the contents of those blocks.  We'd need to
> make sure those pages in the page cache don't get written back to disc
> (either by taking pages in the page cache off the lru list or having
> the filesystem handle writeback of pages to a freed extent as a no-op).

Well, it depends on how much you wish to complicate the matters :).
Filesystems have metadata information attached to pages (e.g. buffer
heads), once you are removing blocks from a file, this information is
becoming stale. So it makes perfect sense to first evict page cache to
remove this metadata caching information and then remove blocks from
on-disk structures.

You can obviously try to do it the other way around - i.e., first remove
blocks from on-disk structures and then remove the cached information from
the page cache. But then you have to make sure stale cached information
isn't used until everything is in sync. So whichever way you slice it, you
have information in two places, you need to keep it in sync and you need
some synchronization between different updaters of this information
in both places so that they cannot get those two places out of sync...

TLDR: I don't see much benefit in switching the ordering.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races
  2021-01-20 16:06 [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races Jan Kara
                   ` (3 preceding siblings ...)
  2021-01-21 19:27 ` [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races Matthew Wilcox
@ 2021-04-02 19:34 ` Theodore Ts'o
  2021-04-06 12:17   ` Jan Kara
  4 siblings, 1 reply; 17+ messages in thread
From: Theodore Ts'o @ 2021-04-02 19:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Matthew Wilcox, linux-ext4

On Wed, Jan 20, 2021 at 05:06:08PM +0100, Jan Kara wrote:
> 
> Amir has reported [1] a that ext4 has a potential issues when reads can race
> with hole punching possibly exposing stale data from freed blocks or even
> corrupting filesystem when stale mapping data gets used for writeout. The
> problem is that during hole punching, new page cache pages can get instantiated
> in a punched range after truncate_inode_pages() has run but before the
> filesystem removes blocks from the file.  In principle any filesystem
> implementing hole punching thus needs to implement a mechanism to block
> instantiating page cache pages during hole punching to avoid this race. This is
> further complicated by the fact that there are multiple places that can
> instantiate pages in page cache.  We can have regular read(2) or page fault
> doing this but fadvise(2) or madvise(2) can also result in reading in page
> cache pages through force_page_cache_readahead().

What's the current status of this patch set?  I'm going through
pending patches and it looks like folks don't like Jan's proposed
solution.  What are next steps?

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] mm: Provide address_space operation for filling pages for read
  2021-01-20 16:20   ` Christoph Hellwig
  2021-01-20 17:27     ` Jan Kara
@ 2021-04-02 21:17     ` Kent Overstreet
  2021-04-06 12:21       ` Jan Kara
  1 sibling, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2021-04-02 21:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, linux-fsdevel, Matthew Wilcox, linux-ext4

On Wed, Jan 20, 2021 at 04:20:01PM +0000, Christoph Hellwig wrote:
> On Wed, Jan 20, 2021 at 05:06:10PM +0100, Jan Kara wrote:
> > Provide an address_space operation for filling pages needed for read
> > into page cache. Filesystems can use this operation to seriealize
> > page cache filling with e.g. hole punching properly.
> 
> Besides the impending rewrite of the area - having another indirection
> here is just horrible for performance.  If we want locking in this area
> it should be in core code and common for multiple file systems.

Agreed.

But, instead of using a rwsemaphore, why not just make it a lock with two shared
states that are exclusive with each other? One state for things that add pages
to the page cache, the other state for things that want to prevent that. That
way, DIO can use it too...

(this is what bcachefs does)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races
  2021-04-02 19:34 ` Theodore Ts'o
@ 2021-04-06 12:17   ` Jan Kara
  2021-04-06 16:45     ` Theodore Ts'o
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2021-04-06 12:17 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-fsdevel, Matthew Wilcox, linux-ext4

On Fri 02-04-21 15:34:13, Theodore Ts'o wrote:
> On Wed, Jan 20, 2021 at 05:06:08PM +0100, Jan Kara wrote:
> > 
> > Amir has reported [1] a that ext4 has a potential issues when reads can race
> > with hole punching possibly exposing stale data from freed blocks or even
> > corrupting filesystem when stale mapping data gets used for writeout. The
> > problem is that during hole punching, new page cache pages can get instantiated
> > in a punched range after truncate_inode_pages() has run but before the
> > filesystem removes blocks from the file.  In principle any filesystem
> > implementing hole punching thus needs to implement a mechanism to block
> > instantiating page cache pages during hole punching to avoid this race. This is
> > further complicated by the fact that there are multiple places that can
> > instantiate pages in page cache.  We can have regular read(2) or page fault
> > doing this but fadvise(2) or madvise(2) can also result in reading in page
> > cache pages through force_page_cache_readahead().
> 
> What's the current status of this patch set?  I'm going through
> pending patches and it looks like folks don't like Jan's proposed
> solution.  What are next steps?

Note that I did post v2 here:

https://lore.kernel.org/linux-fsdevel/20210208163918.7871-1-jack@suse.cz/

It didn't get much comments though. I guess I'll rebase the series, include
the explanations I've added in my reply to Dave and resend.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] mm: Provide address_space operation for filling pages for read
  2021-04-02 21:17     ` Kent Overstreet
@ 2021-04-06 12:21       ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2021-04-06 12:21 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Jan Kara, linux-fsdevel, Matthew Wilcox, linux-ext4

On Fri 02-04-21 17:17:10, Kent Overstreet wrote:
> On Wed, Jan 20, 2021 at 04:20:01PM +0000, Christoph Hellwig wrote:
> > On Wed, Jan 20, 2021 at 05:06:10PM +0100, Jan Kara wrote:
> > > Provide an address_space operation for filling pages needed for read
> > > into page cache. Filesystems can use this operation to seriealize
> > > page cache filling with e.g. hole punching properly.
> > 
> > Besides the impending rewrite of the area - having another indirection
> > here is just horrible for performance.  If we want locking in this area
> > it should be in core code and common for multiple file systems.
> 
> Agreed.

Please see v2 [1] where the indirection is avoided.

> But, instead of using a rwsemaphore, why not just make it a lock with two shared
> states that are exclusive with each other? One state for things that add pages
> to the page cache, the other state for things that want to prevent that. That
> way, DIO can use it too...

Well, the filesystems I convert use rwsem currently so for the conversion,
keeping rwsem is the simplest. If we then decide for a more fancy locking
primitive (and I agree what you describe should be possible), then we can
do that but IMO that's the next step (because it requires auditing every
filesystem that the new primitive is indeed safe for them).

								Honza

[1] https://lore.kernel.org/linux-fsdevel/20210212160108.GW19070@quack2.suse.cz/
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races
  2021-04-06 12:17   ` Jan Kara
@ 2021-04-06 16:45     ` Theodore Ts'o
  2021-04-06 16:50       ` Theodore Ts'o
  0 siblings, 1 reply; 17+ messages in thread
From: Theodore Ts'o @ 2021-04-06 16:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Matthew Wilcox, linux-ext4

On Tue, Apr 06, 2021 at 02:17:02PM +0200, Jan Kara wrote:
> 
> Note that I did post v2 here:
> 
> https://lore.kernel.org/linux-fsdevel/20210208163918.7871-1-jack@suse.cz/
> 
> It didn't get much comments though. I guess I'll rebase the series, include
> the explanations I've added in my reply to Dave and resend.

Hmm, I wonder if it somehow got hit by the vger.kernel.org
instabilities a while back?  As near as I can tell your v2 patches
never were received by:

	http://patchwork.ozlabs.org/project/linux-ext4/

(There are no ext4 patches on patchwork.ozlabs.org on February 8th,
although I do see patches hitting patchwork on February 7th and 9th.)

Resending sounds like a plan.  :-)

	      	  	  	  	       - Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races
  2021-04-06 16:45     ` Theodore Ts'o
@ 2021-04-06 16:50       ` Theodore Ts'o
  0 siblings, 0 replies; 17+ messages in thread
From: Theodore Ts'o @ 2021-04-06 16:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Matthew Wilcox, linux-ext4

On Tue, Apr 06, 2021 at 12:45:28PM -0400, Theodore Ts'o wrote:
> 
> Hmm, I wonder if it somehow got hit by the vger.kernel.org
> instabilities a while back?  As near as I can tell your v2 patches
> never were received by:
> 
> 	http://patchwork.ozlabs.org/project/linux-ext4/
> 
> (There are no ext4 patches on patchwork.ozlabs.org on February 8th,
> although I do see patches hitting patchwork on February 7th and 9th.)

I just noticed that the V2 patches were only sent to linux-fsdevel and
not cc'ed to linux-ext4.  So that explains why I didn't see it on
patchwork.  :-)

						- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-20 16:06 [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races Jan Kara
2021-01-20 16:06 ` [PATCH 1/3] mm: Do not pass iter into generic_file_buffered_read_get_pages() Jan Kara
2021-01-20 16:18   ` Christoph Hellwig
2021-01-20 16:06 ` [PATCH 2/3] mm: Provide address_space operation for filling pages for read Jan Kara
2021-01-20 16:20   ` Christoph Hellwig
2021-01-20 17:27     ` Jan Kara
2021-01-20 17:28       ` Christoph Hellwig
2021-01-20 17:56         ` Matthew Wilcox
2021-04-02 21:17     ` Kent Overstreet
2021-04-06 12:21       ` Jan Kara
2021-01-20 16:06 ` [PATCH 3/3] ext4: Fix stale data exposure when read races with hole punch Jan Kara
2021-01-21 19:27 ` [PATCH 0/3 RFC] fs: Hole punch vs page cache filling races Matthew Wilcox
2021-01-22 14:32   ` Jan Kara
2021-04-02 19:34 ` Theodore Ts'o
2021-04-06 12:17   ` Jan Kara
2021-04-06 16:45     ` Theodore Ts'o
2021-04-06 16:50       ` Theodore Ts'o

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git