linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: Fix races between readahead and hole punching
@ 2019-07-11 14:00 Jan Kara
  2019-07-11 14:00 ` [PATCH 1/3] mm: Handle MADV_WILLNEED through vfs_fadvise() Jan Kara
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Jan Kara @ 2019-07-11 14:00 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, linux-xfs, Amir Goldstein, Boaz Harrosh, Jan Kara

Hello,

this is a patch series that addresses a possible race between readahead and
hole punching Amir has discovered [1]. The first patch makes madvise(2) to
handle readahead requests through fadvise infrastructure, the third patch
then adds necessary locking to XFS to protect against the race. Note that
other filesystems need similar protections but e.g. in case of ext4 it isn't
so simple without seriously regressing mixed rw workload performance so
I'm pushing just xfs fix at this moment which is simple.

								Honza

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mail.gmail.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] mm: Handle MADV_WILLNEED through vfs_fadvise()
  2019-07-11 14:00 [PATCH 0/3] xfs: Fix races between readahead and hole punching Jan Kara
@ 2019-07-11 14:00 ` Jan Kara
  2019-07-12 17:50   ` Darrick J. Wong
                     ` (2 more replies)
  2019-07-11 14:00 ` [PATCH 2/3] fs: Export generic_fadvise() Jan Kara
  2019-07-11 14:00 ` [PATCH 3/3] xfs: Fix stale data exposure when readahead races with hole punch Jan Kara
  2 siblings, 3 replies; 13+ messages in thread
From: Jan Kara @ 2019-07-11 14:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-mm, linux-xfs, Amir Goldstein, Boaz Harrosh, Jan Kara, stable

Currently handling of MADV_WILLNEED hint calls directly into readahead
code. Handle it by calling vfs_fadvise() instead so that filesystem can
use its ->fadvise() callback to acquire necessary locks or otherwise
prepare for the request.

Suggested-by: Amir Goldstein <amir73il@gmail.com>
CC: stable@vger.kernel.org # Needed by "xfs: Fix stale data exposure
					when readahead races with hole punch"
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/madvise.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..ae56d0ef337d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -14,6 +14,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/hugetlb.h>
 #include <linux/falloc.h>
+#include <linux/fadvise.h>
 #include <linux/sched.h>
 #include <linux/ksm.h>
 #include <linux/fs.h>
@@ -275,6 +276,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
 	struct file *file = vma->vm_file;
+	loff_t offset;
 
 	*prev = vma;
 #ifdef CONFIG_SWAP
@@ -298,12 +300,20 @@ static long madvise_willneed(struct vm_area_struct *vma,
 		return 0;
 	}
 
-	start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-	if (end > vma->vm_end)
-		end = vma->vm_end;
-	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-
-	force_page_cache_readahead(file->f_mapping, file, start, end - start);
+	/*
+	 * Filesystem's fadvise may need to take various locks.  We need to
+	 * explicitly grab a reference because the vma (and hence the
+	 * vma's reference to the file) can go away as soon as we drop
+	 * mmap_sem.
+	 */
+	*prev = NULL;	/* tell sys_madvise we drop mmap_sem */
+	get_file(file);
+	up_read(&current->mm->mmap_sem);
+	offset = (loff_t)(start - vma->vm_start)
+			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+	vfs_fadvise(file, offset, end - start, POSIX_FADV_WILLNEED);
+	fput(file);
+	down_read(&current->mm->mmap_sem);
 	return 0;
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] fs: Export generic_fadvise()
  2019-07-11 14:00 [PATCH 0/3] xfs: Fix races between readahead and hole punching Jan Kara
  2019-07-11 14:00 ` [PATCH 1/3] mm: Handle MADV_WILLNEED through vfs_fadvise() Jan Kara
@ 2019-07-11 14:00 ` Jan Kara
  2019-07-12 17:50   ` Darrick J. Wong
  2019-07-12 23:55   ` Sasha Levin
  2019-07-11 14:00 ` [PATCH 3/3] xfs: Fix stale data exposure when readahead races with hole punch Jan Kara
  2 siblings, 2 replies; 13+ messages in thread
From: Jan Kara @ 2019-07-11 14:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-mm, linux-xfs, Amir Goldstein, Boaz Harrosh, Jan Kara, stable

Filesystems will need to call this function from their fadvise handlers.

CC: stable@vger.kernel.org # Needed by "xfs: Fix stale data exposure when
					readahead races with hole punch"
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h | 2 ++
 mm/fadvise.c       | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f7fdfe93e25d..2666862ff00d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3536,6 +3536,8 @@ extern void inode_nohighmem(struct inode *inode);
 /* mm/fadvise.c */
 extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
 		       int advice);
+extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
+			   int advice);
 
 #if defined(CONFIG_IO_URING)
 extern struct sock *io_uring_get_socket(struct file *file);
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 467bcd032037..4f17c83db575 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -27,8 +27,7 @@
  * deactivate the pages and clear PG_Referenced.
  */
 
-static int generic_fadvise(struct file *file, loff_t offset, loff_t len,
-			   int advice)
+int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 {
 	struct inode *inode;
 	struct address_space *mapping;
@@ -178,6 +177,7 @@ static int generic_fadvise(struct file *file, loff_t offset, loff_t len,
 	}
 	return 0;
 }
+EXPORT_SYMBOL(generic_fadvise);
 
 int vfs_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] xfs: Fix stale data exposure when readahead races with hole punch
  2019-07-11 14:00 [PATCH 0/3] xfs: Fix races between readahead and hole punching Jan Kara
  2019-07-11 14:00 ` [PATCH 1/3] mm: Handle MADV_WILLNEED through vfs_fadvise() Jan Kara
  2019-07-11 14:00 ` [PATCH 2/3] fs: Export generic_fadvise() Jan Kara
@ 2019-07-11 14:00 ` Jan Kara
  2019-07-11 15:28   ` Amir Goldstein
  2 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2019-07-11 14:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-mm, linux-xfs, Amir Goldstein, Boaz Harrosh, Jan Kara, stable

Hole puching currently evicts pages from page cache and then goes on to
remove blocks from the inode. This happens under both XFS_IOLOCK_EXCL
and XFS_MMAPLOCK_EXCL which provides appropriate serialization with
racing reads or page faults. However there is currently nothing that
prevents readahead triggered by fadvise() or madvise() from racing with
the hole punch and instantiating page cache page after hole punching has
evicted page cache in xfs_flush_unmap_range() but before it has removed
blocks from the inode. This page cache page will be mapping soon to be
freed block and that can lead to returning stale data to userspace or
even filesystem corruption.

Fix the problem by protecting handling of readahead requests by
XFS_IOLOCK_SHARED similarly as we protect reads.

CC: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mail.gmail.com/
Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 76748255f843..88fe3dbb3ba2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -33,6 +33,7 @@
 #include <linux/pagevec.h>
 #include <linux/backing-dev.h>
 #include <linux/mman.h>
+#include <linux/fadvise.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
 
@@ -939,6 +940,24 @@ xfs_file_fallocate(
 	return error;
 }
 
+STATIC int
+xfs_file_fadvise(
+	struct file *file,
+	loff_t start,
+	loff_t end,
+	int advice)
+{
+	struct xfs_inode *ip = XFS_I(file_inode(file));
+	int ret;
+
+	/* Readahead needs protection from hole punching and similar ops */
+	if (advice == POSIX_FADV_WILLNEED)
+		xfs_ilock(ip, XFS_IOLOCK_SHARED);
+	ret = generic_fadvise(file, start, end, advice);
+	if (advice == POSIX_FADV_WILLNEED)
+		xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+	return ret;
+}
 
 STATIC loff_t
 xfs_file_remap_range(
@@ -1235,6 +1254,7 @@ const struct file_operations xfs_file_operations = {
 	.fsync		= xfs_file_fsync,
 	.get_unmapped_area = thp_get_unmapped_area,
 	.fallocate	= xfs_file_fallocate,
+	.fadvise	= xfs_file_fadvise,
 	.remap_file_range = xfs_file_remap_range,
 };
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] xfs: Fix stale data exposure when readahead races with hole punch
  2019-07-11 14:00 ` [PATCH 3/3] xfs: Fix stale data exposure when readahead races with hole punch Jan Kara
@ 2019-07-11 15:28   ` Amir Goldstein
  2019-07-11 15:49     ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Amir Goldstein @ 2019-07-11 15:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Linux MM, linux-xfs, Boaz Harrosh, stable

On Thu, Jul 11, 2019 at 5:00 PM Jan Kara <jack@suse.cz> wrote:
>
> Hole puching currently evicts pages from page cache and then goes on to
> remove blocks from the inode. This happens under both XFS_IOLOCK_EXCL
> and XFS_MMAPLOCK_EXCL which provides appropriate serialization with
> racing reads or page faults. However there is currently nothing that
> prevents readahead triggered by fadvise() or madvise() from racing with
> the hole punch and instantiating page cache page after hole punching has
> evicted page cache in xfs_flush_unmap_range() but before it has removed
> blocks from the inode. This page cache page will be mapping soon to be
> freed block and that can lead to returning stale data to userspace or
> even filesystem corruption.
>
> Fix the problem by protecting handling of readahead requests by
> XFS_IOLOCK_SHARED similarly as we protect reads.
>
> CC: stable@vger.kernel.org
> Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mail.gmail.com/
> Reported-by: Amir Goldstein <amir73il@gmail.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---

Looks sane. (I'll let xfs developers offer reviewed-by tags)

>  fs/xfs/xfs_file.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 76748255f843..88fe3dbb3ba2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -33,6 +33,7 @@
>  #include <linux/pagevec.h>
>  #include <linux/backing-dev.h>
>  #include <linux/mman.h>
> +#include <linux/fadvise.h>
>
>  static const struct vm_operations_struct xfs_file_vm_ops;
>
> @@ -939,6 +940,24 @@ xfs_file_fallocate(
>         return error;
>  }
>
> +STATIC int
> +xfs_file_fadvise(
> +       struct file *file,
> +       loff_t start,
> +       loff_t end,
> +       int advice)
> +{
> +       struct xfs_inode *ip = XFS_I(file_inode(file));
> +       int ret;
> +
> +       /* Readahead needs protection from hole punching and similar ops */
> +       if (advice == POSIX_FADV_WILLNEED)
> +               xfs_ilock(ip, XFS_IOLOCK_SHARED);
> +       ret = generic_fadvise(file, start, end, advice);
> +       if (advice == POSIX_FADV_WILLNEED)
> +               xfs_iunlock(ip, XFS_IOLOCK_SHARED);
> +       return ret;
> +}
>
>  STATIC loff_t
>  xfs_file_remap_range(
> @@ -1235,6 +1254,7 @@ const struct file_operations xfs_file_operations = {
>         .fsync          = xfs_file_fsync,
>         .get_unmapped_area = thp_get_unmapped_area,
>         .fallocate      = xfs_file_fallocate,
> +       .fadvise        = xfs_file_fadvise,
>         .remap_file_range = xfs_file_remap_range,
>  };
>
> --
> 2.16.4
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] xfs: Fix stale data exposure when readahead races with hole punch
  2019-07-11 15:28   ` Amir Goldstein
@ 2019-07-11 15:49     ` Darrick J. Wong
  2019-07-12 12:00       ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2019-07-11 15:49 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, Linux MM, linux-xfs, Boaz Harrosh, stable

On Thu, Jul 11, 2019 at 06:28:54PM +0300, Amir Goldstein wrote:
> On Thu, Jul 11, 2019 at 5:00 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hole puching currently evicts pages from page cache and then goes on to
> > remove blocks from the inode. This happens under both XFS_IOLOCK_EXCL
> > and XFS_MMAPLOCK_EXCL which provides appropriate serialization with
> > racing reads or page faults. However there is currently nothing that
> > prevents readahead triggered by fadvise() or madvise() from racing with
> > the hole punch and instantiating page cache page after hole punching has
> > evicted page cache in xfs_flush_unmap_range() but before it has removed
> > blocks from the inode. This page cache page will be mapping soon to be
> > freed block and that can lead to returning stale data to userspace or
> > even filesystem corruption.
> >
> > Fix the problem by protecting handling of readahead requests by
> > XFS_IOLOCK_SHARED similarly as we protect reads.
> >
> > CC: stable@vger.kernel.org
> > Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mail.gmail.com/
> > Reported-by: Amir Goldstein <amir73il@gmail.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> 
> Looks sane. (I'll let xfs developers offer reviewed-by tags)
> 
> >  fs/xfs/xfs_file.c | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 76748255f843..88fe3dbb3ba2 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -33,6 +33,7 @@
> >  #include <linux/pagevec.h>
> >  #include <linux/backing-dev.h>
> >  #include <linux/mman.h>
> > +#include <linux/fadvise.h>
> >
> >  static const struct vm_operations_struct xfs_file_vm_ops;
> >
> > @@ -939,6 +940,24 @@ xfs_file_fallocate(
> >         return error;
> >  }
> >
> > +STATIC int
> > +xfs_file_fadvise(
> > +       struct file *file,
> > +       loff_t start,
> > +       loff_t end,
> > +       int advice)

Indentation needs fixing here.

> > +{
> > +       struct xfs_inode *ip = XFS_I(file_inode(file));
> > +       int ret;
> > +
> > +       /* Readahead needs protection from hole punching and similar ops */
> > +       if (advice == POSIX_FADV_WILLNEED)
> > +               xfs_ilock(ip, XFS_IOLOCK_SHARED);

It's good to fix this race, but at the same time I wonder what's the
impact to processes writing to one part of a file waiting on IOLOCK_EXCL
while readahead holds IOLOCK_SHARED?

(bluh bluh range locks ftw bluh bluh)

Do we need a lock for DONTNEED?  I think the answer is that you have to
lock the page to drop it and that will protect us from <myriad punch and
truncate spaghetti> ... ?

> > +       ret = generic_fadvise(file, start, end, advice);
> > +       if (advice == POSIX_FADV_WILLNEED)
> > +               xfs_iunlock(ip, XFS_IOLOCK_SHARED);

Maybe it'd be better to do:

	int	lockflags = 0;

	if (advice == POSIX_FADV_WILLNEED) {
		lockflags = XFS_IOLOCK_SHARED;
		xfs_ilock(ip, lockflags);
	}

	ret = generic_fadvise(file, start, end, advice);

	if (lockflags)
		xfs_iunlock(ip, lockflags);

Just in case we some day want more or different types of inode locks?

--D

> > +       return ret;
> > +}
> >
> >  STATIC loff_t
> >  xfs_file_remap_range(
> > @@ -1235,6 +1254,7 @@ const struct file_operations xfs_file_operations = {
> >         .fsync          = xfs_file_fsync,
> >         .get_unmapped_area = thp_get_unmapped_area,
> >         .fallocate      = xfs_file_fallocate,
> > +       .fadvise        = xfs_file_fadvise,
> >         .remap_file_range = xfs_file_remap_range,
> >  };
> >
> > --
> > 2.16.4
> >


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] xfs: Fix stale data exposure when readahead races with hole punch
  2019-07-11 15:49     ` Darrick J. Wong
@ 2019-07-12 12:00       ` Jan Kara
  2019-07-12 17:56         ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2019-07-12 12:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Jan Kara, linux-fsdevel, Linux MM, linux-xfs,
	Boaz Harrosh, stable

On Thu 11-07-19 08:49:17, Darrick J. Wong wrote:
> On Thu, Jul 11, 2019 at 06:28:54PM +0300, Amir Goldstein wrote:
> > > +{
> > > +       struct xfs_inode *ip = XFS_I(file_inode(file));
> > > +       int ret;
> > > +
> > > +       /* Readahead needs protection from hole punching and similar ops */
> > > +       if (advice == POSIX_FADV_WILLNEED)
> > > +               xfs_ilock(ip, XFS_IOLOCK_SHARED);
> 
> It's good to fix this race, but at the same time I wonder what's the
> impact to processes writing to one part of a file waiting on IOLOCK_EXCL
> while readahead holds IOLOCK_SHARED?
> 
> (bluh bluh range locks ftw bluh bluh)

Yeah, with range locks this would have less impact. Also note that we hold
the lock only during page setup and IO submission. IO itself will already
happen without IOLOCK, only under page lock. But that's enough to stop the
race.

> Do we need a lock for DONTNEED?  I think the answer is that you have to
> lock the page to drop it and that will protect us from <myriad punch and
> truncate spaghetti> ... ?

Yeah, DONTNEED is just page writeback + invalidate. So page lock is enough
to protect from anything bad. Essentially we need IOLOCK only to protect
the places that creates new pages in page cache.

> > > +       ret = generic_fadvise(file, start, end, advice);
> > > +       if (advice == POSIX_FADV_WILLNEED)
> > > +               xfs_iunlock(ip, XFS_IOLOCK_SHARED);
> 
> Maybe it'd be better to do:
> 
> 	int	lockflags = 0;
> 
> 	if (advice == POSIX_FADV_WILLNEED) {
> 		lockflags = XFS_IOLOCK_SHARED;
> 		xfs_ilock(ip, lockflags);
> 	}
> 
> 	ret = generic_fadvise(file, start, end, advice);
> 
> 	if (lockflags)
> 		xfs_iunlock(ip, lockflags);
> 
> Just in case we some day want more or different types of inode locks?

OK, will do. Just I'll get to testing this only after I return from
vacation.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm: Handle MADV_WILLNEED through vfs_fadvise()
  2019-07-11 14:00 ` [PATCH 1/3] mm: Handle MADV_WILLNEED through vfs_fadvise() Jan Kara
@ 2019-07-12 17:50   ` Darrick J. Wong
  2019-07-12 23:55   ` Sasha Levin
  2019-07-23  3:08   ` Boaz Harrosh
  2 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2019-07-12 17:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, linux-xfs, Amir Goldstein, Boaz Harrosh, stable

On Thu, Jul 11, 2019 at 04:00:10PM +0200, Jan Kara wrote:
> Currently handling of MADV_WILLNEED hint calls directly into readahead
> code. Handle it by calling vfs_fadvise() instead so that filesystem can
> use its ->fadvise() callback to acquire necessary locks or otherwise
> prepare for the request.
> 
> Suggested-by: Amir Goldstein <amir73il@gmail.com>
> CC: stable@vger.kernel.org # Needed by "xfs: Fix stale data exposure
> 					when readahead races with hole punch"
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks reasonable to me, though is this race between readahead and
truncate severe enough to try to push it as a fix for 5.3 or are you
targetting 5.4?

--D

> ---
>  mm/madvise.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 628022e674a7..ae56d0ef337d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -14,6 +14,7 @@
>  #include <linux/userfaultfd_k.h>
>  #include <linux/hugetlb.h>
>  #include <linux/falloc.h>
> +#include <linux/fadvise.h>
>  #include <linux/sched.h>
>  #include <linux/ksm.h>
>  #include <linux/fs.h>
> @@ -275,6 +276,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  			     unsigned long start, unsigned long end)
>  {
>  	struct file *file = vma->vm_file;
> +	loff_t offset;
>  
>  	*prev = vma;
>  #ifdef CONFIG_SWAP
> @@ -298,12 +300,20 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  		return 0;
>  	}
>  
> -	start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> -	if (end > vma->vm_end)
> -		end = vma->vm_end;
> -	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> -
> -	force_page_cache_readahead(file->f_mapping, file, start, end - start);
> +	/*
> +	 * Filesystem's fadvise may need to take various locks.  We need to
> +	 * explicitly grab a reference because the vma (and hence the
> +	 * vma's reference to the file) can go away as soon as we drop
> +	 * mmap_sem.
> +	 */
> +	*prev = NULL;	/* tell sys_madvise we drop mmap_sem */
> +	get_file(file);
> +	up_read(&current->mm->mmap_sem);
> +	offset = (loff_t)(start - vma->vm_start)
> +			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> +	vfs_fadvise(file, offset, end - start, POSIX_FADV_WILLNEED);
> +	fput(file);
> +	down_read(&current->mm->mmap_sem);
>  	return 0;
>  }
>  
> -- 
> 2.16.4
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] fs: Export generic_fadvise()
  2019-07-11 14:00 ` [PATCH 2/3] fs: Export generic_fadvise() Jan Kara
@ 2019-07-12 17:50   ` Darrick J. Wong
  2019-07-12 23:55   ` Sasha Levin
  1 sibling, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2019-07-12 17:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, linux-xfs, Amir Goldstein, Boaz Harrosh, stable

On Thu, Jul 11, 2019 at 04:00:11PM +0200, Jan Kara wrote:
> Filesystems will need to call this function from their fadvise handlers.
> 
> CC: stable@vger.kernel.org # Needed by "xfs: Fix stale data exposure when
> 					readahead races with hole punch"
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  include/linux/fs.h | 2 ++
>  mm/fadvise.c       | 4 ++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f7fdfe93e25d..2666862ff00d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3536,6 +3536,8 @@ extern void inode_nohighmem(struct inode *inode);
>  /* mm/fadvise.c */
>  extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
>  		       int advice);
> +extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
> +			   int advice);
>  
>  #if defined(CONFIG_IO_URING)
>  extern struct sock *io_uring_get_socket(struct file *file);
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 467bcd032037..4f17c83db575 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -27,8 +27,7 @@
>   * deactivate the pages and clear PG_Referenced.
>   */
>  
> -static int generic_fadvise(struct file *file, loff_t offset, loff_t len,
> -			   int advice)
> +int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>  {
>  	struct inode *inode;
>  	struct address_space *mapping;
> @@ -178,6 +177,7 @@ static int generic_fadvise(struct file *file, loff_t offset, loff_t len,
>  	}
>  	return 0;
>  }
> +EXPORT_SYMBOL(generic_fadvise);
>  
>  int vfs_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>  {
> -- 
> 2.16.4
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] xfs: Fix stale data exposure when readahead races with hole punch
  2019-07-12 12:00       ` Jan Kara
@ 2019-07-12 17:56         ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2019-07-12 17:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, linux-fsdevel, Linux MM, linux-xfs, Boaz Harrosh, stable

On Fri, Jul 12, 2019 at 02:00:04PM +0200, Jan Kara wrote:
> On Thu 11-07-19 08:49:17, Darrick J. Wong wrote:
> > On Thu, Jul 11, 2019 at 06:28:54PM +0300, Amir Goldstein wrote:
> > > > +{
> > > > +       struct xfs_inode *ip = XFS_I(file_inode(file));
> > > > +       int ret;
> > > > +
> > > > +       /* Readahead needs protection from hole punching and similar ops */
> > > > +       if (advice == POSIX_FADV_WILLNEED)
> > > > +               xfs_ilock(ip, XFS_IOLOCK_SHARED);
> > 
> > It's good to fix this race, but at the same time I wonder what's the
> > impact to processes writing to one part of a file waiting on IOLOCK_EXCL
> > while readahead holds IOLOCK_SHARED?
> > 
> > (bluh bluh range locks ftw bluh bluh)
> 
> Yeah, with range locks this would have less impact. Also note that we hold
> the lock only during page setup and IO submission. IO itself will already
> happen without IOLOCK, only under page lock. But that's enough to stop the
> race.

> > Do we need a lock for DONTNEED?  I think the answer is that you have to
> > lock the page to drop it and that will protect us from <myriad punch and
> > truncate spaghetti> ... ?
> 
> Yeah, DONTNEED is just page writeback + invalidate. So page lock is enough
> to protect from anything bad. Essentially we need IOLOCK only to protect
> the places that creates new pages in page cache.
> 
> > > > +       ret = generic_fadvise(file, start, end, advice);
> > > > +       if (advice == POSIX_FADV_WILLNEED)
> > > > +               xfs_iunlock(ip, XFS_IOLOCK_SHARED);
> > 
> > Maybe it'd be better to do:
> > 
> > 	int	lockflags = 0;
> > 
> > 	if (advice == POSIX_FADV_WILLNEED) {
> > 		lockflags = XFS_IOLOCK_SHARED;
> > 		xfs_ilock(ip, lockflags);
> > 	}
> > 
> > 	ret = generic_fadvise(file, start, end, advice);
> > 
> > 	if (lockflags)
> > 		xfs_iunlock(ip, lockflags);
> > 
> > Just in case we some day want more or different types of inode locks?
> 
> OK, will do. Just I'll get to testing this only after I return from
> vacation.

<nod>

--D
> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] fs: Export generic_fadvise()
  2019-07-11 14:00 ` [PATCH 2/3] fs: Export generic_fadvise() Jan Kara
  2019-07-12 17:50   ` Darrick J. Wong
@ 2019-07-12 23:55   ` Sasha Levin
  1 sibling, 0 replies; 13+ messages in thread
From: Sasha Levin @ 2019-07-12 23:55 UTC (permalink / raw)
  To: Sasha Levin, Jan Kara, linux-fsdevel; +Cc: linux-mm, , stable

Hi,

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: all

The bot has tested the following trees: v5.2, v5.1.17, v4.19.58, v4.14.133, v4.9.185, v4.4.185.

v5.2: Build OK!
v5.1.17: Build OK!
v4.19.58: Build OK!
v4.14.133: Failed to apply! Possible dependencies:
    17ef445f9bef ("Documentation/filesystems: update documentation of file_operations")
    312db1aa1dc7 ("fs: add ksys_mount() helper; remove in-kernel calls to sys_mount()")
    36028d5dd711 ("fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls")
    3a18ef5c1b39 ("fs: add ksys_umount() helper; remove in-kernel call to sys_umount()")
    3ce4a7bf6626 ("fs: add ksys_read() helper; remove in-kernel calls to sys_read()")
    45cd0faae371 ("vfs: add the fadvise() file operation")
    6e8b704df584 ("fs: update documentation to mention __poll_t and match the code")
    70f68ee81e2e ("fs: add ksys_sync() helper; remove in-kernel calls to sys_sync()")
    806cbae1228c ("fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall")
    819671ff849b ("syscalls: define and explain goal to not call syscalls in the kernel")
    9b32105ec6b1 ("kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()")
    9d5b7c956b09 ("mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()")
    a16fe33ab557 ("fs: add ksys_chroot() helper; remove-in kernel calls to sys_chroot()")
    c7248321a3d4 ("fs: add ksys_dup{,3}() helper; remove in-kernel calls to sys_dup{,3}()")
    e2aaa9f42336 ("kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()")
    e7a3e8b2edf5 ("fs: add ksys_write() helper; remove in-kernel calls to sys_write()")
    edf292c76b88 ("fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()")

v4.9.185: Failed to apply! Possible dependencies:
    17ef445f9bef ("Documentation/filesystems: update documentation of file_operations")
    3859a271a003 ("randstruct: Mark various structs for randomization")
    45cd0faae371 ("vfs: add the fadvise() file operation")
    5613fda9a503 ("sched/cputime: Convert task/group cputime to nsecs")
    60f3e00d25b4 ("sysv,ipc: cacheline align kern_ipc_perm")
    6e8b704df584 ("fs: update documentation to mention __poll_t and match the code")
    8c8b73c4811f ("sched/cputime, powerpc: Prepare accounting structure for cputime flush on tick")
    a19ff1a2cc92 ("sched/cputime, powerpc/vtime: Accumulate cputime and account only on tick/task switch")
    b18b6a9cef7f ("timers: Omit POSIX timer stuff from task_struct when disabled")
    baa73d9e478f ("posix-timers: Make them configurable")
    c3edc4010e9d ("sched/headers: Move task_struct::signal and task_struct::sighand types and accessors into <linux/sched/signal.h>")
    d69dece5f5b6 ("LSM: Add /sys/kernel/security/lsm")
    f828c3d0aeba ("sched/cputime, powerpc: Migrate stolen_time field to the accounting structure")

v4.4.185: Failed to apply! Possible dependencies:
    04b38d601239 ("vfs: pull btrfs clone API to vfs layer")
    17ef445f9bef ("Documentation/filesystems: update documentation of file_operations")
    29732938a628 ("vfs: add copy_file_range syscall and vfs helper")
    3859a271a003 ("randstruct: Mark various structs for randomization")
    3db11b2eecc0 ("btrfs: add .copy_file_range file operation")
    45cd0faae371 ("vfs: add the fadvise() file operation")
    54dbc1517237 ("vfs: hoist the btrfs deduplication ioctl to the vfs")
    6e8b704df584 ("fs: update documentation to mention __poll_t and match the code")
    75ba1d07fd6a ("seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char")
    81243eacfa40 ("cred: simpler, 1D supplementary groups")
    d79bdd52d8be ("vfs: wire up compat ioctl for CLONE/CLONE_RANGE")
    f7a5f132b447 ("proc: faster /proc/*/status")


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

--
Thanks,
Sasha


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm: Handle MADV_WILLNEED through vfs_fadvise()
  2019-07-11 14:00 ` [PATCH 1/3] mm: Handle MADV_WILLNEED through vfs_fadvise() Jan Kara
  2019-07-12 17:50   ` Darrick J. Wong
@ 2019-07-12 23:55   ` Sasha Levin
  2019-07-23  3:08   ` Boaz Harrosh
  2 siblings, 0 replies; 13+ messages in thread
From: Sasha Levin @ 2019-07-12 23:55 UTC (permalink / raw)
  To: Sasha Levin, Jan Kara, linux-fsdevel; +Cc: linux-mm, , stable

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1025 bytes --]

Hi,

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: all

The bot has tested the following trees: v5.2, v5.1.17, v4.19.58, v4.14.133, v4.9.185, v4.4.185.

v5.2: Build OK!
v5.1.17: Build OK!
v4.19.58: Build OK!
v4.14.133: Build failed! Errors:
    mm/madvise.c:314:2: error: implicit declaration of function ‘vfs_fadvise’; did you mean ‘sys_madvise’? [-Werror=implicit-function-declaration]

v4.9.185: Build failed! Errors:
    mm/madvise.c:266:2: error: implicit declaration of function ‘vfs_fadvise’; did you mean ‘sys_madvise’? [-Werror=implicit-function-declaration]

v4.4.185: Build failed! Errors:
    mm/madvise.c:261:2: error: implicit declaration of function ‘vfs_fadvise’; did you mean ‘sys_madvise’? [-Werror=implicit-function-declaration]


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

--
Thanks,
Sasha


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm: Handle MADV_WILLNEED through vfs_fadvise()
  2019-07-11 14:00 ` [PATCH 1/3] mm: Handle MADV_WILLNEED through vfs_fadvise() Jan Kara
  2019-07-12 17:50   ` Darrick J. Wong
  2019-07-12 23:55   ` Sasha Levin
@ 2019-07-23  3:08   ` Boaz Harrosh
  2 siblings, 0 replies; 13+ messages in thread
From: Boaz Harrosh @ 2019-07-23  3:08 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel
  Cc: linux-mm, linux-xfs, Amir Goldstein, Boaz Harrosh, stable

On 11/07/2019 17:00, Jan Kara wrote:
> Currently handling of MADV_WILLNEED hint calls directly into readahead
> code. Handle it by calling vfs_fadvise() instead so that filesystem can
> use its ->fadvise() callback to acquire necessary locks or otherwise
> prepare for the request.
> 
> Suggested-by: Amir Goldstein <amir73il@gmail.com>
> CC: stable@vger.kernel.org # Needed by "xfs: Fix stale data exposure
> 					when readahead races with hole punch"
> Signed-off-by: Jan Kara <jack@suse.cz>

I had a similar patch for my needs. But did not drop the mmap_sem when calling into
the FS. This one is much better.

Reviewed-by: Boaz Harrosh <boazh@netapp.com>

I tested this patch, Works perfect for my needs.

Thank you for this patch
Boaz

> ---
>  mm/madvise.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 628022e674a7..ae56d0ef337d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -14,6 +14,7 @@
>  #include <linux/userfaultfd_k.h>
>  #include <linux/hugetlb.h>
>  #include <linux/falloc.h>
> +#include <linux/fadvise.h>
>  #include <linux/sched.h>
>  #include <linux/ksm.h>
>  #include <linux/fs.h>
> @@ -275,6 +276,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  			     unsigned long start, unsigned long end)
>  {
>  	struct file *file = vma->vm_file;
> +	loff_t offset;
>  
>  	*prev = vma;
>  #ifdef CONFIG_SWAP
> @@ -298,12 +300,20 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  		return 0;
>  	}
>  
> -	start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> -	if (end > vma->vm_end)
> -		end = vma->vm_end;
> -	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> -
> -	force_page_cache_readahead(file->f_mapping, file, start, end - start);
> +	/*
> +	 * Filesystem's fadvise may need to take various locks.  We need to
> +	 * explicitly grab a reference because the vma (and hence the
> +	 * vma's reference to the file) can go away as soon as we drop
> +	 * mmap_sem.
> +	 */
> +	*prev = NULL;	/* tell sys_madvise we drop mmap_sem */
> +	get_file(file);
> +	up_read(&current->mm->mmap_sem);
> +	offset = (loff_t)(start - vma->vm_start)
> +			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> +	vfs_fadvise(file, offset, end - start, POSIX_FADV_WILLNEED);
> +	fput(file);
> +	down_read(&current->mm->mmap_sem);
>  	return 0;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-07-23  3:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-11 14:00 [PATCH 0/3] xfs: Fix races between readahead and hole punching Jan Kara
2019-07-11 14:00 ` [PATCH 1/3] mm: Handle MADV_WILLNEED through vfs_fadvise() Jan Kara
2019-07-12 17:50   ` Darrick J. Wong
2019-07-12 23:55   ` Sasha Levin
2019-07-23  3:08   ` Boaz Harrosh
2019-07-11 14:00 ` [PATCH 2/3] fs: Export generic_fadvise() Jan Kara
2019-07-12 17:50   ` Darrick J. Wong
2019-07-12 23:55   ` Sasha Levin
2019-07-11 14:00 ` [PATCH 3/3] xfs: Fix stale data exposure when readahead races with hole punch Jan Kara
2019-07-11 15:28   ` Amir Goldstein
2019-07-11 15:49     ` Darrick J. Wong
2019-07-12 12:00       ` Jan Kara
2019-07-12 17:56         ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).