All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	virtio-fs@redhat.com, miklos@szeredi.hu, stefanha@redhat.com,
	dgilbert@redhat.com
Subject: Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
Date: Wed, 12 Aug 2020 11:23:45 +1000	[thread overview]
Message-ID: <20200812012345.GG2079@dread.disaster.area> (raw)
In-Reply-To: <20200811175530.GB497326@redhat.com>

On Tue, Aug 11, 2020 at 01:55:30PM -0400, Vivek Goyal wrote:
> On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote:
> > On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote:
> > > We need some kind of locking mechanism here. Normal file systems like
> > > ext4 and xfs seems to take their own semaphore to protect agains
> > > truncate while fault is going on.
> > > 
> > > We have additional requirement to protect against fuse dax memory range
> > > reclaim. When a range has been selected for reclaim, we need to make sure
> > > no other read/write/fault can try to access that memory range while
> > > reclaim is in progress. Once reclaim is complete, lock will be released
> > > and read/write/fault will trigger allocation of fresh dax range.
> > > 
> > > Taking inode_lock() is not an option in fault path as lockdep complains
> > > about circular dependencies. So define a new fuse_inode->i_mmap_sem.
> > 
> > That's precisely why filesystems like ext4 and XFS define their own
> > rwsem.
> > 
> > Note that this isn't a DAX requirement - the page fault
> > serialisation is actually a requirement of hole punching...
> 
> Hi Dave,
> 
> I noticed that fuse code currently does not seem to have a rwsem which
> can provide mutual exclusion between truncation/hole_punch path
> and page fault path. I am wondering does that mean there are issues
> with existing code or something else makes it unnecessary to provide
> this mutual exlusion.

I don't know enough about the fuse implementation to say. What I'm
saying is that nothing in the core mm/ or VFS serilises page cache
access to the data against direct filesystem manipulations of the
underlying filesystem structures.

i.e. nothing in the VFS or page fault IO path prevents this race
condition:

P0				P1
fallocate
page cache invalidation
				page fault
				read data
punch out data extents
				<data exposed to userspace is stale>
				<data exposed to userspace has no
				backing store allocated>


That's where the ext4 and XFS internal rwsem come into play:

fallocate
down_write(mmaplock)
page cache invalidation
				page fault
				down_read(mmaplock)
				<blocks>
punch out data
up_write(mmaplock)
				<unblocks>
				<sees hole>
				<allocates zeroed pages in page cache>

And there's not stale data exposure to userspace.

It's the same reason that we use the i_rwsem to prevent concurrent
IO while a truncate or hole punch is in progress. The IO could map
the extent, then block in the IO path, while the filesytsem
re-allocates and writes new data or metadata to those blocks. That's
another potential non-owner data exposure problem.

And if you don't drain AIO+DIO before truncate/hole punch, the
i_rwsem does not protect you against concurrent IO as that gets
dropped after the AIO is submitted and returns EIOCBQUEUED to the
AIO layer. Hence there's IO in flight that isn't tracked by the
i_rwsem or the MMAPLOCK, and if you punch out the blocks and
reallocate them while the IO is in flight....

> > > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> > >  			file_update_time(file);
> > >  	}
> > >  
> > > -	if (mode & FALLOC_FL_PUNCH_HOLE)
> > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > +		down_write(&fi->i_mmap_sem);
> > >  		truncate_pagecache_range(inode, offset, offset + length - 1);
> > > -
> > > +		up_write(&fi->i_mmap_sem);
> > > +	}
> > >  	fuse_invalidate_attr(inode);
> > 
> > 
> > I'm not sure this is sufficient. You have to lock page faults out
> > for the entire time the hole punch is being performed, not just while
> > the mapping is being invalidated.
> > 
> > That is, once you've taken the inode lock and written back the dirty
> > data over the range being punched, you can then take a page fault
> > and dirty the page again. Then after you punch the hole out,
> > you have a dirty page with non-zero data in it, and that can get
> > written out before the page cache is truncated.
> 
> Just for my better udnerstanding of the issue, I am wondering what
> problem will it lead to.
> If one process is doing punch_hole and other is writing in the
> range being punched, end result could be anything. Either we will
> read zeroes from punched_hole pages or we will read the data
> written by process writing to mmaped page, depending on in what
> order it got executed. 
>
> If that's the case, then holding fi->i_mmap_sem for the whole
> duration might not matter. What am I missing?

That it is safe to invalidate the page cache after the hole has been
punched.

There is nothing stopping, say, memory reclaim from reclaiming pages
over the range while the hole is being punched, then having the
application refault them while the backing store is being freed.
While the page fault read IO is in progress, there's nothing
stopping the filesystem from freeing those blocks, nor reallocating
them and writing something else to them (e.g. metadata). So they
could read someone elses data.

Even worse: the page fault is a write fault, it lands in a hole, has
space allocated, the page cache is zeroed, page marked dirty, and
then the hole punch calls truncate_pagecache_range() which tosses
away the zeroed page and the data the userspace application wrote
to the page.

The application then refaults the page, reading stale data off
disk instead of seeing what it had already written to the page....

And unlike truncated pages, the mm/ code cannot reliably detect
invalidation races on lockless lookup of pages that are within EOF.
They rely on truncate changing the file size before page
invalidation to detect races as page->index then points beyond EOF.
Hole punching does not change inode size, so the page cache lookups
cannot tell the difference between a new page that just needs IO to
initialise the data and a page that has just been invalidated....

IOWs, there are many ways things can go wrong with hole punch, and
the only way to avoid them all is to do invalidate and lock out the
page cache before starting the fallocate operation. i.e.:

	1. lock up the entire IO path (vfs and page fault)
	2. drain the AIO+DIO path
	3. write back dirty pages
	4. invalidate the page cache

Because this is the only way we can guarantee that nothing can access
the filesystem's backing store for the range we are about to
directly manipulate the data in while we perform an "offloaded" data
transformation on that range...

This isn't just hole punch - the same problems exist with
FALLOC_FL_ZERO_RANGE and FALLOC_FL_{INSERT,COLLAPSE}_RANGE because
they change data with extent manipulations and/or hardware offloads
that provide no guarantees of specific data state or integrity until
they complete....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: miklos@szeredi.hu, linux-kernel@vger.kernel.org,
	virtio-fs@redhat.com, linux-fsdevel@vger.kernel.org
Subject: Re: [Virtio-fs] [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
Date: Wed, 12 Aug 2020 11:23:45 +1000	[thread overview]
Message-ID: <20200812012345.GG2079@dread.disaster.area> (raw)
In-Reply-To: <20200811175530.GB497326@redhat.com>

On Tue, Aug 11, 2020 at 01:55:30PM -0400, Vivek Goyal wrote:
> On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote:
> > On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote:
> > > We need some kind of locking mechanism here. Normal file systems like
> > > ext4 and xfs seems to take their own semaphore to protect agains
> > > truncate while fault is going on.
> > > 
> > > We have additional requirement to protect against fuse dax memory range
> > > reclaim. When a range has been selected for reclaim, we need to make sure
> > > no other read/write/fault can try to access that memory range while
> > > reclaim is in progress. Once reclaim is complete, lock will be released
> > > and read/write/fault will trigger allocation of fresh dax range.
> > > 
> > > Taking inode_lock() is not an option in fault path as lockdep complains
> > > about circular dependencies. So define a new fuse_inode->i_mmap_sem.
> > 
> > That's precisely why filesystems like ext4 and XFS define their own
> > rwsem.
> > 
> > Note that this isn't a DAX requirement - the page fault
> > serialisation is actually a requirement of hole punching...
> 
> Hi Dave,
> 
> I noticed that fuse code currently does not seem to have a rwsem which
> can provide mutual exclusion between truncation/hole_punch path
> and page fault path. I am wondering does that mean there are issues
> with existing code or something else makes it unnecessary to provide
> this mutual exlusion.

I don't know enough about the fuse implementation to say. What I'm
saying is that nothing in the core mm/ or VFS serilises page cache
access to the data against direct filesystem manipulations of the
underlying filesystem structures.

i.e. nothing in the VFS or page fault IO path prevents this race
condition:

P0				P1
fallocate
page cache invalidation
				page fault
				read data
punch out data extents
				<data exposed to userspace is stale>
				<data exposed to userspace has no
				backing store allocated>


That's where the ext4 and XFS internal rwsem come into play:

fallocate
down_write(mmaplock)
page cache invalidation
				page fault
				down_read(mmaplock)
				<blocks>
punch out data
up_write(mmaplock)
				<unblocks>
				<sees hole>
				<allocates zeroed pages in page cache>

And there's not stale data exposure to userspace.

It's the same reason that we use the i_rwsem to prevent concurrent
IO while a truncate or hole punch is in progress. The IO could map
the extent, then block in the IO path, while the filesytsem
re-allocates and writes new data or metadata to those blocks. That's
another potential non-owner data exposure problem.

And if you don't drain AIO+DIO before truncate/hole punch, the
i_rwsem does not protect you against concurrent IO as that gets
dropped after the AIO is submitted and returns EIOCBQUEUED to the
AIO layer. Hence there's IO in flight that isn't tracked by the
i_rwsem or the MMAPLOCK, and if you punch out the blocks and
reallocate them while the IO is in flight....

> > > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> > >  			file_update_time(file);
> > >  	}
> > >  
> > > -	if (mode & FALLOC_FL_PUNCH_HOLE)
> > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > +		down_write(&fi->i_mmap_sem);
> > >  		truncate_pagecache_range(inode, offset, offset + length - 1);
> > > -
> > > +		up_write(&fi->i_mmap_sem);
> > > +	}
> > >  	fuse_invalidate_attr(inode);
> > 
> > 
> > I'm not sure this is sufficient. You have to lock page faults out
> > for the entire time the hole punch is being performed, not just while
> > the mapping is being invalidated.
> > 
> > That is, once you've taken the inode lock and written back the dirty
> > data over the range being punched, you can then take a page fault
> > and dirty the page again. Then after you punch the hole out,
> > you have a dirty page with non-zero data in it, and that can get
> > written out before the page cache is truncated.
> 
> Just for my better udnerstanding of the issue, I am wondering what
> problem will it lead to.
> If one process is doing punch_hole and other is writing in the
> range being punched, end result could be anything. Either we will
> read zeroes from punched_hole pages or we will read the data
> written by process writing to mmaped page, depending on in what
> order it got executed. 
>
> If that's the case, then holding fi->i_mmap_sem for the whole
> duration might not matter. What am I missing?

That it is safe to invalidate the page cache after the hole has been
punched.

There is nothing stopping, say, memory reclaim from reclaiming pages
over the range while the hole is being punched, then having the
application refault them while the backing store is being freed.
While the page fault read IO is in progress, there's nothing
stopping the filesystem from freeing those blocks, nor reallocating
them and writing something else to them (e.g. metadata). So they
could read someone elses data.

Even worse: the page fault is a write fault, it lands in a hole, has
space allocated, the page cache is zeroed, page marked dirty, and
then the hole punch calls truncate_pagecache_range() which tosses
away the zeroed page and the data the userspace application wrote
to the page.

The application then refaults the page, reading stale data off
disk instead of seeing what it had already written to the page....

And unlike truncated pages, the mm/ code cannot reliably detect
invalidation races on lockless lookup of pages that are within EOF.
They rely on truncate changing the file size before page
invalidation to detect races as page->index then points beyond EOF.
Hole punching does not change inode size, so the page cache lookups
cannot tell the difference between a new page that just needs IO to
initialise the data and a page that has just been invalidated....

IOWs, there are many ways things can go wrong with hole punch, and
the only way to avoid them all is to do invalidate and lock out the
page cache before starting the fallocate operation. i.e.:

	1. lock up the entire IO path (vfs and page fault)
	2. drain the AIO+DIO path
	3. write back dirty pages
	4. invalidate the page cache

Because this is the only way we can guarantee that nothing can access
the filesystem's backing store for the range we are about to
directly manipulate the data in while we perform an "offloaded" data
transformation on that range...

This isn't just hole punch - the same problems exist with
FALLOC_FL_ZERO_RANGE and FALLOC_FL_{INSERT,COLLAPSE}_RANGE because
they change data with extent manipulations and/or hardware offloads
that provide no guarantees of specific data state or integrity until
they complete....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


  reply	other threads:[~2020-08-12  1:23 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
2020-08-07 19:55 ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55   ` Vivek Goyal
2020-08-17 16:57   ` Jan Kara
2020-08-17 16:57     ` [Virtio-fs] " Jan Kara
2020-08-17 16:57     ` Jan Kara
2020-08-07 19:55 ` [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55   ` Vivek Goyal
2020-08-17 16:53   ` Jan Kara
2020-08-17 16:53     ` [Virtio-fs] " Jan Kara
2020-08-17 16:53     ` Jan Kara
2020-08-17 17:22     ` Vivek Goyal
2020-08-17 17:22       ` [Virtio-fs] " Vivek Goyal
2020-08-17 17:22       ` Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 03/20] virtio: Add get_shm_region method Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10 13:47   ` Michael S. Tsirkin
2020-08-10 13:47     ` [Virtio-fs] " Michael S. Tsirkin
2020-08-10 13:54     ` Vivek Goyal
2020-08-10 13:54       ` [Virtio-fs] " Vivek Goyal
2020-08-10 14:02   ` Michael S. Tsirkin
2020-08-10 14:02     ` [Virtio-fs] " Michael S. Tsirkin
2020-08-10 14:06   ` Michael S. Tsirkin
2020-08-10 14:06     ` [Virtio-fs] " Michael S. Tsirkin
2020-08-07 19:55 ` [PATCH v2 04/20] virtio: Implement get_shm_region for PCI transport Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10 14:05   ` Michael S. Tsirkin
2020-08-10 14:05     ` [Virtio-fs] " Michael S. Tsirkin
2020-08-10 14:50     ` Vivek Goyal
2020-08-10 14:50       ` [Virtio-fs] " Vivek Goyal
2020-08-14  2:51       ` Gurchetan Singh
2020-08-17 20:29         ` Vivek Goyal
2020-08-17 20:29           ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 05/20] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10 14:03   ` Michael S. Tsirkin
2020-08-10 14:03     ` [Virtio-fs] " Michael S. Tsirkin
2020-08-07 19:55 ` [PATCH v2 06/20] virtiofs: Provide a helper function for virtqueue initialization Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 07/20] fuse: Get rid of no_mount_options Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 08/20] fuse,virtiofs: Add a mount option to enable dax Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] [PATCH v2 08/20] fuse, virtiofs: " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 09/20] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 10/20] fuse,virtiofs: Keep a list of free dax memory ranges Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] [PATCH v2 10/20] fuse, virtiofs: " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 11/20] fuse: implement FUSE_INIT map_alignment field Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 12/20] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 13/20] fuse, dax: Implement dax read/write operations Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10 22:06   ` Dave Chinner
2020-08-10 22:06     ` [Virtio-fs] " Dave Chinner
2020-08-11 17:44     ` Vivek Goyal
2020-08-11 17:44       ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 14/20] fuse,dax: add DAX mmap support Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10 22:22   ` Dave Chinner
2020-08-10 22:22     ` [Virtio-fs] " Dave Chinner
2020-08-11 17:55     ` Vivek Goyal
2020-08-11 17:55       ` [Virtio-fs] " Vivek Goyal
2020-08-12  1:23       ` Dave Chinner [this message]
2020-08-12  1:23         ` Dave Chinner
2020-08-12 21:10         ` Vivek Goyal
2020-08-12 21:10           ` [Virtio-fs] " Vivek Goyal
2020-08-13  5:12           ` Dave Chinner
2020-08-13  5:12             ` [Virtio-fs] " Dave Chinner
2020-08-07 19:55 ` [PATCH v2 16/20] fuse,virtiofs: Define dax address space operations Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] [PATCH v2 16/20] fuse, virtiofs: " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 17/20] fuse,virtiofs: Maintain a list of busy elements Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] [PATCH v2 17/20] fuse, virtiofs: " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 18/20] fuse: Release file in process context Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10  8:29   ` Miklos Szeredi
2020-08-10  8:29     ` [Virtio-fs] " Miklos Szeredi
2020-08-10 15:48     ` Vivek Goyal
2020-08-10 15:48       ` [Virtio-fs] " Vivek Goyal
2020-08-10 19:37     ` Vivek Goyal
2020-08-10 19:37       ` [Virtio-fs] " Vivek Goyal
2020-08-10 19:39       ` Miklos Szeredi
2020-08-10 19:39         ` [Virtio-fs] " Miklos Szeredi
2020-08-07 19:55 ` [PATCH v2 19/20] fuse: Take inode lock for dax inode truncation Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 20/20] fuse,virtiofs: Add logic to free up a memory range Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] [PATCH v2 20/20] fuse, virtiofs: " Vivek Goyal
2020-08-10  7:29 ` [PATCH v2 00/20] virtiofs: Add DAX support Miklos Szeredi
2020-08-10  7:29   ` [Virtio-fs] " Miklos Szeredi
2020-08-10 13:08   ` Vivek Goyal
2020-08-10 13:08     ` [Virtio-fs] " Vivek Goyal
2020-08-10 13:08     ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200812012345.GG2079@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=dgilbert@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=stefanha@redhat.com \
    --cc=vgoyal@redhat.com \
    --cc=virtio-fs@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.