All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vivek Goyal <vgoyal@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	virtio-fs@redhat.com, miklos@szeredi.hu, stefanha@redhat.com,
	dgilbert@redhat.com
Subject: Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
Date: Tue, 11 Aug 2020 13:55:30 -0400	[thread overview]
Message-ID: <20200811175530.GB497326@redhat.com> (raw)
In-Reply-To: <20200810222238.GD2079@dread.disaster.area>

On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote:
> On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote:
> > We need some kind of locking mechanism here. Normal file systems like
> > ext4 and xfs seems to take their own semaphore to protect agains
> > truncate while fault is going on.
> > 
> > We have additional requirement to protect against fuse dax memory range
> > reclaim. When a range has been selected for reclaim, we need to make sure
> > no other read/write/fault can try to access that memory range while
> > reclaim is in progress. Once reclaim is complete, lock will be released
> > and read/write/fault will trigger allocation of fresh dax range.
> > 
> > Taking inode_lock() is not an option in fault path as lockdep complains
> > about circular dependencies. So define a new fuse_inode->i_mmap_sem.
> 
> That's precisely why filesystems like ext4 and XFS define their own
> rwsem.
> 
> Note that this isn't a DAX requirement - the page fault
> serialisation is actually a requirement of hole punching...

Hi Dave,

I noticed that fuse code currently does not seem to have a rwsem which
can provide mutual exclusion between truncation/hole_punch path
and page fault path. I am wondering does that mean there are issues
with existing code or something else makes it unnecessary to provide
this mutual exlusion.

> 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  fs/fuse/dir.c    |  2 ++
> >  fs/fuse/file.c   | 15 ++++++++++++---
> >  fs/fuse/fuse_i.h |  7 +++++++
> >  fs/fuse/inode.c  |  1 +
> >  4 files changed, 22 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index 26f028bc760b..f40766c0693b 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -1609,8 +1609,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
> >  	 */
> >  	if ((is_truncate || !is_wb) &&
> >  	    S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
> > +		down_write(&fi->i_mmap_sem);
> >  		truncate_pagecache(inode, outarg.attr.size);
> >  		invalidate_inode_pages2(inode->i_mapping);
> > +		up_write(&fi->i_mmap_sem);
> >  	}
> >  
> >  	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index be7d90eb5b41..00ad27216cc3 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -2878,11 +2878,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
> >  
> >  	if (write)
> >  		sb_start_pagefault(sb);
> > -
> > +	/*
> > +	 * We need to serialize against not only truncate but also against
> > +	 * fuse dax memory range reclaim. While a range is being reclaimed,
> > +	 * we do not want any read/write/mmap to make progress and try
> > +	 * to populate page cache or access memory we are trying to free.
> > +	 */
> > +	down_read(&get_fuse_inode(inode)->i_mmap_sem);
> >  	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
> >  
> >  	if (ret & VM_FAULT_NEEDDSYNC)
> >  		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> > +	up_read(&get_fuse_inode(inode)->i_mmap_sem);
> >  
> >  	if (write)
> >  		sb_end_pagefault(sb);
> > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> >  			file_update_time(file);
> >  	}
> >  
> > -	if (mode & FALLOC_FL_PUNCH_HOLE)
> > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +		down_write(&fi->i_mmap_sem);
> >  		truncate_pagecache_range(inode, offset, offset + length - 1);
> > -
> > +		up_write(&fi->i_mmap_sem);
> > +	}
> >  	fuse_invalidate_attr(inode);
> 
> 
> I'm not sure this is sufficient. You have to lock page faults out
> for the entire time the hole punch is being performed, not just while
> the mapping is being invalidated.
> 
> That is, once you've taken the inode lock and written back the dirty
> data over the range being punched, you can then take a page fault
> and dirty the page again. Then after you punch the hole out,
> you have a dirty page with non-zero data in it, and that can get
> written out before the page cache is truncated.

Just for my better udnerstanding of the issue, I am wondering what
problem will it lead to. If one process is doing punch_hole and
other is writing in the range being punched, end result could be
anything. Either we will read zeroes from punched_hole pages or
we will read the data written by process writing to mmaped page, depending
on in what order it got executed. 

If that's the case, then holding fi->i_mmap_sem for the whole duration
might not matter. What am I missing?

Thanks
Vivek

> 
> IOWs, to do a hole punch safely, you have to both lock the inode
> and lock out page faults *before* you write back dirty data. Then
> you can invalidate the page cache so you know there is no cached
> data over the range about to be punched. Once the punch is done,
> then you can drop all locks....
> 
> The same goes for any other operation that manipulates extents
> directly (other fallocate ops, truncate, etc).
> 
> /me also wonders if there can be racing AIO+DIO in progress over the
> range that is being punched and whether fuse needs to call
> inode_dio_wait() before punching holes, running truncates, etc...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


WARNING: multiple messages have this Message-ID (diff)
From: Vivek Goyal <vgoyal@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: miklos@szeredi.hu, linux-kernel@vger.kernel.org,
	virtio-fs@redhat.com, linux-fsdevel@vger.kernel.org
Subject: Re: [Virtio-fs] [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
Date: Tue, 11 Aug 2020 13:55:30 -0400	[thread overview]
Message-ID: <20200811175530.GB497326@redhat.com> (raw)
In-Reply-To: <20200810222238.GD2079@dread.disaster.area>

On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote:
> On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote:
> > We need some kind of locking mechanism here. Normal file systems like
> > ext4 and xfs seems to take their own semaphore to protect agains
> > truncate while fault is going on.
> > 
> > We have additional requirement to protect against fuse dax memory range
> > reclaim. When a range has been selected for reclaim, we need to make sure
> > no other read/write/fault can try to access that memory range while
> > reclaim is in progress. Once reclaim is complete, lock will be released
> > and read/write/fault will trigger allocation of fresh dax range.
> > 
> > Taking inode_lock() is not an option in fault path as lockdep complains
> > about circular dependencies. So define a new fuse_inode->i_mmap_sem.
> 
> That's precisely why filesystems like ext4 and XFS define their own
> rwsem.
> 
> Note that this isn't a DAX requirement - the page fault
> serialisation is actually a requirement of hole punching...

Hi Dave,

I noticed that fuse code currently does not seem to have a rwsem which
can provide mutual exclusion between truncation/hole_punch path
and page fault path. I am wondering does that mean there are issues
with existing code or something else makes it unnecessary to provide
this mutual exlusion.

> 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  fs/fuse/dir.c    |  2 ++
> >  fs/fuse/file.c   | 15 ++++++++++++---
> >  fs/fuse/fuse_i.h |  7 +++++++
> >  fs/fuse/inode.c  |  1 +
> >  4 files changed, 22 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index 26f028bc760b..f40766c0693b 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -1609,8 +1609,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
> >  	 */
> >  	if ((is_truncate || !is_wb) &&
> >  	    S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
> > +		down_write(&fi->i_mmap_sem);
> >  		truncate_pagecache(inode, outarg.attr.size);
> >  		invalidate_inode_pages2(inode->i_mapping);
> > +		up_write(&fi->i_mmap_sem);
> >  	}
> >  
> >  	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index be7d90eb5b41..00ad27216cc3 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -2878,11 +2878,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
> >  
> >  	if (write)
> >  		sb_start_pagefault(sb);
> > -
> > +	/*
> > +	 * We need to serialize against not only truncate but also against
> > +	 * fuse dax memory range reclaim. While a range is being reclaimed,
> > +	 * we do not want any read/write/mmap to make progress and try
> > +	 * to populate page cache or access memory we are trying to free.
> > +	 */
> > +	down_read(&get_fuse_inode(inode)->i_mmap_sem);
> >  	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
> >  
> >  	if (ret & VM_FAULT_NEEDDSYNC)
> >  		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> > +	up_read(&get_fuse_inode(inode)->i_mmap_sem);
> >  
> >  	if (write)
> >  		sb_end_pagefault(sb);
> > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> >  			file_update_time(file);
> >  	}
> >  
> > -	if (mode & FALLOC_FL_PUNCH_HOLE)
> > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +		down_write(&fi->i_mmap_sem);
> >  		truncate_pagecache_range(inode, offset, offset + length - 1);
> > -
> > +		up_write(&fi->i_mmap_sem);
> > +	}
> >  	fuse_invalidate_attr(inode);
> 
> 
> I'm not sure this is sufficient. You have to lock page faults out
> for the entire time the hole punch is being performed, not just while
> the mapping is being invalidated.
> 
> That is, once you've taken the inode lock and written back the dirty
> data over the range being punched, you can then take a page fault
> and dirty the page again. Then after you punch the hole out,
> you have a dirty page with non-zero data in it, and that can get
> written out before the page cache is truncated.

Just for my better udnerstanding of the issue, I am wondering what
problem will it lead to. If one process is doing punch_hole and
other is writing in the range being punched, end result could be
anything. Either we will read zeroes from punched_hole pages or
we will read the data written by process writing to mmaped page, depending
on in what order it got executed. 

If that's the case, then holding fi->i_mmap_sem for the whole duration
might not matter. What am I missing?

Thanks
Vivek

> 
> IOWs, to do a hole punch safely, you have to both lock the inode
> and lock out page faults *before* you write back dirty data. Then
> you can invalidate the page cache so you know there is no cached
> data over the range about to be punched. Once the punch is done,
> then you can drop all locks....
> 
> The same goes for any other operation that manipulates extents
> directly (other fallocate ops, truncate, etc).
> 
> /me also wonders if there can be racing AIO+DIO in progress over the
> range that is being punched and whether fuse needs to call
> inode_dio_wait() before punching holes, running truncates, etc...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


  reply	other threads:[~2020-08-11 17:55 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
2020-08-07 19:55 ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55   ` Vivek Goyal
2020-08-17 16:57   ` Jan Kara
2020-08-17 16:57     ` [Virtio-fs] " Jan Kara
2020-08-17 16:57     ` Jan Kara
2020-08-07 19:55 ` [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55   ` Vivek Goyal
2020-08-17 16:53   ` Jan Kara
2020-08-17 16:53     ` [Virtio-fs] " Jan Kara
2020-08-17 16:53     ` Jan Kara
2020-08-17 17:22     ` Vivek Goyal
2020-08-17 17:22       ` [Virtio-fs] " Vivek Goyal
2020-08-17 17:22       ` Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 03/20] virtio: Add get_shm_region method Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10 13:47   ` Michael S. Tsirkin
2020-08-10 13:47     ` [Virtio-fs] " Michael S. Tsirkin
2020-08-10 13:54     ` Vivek Goyal
2020-08-10 13:54       ` [Virtio-fs] " Vivek Goyal
2020-08-10 14:02   ` Michael S. Tsirkin
2020-08-10 14:02     ` [Virtio-fs] " Michael S. Tsirkin
2020-08-10 14:06   ` Michael S. Tsirkin
2020-08-10 14:06     ` [Virtio-fs] " Michael S. Tsirkin
2020-08-07 19:55 ` [PATCH v2 04/20] virtio: Implement get_shm_region for PCI transport Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10 14:05   ` Michael S. Tsirkin
2020-08-10 14:05     ` [Virtio-fs] " Michael S. Tsirkin
2020-08-10 14:50     ` Vivek Goyal
2020-08-10 14:50       ` [Virtio-fs] " Vivek Goyal
2020-08-14  2:51       ` Gurchetan Singh
2020-08-17 20:29         ` Vivek Goyal
2020-08-17 20:29           ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 05/20] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10 14:03   ` Michael S. Tsirkin
2020-08-10 14:03     ` [Virtio-fs] " Michael S. Tsirkin
2020-08-07 19:55 ` [PATCH v2 06/20] virtiofs: Provide a helper function for virtqueue initialization Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 07/20] fuse: Get rid of no_mount_options Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 08/20] fuse,virtiofs: Add a mount option to enable dax Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] [PATCH v2 08/20] fuse, virtiofs: " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 09/20] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 10/20] fuse,virtiofs: Keep a list of free dax memory ranges Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] [PATCH v2 10/20] fuse, virtiofs: " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 11/20] fuse: implement FUSE_INIT map_alignment field Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 12/20] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 13/20] fuse, dax: Implement dax read/write operations Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10 22:06   ` Dave Chinner
2020-08-10 22:06     ` [Virtio-fs] " Dave Chinner
2020-08-11 17:44     ` Vivek Goyal
2020-08-11 17:44       ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 14/20] fuse,dax: add DAX mmap support Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10 22:22   ` Dave Chinner
2020-08-10 22:22     ` [Virtio-fs] " Dave Chinner
2020-08-11 17:55     ` Vivek Goyal [this message]
2020-08-11 17:55       ` Vivek Goyal
2020-08-12  1:23       ` Dave Chinner
2020-08-12  1:23         ` [Virtio-fs] " Dave Chinner
2020-08-12 21:10         ` Vivek Goyal
2020-08-12 21:10           ` [Virtio-fs] " Vivek Goyal
2020-08-13  5:12           ` Dave Chinner
2020-08-13  5:12             ` [Virtio-fs] " Dave Chinner
2020-08-07 19:55 ` [PATCH v2 16/20] fuse,virtiofs: Define dax address space operations Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] [PATCH v2 16/20] fuse, virtiofs: " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 17/20] fuse,virtiofs: Maintain a list of busy elements Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] [PATCH v2 17/20] fuse, virtiofs: " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 18/20] fuse: Release file in process context Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-10  8:29   ` Miklos Szeredi
2020-08-10  8:29     ` [Virtio-fs] " Miklos Szeredi
2020-08-10 15:48     ` Vivek Goyal
2020-08-10 15:48       ` [Virtio-fs] " Vivek Goyal
2020-08-10 19:37     ` Vivek Goyal
2020-08-10 19:37       ` [Virtio-fs] " Vivek Goyal
2020-08-10 19:39       ` Miklos Szeredi
2020-08-10 19:39         ` [Virtio-fs] " Miklos Szeredi
2020-08-07 19:55 ` [PATCH v2 19/20] fuse: Take inode lock for dax inode truncation Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] " Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 20/20] fuse,virtiofs: Add logic to free up a memory range Vivek Goyal
2020-08-07 19:55   ` [Virtio-fs] [PATCH v2 20/20] fuse, virtiofs: " Vivek Goyal
2020-08-10  7:29 ` [PATCH v2 00/20] virtiofs: Add DAX support Miklos Szeredi
2020-08-10  7:29   ` [Virtio-fs] " Miklos Szeredi
2020-08-10 13:08   ` Vivek Goyal
2020-08-10 13:08     ` [Virtio-fs] " Vivek Goyal
2020-08-10 13:08     ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200811175530.GB497326@redhat.com \
    --to=vgoyal@redhat.com \
    --cc=david@fromorbit.com \
    --cc=dgilbert@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=stefanha@redhat.com \
    --cc=virtio-fs@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.