linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	virtio-fs@redhat.com, miklos@szeredi.hu, stefanha@redhat.com,
	dgilbert@redhat.com
Subject: Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
Date: Tue, 11 Aug 2020 08:22:38 +1000	[thread overview]
Message-ID: <20200810222238.GD2079@dread.disaster.area> (raw)
In-Reply-To: <20200807195526.426056-16-vgoyal@redhat.com>

On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote:
> We need some kind of locking mechanism here. Normal file systems like
> ext4 and xfs seems to take their own semaphore to protect agains
> truncate while fault is going on.
> 
> We have additional requirement to protect against fuse dax memory range
> reclaim. When a range has been selected for reclaim, we need to make sure
> no other read/write/fault can try to access that memory range while
> reclaim is in progress. Once reclaim is complete, lock will be released
> and read/write/fault will trigger allocation of fresh dax range.
> 
> Taking inode_lock() is not an option in fault path as lockdep complains
> about circular dependencies. So define a new fuse_inode->i_mmap_sem.

That's precisely why filesystems like ext4 and XFS define their own
rwsem.

Note that this isn't a DAX requirement - the page fault
serialisation is actually a requirement of hole punching...

> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/fuse/dir.c    |  2 ++
>  fs/fuse/file.c   | 15 ++++++++++++---
>  fs/fuse/fuse_i.h |  7 +++++++
>  fs/fuse/inode.c  |  1 +
>  4 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 26f028bc760b..f40766c0693b 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1609,8 +1609,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>  	 */
>  	if ((is_truncate || !is_wb) &&
>  	    S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
> +		down_write(&fi->i_mmap_sem);
>  		truncate_pagecache(inode, outarg.attr.size);
>  		invalidate_inode_pages2(inode->i_mapping);
> +		up_write(&fi->i_mmap_sem);
>  	}
>  
>  	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index be7d90eb5b41..00ad27216cc3 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -2878,11 +2878,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
>  
>  	if (write)
>  		sb_start_pagefault(sb);
> -
> +	/*
> +	 * We need to serialize against not only truncate but also against
> +	 * fuse dax memory range reclaim. While a range is being reclaimed,
> +	 * we do not want any read/write/mmap to make progress and try
> +	 * to populate page cache or access memory we are trying to free.
> +	 */
> +	down_read(&get_fuse_inode(inode)->i_mmap_sem);
>  	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
>  
>  	if (ret & VM_FAULT_NEEDDSYNC)
>  		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> +	up_read(&get_fuse_inode(inode)->i_mmap_sem);
>  
>  	if (write)
>  		sb_end_pagefault(sb);
> @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
>  			file_update_time(file);
>  	}
>  
> -	if (mode & FALLOC_FL_PUNCH_HOLE)
> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> +		down_write(&fi->i_mmap_sem);
>  		truncate_pagecache_range(inode, offset, offset + length - 1);
> -
> +		up_write(&fi->i_mmap_sem);
> +	}
>  	fuse_invalidate_attr(inode);


I'm not sure this is sufficient. You have to lock page faults out
for the entire time the hole punch is being performed, not just while
the mapping is being invalidated.

That is, once you've taken the inode lock and written back the dirty
data over the range being punched, you can then take a page fault
and dirty the page again. Then after you punch the hole out,
you have a dirty page with non-zero data in it, and that can get
written out before the page cache is truncated.

IOWs, to do a hole punch safely, you have to both lock the inode
and lock out page faults *before* you write back dirty data. Then
you can invalidate the page cache so you know there is no cached
data over the range about to be punched. Once the punch is done,
then you can drop all locks....

The same goes for any other operation that manipulates extents
directly (other fallocate ops, truncate, etc).

/me also wonders if there can be racing AIO+DIO in progress over the
range that is being punched and whether fuse needs to call
inode_dio_wait() before punching holes, running truncates, etc...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2020-08-10 22:22 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev Vivek Goyal
2020-08-17 16:57   ` Jan Kara
2020-08-07 19:55 ` [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
2020-08-17 16:53   ` Jan Kara
2020-08-17 17:22     ` Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 03/20] virtio: Add get_shm_region method Vivek Goyal
2020-08-10 13:47   ` Michael S. Tsirkin
2020-08-10 13:54     ` Vivek Goyal
2020-08-10 14:02   ` Michael S. Tsirkin
2020-08-10 14:06   ` Michael S. Tsirkin
2020-08-07 19:55 ` [PATCH v2 04/20] virtio: Implement get_shm_region for PCI transport Vivek Goyal
2020-08-10 14:05   ` Michael S. Tsirkin
2020-08-10 14:50     ` Vivek Goyal
     [not found]       ` <CAAfnVBk+Hmcm2ftd3wOK-P2NyYQ7z4Wrf1JKhLJaNkCZBLoo6g@mail.gmail.com>
2020-08-17 20:29         ` Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 05/20] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
2020-08-10 14:03   ` Michael S. Tsirkin
2020-08-07 19:55 ` [PATCH v2 06/20] virtiofs: Provide a helper function for virtqueue initialization Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 07/20] fuse: Get rid of no_mount_options Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 08/20] fuse,virtiofs: Add a mount option to enable dax Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 09/20] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 10/20] fuse,virtiofs: Keep a list of free dax memory ranges Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 11/20] fuse: implement FUSE_INIT map_alignment field Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 12/20] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 13/20] fuse, dax: Implement dax read/write operations Vivek Goyal
2020-08-10 22:06   ` Dave Chinner
2020-08-11 17:44     ` Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 14/20] fuse,dax: add DAX mmap support Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
2020-08-10 22:22   ` Dave Chinner [this message]
2020-08-11 17:55     ` Vivek Goyal
2020-08-12  1:23       ` Dave Chinner
2020-08-12 21:10         ` Vivek Goyal
2020-08-13  5:12           ` Dave Chinner
2020-08-07 19:55 ` [PATCH v2 16/20] fuse,virtiofs: Define dax address space operations Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 17/20] fuse,virtiofs: Maintain a list of busy elements Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 18/20] fuse: Release file in process context Vivek Goyal
2020-08-10  8:29   ` Miklos Szeredi
2020-08-10 15:48     ` Vivek Goyal
2020-08-10 19:37     ` Vivek Goyal
2020-08-10 19:39       ` Miklos Szeredi
2020-08-07 19:55 ` [PATCH v2 19/20] fuse: Take inode lock for dax inode truncation Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 20/20] fuse,virtiofs: Add logic to free up a memory range Vivek Goyal
2020-08-10  7:29 ` [PATCH v2 00/20] virtiofs: Add DAX support Miklos Szeredi
2020-08-10 13:08   ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200810222238.GD2079@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=dgilbert@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=stefanha@redhat.com \
    --cc=vgoyal@redhat.com \
    --cc=virtio-fs@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).