All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Wilcox, Matthew R" <matthew.r.wilcox@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	NeilBrown <neilb@suse.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>
Subject: RE: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
Date: Mon, 14 Mar 2016 14:51:26 +0000	[thread overview]
Message-ID: <100D68C7BA14664A8938383216E40DE0422086AA@FMSMSX114.amr.corp.intel.com> (raw)
In-Reply-To: <20160314100128.GB6801@quack.suse.cz>

I think the ultimate goal here has to be to have the truncate code lock the DAX entry in the radix tree and delete it.  Then we can have do_cow_fault() unlock the radix tree entry instead of the i_mmap_lock.  So we'll need another element in struct vm_fault where we can pass back a pointer into the radix tree instead of a pointer to struct page (or add another bit to VM_FAULT_ that indicates that 'page' is not actually a page, but a pointer to an exceptional entry ... or have the MM code understand the exceptional bit ... there's a few ways we can go here).

-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Monday, March 14, 2016 3:01 AM
To: Wilcox, Matthew R
Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock

On Thu 10-03-16 20:10:09, Wilcox, Matthew R wrote:
> Here's the race:
> 
> CPU 0				CPU 1
> do_cow_fault()
> __do_fault()
> takes sem
> dax_fault()
> releases sem
> 				truncate()
> 				unmap_mapping_range()
> 				i_mmap_lock_write()
> 				unmap_mapping_range_tree()
> 				i_mmap_unlock_write()
> do_set_pte()
> 
> Holding i_mmap_lock_read() from inside __do_fault() prevents the truncate
> from proceeding until the page is inseted with do_set_pte().

Ah, right. Thanks for reminding me. I was hoping to get rid of this
i_mmap_lock abuse in DAX code but obviously it needs more work :).

								Honza

> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz] 
> Sent: Thursday, March 10, 2016 12:05 PM
> To: Wilcox, Matthew R
> Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
> Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> 
> On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote:
> > This locking's still necessary.  i_mmap_sem has already been released by
> > the time we're back in do_cow_fault(), so it doesn't protect that page,
> > and truncate can have whizzed past and thinks there's nothing to unmap.
> > So a task can have a MAP_PRIVATE page still in its address space after
> > it's supposed to have been unmapped.
> 
> I don't think this is possible. Filesystem holds its inode->i_mmap_sem for
> reading when handling the fault. That synchronizes against truncate...
> 
> 								Honza
> 
> > -----Original Message-----
> > From: Jan Kara [mailto:jack@suse.cz] 
> > Sent: Thursday, March 10, 2016 11:19 AM
> > To: linux-fsdevel@vger.kernel.org
> > Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
> > Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> > 
> > At one point DAX used i_mmap_lock so synchronize page faults with page
> > table invalidation during truncate. However these days DAX uses
> > filesystem specific RW semaphores to protect against these races
> > (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
> > the unnecessary locking.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c    | 19 -------------------
> >  mm/memory.c | 14 --------------
> >  2 files changed, 33 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 9c4d697fb6fc..e409e8fc13b7 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	pgoff_t size;
> >  	int error;
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * Check truncate didn't happen while we were allocating a block.
> >  	 * If it did, this block may or may not be still allocated to the
> > @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	return error;
> >  }
> >  
> > @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  		if (error)
> >  			goto unlock_page;
> >  		vmf->page = page;
> > -		if (!page) {
> > -			i_mmap_lock_read(mapping);
> > -			/* Check we didn't race with truncate */
> > -			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
> > -								PAGE_SHIFT;
> > -			if (vmf->pgoff >= size) {
> > -				i_mmap_unlock_read(mapping);
> > -				error = -EIO;
> > -				goto out;
> > -			}
> > -		}
> >  		return VM_FAULT_LOCKED;
> >  	}
> >  
> > @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		truncate_pagecache_range(inode, lstart, lend);
> >  	}
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * If a truncate happened while we were allocating blocks, we may
> >  	 * leave blocks allocated to the file that are beyond EOF.  We can't
> > @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	}
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	if (buffer_unwritten(&bh))
> >  		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8132787ae4d5..13f76eb08f33 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
> >  	if (details.last_index < details.first_index)
> >  		details.last_index = ULONG_MAX;
> >  
> > -
> > -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
> >  	i_mmap_lock_write(mapping);
> >  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
> >  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> > @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		if (fault_page) {
> >  			unlock_page(fault_page);
> >  			page_cache_release(fault_page);
> > -		} else {
> > -			/*
> > -			 * The fault handler has no page to lock, so it holds
> > -			 * i_mmap_lock for read to protect against truncate.
> > -			 */
> > -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  		}
> >  		goto uncharge_out;
> >  	}
> > @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (fault_page) {
> >  		unlock_page(fault_page);
> >  		page_cache_release(fault_page);
> > -	} else {
> > -		/*
> > -		 * The fault handler has no page to lock, so it holds
> > -		 * i_mmap_lock for read to protect against truncate.
> > -		 */
> > -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  	}
> >  	return ret;
> >  uncharge_out:
> > -- 
> > 2.6.2
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: "Wilcox, Matthew R" <matthew.r.wilcox@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"Ross Zwisler" <ross.zwisler@linux.intel.com>,
	"Williams, Dan J" <dan.j.williams@intel.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	NeilBrown <neilb@suse.com>
Subject: RE: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
Date: Mon, 14 Mar 2016 14:51:26 +0000	[thread overview]
Message-ID: <100D68C7BA14664A8938383216E40DE0422086AA@FMSMSX114.amr.corp.intel.com> (raw)
In-Reply-To: <20160314100128.GB6801@quack.suse.cz>

I think the ultimate goal here has to be to have the truncate code lock the DAX entry in the radix tree and delete it.  Then we can have do_cow_fault() unlock the radix tree entry instead of the i_mmap_lock.  So we'll need another element in struct vm_fault where we can pass back a pointer into the radix tree instead of a pointer to struct page (or add another bit to VM_FAULT_ that indicates that 'page' is not actually a page, but a pointer to an exceptional entry ... or have the MM code understand the exceptional bit ... there's a few ways we can go here).

-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Monday, March 14, 2016 3:01 AM
To: Wilcox, Matthew R
Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock

On Thu 10-03-16 20:10:09, Wilcox, Matthew R wrote:
> Here's the race:
> 
> CPU 0				CPU 1
> do_cow_fault()
> __do_fault()
> takes sem
> dax_fault()
> releases sem
> 				truncate()
> 				unmap_mapping_range()
> 				i_mmap_lock_write()
> 				unmap_mapping_range_tree()
> 				i_mmap_unlock_write()
> do_set_pte()
> 
> Holding i_mmap_lock_read() from inside __do_fault() prevents the truncate
> from proceeding until the page is inseted with do_set_pte().

Ah, right. Thanks for reminding me. I was hoping to get rid of this
i_mmap_lock abuse in DAX code but obviously it needs more work :).

								Honza

> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz] 
> Sent: Thursday, March 10, 2016 12:05 PM
> To: Wilcox, Matthew R
> Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
> Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> 
> On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote:
> > This locking's still necessary.  i_mmap_sem has already been released by
> > the time we're back in do_cow_fault(), so it doesn't protect that page,
> > and truncate can have whizzed past and thinks there's nothing to unmap.
> > So a task can have a MAP_PRIVATE page still in its address space after
> > it's supposed to have been unmapped.
> 
> I don't think this is possible. Filesystem holds its inode->i_mmap_sem for
> reading when handling the fault. That synchronizes against truncate...
> 
> 								Honza
> 
> > -----Original Message-----
> > From: Jan Kara [mailto:jack@suse.cz] 
> > Sent: Thursday, March 10, 2016 11:19 AM
> > To: linux-fsdevel@vger.kernel.org
> > Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
> > Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> > 
> > At one point DAX used i_mmap_lock so synchronize page faults with page
> > table invalidation during truncate. However these days DAX uses
> > filesystem specific RW semaphores to protect against these races
> > (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
> > the unnecessary locking.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c    | 19 -------------------
> >  mm/memory.c | 14 --------------
> >  2 files changed, 33 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 9c4d697fb6fc..e409e8fc13b7 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	pgoff_t size;
> >  	int error;
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * Check truncate didn't happen while we were allocating a block.
> >  	 * If it did, this block may or may not be still allocated to the
> > @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	return error;
> >  }
> >  
> > @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  		if (error)
> >  			goto unlock_page;
> >  		vmf->page = page;
> > -		if (!page) {
> > -			i_mmap_lock_read(mapping);
> > -			/* Check we didn't race with truncate */
> > -			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
> > -								PAGE_SHIFT;
> > -			if (vmf->pgoff >= size) {
> > -				i_mmap_unlock_read(mapping);
> > -				error = -EIO;
> > -				goto out;
> > -			}
> > -		}
> >  		return VM_FAULT_LOCKED;
> >  	}
> >  
> > @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		truncate_pagecache_range(inode, lstart, lend);
> >  	}
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * If a truncate happened while we were allocating blocks, we may
> >  	 * leave blocks allocated to the file that are beyond EOF.  We can't
> > @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	}
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	if (buffer_unwritten(&bh))
> >  		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8132787ae4d5..13f76eb08f33 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
> >  	if (details.last_index < details.first_index)
> >  		details.last_index = ULONG_MAX;
> >  
> > -
> > -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
> >  	i_mmap_lock_write(mapping);
> >  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
> >  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> > @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		if (fault_page) {
> >  			unlock_page(fault_page);
> >  			page_cache_release(fault_page);
> > -		} else {
> > -			/*
> > -			 * The fault handler has no page to lock, so it holds
> > -			 * i_mmap_lock for read to protect against truncate.
> > -			 */
> > -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  		}
> >  		goto uncharge_out;
> >  	}
> > @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (fault_page) {
> >  		unlock_page(fault_page);
> >  		page_cache_release(fault_page);
> > -	} else {
> > -		/*
> > -		 * The fault handler has no page to lock, so it holds
> > -		 * i_mmap_lock for read to protect against truncate.
> > -		 */
> > -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  	}
> >  	return ret;
> >  uncharge_out:
> > -- 
> > 2.6.2
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2016-03-14 14:52 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-10 19:18 [RFC] [PATCH 0/12] DAX page fault locking Jan Kara
2016-03-10 19:18 ` Jan Kara
2016-03-10 19:18 ` [PATCH 01/12] DAX: move RADIX_DAX_ definitions to dax.c Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-11 22:54   ` Ross Zwisler
2016-03-11 22:54     ` Ross Zwisler
2016-03-10 19:18 ` [PATCH 02/12] radix-tree: make 'indirect' bit available to exception entries Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 03/12] mm: Remove VM_FAULT_MINOR Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:38   ` Wilcox, Matthew R
2016-03-10 19:38     ` Wilcox, Matthew R
2016-03-10 19:48     ` Jan Kara
2016-03-10 19:48       ` Jan Kara
2016-03-10 19:18 ` [PATCH 04/12] ocfs2: Fix return value from ocfs2_page_mkwrite() Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 05/12] dax: Remove synchronization using i_mmap_lock Jan Kara
2016-03-10 19:55   ` Wilcox, Matthew R
2016-03-10 19:55     ` Wilcox, Matthew R
2016-03-10 20:05     ` Jan Kara
2016-03-10 20:05       ` Jan Kara
2016-03-10 20:10       ` Wilcox, Matthew R
2016-03-10 20:10         ` Wilcox, Matthew R
2016-03-14 10:01         ` Jan Kara
2016-03-14 10:01           ` Jan Kara
2016-03-14 14:51           ` Wilcox, Matthew R [this message]
2016-03-14 14:51             ` Wilcox, Matthew R
2016-03-15  9:50             ` Jan Kara
2016-03-15  9:50               ` Jan Kara
2016-03-10 19:18 ` [PATCH 06/12] dax: Remove complete_unwritten argument Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 07/12] dax: Fix data corruption for written and mmapped files Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 08/12] dax: Fix bogus fault return value on cow faults Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 09/12] dax: Allow DAX code to replace exceptional entries Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 10/12] dax: Remove redundant inode size checks Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 11/12] dax: Disable huge page handling Jan Kara
2016-03-10 19:34   ` Dan Williams
2016-03-10 19:34     ` Dan Williams
2016-03-10 19:52     ` Jan Kara
2016-03-10 19:52       ` Jan Kara
2016-03-10 19:18 ` [PATCH 12/12] dax: New fault locking Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 23:54   ` NeilBrown
2016-03-10 23:54     ` NeilBrown
2016-03-15 21:34     ` NeilBrown
2016-03-15 21:34       ` NeilBrown
2016-03-18 14:16       ` Jan Kara
2016-03-18 14:16         ` Jan Kara
2016-03-18 15:39         ` Jan Kara
2016-03-18 15:39           ` Jan Kara
2016-03-22 21:10           ` NeilBrown
2016-03-22 21:10             ` NeilBrown
2016-03-23 11:00             ` Jan Kara
2016-03-23 11:00               ` Jan Kara
2016-03-31  4:20               ` NeilBrown
2016-03-31  4:20                 ` NeilBrown
2016-03-31  8:54                 ` Jan Kara
2016-03-31  8:54                   ` Jan Kara
2016-04-01  0:34                   ` NeilBrown
2016-04-01  0:34                     ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=100D68C7BA14664A8938383216E40DE0422086AA@FMSMSX114.amr.corp.intel.com \
    --to=matthew.r.wilcox@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=neilb@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.