From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by ml01.01.org (Postfix) with ESMTP id 2C4391A1E81 for ; Mon, 14 Mar 2016 07:52:00 -0700 (PDT) From: "Wilcox, Matthew R" Subject: RE: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock Date: Mon, 14 Mar 2016 14:51:26 +0000 Message-ID: <100D68C7BA14664A8938383216E40DE0422086AA@FMSMSX114.amr.corp.intel.com> References: <1457637535-21633-1-git-send-email-jack@suse.cz> <1457637535-21633-6-git-send-email-jack@suse.cz> <100D68C7BA14664A8938383216E40DE0422079E9@FMSMSX114.amr.corp.intel.com> <20160310200501.GA23203@quack.suse.cz> <100D68C7BA14664A8938383216E40DE042207AA9@FMSMSX114.amr.corp.intel.com> <20160314100128.GB6801@quack.suse.cz> In-Reply-To: <20160314100128.GB6801@quack.suse.cz> Content-Language: en-US MIME-Version: 1.0 List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Jan Kara Cc: "linux-fsdevel@vger.kernel.org" , NeilBrown , "linux-nvdimm@lists.01.org" List-ID: I think the ultimate goal here has to be to have the truncate code lock the DAX entry in the radix tree and delete it. Then we can have do_cow_fault() unlock the radix tree entry instead of the i_mmap_lock. So we'll need another element in struct vm_fault where we can pass back a pointer into the radix tree instead of a pointer to struct page (or add another bit to VM_FAULT_ that indicates that 'page' is not actually a page, but a pointer to an exceptional entry ... or have the MM code understand the exceptional bit ... there's a few ways we can go here). -----Original Message----- From: Jan Kara [mailto:jack@suse.cz] Sent: Monday, March 14, 2016 3:01 AM To: Wilcox, Matthew R Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock On Thu 10-03-16 20:10:09, Wilcox, Matthew R wrote: > Here's the race: > > CPU 0 CPU 1 > do_cow_fault() > __do_fault() > takes sem > dax_fault() > releases sem > truncate() > unmap_mapping_range() > i_mmap_lock_write() > unmap_mapping_range_tree() > i_mmap_unlock_write() > do_set_pte() > > Holding i_mmap_lock_read() from inside __do_fault() prevents the truncate > from proceeding until the page is inseted with do_set_pte(). Ah, right. Thanks for reminding me. I was hoping to get rid of this i_mmap_lock abuse in DAX code but obviously it needs more work :). Honza > -----Original Message----- > From: Jan Kara [mailto:jack@suse.cz] > Sent: Thursday, March 10, 2016 12:05 PM > To: Wilcox, Matthew R > Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown > Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock > > On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote: > > This locking's still necessary. i_mmap_sem has already been released by > > the time we're back in do_cow_fault(), so it doesn't protect that page, > > and truncate can have whizzed past and thinks there's nothing to unmap. > > So a task can have a MAP_PRIVATE page still in its address space after > > it's supposed to have been unmapped. > > I don't think this is possible. Filesystem holds its inode->i_mmap_sem for > reading when handling the fault. That synchronizes against truncate... > > Honza > > > -----Original Message----- > > From: Jan Kara [mailto:jack@suse.cz] > > Sent: Thursday, March 10, 2016 11:19 AM > > To: linux-fsdevel@vger.kernel.org > > Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara > > Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock > > > > At one point DAX used i_mmap_lock so synchronize page faults with page > > table invalidation during truncate. However these days DAX uses > > filesystem specific RW semaphores to protect against these races > > (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove > > the unnecessary locking. > > > > Signed-off-by: Jan Kara > > --- > > fs/dax.c | 19 ------------------- > > mm/memory.c | 14 -------------- > > 2 files changed, 33 deletions(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 9c4d697fb6fc..e409e8fc13b7 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, > > pgoff_t size; > > int error; > > > > - i_mmap_lock_read(mapping); > > - > > /* > > * Check truncate didn't happen while we were allocating a block. > > * If it did, this block may or may not be still allocated to the > > @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, > > error = vm_insert_mixed(vma, vaddr, dax.pfn); > > > > out: > > - i_mmap_unlock_read(mapping); > > - > > return error; > > } > > > > @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > if (error) > > goto unlock_page; > > vmf->page = page; > > - if (!page) { > > - i_mmap_lock_read(mapping); > > - /* Check we didn't race with truncate */ > > - size = (i_size_read(inode) + PAGE_SIZE - 1) >> > > - PAGE_SHIFT; > > - if (vmf->pgoff >= size) { > > - i_mmap_unlock_read(mapping); > > - error = -EIO; > > - goto out; > > - } > > - } > > return VM_FAULT_LOCKED; > > } > > > > @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > > truncate_pagecache_range(inode, lstart, lend); > > } > > > > - i_mmap_lock_read(mapping); > > - > > /* > > * If a truncate happened while we were allocating blocks, we may > > * leave blocks allocated to the file that are beyond EOF. We can't > > @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > > } > > > > out: > > - i_mmap_unlock_read(mapping); > > - > > if (buffer_unwritten(&bh)) > > complete_unwritten(&bh, !(result & VM_FAULT_ERROR)); > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 8132787ae4d5..13f76eb08f33 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping, > > if (details.last_index < details.first_index) > > details.last_index = ULONG_MAX; > > > > - > > - /* DAX uses i_mmap_lock to serialise file truncate vs page fault */ > > i_mmap_lock_write(mapping); > > if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) > > unmap_mapping_range_tree(&mapping->i_mmap, &details); > > @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > if (fault_page) { > > unlock_page(fault_page); > > page_cache_release(fault_page); > > - } else { > > - /* > > - * The fault handler has no page to lock, so it holds > > - * i_mmap_lock for read to protect against truncate. > > - */ > > - i_mmap_unlock_read(vma->vm_file->f_mapping); > > } > > goto uncharge_out; > > } > > @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > if (fault_page) { > > unlock_page(fault_page); > > page_cache_release(fault_page); > > - } else { > > - /* > > - * The fault handler has no page to lock, so it holds > > - * i_mmap_lock for read to protect against truncate. > > - */ > > - i_mmap_unlock_read(vma->vm_file->f_mapping); > > } > > return ret; > > uncharge_out: > > -- > > 2.6.2 > > > -- > Jan Kara > SUSE Labs, CR -- Jan Kara SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga11.intel.com ([192.55.52.93]:64883 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932353AbcCNOv2 convert rfc822-to-8bit (ORCPT ); Mon, 14 Mar 2016 10:51:28 -0400 From: "Wilcox, Matthew R" To: Jan Kara CC: "linux-fsdevel@vger.kernel.org" , "Ross Zwisler" , "Williams, Dan J" , "linux-nvdimm@lists.01.org" , NeilBrown Subject: RE: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock Date: Mon, 14 Mar 2016 14:51:26 +0000 Message-ID: <100D68C7BA14664A8938383216E40DE0422086AA@FMSMSX114.amr.corp.intel.com> References: <1457637535-21633-1-git-send-email-jack@suse.cz> <1457637535-21633-6-git-send-email-jack@suse.cz> <100D68C7BA14664A8938383216E40DE0422079E9@FMSMSX114.amr.corp.intel.com> <20160310200501.GA23203@quack.suse.cz> <100D68C7BA14664A8938383216E40DE042207AA9@FMSMSX114.amr.corp.intel.com> <20160314100128.GB6801@quack.suse.cz> In-Reply-To: <20160314100128.GB6801@quack.suse.cz> Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: I think the ultimate goal here has to be to have the truncate code lock the DAX entry in the radix tree and delete it. Then we can have do_cow_fault() unlock the radix tree entry instead of the i_mmap_lock. So we'll need another element in struct vm_fault where we can pass back a pointer into the radix tree instead of a pointer to struct page (or add another bit to VM_FAULT_ that indicates that 'page' is not actually a page, but a pointer to an exceptional entry ... or have the MM code understand the exceptional bit ... there's a few ways we can go here). -----Original Message----- From: Jan Kara [mailto:jack@suse.cz] Sent: Monday, March 14, 2016 3:01 AM To: Wilcox, Matthew R Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock On Thu 10-03-16 20:10:09, Wilcox, Matthew R wrote: > Here's the race: > > CPU 0 CPU 1 > do_cow_fault() > __do_fault() > takes sem > dax_fault() > releases sem > truncate() > unmap_mapping_range() > i_mmap_lock_write() > unmap_mapping_range_tree() > i_mmap_unlock_write() > do_set_pte() > > Holding i_mmap_lock_read() from inside __do_fault() prevents the truncate > from proceeding until the page is inseted with do_set_pte(). Ah, right. Thanks for reminding me. I was hoping to get rid of this i_mmap_lock abuse in DAX code but obviously it needs more work :). Honza > -----Original Message----- > From: Jan Kara [mailto:jack@suse.cz] > Sent: Thursday, March 10, 2016 12:05 PM > To: Wilcox, Matthew R > Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown > Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock > > On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote: > > This locking's still necessary. i_mmap_sem has already been released by > > the time we're back in do_cow_fault(), so it doesn't protect that page, > > and truncate can have whizzed past and thinks there's nothing to unmap. > > So a task can have a MAP_PRIVATE page still in its address space after > > it's supposed to have been unmapped. > > I don't think this is possible. Filesystem holds its inode->i_mmap_sem for > reading when handling the fault. That synchronizes against truncate... > > Honza > > > -----Original Message----- > > From: Jan Kara [mailto:jack@suse.cz] > > Sent: Thursday, March 10, 2016 11:19 AM > > To: linux-fsdevel@vger.kernel.org > > Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara > > Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock > > > > At one point DAX used i_mmap_lock so synchronize page faults with page > > table invalidation during truncate. However these days DAX uses > > filesystem specific RW semaphores to protect against these races > > (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove > > the unnecessary locking. > > > > Signed-off-by: Jan Kara > > --- > > fs/dax.c | 19 ------------------- > > mm/memory.c | 14 -------------- > > 2 files changed, 33 deletions(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 9c4d697fb6fc..e409e8fc13b7 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, > > pgoff_t size; > > int error; > > > > - i_mmap_lock_read(mapping); > > - > > /* > > * Check truncate didn't happen while we were allocating a block. > > * If it did, this block may or may not be still allocated to the > > @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, > > error = vm_insert_mixed(vma, vaddr, dax.pfn); > > > > out: > > - i_mmap_unlock_read(mapping); > > - > > return error; > > } > > > > @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > if (error) > > goto unlock_page; > > vmf->page = page; > > - if (!page) { > > - i_mmap_lock_read(mapping); > > - /* Check we didn't race with truncate */ > > - size = (i_size_read(inode) + PAGE_SIZE - 1) >> > > - PAGE_SHIFT; > > - if (vmf->pgoff >= size) { > > - i_mmap_unlock_read(mapping); > > - error = -EIO; > > - goto out; > > - } > > - } > > return VM_FAULT_LOCKED; > > } > > > > @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > > truncate_pagecache_range(inode, lstart, lend); > > } > > > > - i_mmap_lock_read(mapping); > > - > > /* > > * If a truncate happened while we were allocating blocks, we may > > * leave blocks allocated to the file that are beyond EOF. We can't > > @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > > } > > > > out: > > - i_mmap_unlock_read(mapping); > > - > > if (buffer_unwritten(&bh)) > > complete_unwritten(&bh, !(result & VM_FAULT_ERROR)); > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 8132787ae4d5..13f76eb08f33 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping, > > if (details.last_index < details.first_index) > > details.last_index = ULONG_MAX; > > > > - > > - /* DAX uses i_mmap_lock to serialise file truncate vs page fault */ > > i_mmap_lock_write(mapping); > > if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) > > unmap_mapping_range_tree(&mapping->i_mmap, &details); > > @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > if (fault_page) { > > unlock_page(fault_page); > > page_cache_release(fault_page); > > - } else { > > - /* > > - * The fault handler has no page to lock, so it holds > > - * i_mmap_lock for read to protect against truncate. > > - */ > > - i_mmap_unlock_read(vma->vm_file->f_mapping); > > } > > goto uncharge_out; > > } > > @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > if (fault_page) { > > unlock_page(fault_page); > > page_cache_release(fault_page); > > - } else { > > - /* > > - * The fault handler has no page to lock, so it holds > > - * i_mmap_lock for read to protect against truncate. > > - */ > > - i_mmap_unlock_read(vma->vm_file->f_mapping); > > } > > return ret; > > uncharge_out: > > -- > > 2.6.2 > > > -- > Jan Kara > SUSE Labs, CR -- Jan Kara SUSE Labs, CR