From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 21 Dec 2015 18:32:02 +0100 From: Jan Kara Subject: Re: [PATCH v5 5/7] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Message-ID: <20151221173202.GB7030@quack.suse.cz> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-6-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1450502540-8744-6-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: On Fri 18-12-15 22:22:18, Ross Zwisler wrote: > To properly support the new DAX fsync/msync infrastructure filesystems > need to call dax_pfn_mkwrite() so that DAX can track when user pages are > dirtied. The patch looks good to me. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Ross Zwisler > --- > fs/ext2/file.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 11a42c5..2c88d68 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, > { > struct inode *inode = file_inode(vma->vm_file); > struct ext2_inode_info *ei = EXT2_I(inode); > - int ret = VM_FAULT_NOPAGE; > loff_t size; > + int ret; > > sb_start_pagefault(inode->i_sb); > file_update_time(vma->vm_file); > @@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, > size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > if (vmf->pgoff >= size) > ret = VM_FAULT_SIGBUS; > + else > + ret = dax_pfn_mkwrite(vma, vmf); > > up_read(&ei->dax_sem); > sb_end_pagefault(inode->i_sb); > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v5 5/7] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Date: Fri, 18 Dec 2015 22:22:18 -0700 Message-Id: <1450502540-8744-6-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: To properly support the new DAX fsync/msync infrastructure filesystems need to call dax_pfn_mkwrite() so that DAX can track when user pages are dirtied. Signed-off-by: Ross Zwisler --- fs/ext2/file.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 11a42c5..2c88d68 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, { struct inode *inode = file_inode(vma->vm_file); struct ext2_inode_info *ei = EXT2_I(inode); - int ret = VM_FAULT_NOPAGE; loff_t size; + int ret; sb_start_pagefault(inode->i_sb); file_update_time(vma->vm_file); @@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); up_read(&ei->dax_sem); sb_end_pagefault(inode->i_sb); -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v5 1/7] pmem: add wb_cache_pmem() to the PMEM API Date: Fri, 18 Dec 2015 22:22:14 -0700 Message-Id: <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: The function __arch_wb_cache_pmem() was already an internal implementation detail of the x86 PMEM API, but this functionality needs to be exported as part of the general PMEM API to handle the fsync/msync case for DAX mmaps. One thing worth noting is that we really do want this to be part of the PMEM API as opposed to a stand-alone function like clflush_cache_range() because of ordering restrictions. By having wb_cache_pmem() as part of the PMEM API we can leave it unordered, call it multiple times to write back large amounts of memory, and then order the multiple calls with a single wmb_pmem(). Signed-off-by: Ross Zwisler --- arch/x86/include/asm/pmem.h | 11 ++++++----- include/linux/pmem.h | 22 +++++++++++++++++++++- 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index d8ce3ec..6c7ade0 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void) } /** - * __arch_wb_cache_pmem - write back a cache range with CLWB + * arch_wb_cache_pmem - write back a cache range with CLWB * @vaddr: virtual start address * @size: number of bytes to write back * * Write back a cache range using the CLWB (cache line write back) * instruction. This function requires explicit ordering with an - * arch_wmb_pmem() call. This API is internal to the x86 PMEM implementation. + * arch_wmb_pmem() call. */ -static inline void __arch_wb_cache_pmem(void *vaddr, size_t size) +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) { u16 x86_clflush_size = boot_cpu_data.x86_clflush_size; unsigned long clflush_mask = x86_clflush_size - 1; + void *vaddr = (void __force *)addr; void *vend = vaddr + size; void *p; @@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes, len = copy_from_iter_nocache(vaddr, bytes, i); if (__iter_needs_pmem_wb(i)) - __arch_wb_cache_pmem(vaddr, bytes); + arch_wb_cache_pmem(addr, bytes); return len; } @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) else memset(vaddr, 0, size); - __arch_wb_cache_pmem(vaddr, size); + arch_wb_cache_pmem(addr, size); } static inline bool __arch_has_wmb_pmem(void) diff --git a/include/linux/pmem.h b/include/linux/pmem.h index acfea8c..7c3d11a 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) { BUG(); } + +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) +{ + BUG(); +} #endif /* * Architectures that define ARCH_HAS_PMEM_API must provide * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(), - * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem(). + * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem() + * and arch_has_wmb_pmem(). */ static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size) { @@ -178,4 +184,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size) else default_clear_pmem(addr, size); } + +/** + * wb_cache_pmem - write back processor cache for PMEM memory range + * @addr: virtual start address + * @size: number of bytes to write back + * + * Write back the processor cache range starting at 'addr' for 'size' bytes. + * This function requires explicit ordering with a wmb_pmem() call. + */ +static inline void wb_cache_pmem(void __pmem *addr, size_t size) +{ + if (arch_has_pmem_api()) + arch_wb_cache_pmem(addr, size); +} #endif /* __PMEM_H__ */ -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20151221170545.GA13494@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> <20151221170545.GA13494@linux.intel.com> Date: Mon, 21 Dec 2015 11:27:35 -0800 Message-ID: Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen List-ID: On Mon, Dec 21, 2015 at 9:05 AM, Ross Zwisler wrote: > On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: >> On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler >> wrote: >> > To properly handle fsync/msync in an efficient way DAX needs to track dirty >> > pages so it is able to flush them durably to media on demand. >> > >> > The tracking of dirty pages is done via the radix tree in struct >> > address_space. This radix tree is already used by the page writeback >> > infrastructure for tracking dirty pages associated with an open file, and >> > it already has support for exceptional (non struct page*) entries. We >> > build upon these features to add exceptional entries to the radix tree for >> > DAX dirty PMD or PTE pages at fault time. >> > >> > Signed-off-by: Ross Zwisler >> [..] >> > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, >> > + void *entry) >> > +{ >> > + struct radix_tree_root *page_tree = &mapping->page_tree; >> > + int type = RADIX_DAX_TYPE(entry); >> > + struct radix_tree_node *node; >> > + void **slot; >> > + >> > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { >> > + WARN_ON_ONCE(1); >> > + return; >> > + } >> > + >> > + spin_lock_irq(&mapping->tree_lock); >> > + /* >> > + * Regular page slots are stabilized by the page lock even >> > + * without the tree itself locked. These unlocked entries >> > + * need verification under the tree lock. >> > + */ >> > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) >> > + goto unlock; >> > + if (*slot != entry) >> > + goto unlock; >> > + >> > + /* another fsync thread may have already written back this entry */ >> > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) >> > + goto unlock; >> > + >> > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); >> > + >> > + if (type == RADIX_DAX_PMD) >> > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); >> > + else >> > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); >> >> Hi Ross, I should have realized this sooner, but what guarantees that >> the address returned by RADIX_DAX_ADDR(entry) is still valid at this >> point? I think we need to store the sector in the radix tree and then >> perform a new dax_map_atomic() operation to either lookup a valid >> address or fail the sync request. Otherwise, if the device is gone >> we'll crash, or write into some other random vmalloc address space. > > Ah, good point, thank you. v4 of this series is based on a version of > DAX where we aren't properly dealing with PMEM device removal. I've got an > updated version that merges with your dax_map_atomic() changes, and I'll add > this change into v5 which I will send out today. Thank you for the > suggestion. To make the merge simpler you could skip the rebase for now and just call blk_queue_enter() / blk_queue_exit() around the calls to wb_cache_pmem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20151221170545.GA13494@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> <20151221170545.GA13494@linux.intel.com> Date: Mon, 21 Dec 2015 09:49:01 -0800 Message-ID: Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen List-ID: On Mon, Dec 21, 2015 at 9:05 AM, Ross Zwisler wrote: > On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: >> On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler >> wrote: [..] >> Hi Ross, I should have realized this sooner, but what guarantees that >> the address returned by RADIX_DAX_ADDR(entry) is still valid at this >> point? I think we need to store the sector in the radix tree and then >> perform a new dax_map_atomic() operation to either lookup a valid >> address or fail the sync request. Otherwise, if the device is gone >> we'll crash, or write into some other random vmalloc address space. > > Ah, good point, thank you. v4 of this series is based on a version of > DAX where we aren't properly dealing with PMEM device removal. I've got an > updated version that merges with your dax_map_atomic() changes, and I'll add > this change into v5 which I will send out today. Thank you for the > suggestion. > > One clarification, with the code as it is in v4 we are only doing > clflush/clflushopt/clwb instructions on the kaddr we've stored in the radix > tree, so I don't think that there is actually a risk of us doing a "write into > some other random vmalloc address space"? I think at worse we will end up > clflushing an address that either isn't mapped or has been remapped by someone > else. Or are you worried that the clflush would trigger a cache writeback to > a memory address where writes have side effects, thus triggering the side > effect? > > I definitely think it needs to be fixed, I'm just trying to make sure I > understood your comment. True, this would be flushing an address that was dirtied while valid. Should be ok in practice for now since dax is effectively limited to x86, but we should not be leaning on x86 details in an architecture generic implementation like this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 21 Dec 2015 10:05:45 -0700 From: Ross Zwisler Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Message-ID: <20151221170545.GA13494@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen List-ID: On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: > On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler > wrote: > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > > > Signed-off-by: Ross Zwisler > [..] > > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > > + void *entry) > > +{ > > + struct radix_tree_root *page_tree = &mapping->page_tree; > > + int type = RADIX_DAX_TYPE(entry); > > + struct radix_tree_node *node; > > + void **slot; > > + > > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > > + WARN_ON_ONCE(1); > > + return; > > + } > > + > > + spin_lock_irq(&mapping->tree_lock); > > + /* > > + * Regular page slots are stabilized by the page lock even > > + * without the tree itself locked. These unlocked entries > > + * need verification under the tree lock. > > + */ > > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > > + goto unlock; > > + if (*slot != entry) > > + goto unlock; > > + > > + /* another fsync thread may have already written back this entry */ > > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > > + goto unlock; > > + > > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > > + > > + if (type == RADIX_DAX_PMD) > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > > + else > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); > > Hi Ross, I should have realized this sooner, but what guarantees that > the address returned by RADIX_DAX_ADDR(entry) is still valid at this > point? I think we need to store the sector in the radix tree and then > perform a new dax_map_atomic() operation to either lookup a valid > address or fail the sync request. Otherwise, if the device is gone > we'll crash, or write into some other random vmalloc address space. Ah, good point, thank you. v4 of this series is based on a version of DAX where we aren't properly dealing with PMEM device removal. I've got an updated version that merges with your dax_map_atomic() changes, and I'll add this change into v5 which I will send out today. Thank you for the suggestion. One clarification, with the code as it is in v4 we are only doing clflush/clflushopt/clwb instructions on the kaddr we've stored in the radix tree, so I don't think that there is actually a risk of us doing a "write into some other random vmalloc address space"? I think at worse we will end up clflushing an address that either isn't mapped or has been remapped by someone else. Or are you worried that the clflush would trigger a cache writeback to a memory address where writes have side effects, thus triggering the side effect? I definitely think it needs to be fixed, I'm just trying to make sure I understood your comment. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> Date: Sat, 19 Dec 2015 10:37:46 -0800 Message-ID: Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen List-ID: On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler wrote: > To properly handle fsync/msync in an efficient way DAX needs to track dirty > pages so it is able to flush them durably to media on demand. > > The tracking of dirty pages is done via the radix tree in struct > address_space. This radix tree is already used by the page writeback > infrastructure for tracking dirty pages associated with an open file, and > it already has support for exceptional (non struct page*) entries. We > build upon these features to add exceptional entries to the radix tree for > DAX dirty PMD or PTE pages at fault time. > > Signed-off-by: Ross Zwisler [..] > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > + void *entry) > +{ > + struct radix_tree_root *page_tree = &mapping->page_tree; > + int type = RADIX_DAX_TYPE(entry); > + struct radix_tree_node *node; > + void **slot; > + > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > + WARN_ON_ONCE(1); > + return; > + } > + > + spin_lock_irq(&mapping->tree_lock); > + /* > + * Regular page slots are stabilized by the page lock even > + * without the tree itself locked. These unlocked entries > + * need verification under the tree lock. > + */ > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > + goto unlock; > + if (*slot != entry) > + goto unlock; > + > + /* another fsync thread may have already written back this entry */ > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > + goto unlock; > + > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > + > + if (type == RADIX_DAX_PMD) > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > + else > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); Hi Ross, I should have realized this sooner, but what guarantees that the address returned by RADIX_DAX_ADDR(entry) is still valid at this point? I think we need to store the sector in the radix tree and then perform a new dax_map_atomic() operation to either lookup a valid address or fail the sync request. Otherwise, if the device is gone we'll crash, or write into some other random vmalloc address space. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v5 4/7] dax: add support for fsync/sync Date: Fri, 18 Dec 2015 22:22:17 -0700 Message-Id: <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: To properly handle fsync/msync in an efficient way DAX needs to track dirty pages so it is able to flush them durably to media on demand. The tracking of dirty pages is done via the radix tree in struct address_space. This radix tree is already used by the page writeback infrastructure for tracking dirty pages associated with an open file, and it already has support for exceptional (non struct page*) entries. We build upon these features to add exceptional entries to the radix tree for DAX dirty PMD or PTE pages at fault time. Signed-off-by: Ross Zwisler --- fs/dax.c | 159 ++++++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/dax.h | 2 + mm/filemap.c | 3 + 3 files changed, 158 insertions(+), 6 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 43671b6..19347cf 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -289,6 +290,143 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh, return 0; } +static int dax_radix_entry(struct address_space *mapping, pgoff_t index, + void __pmem *addr, bool pmd_entry, bool dirty) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + int error = 0; + void *entry; + + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + + spin_lock_irq(&mapping->tree_lock); + entry = radix_tree_lookup(page_tree, index); + + if (entry) { + if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) + goto dirty; + radix_tree_delete(&mapping->page_tree, index); + mapping->nrdax--; + } + + if (!addr) { + /* + * This can happen during correct operation if our pfn_mkwrite + * fault raced against a hole punch operation. If this + * happens the pte that was hole punched will have been + * unmapped and the radix tree entry will have been removed by + * the time we are called, but the call will still happen. We + * will return all the way up to wp_pfn_shared(), where the + * pte_same() check will fail, eventually causing page fault + * to be retried by the CPU. + */ + goto unlock; + } else if (RADIX_DAX_TYPE(addr)) { + WARN_ONCE(1, "%s: invalid address %p\n", __func__, addr); + goto unlock; + } + + error = radix_tree_insert(page_tree, index, + RADIX_DAX_ENTRY(addr, pmd_entry)); + if (error) + goto unlock; + + mapping->nrdax++; + dirty: + if (dirty) + radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY); + unlock: + spin_unlock_irq(&mapping->tree_lock); + return error; +} + +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, + void *entry) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + int type = RADIX_DAX_TYPE(entry); + struct radix_tree_node *node; + void **slot; + + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { + WARN_ON_ONCE(1); + return; + } + + spin_lock_irq(&mapping->tree_lock); + /* + * Regular page slots are stabilized by the page lock even + * without the tree itself locked. These unlocked entries + * need verification under the tree lock. + */ + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) + goto unlock; + if (*slot != entry) + goto unlock; + + /* another fsync thread may have already written back this entry */ + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) + goto unlock; + + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); + + if (type == RADIX_DAX_PMD) + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); + else + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); + unlock: + spin_unlock_irq(&mapping->tree_lock); +} + +/* + * Flush the mapping to the persistent domain within the byte range of [start, + * end]. This is required by data integrity operations to ensure file data is + * on persistent storage prior to completion of the operation. + */ +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, + loff_t end) +{ + struct inode *inode = mapping->host; + pgoff_t indices[PAGEVEC_SIZE]; + pgoff_t start_page, end_page; + struct pagevec pvec; + void *entry; + int i; + + if (inode->i_blkbits != PAGE_SHIFT) { + WARN_ON_ONCE(1); + return; + } + + rcu_read_lock(); + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK); + rcu_read_unlock(); + + /* see if the start of our range is covered by a PMD entry */ + if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) + start &= PMD_MASK; + + start_page = start >> PAGE_CACHE_SHIFT; + end_page = end >> PAGE_CACHE_SHIFT; + + tag_pages_for_writeback(mapping, start_page, end_page); + + pagevec_init(&pvec, 0); + while (1) { + pvec.nr = find_get_entries_tag(mapping, start_page, + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE, + pvec.pages, indices); + + if (pvec.nr == 0) + break; + + for (i = 0; i < pvec.nr; i++) + dax_writeback_one(mapping, indices[i], pvec.pages[i]); + } + wmb_pmem(); +} +EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); + static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { @@ -329,7 +467,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, } error = vm_insert_mixed(vma, vaddr, pfn); + if (error) + goto out; + error = dax_radix_entry(mapping, vmf->pgoff, addr, false, + vmf->flags & FAULT_FLAG_WRITE); out: i_mmap_unlock_read(mapping); @@ -452,6 +594,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, delete_from_page_cache(page); unlock_page(page); page_cache_release(page); + page = NULL; } /* @@ -539,7 +682,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, pgoff_t size, pgoff; sector_t block, sector; unsigned long pfn; - int result = 0; + int error, result = 0; /* dax pmd mappings are broken wrt gup and fork */ if (!IS_ENABLED(CONFIG_FS_DAX_PMD)) @@ -651,6 +794,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); + + if (write) { + error = dax_radix_entry(mapping, pgoff, kaddr, true, + true); + if (error) + goto fallback; + } } out: @@ -702,15 +852,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault); * dax_pfn_mkwrite - handle first write to DAX page * @vma: The virtual memory area where the fault occurred * @vmf: The description of the fault - * */ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) { - struct super_block *sb = file_inode(vma->vm_file)->i_sb; + struct file *file = vma->vm_file; - sb_start_pagefault(sb); - file_update_time(vma->vm_file); - sb_end_pagefault(sb); + dax_radix_entry(file->f_mapping, vmf->pgoff, NULL, false, true); return VM_FAULT_NOPAGE; } EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); diff --git a/include/linux/dax.h b/include/linux/dax.h index e9d57f68..11eb183 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -41,4 +41,6 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, + loff_t end); #endif diff --git a/mm/filemap.c b/mm/filemap.c index 99dfbc9..9577783 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -482,6 +482,9 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; + if (dax_mapping(mapping) && mapping->nrdax) + dax_writeback_mapping_range(mapping, lstart, lend); + if (mapping->nrpages) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v5 3/7] mm: add find_get_entries_tag() Date: Fri, 18 Dec 2015 22:22:16 -0700 Message-Id: <1450502540-8744-4-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: Add find_get_entries_tag() to the family of functions that include find_get_entries(), find_get_pages() and find_get_pages_tag(). This is needed for DAX dirty page handling because we need a list of both page offsets and radix tree entries ('indices' and 'entries' in this function) that are marked with the PAGECACHE_TAG_TOWRITE tag. Signed-off-by: Ross Zwisler Reviewed-by: Jan Kara --- include/linux/pagemap.h | 3 +++ mm/filemap.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 71 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 26eabf5..4db0425 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -361,6 +361,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages); unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, int tag, unsigned int nr_pages, struct page **pages); +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices); struct page *grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, unsigned flags); diff --git a/mm/filemap.c b/mm/filemap.c index 167a4d9..99dfbc9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1498,6 +1498,74 @@ repeat: } EXPORT_SYMBOL(find_get_pages_tag); +/** + * find_get_entries_tag - find and return entries that match @tag + * @mapping: the address_space to search + * @start: the starting page cache index + * @tag: the tag index + * @nr_entries: the maximum number of entries + * @entries: where the resulting entries are placed + * @indices: the cache indices corresponding to the entries in @entries + * + * Like find_get_entries, except we only return entries which are tagged with + * @tag. + */ +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices) +{ + void **slot; + unsigned int ret = 0; + struct radix_tree_iter iter; + + if (!nr_entries) + return 0; + + rcu_read_lock(); +restart: + radix_tree_for_each_tagged(slot, &mapping->page_tree, + &iter, start, tag) { + struct page *page; +repeat: + page = radix_tree_deref_slot(slot); + if (unlikely(!page)) + continue; + if (radix_tree_exception(page)) { + if (radix_tree_deref_retry(page)) { + /* + * Transient condition which can only trigger + * when entry at index 0 moves out of or back + * to root: none yet gotten, safe to restart. + */ + goto restart; + } + + /* + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. + */ + goto export; + } + if (!page_cache_get_speculative(page)) + goto repeat; + + /* Has the page moved? */ + if (unlikely(page != *slot)) { + page_cache_release(page); + goto repeat; + } +export: + indices[ret] = iter.index; + entries[ret] = page; + if (++ret == nr_entries) + break; + } + rcu_read_unlock(); + return ret; +} +EXPORT_SYMBOL(find_get_entries_tag); + /* * CD/DVDs are error prone. When a medium error occurs, the driver may fail * a _large_ part of the i/o request. Imagine the worst scenario: -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 21 Dec 2015 18:32:24 +0100 From: Jan Kara Subject: Re: [PATCH v5 6/7] ext4: call dax_pfn_mkwrite() for DAX fsync/msync Message-ID: <20151221173223.GC7030@quack.suse.cz> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-7-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1450502540-8744-7-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: On Fri 18-12-15 22:22:19, Ross Zwisler wrote: > To properly support the new DAX fsync/msync infrastructure filesystems > need to call dax_pfn_mkwrite() so that DAX can track when user pages are > dirtied. The patch looks good to me. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Ross Zwisler > --- > fs/ext4/file.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 749b222..8c8965c 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, > { > struct inode *inode = file_inode(vma->vm_file); > struct super_block *sb = inode->i_sb; > - int ret = VM_FAULT_NOPAGE; > loff_t size; > + int ret; > > sb_start_pagefault(sb); > file_update_time(vma->vm_file); > @@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, > size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > if (vmf->pgoff >= size) > ret = VM_FAULT_SIGBUS; > + else > + ret = dax_pfn_mkwrite(vma, vmf); > up_read(&EXT4_I(inode)->i_mmap_sem); > sb_end_pagefault(sb); > > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v5 6/7] ext4: call dax_pfn_mkwrite() for DAX fsync/msync Date: Fri, 18 Dec 2015 22:22:19 -0700 Message-Id: <1450502540-8744-7-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: To properly support the new DAX fsync/msync infrastructure filesystems need to call dax_pfn_mkwrite() so that DAX can track when user pages are dirtied. Signed-off-by: Ross Zwisler --- fs/ext4/file.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 749b222..8c8965c 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, { struct inode *inode = file_inode(vma->vm_file); struct super_block *sb = inode->i_sb; - int ret = VM_FAULT_NOPAGE; loff_t size; + int ret; sb_start_pagefault(sb); file_update_time(vma->vm_file); @@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); up_read(&EXT4_I(inode)->i_mmap_sem); sb_end_pagefault(sb); -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 21 Dec 2015 10:45:34 -0700 From: Ross Zwisler Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Message-ID: <20151221174534.GA4978@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> <20151221171512.GA7030@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151221171512.GA7030@quack.suse.cz> Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: On Mon, Dec 21, 2015 at 06:15:12PM +0100, Jan Kara wrote: > On Fri 18-12-15 22:22:15, Ross Zwisler wrote: > > Add support for tracking dirty DAX entries in the struct address_space > > radix tree. This tree is already used for dirty page writeback, and it > > already supports the use of exceptional (non struct page*) entries. > > > > In order to properly track dirty DAX pages we will insert new exceptional > > entries into the radix tree that represent dirty DAX PTE or PMD pages. > > These exceptional entries will also contain the writeback addresses for the > > PTE or PMD faults that we can use at fsync/msync time. > > > > There are currently two types of exceptional entries (shmem and shadow) > > that can be placed into the radix tree, and this adds a third. We rely on > > the fact that only one type of exceptional entry can be found in a given > > radix tree based on its usage. This happens for free with DAX vs shmem but > > we explicitly prevent shadow entries from being added to radix trees for > > DAX mappings. > > > > The only shadow entries that would be generated for DAX radix trees would > > be to track zero page mappings that were created for holes. These pages > > would receive minimal benefit from having shadow entries, and the choice > > to have only one type of exceptional entry in a given radix tree makes the > > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > > > Signed-off-by: Ross Zwisler > > The patch looks good to me. Just one comment: When we have this exclusion > between different types of exceptional entries, there is no real need to > have separate counters of 'shadow' and 'dax' entries, is there? We can have > one 'nrexceptional' counter and don't have to grow struct inode > unnecessarily which would be really welcome since DAX isn't a mainstream > feature. Could you please change the code? Thanks! Sure, this sounds good. Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 21 Dec 2015 18:15:12 +0100 From: Jan Kara Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Message-ID: <20151221171512.GA7030@quack.suse.cz> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: On Fri 18-12-15 22:22:15, Ross Zwisler wrote: > Add support for tracking dirty DAX entries in the struct address_space > radix tree. This tree is already used for dirty page writeback, and it > already supports the use of exceptional (non struct page*) entries. > > In order to properly track dirty DAX pages we will insert new exceptional > entries into the radix tree that represent dirty DAX PTE or PMD pages. > These exceptional entries will also contain the writeback addresses for the > PTE or PMD faults that we can use at fsync/msync time. > > There are currently two types of exceptional entries (shmem and shadow) > that can be placed into the radix tree, and this adds a third. We rely on > the fact that only one type of exceptional entry can be found in a given > radix tree based on its usage. This happens for free with DAX vs shmem but > we explicitly prevent shadow entries from being added to radix trees for > DAX mappings. > > The only shadow entries that would be generated for DAX radix trees would > be to track zero page mappings that were created for holes. These pages > would receive minimal benefit from having shadow entries, and the choice > to have only one type of exceptional entry in a given radix tree makes the > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > Signed-off-by: Ross Zwisler The patch looks good to me. Just one comment: When we have this exclusion between different types of exceptional entries, there is no real need to have separate counters of 'shadow' and 'dax' entries, is there? We can have one 'nrexceptional' counter and don't have to grow struct inode unnecessarily which would be really welcome since DAX isn't a mainstream feature. Could you please change the code? Thanks! Honza > --- > fs/block_dev.c | 3 ++- > fs/inode.c | 1 + > include/linux/dax.h | 5 ++++ > include/linux/fs.h | 1 + > include/linux/radix-tree.h | 9 +++++++ > mm/filemap.c | 13 +++++++--- > mm/truncate.c | 64 +++++++++++++++++++++++++++------------------- > mm/vmscan.c | 9 ++++++- > 8 files changed, 73 insertions(+), 32 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index c25639e..226dacc 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -75,7 +75,8 @@ void kill_bdev(struct block_device *bdev) > { > struct address_space *mapping = bdev->bd_inode->i_mapping; > > - if (mapping->nrpages == 0 && mapping->nrshadows == 0) > + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && > + mapping->nrdax == 0) > return; > > invalidate_bh_lrus(); > diff --git a/fs/inode.c b/fs/inode.c > index 1be5f90..79d828f 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -496,6 +496,7 @@ void clear_inode(struct inode *inode) > spin_lock_irq(&inode->i_data.tree_lock); > BUG_ON(inode->i_data.nrpages); > BUG_ON(inode->i_data.nrshadows); > + BUG_ON(inode->i_data.nrdax); > spin_unlock_irq(&inode->i_data.tree_lock); > BUG_ON(!list_empty(&inode->i_data.private_list)); > BUG_ON(!(inode->i_state & I_FREEING)); > diff --git a/include/linux/dax.h b/include/linux/dax.h > index b415e52..e9d57f68 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) > { > return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); > } > + > +static inline bool dax_mapping(struct address_space *mapping) > +{ > + return mapping->host && IS_DAX(mapping->host); > +} > #endif > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3aa5142..b9ac534 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -433,6 +433,7 @@ struct address_space { > /* Protected by tree_lock together with the radix tree */ > unsigned long nrpages; /* number of total pages */ > unsigned long nrshadows; /* number of shadow entries */ > + unsigned long nrdax; /* number of DAX entries */ > pgoff_t writeback_index;/* writeback starts here */ > const struct address_space_operations *a_ops; /* methods */ > unsigned long flags; /* error bits/gfp mask */ > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h > index 33170db..f793c99 100644 > --- a/include/linux/radix-tree.h > +++ b/include/linux/radix-tree.h > @@ -51,6 +51,15 @@ > #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 > #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 > > +#define RADIX_DAX_MASK 0xf > +#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY) > +#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY) > +#define RADIX_DAX_TYPE(entry) ((__force unsigned long)entry & RADIX_DAX_MASK) > +#define RADIX_DAX_ADDR(entry) ((void __pmem *)((unsigned long)entry & \ > + ~RADIX_DAX_MASK)) > +#define RADIX_DAX_ENTRY(addr, pmd) ((void *)((__force unsigned long)addr | \ > + (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE))) > + > static inline int radix_tree_is_indirect_ptr(void *ptr) > { > return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR); > diff --git a/mm/filemap.c b/mm/filemap.c > index 1bb0076..167a4d9 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -11,6 +11,7 @@ > */ > #include > #include > +#include > #include > #include > #include > @@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping, > p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); > if (!radix_tree_exceptional_entry(p)) > return -EEXIST; > + > + if (dax_mapping(mapping)) { > + WARN_ON(1); > + return -EINVAL; > + } > + > if (shadowp) > *shadowp = p; > mapping->nrshadows--; > @@ -1242,9 +1249,9 @@ repeat: > if (radix_tree_deref_retry(page)) > goto restart; > /* > - * A shadow entry of a recently evicted page, > - * or a swap entry from shmem/tmpfs. Return > - * it without attempting to raise page count. > + * A shadow entry of a recently evicted page, a swap > + * entry from shmem/tmpfs or a DAX entry. Return it > + * without attempting to raise page count. > */ > goto export; > } > diff --git a/mm/truncate.c b/mm/truncate.c > index 76e35ad..1dc9f29 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -9,6 +9,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping, > return; > > spin_lock_irq(&mapping->tree_lock); > - /* > - * Regular page slots are stabilized by the page lock even > - * without the tree itself locked. These unlocked entries > - * need verification under the tree lock. > - */ > - if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot)) > - goto unlock; > - if (*slot != entry) > - goto unlock; > - radix_tree_replace_slot(slot, NULL); > - mapping->nrshadows--; > - if (!node) > - goto unlock; > - workingset_node_shadows_dec(node); > - /* > - * Don't track node without shadow entries. > - * > - * Avoid acquiring the list_lru lock if already untracked. > - * The list_empty() test is safe as node->private_list is > - * protected by mapping->tree_lock. > - */ > - if (!workingset_node_shadows(node) && > - !list_empty(&node->private_list)) > - list_lru_del(&workingset_shadow_nodes, &node->private_list); > - __radix_tree_delete_node(&mapping->page_tree, node); > + > + if (dax_mapping(mapping)) { > + if (radix_tree_delete_item(&mapping->page_tree, index, entry)) > + mapping->nrdax--; > + } else { > + /* > + * Regular page slots are stabilized by the page lock even > + * without the tree itself locked. These unlocked entries > + * need verification under the tree lock. > + */ > + if (!__radix_tree_lookup(&mapping->page_tree, index, &node, > + &slot)) > + goto unlock; > + if (*slot != entry) > + goto unlock; > + radix_tree_replace_slot(slot, NULL); > + mapping->nrshadows--; > + if (!node) > + goto unlock; > + workingset_node_shadows_dec(node); > + /* > + * Don't track node without shadow entries. > + * > + * Avoid acquiring the list_lru lock if already untracked. > + * The list_empty() test is safe as node->private_list is > + * protected by mapping->tree_lock. > + */ > + if (!workingset_node_shadows(node) && > + !list_empty(&node->private_list)) > + list_lru_del(&workingset_shadow_nodes, > + &node->private_list); > + __radix_tree_delete_node(&mapping->page_tree, node); > + } > unlock: > spin_unlock_irq(&mapping->tree_lock); > } > @@ -228,7 +237,8 @@ void truncate_inode_pages_range(struct address_space *mapping, > int i; > > cleancache_invalidate_inode(mapping); > - if (mapping->nrpages == 0 && mapping->nrshadows == 0) > + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && > + mapping->nrdax == 0) > return; > > /* Offsets within partial pages */ > @@ -423,7 +433,7 @@ void truncate_inode_pages_final(struct address_space *mapping) > smp_rmb(); > nrshadows = mapping->nrshadows; > > - if (nrpages || nrshadows) { > + if (nrpages || nrshadows || mapping->nrdax) { > /* > * As truncation uses a lockless tree lookup, cycle > * the tree lock to make sure any ongoing tree > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2aec424..8071956 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -46,6 +46,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > * inode reclaim needs to empty out the radix tree or > * the nodes are lost. Don't plant shadows behind its > * back. > + * > + * We also don't store shadows for DAX mappings because the > + * only page cache pages found in these are zero pages > + * covering holes, and because we don't want to mix DAX > + * exceptional entries and shadow exceptional entries in the > + * same page_tree. > */ > if (reclaimed && page_is_file_cache(page) && > - !mapping_exiting(mapping)) > + !mapping_exiting(mapping) && !dax_mapping(mapping)) > shadow = workingset_eviction(mapping, page); > __delete_from_page_cache(page, shadow, memcg); > spin_unlock_irqrestore(&mapping->tree_lock, flags); > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Date: Fri, 18 Dec 2015 22:22:15 -0700 Message-Id: <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: Add support for tracking dirty DAX entries in the struct address_space radix tree. This tree is already used for dirty page writeback, and it already supports the use of exceptional (non struct page*) entries. In order to properly track dirty DAX pages we will insert new exceptional entries into the radix tree that represent dirty DAX PTE or PMD pages. These exceptional entries will also contain the writeback addresses for the PTE or PMD faults that we can use at fsync/msync time. There are currently two types of exceptional entries (shmem and shadow) that can be placed into the radix tree, and this adds a third. We rely on the fact that only one type of exceptional entry can be found in a given radix tree based on its usage. This happens for free with DAX vs shmem but we explicitly prevent shadow entries from being added to radix trees for DAX mappings. The only shadow entries that would be generated for DAX radix trees would be to track zero page mappings that were created for holes. These pages would receive minimal benefit from having shadow entries, and the choice to have only one type of exceptional entry in a given radix tree makes the logic simpler both in clear_exceptional_entry() and in the rest of DAX. Signed-off-by: Ross Zwisler --- fs/block_dev.c | 3 ++- fs/inode.c | 1 + include/linux/dax.h | 5 ++++ include/linux/fs.h | 1 + include/linux/radix-tree.h | 9 +++++++ mm/filemap.c | 13 +++++++--- mm/truncate.c | 64 +++++++++++++++++++++++++++------------------- mm/vmscan.c | 9 ++++++- 8 files changed, 73 insertions(+), 32 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index c25639e..226dacc 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -75,7 +75,8 @@ void kill_bdev(struct block_device *bdev) { struct address_space *mapping = bdev->bd_inode->i_mapping; - if (mapping->nrpages == 0 && mapping->nrshadows == 0) + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && + mapping->nrdax == 0) return; invalidate_bh_lrus(); diff --git a/fs/inode.c b/fs/inode.c index 1be5f90..79d828f 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -496,6 +496,7 @@ void clear_inode(struct inode *inode) spin_lock_irq(&inode->i_data.tree_lock); BUG_ON(inode->i_data.nrpages); BUG_ON(inode->i_data.nrshadows); + BUG_ON(inode->i_data.nrdax); spin_unlock_irq(&inode->i_data.tree_lock); BUG_ON(!list_empty(&inode->i_data.private_list)); BUG_ON(!(inode->i_state & I_FREEING)); diff --git a/include/linux/dax.h b/include/linux/dax.h index b415e52..e9d57f68 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) { return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); } + +static inline bool dax_mapping(struct address_space *mapping) +{ + return mapping->host && IS_DAX(mapping->host); +} #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index 3aa5142..b9ac534 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -433,6 +433,7 @@ struct address_space { /* Protected by tree_lock together with the radix tree */ unsigned long nrpages; /* number of total pages */ unsigned long nrshadows; /* number of shadow entries */ + unsigned long nrdax; /* number of DAX entries */ pgoff_t writeback_index;/* writeback starts here */ const struct address_space_operations *a_ops; /* methods */ unsigned long flags; /* error bits/gfp mask */ diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 33170db..f793c99 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -51,6 +51,15 @@ #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 +#define RADIX_DAX_MASK 0xf +#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY) +#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY) +#define RADIX_DAX_TYPE(entry) ((__force unsigned long)entry & RADIX_DAX_MASK) +#define RADIX_DAX_ADDR(entry) ((void __pmem *)((unsigned long)entry & \ + ~RADIX_DAX_MASK)) +#define RADIX_DAX_ENTRY(addr, pmd) ((void *)((__force unsigned long)addr | \ + (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE))) + static inline int radix_tree_is_indirect_ptr(void *ptr) { return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR); diff --git a/mm/filemap.c b/mm/filemap.c index 1bb0076..167a4d9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -11,6 +11,7 @@ */ #include #include +#include #include #include #include @@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping, p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); if (!radix_tree_exceptional_entry(p)) return -EEXIST; + + if (dax_mapping(mapping)) { + WARN_ON(1); + return -EINVAL; + } + if (shadowp) *shadowp = p; mapping->nrshadows--; @@ -1242,9 +1249,9 @@ repeat: if (radix_tree_deref_retry(page)) goto restart; /* - * A shadow entry of a recently evicted page, - * or a swap entry from shmem/tmpfs. Return - * it without attempting to raise page count. + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. */ goto export; } diff --git a/mm/truncate.c b/mm/truncate.c index 76e35ad..1dc9f29 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -9,6 +9,7 @@ #include #include +#include #include #include #include @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping, return; spin_lock_irq(&mapping->tree_lock); - /* - * Regular page slots are stabilized by the page lock even - * without the tree itself locked. These unlocked entries - * need verification under the tree lock. - */ - if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot)) - goto unlock; - if (*slot != entry) - goto unlock; - radix_tree_replace_slot(slot, NULL); - mapping->nrshadows--; - if (!node) - goto unlock; - workingset_node_shadows_dec(node); - /* - * Don't track node without shadow entries. - * - * Avoid acquiring the list_lru lock if already untracked. - * The list_empty() test is safe as node->private_list is - * protected by mapping->tree_lock. - */ - if (!workingset_node_shadows(node) && - !list_empty(&node->private_list)) - list_lru_del(&workingset_shadow_nodes, &node->private_list); - __radix_tree_delete_node(&mapping->page_tree, node); + + if (dax_mapping(mapping)) { + if (radix_tree_delete_item(&mapping->page_tree, index, entry)) + mapping->nrdax--; + } else { + /* + * Regular page slots are stabilized by the page lock even + * without the tree itself locked. These unlocked entries + * need verification under the tree lock. + */ + if (!__radix_tree_lookup(&mapping->page_tree, index, &node, + &slot)) + goto unlock; + if (*slot != entry) + goto unlock; + radix_tree_replace_slot(slot, NULL); + mapping->nrshadows--; + if (!node) + goto unlock; + workingset_node_shadows_dec(node); + /* + * Don't track node without shadow entries. + * + * Avoid acquiring the list_lru lock if already untracked. + * The list_empty() test is safe as node->private_list is + * protected by mapping->tree_lock. + */ + if (!workingset_node_shadows(node) && + !list_empty(&node->private_list)) + list_lru_del(&workingset_shadow_nodes, + &node->private_list); + __radix_tree_delete_node(&mapping->page_tree, node); + } unlock: spin_unlock_irq(&mapping->tree_lock); } @@ -228,7 +237,8 @@ void truncate_inode_pages_range(struct address_space *mapping, int i; cleancache_invalidate_inode(mapping); - if (mapping->nrpages == 0 && mapping->nrshadows == 0) + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && + mapping->nrdax == 0) return; /* Offsets within partial pages */ @@ -423,7 +433,7 @@ void truncate_inode_pages_final(struct address_space *mapping) smp_rmb(); nrshadows = mapping->nrshadows; - if (nrpages || nrshadows) { + if (nrpages || nrshadows || mapping->nrdax) { /* * As truncation uses a lockless tree lookup, cycle * the tree lock to make sure any ongoing tree diff --git a/mm/vmscan.c b/mm/vmscan.c index 2aec424..8071956 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -46,6 +46,7 @@ #include #include #include +#include #include #include @@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, * inode reclaim needs to empty out the radix tree or * the nodes are lost. Don't plant shadows behind its * back. + * + * We also don't store shadows for DAX mappings because the + * only page cache pages found in these are zero pages + * covering holes, and because we don't want to mix DAX + * exceptional entries and shadow exceptional entries in the + * same page_tree. */ if (reclaimed && page_is_file_cache(page) && - !mapping_exiting(mapping)) + !mapping_exiting(mapping) && !dax_mapping(mapping)) shadow = workingset_eviction(mapping, page); __delete_from_page_cache(page, shadow, memcg); spin_unlock_irqrestore(&mapping->tree_lock, flags); -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v5 0/7] DAX fsync/msync support Date: Fri, 18 Dec 2015 22:22:13 -0700 Message-Id: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: Changes from v4: - Explicity prevent shadow entries from being added to radix trees for DAX mappings in patch 2. The only shadow entries that would be generated for DAX radix trees would be to track zero page mappings that were created for holes. These pages would receive minimal benefit from having shadow entries, and the choice to have only one type of exceptional entry in a given radix tree makes the logic simpler both in clear_exceptional_entry() and in the rest of DAX. (Jan) - Added Reviewed-by from Jan to patch 3. This series is built upon ext4/master. A working tree with this series applied can be found here: https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_v5 Ross Zwisler (7): pmem: add wb_cache_pmem() to the PMEM API dax: support dirty DAX entries in radix tree mm: add find_get_entries_tag() dax: add support for fsync/sync ext2: call dax_pfn_mkwrite() for DAX fsync/msync ext4: call dax_pfn_mkwrite() for DAX fsync/msync xfs: call dax_pfn_mkwrite() for DAX fsync/msync arch/x86/include/asm/pmem.h | 11 +-- fs/block_dev.c | 3 +- fs/dax.c | 159 ++++++++++++++++++++++++++++++++++++++++++-- fs/ext2/file.c | 4 +- fs/ext4/file.c | 4 +- fs/inode.c | 1 + fs/xfs/xfs_file.c | 7 +- include/linux/dax.h | 7 ++ include/linux/fs.h | 1 + include/linux/pagemap.h | 3 + include/linux/pmem.h | 22 +++++- include/linux/radix-tree.h | 9 +++ mm/filemap.c | 84 ++++++++++++++++++++++- mm/truncate.c | 64 ++++++++++-------- mm/vmscan.c | 9 ++- 15 files changed, 339 insertions(+), 49 deletions(-) -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v5 7/7] xfs: call dax_pfn_mkwrite() for DAX fsync/msync Date: Fri, 18 Dec 2015 22:22:20 -0700 Message-Id: <1450502540-8744-8-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen List-ID: To properly support the new DAX fsync/msync infrastructure filesystems need to call dax_pfn_mkwrite() so that DAX can track when user pages are dirtied. Signed-off-by: Ross Zwisler --- fs/xfs/xfs_file.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index f5392ab..40ffbb1 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1603,9 +1603,8 @@ xfs_filemap_pmd_fault( /* * pfn_mkwrite was originally inteneded to ensure we capture time stamp * updates on write faults. In reality, it's need to serialise against - * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite() - * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault - * barrier in place. + * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED + * to ensure we serialise the fault barrier in place. */ static int xfs_filemap_pfn_mkwrite( @@ -1628,6 +1627,8 @@ xfs_filemap_pfn_mkwrite( size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else if (IS_DAX(inode)) + ret = dax_pfn_mkwrite(vma, vmf); xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); sb_end_pagefault(inode->i_sb); return ret; -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753088AbbLSFWb (ORCPT ); Sat, 19 Dec 2015 00:22:31 -0500 Received: from mga04.intel.com ([192.55.52.120]:62179 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750722AbbLSFW2 (ORCPT ); Sat, 19 Dec 2015 00:22:28 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,448,1444719600"; d="scan'208";a="620559522" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: [PATCH v5 0/7] DAX fsync/msync support Date: Fri, 18 Dec 2015 22:22:13 -0700 Message-Id: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Changes from v4: - Explicity prevent shadow entries from being added to radix trees for DAX mappings in patch 2. The only shadow entries that would be generated for DAX radix trees would be to track zero page mappings that were created for holes. These pages would receive minimal benefit from having shadow entries, and the choice to have only one type of exceptional entry in a given radix tree makes the logic simpler both in clear_exceptional_entry() and in the rest of DAX. (Jan) - Added Reviewed-by from Jan to patch 3. This series is built upon ext4/master. A working tree with this series applied can be found here: https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_v5 Ross Zwisler (7): pmem: add wb_cache_pmem() to the PMEM API dax: support dirty DAX entries in radix tree mm: add find_get_entries_tag() dax: add support for fsync/sync ext2: call dax_pfn_mkwrite() for DAX fsync/msync ext4: call dax_pfn_mkwrite() for DAX fsync/msync xfs: call dax_pfn_mkwrite() for DAX fsync/msync arch/x86/include/asm/pmem.h | 11 +-- fs/block_dev.c | 3 +- fs/dax.c | 159 ++++++++++++++++++++++++++++++++++++++++++-- fs/ext2/file.c | 4 +- fs/ext4/file.c | 4 +- fs/inode.c | 1 + fs/xfs/xfs_file.c | 7 +- include/linux/dax.h | 7 ++ include/linux/fs.h | 1 + include/linux/pagemap.h | 3 + include/linux/pmem.h | 22 +++++- include/linux/radix-tree.h | 9 +++ mm/filemap.c | 84 ++++++++++++++++++++++- mm/truncate.c | 64 ++++++++++-------- mm/vmscan.c | 9 ++- 15 files changed, 339 insertions(+), 49 deletions(-) -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753926AbbLSFXj (ORCPT ); Sat, 19 Dec 2015 00:23:39 -0500 Received: from mga04.intel.com ([192.55.52.120]:62179 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751140AbbLSFW3 (ORCPT ); Sat, 19 Dec 2015 00:22:29 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,448,1444719600"; d="scan'208";a="620559526" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: [PATCH v5 1/7] pmem: add wb_cache_pmem() to the PMEM API Date: Fri, 18 Dec 2015 22:22:14 -0700 Message-Id: <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The function __arch_wb_cache_pmem() was already an internal implementation detail of the x86 PMEM API, but this functionality needs to be exported as part of the general PMEM API to handle the fsync/msync case for DAX mmaps. One thing worth noting is that we really do want this to be part of the PMEM API as opposed to a stand-alone function like clflush_cache_range() because of ordering restrictions. By having wb_cache_pmem() as part of the PMEM API we can leave it unordered, call it multiple times to write back large amounts of memory, and then order the multiple calls with a single wmb_pmem(). Signed-off-by: Ross Zwisler --- arch/x86/include/asm/pmem.h | 11 ++++++----- include/linux/pmem.h | 22 +++++++++++++++++++++- 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index d8ce3ec..6c7ade0 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void) } /** - * __arch_wb_cache_pmem - write back a cache range with CLWB + * arch_wb_cache_pmem - write back a cache range with CLWB * @vaddr: virtual start address * @size: number of bytes to write back * * Write back a cache range using the CLWB (cache line write back) * instruction. This function requires explicit ordering with an - * arch_wmb_pmem() call. This API is internal to the x86 PMEM implementation. + * arch_wmb_pmem() call. */ -static inline void __arch_wb_cache_pmem(void *vaddr, size_t size) +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) { u16 x86_clflush_size = boot_cpu_data.x86_clflush_size; unsigned long clflush_mask = x86_clflush_size - 1; + void *vaddr = (void __force *)addr; void *vend = vaddr + size; void *p; @@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes, len = copy_from_iter_nocache(vaddr, bytes, i); if (__iter_needs_pmem_wb(i)) - __arch_wb_cache_pmem(vaddr, bytes); + arch_wb_cache_pmem(addr, bytes); return len; } @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) else memset(vaddr, 0, size); - __arch_wb_cache_pmem(vaddr, size); + arch_wb_cache_pmem(addr, size); } static inline bool __arch_has_wmb_pmem(void) diff --git a/include/linux/pmem.h b/include/linux/pmem.h index acfea8c..7c3d11a 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) { BUG(); } + +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) +{ + BUG(); +} #endif /* * Architectures that define ARCH_HAS_PMEM_API must provide * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(), - * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem(). + * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem() + * and arch_has_wmb_pmem(). */ static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size) { @@ -178,4 +184,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size) else default_clear_pmem(addr, size); } + +/** + * wb_cache_pmem - write back processor cache for PMEM memory range + * @addr: virtual start address + * @size: number of bytes to write back + * + * Write back the processor cache range starting at 'addr' for 'size' bytes. + * This function requires explicit ordering with a wmb_pmem() call. + */ +static inline void wb_cache_pmem(void __pmem *addr, size_t size) +{ + if (arch_has_pmem_api()) + arch_wb_cache_pmem(addr, size); +} #endif /* __PMEM_H__ */ -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753402AbbLSFXX (ORCPT ); Sat, 19 Dec 2015 00:23:23 -0500 Received: from mga04.intel.com ([192.55.52.120]:9406 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753134AbbLSFWb (ORCPT ); Sat, 19 Dec 2015 00:22:31 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,448,1444719600"; d="scan'208";a="620559537" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Date: Fri, 18 Dec 2015 22:22:15 -0700 Message-Id: <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add support for tracking dirty DAX entries in the struct address_space radix tree. This tree is already used for dirty page writeback, and it already supports the use of exceptional (non struct page*) entries. In order to properly track dirty DAX pages we will insert new exceptional entries into the radix tree that represent dirty DAX PTE or PMD pages. These exceptional entries will also contain the writeback addresses for the PTE or PMD faults that we can use at fsync/msync time. There are currently two types of exceptional entries (shmem and shadow) that can be placed into the radix tree, and this adds a third. We rely on the fact that only one type of exceptional entry can be found in a given radix tree based on its usage. This happens for free with DAX vs shmem but we explicitly prevent shadow entries from being added to radix trees for DAX mappings. The only shadow entries that would be generated for DAX radix trees would be to track zero page mappings that were created for holes. These pages would receive minimal benefit from having shadow entries, and the choice to have only one type of exceptional entry in a given radix tree makes the logic simpler both in clear_exceptional_entry() and in the rest of DAX. Signed-off-by: Ross Zwisler --- fs/block_dev.c | 3 ++- fs/inode.c | 1 + include/linux/dax.h | 5 ++++ include/linux/fs.h | 1 + include/linux/radix-tree.h | 9 +++++++ mm/filemap.c | 13 +++++++--- mm/truncate.c | 64 +++++++++++++++++++++++++++------------------- mm/vmscan.c | 9 ++++++- 8 files changed, 73 insertions(+), 32 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index c25639e..226dacc 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -75,7 +75,8 @@ void kill_bdev(struct block_device *bdev) { struct address_space *mapping = bdev->bd_inode->i_mapping; - if (mapping->nrpages == 0 && mapping->nrshadows == 0) + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && + mapping->nrdax == 0) return; invalidate_bh_lrus(); diff --git a/fs/inode.c b/fs/inode.c index 1be5f90..79d828f 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -496,6 +496,7 @@ void clear_inode(struct inode *inode) spin_lock_irq(&inode->i_data.tree_lock); BUG_ON(inode->i_data.nrpages); BUG_ON(inode->i_data.nrshadows); + BUG_ON(inode->i_data.nrdax); spin_unlock_irq(&inode->i_data.tree_lock); BUG_ON(!list_empty(&inode->i_data.private_list)); BUG_ON(!(inode->i_state & I_FREEING)); diff --git a/include/linux/dax.h b/include/linux/dax.h index b415e52..e9d57f68 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) { return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); } + +static inline bool dax_mapping(struct address_space *mapping) +{ + return mapping->host && IS_DAX(mapping->host); +} #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index 3aa5142..b9ac534 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -433,6 +433,7 @@ struct address_space { /* Protected by tree_lock together with the radix tree */ unsigned long nrpages; /* number of total pages */ unsigned long nrshadows; /* number of shadow entries */ + unsigned long nrdax; /* number of DAX entries */ pgoff_t writeback_index;/* writeback starts here */ const struct address_space_operations *a_ops; /* methods */ unsigned long flags; /* error bits/gfp mask */ diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 33170db..f793c99 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -51,6 +51,15 @@ #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 +#define RADIX_DAX_MASK 0xf +#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY) +#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY) +#define RADIX_DAX_TYPE(entry) ((__force unsigned long)entry & RADIX_DAX_MASK) +#define RADIX_DAX_ADDR(entry) ((void __pmem *)((unsigned long)entry & \ + ~RADIX_DAX_MASK)) +#define RADIX_DAX_ENTRY(addr, pmd) ((void *)((__force unsigned long)addr | \ + (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE))) + static inline int radix_tree_is_indirect_ptr(void *ptr) { return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR); diff --git a/mm/filemap.c b/mm/filemap.c index 1bb0076..167a4d9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -11,6 +11,7 @@ */ #include #include +#include #include #include #include @@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping, p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); if (!radix_tree_exceptional_entry(p)) return -EEXIST; + + if (dax_mapping(mapping)) { + WARN_ON(1); + return -EINVAL; + } + if (shadowp) *shadowp = p; mapping->nrshadows--; @@ -1242,9 +1249,9 @@ repeat: if (radix_tree_deref_retry(page)) goto restart; /* - * A shadow entry of a recently evicted page, - * or a swap entry from shmem/tmpfs. Return - * it without attempting to raise page count. + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. */ goto export; } diff --git a/mm/truncate.c b/mm/truncate.c index 76e35ad..1dc9f29 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -9,6 +9,7 @@ #include #include +#include #include #include #include @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping, return; spin_lock_irq(&mapping->tree_lock); - /* - * Regular page slots are stabilized by the page lock even - * without the tree itself locked. These unlocked entries - * need verification under the tree lock. - */ - if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot)) - goto unlock; - if (*slot != entry) - goto unlock; - radix_tree_replace_slot(slot, NULL); - mapping->nrshadows--; - if (!node) - goto unlock; - workingset_node_shadows_dec(node); - /* - * Don't track node without shadow entries. - * - * Avoid acquiring the list_lru lock if already untracked. - * The list_empty() test is safe as node->private_list is - * protected by mapping->tree_lock. - */ - if (!workingset_node_shadows(node) && - !list_empty(&node->private_list)) - list_lru_del(&workingset_shadow_nodes, &node->private_list); - __radix_tree_delete_node(&mapping->page_tree, node); + + if (dax_mapping(mapping)) { + if (radix_tree_delete_item(&mapping->page_tree, index, entry)) + mapping->nrdax--; + } else { + /* + * Regular page slots are stabilized by the page lock even + * without the tree itself locked. These unlocked entries + * need verification under the tree lock. + */ + if (!__radix_tree_lookup(&mapping->page_tree, index, &node, + &slot)) + goto unlock; + if (*slot != entry) + goto unlock; + radix_tree_replace_slot(slot, NULL); + mapping->nrshadows--; + if (!node) + goto unlock; + workingset_node_shadows_dec(node); + /* + * Don't track node without shadow entries. + * + * Avoid acquiring the list_lru lock if already untracked. + * The list_empty() test is safe as node->private_list is + * protected by mapping->tree_lock. + */ + if (!workingset_node_shadows(node) && + !list_empty(&node->private_list)) + list_lru_del(&workingset_shadow_nodes, + &node->private_list); + __radix_tree_delete_node(&mapping->page_tree, node); + } unlock: spin_unlock_irq(&mapping->tree_lock); } @@ -228,7 +237,8 @@ void truncate_inode_pages_range(struct address_space *mapping, int i; cleancache_invalidate_inode(mapping); - if (mapping->nrpages == 0 && mapping->nrshadows == 0) + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && + mapping->nrdax == 0) return; /* Offsets within partial pages */ @@ -423,7 +433,7 @@ void truncate_inode_pages_final(struct address_space *mapping) smp_rmb(); nrshadows = mapping->nrshadows; - if (nrpages || nrshadows) { + if (nrpages || nrshadows || mapping->nrdax) { /* * As truncation uses a lockless tree lookup, cycle * the tree lock to make sure any ongoing tree diff --git a/mm/vmscan.c b/mm/vmscan.c index 2aec424..8071956 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -46,6 +46,7 @@ #include #include #include +#include #include #include @@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, * inode reclaim needs to empty out the radix tree or * the nodes are lost. Don't plant shadows behind its * back. + * + * We also don't store shadows for DAX mappings because the + * only page cache pages found in these are zero pages + * covering holes, and because we don't want to mix DAX + * exceptional entries and shadow exceptional entries in the + * same page_tree. */ if (reclaimed && page_is_file_cache(page) && - !mapping_exiting(mapping)) + !mapping_exiting(mapping) && !dax_mapping(mapping)) shadow = workingset_eviction(mapping, page); __delete_from_page_cache(page, shadow, memcg); spin_unlock_irqrestore(&mapping->tree_lock, flags); -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753505AbbLSFX1 (ORCPT ); Sat, 19 Dec 2015 00:23:27 -0500 Received: from mga04.intel.com ([192.55.52.120]:24029 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753181AbbLSFWd (ORCPT ); Sat, 19 Dec 2015 00:22:33 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,448,1444719600"; d="scan'208";a="620559547" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: [PATCH v5 3/7] mm: add find_get_entries_tag() Date: Fri, 18 Dec 2015 22:22:16 -0700 Message-Id: <1450502540-8744-4-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add find_get_entries_tag() to the family of functions that include find_get_entries(), find_get_pages() and find_get_pages_tag(). This is needed for DAX dirty page handling because we need a list of both page offsets and radix tree entries ('indices' and 'entries' in this function) that are marked with the PAGECACHE_TAG_TOWRITE tag. Signed-off-by: Ross Zwisler Reviewed-by: Jan Kara --- include/linux/pagemap.h | 3 +++ mm/filemap.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 71 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 26eabf5..4db0425 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -361,6 +361,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages); unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, int tag, unsigned int nr_pages, struct page **pages); +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices); struct page *grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, unsigned flags); diff --git a/mm/filemap.c b/mm/filemap.c index 167a4d9..99dfbc9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1498,6 +1498,74 @@ repeat: } EXPORT_SYMBOL(find_get_pages_tag); +/** + * find_get_entries_tag - find and return entries that match @tag + * @mapping: the address_space to search + * @start: the starting page cache index + * @tag: the tag index + * @nr_entries: the maximum number of entries + * @entries: where the resulting entries are placed + * @indices: the cache indices corresponding to the entries in @entries + * + * Like find_get_entries, except we only return entries which are tagged with + * @tag. + */ +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices) +{ + void **slot; + unsigned int ret = 0; + struct radix_tree_iter iter; + + if (!nr_entries) + return 0; + + rcu_read_lock(); +restart: + radix_tree_for_each_tagged(slot, &mapping->page_tree, + &iter, start, tag) { + struct page *page; +repeat: + page = radix_tree_deref_slot(slot); + if (unlikely(!page)) + continue; + if (radix_tree_exception(page)) { + if (radix_tree_deref_retry(page)) { + /* + * Transient condition which can only trigger + * when entry at index 0 moves out of or back + * to root: none yet gotten, safe to restart. + */ + goto restart; + } + + /* + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. + */ + goto export; + } + if (!page_cache_get_speculative(page)) + goto repeat; + + /* Has the page moved? */ + if (unlikely(page != *slot)) { + page_cache_release(page); + goto repeat; + } +export: + indices[ret] = iter.index; + entries[ret] = page; + if (++ret == nr_entries) + break; + } + rcu_read_unlock(); + return ret; +} +EXPORT_SYMBOL(find_get_entries_tag); + /* * CD/DVDs are error prone. When a medium error occurs, the driver may fail * a _large_ part of the i/o request. Imagine the worst scenario: -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753589AbbLSFXa (ORCPT ); Sat, 19 Dec 2015 00:23:30 -0500 Received: from mga04.intel.com ([192.55.52.120]:24029 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753188AbbLSFWe (ORCPT ); Sat, 19 Dec 2015 00:22:34 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,448,1444719600"; d="scan'208";a="620559555" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: [PATCH v5 4/7] dax: add support for fsync/sync Date: Fri, 18 Dec 2015 22:22:17 -0700 Message-Id: <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org To properly handle fsync/msync in an efficient way DAX needs to track dirty pages so it is able to flush them durably to media on demand. The tracking of dirty pages is done via the radix tree in struct address_space. This radix tree is already used by the page writeback infrastructure for tracking dirty pages associated with an open file, and it already has support for exceptional (non struct page*) entries. We build upon these features to add exceptional entries to the radix tree for DAX dirty PMD or PTE pages at fault time. Signed-off-by: Ross Zwisler --- fs/dax.c | 159 ++++++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/dax.h | 2 + mm/filemap.c | 3 + 3 files changed, 158 insertions(+), 6 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 43671b6..19347cf 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -289,6 +290,143 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh, return 0; } +static int dax_radix_entry(struct address_space *mapping, pgoff_t index, + void __pmem *addr, bool pmd_entry, bool dirty) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + int error = 0; + void *entry; + + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + + spin_lock_irq(&mapping->tree_lock); + entry = radix_tree_lookup(page_tree, index); + + if (entry) { + if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) + goto dirty; + radix_tree_delete(&mapping->page_tree, index); + mapping->nrdax--; + } + + if (!addr) { + /* + * This can happen during correct operation if our pfn_mkwrite + * fault raced against a hole punch operation. If this + * happens the pte that was hole punched will have been + * unmapped and the radix tree entry will have been removed by + * the time we are called, but the call will still happen. We + * will return all the way up to wp_pfn_shared(), where the + * pte_same() check will fail, eventually causing page fault + * to be retried by the CPU. + */ + goto unlock; + } else if (RADIX_DAX_TYPE(addr)) { + WARN_ONCE(1, "%s: invalid address %p\n", __func__, addr); + goto unlock; + } + + error = radix_tree_insert(page_tree, index, + RADIX_DAX_ENTRY(addr, pmd_entry)); + if (error) + goto unlock; + + mapping->nrdax++; + dirty: + if (dirty) + radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY); + unlock: + spin_unlock_irq(&mapping->tree_lock); + return error; +} + +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, + void *entry) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + int type = RADIX_DAX_TYPE(entry); + struct radix_tree_node *node; + void **slot; + + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { + WARN_ON_ONCE(1); + return; + } + + spin_lock_irq(&mapping->tree_lock); + /* + * Regular page slots are stabilized by the page lock even + * without the tree itself locked. These unlocked entries + * need verification under the tree lock. + */ + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) + goto unlock; + if (*slot != entry) + goto unlock; + + /* another fsync thread may have already written back this entry */ + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) + goto unlock; + + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); + + if (type == RADIX_DAX_PMD) + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); + else + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); + unlock: + spin_unlock_irq(&mapping->tree_lock); +} + +/* + * Flush the mapping to the persistent domain within the byte range of [start, + * end]. This is required by data integrity operations to ensure file data is + * on persistent storage prior to completion of the operation. + */ +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, + loff_t end) +{ + struct inode *inode = mapping->host; + pgoff_t indices[PAGEVEC_SIZE]; + pgoff_t start_page, end_page; + struct pagevec pvec; + void *entry; + int i; + + if (inode->i_blkbits != PAGE_SHIFT) { + WARN_ON_ONCE(1); + return; + } + + rcu_read_lock(); + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK); + rcu_read_unlock(); + + /* see if the start of our range is covered by a PMD entry */ + if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) + start &= PMD_MASK; + + start_page = start >> PAGE_CACHE_SHIFT; + end_page = end >> PAGE_CACHE_SHIFT; + + tag_pages_for_writeback(mapping, start_page, end_page); + + pagevec_init(&pvec, 0); + while (1) { + pvec.nr = find_get_entries_tag(mapping, start_page, + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE, + pvec.pages, indices); + + if (pvec.nr == 0) + break; + + for (i = 0; i < pvec.nr; i++) + dax_writeback_one(mapping, indices[i], pvec.pages[i]); + } + wmb_pmem(); +} +EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); + static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { @@ -329,7 +467,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, } error = vm_insert_mixed(vma, vaddr, pfn); + if (error) + goto out; + error = dax_radix_entry(mapping, vmf->pgoff, addr, false, + vmf->flags & FAULT_FLAG_WRITE); out: i_mmap_unlock_read(mapping); @@ -452,6 +594,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, delete_from_page_cache(page); unlock_page(page); page_cache_release(page); + page = NULL; } /* @@ -539,7 +682,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, pgoff_t size, pgoff; sector_t block, sector; unsigned long pfn; - int result = 0; + int error, result = 0; /* dax pmd mappings are broken wrt gup and fork */ if (!IS_ENABLED(CONFIG_FS_DAX_PMD)) @@ -651,6 +794,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); + + if (write) { + error = dax_radix_entry(mapping, pgoff, kaddr, true, + true); + if (error) + goto fallback; + } } out: @@ -702,15 +852,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault); * dax_pfn_mkwrite - handle first write to DAX page * @vma: The virtual memory area where the fault occurred * @vmf: The description of the fault - * */ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) { - struct super_block *sb = file_inode(vma->vm_file)->i_sb; + struct file *file = vma->vm_file; - sb_start_pagefault(sb); - file_update_time(vma->vm_file); - sb_end_pagefault(sb); + dax_radix_entry(file->f_mapping, vmf->pgoff, NULL, false, true); return VM_FAULT_NOPAGE; } EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); diff --git a/include/linux/dax.h b/include/linux/dax.h index e9d57f68..11eb183 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -41,4 +41,6 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, + loff_t end); #endif diff --git a/mm/filemap.c b/mm/filemap.c index 99dfbc9..9577783 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -482,6 +482,9 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; + if (dax_mapping(mapping) && mapping->nrdax) + dax_writeback_mapping_range(mapping, lstart, lend); + if (mapping->nrpages) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753673AbbLSFXc (ORCPT ); Sat, 19 Dec 2015 00:23:32 -0500 Received: from mga04.intel.com ([192.55.52.120]:24029 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753222AbbLSFWg (ORCPT ); Sat, 19 Dec 2015 00:22:36 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,448,1444719600"; d="scan'208";a="620559561" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: [PATCH v5 5/7] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Date: Fri, 18 Dec 2015 22:22:18 -0700 Message-Id: <1450502540-8744-6-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org To properly support the new DAX fsync/msync infrastructure filesystems need to call dax_pfn_mkwrite() so that DAX can track when user pages are dirtied. Signed-off-by: Ross Zwisler --- fs/ext2/file.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 11a42c5..2c88d68 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, { struct inode *inode = file_inode(vma->vm_file); struct ext2_inode_info *ei = EXT2_I(inode); - int ret = VM_FAULT_NOPAGE; loff_t size; + int ret; sb_start_pagefault(inode->i_sb); file_update_time(vma->vm_file); @@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); up_read(&ei->dax_sem); sb_end_pagefault(inode->i_sb); -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753826AbbLSFXh (ORCPT ); Sat, 19 Dec 2015 00:23:37 -0500 Received: from mga04.intel.com ([192.55.52.120]:24029 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753249AbbLSFWh (ORCPT ); Sat, 19 Dec 2015 00:22:37 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,448,1444719600"; d="scan'208";a="620559568" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: [PATCH v5 6/7] ext4: call dax_pfn_mkwrite() for DAX fsync/msync Date: Fri, 18 Dec 2015 22:22:19 -0700 Message-Id: <1450502540-8744-7-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org To properly support the new DAX fsync/msync infrastructure filesystems need to call dax_pfn_mkwrite() so that DAX can track when user pages are dirtied. Signed-off-by: Ross Zwisler --- fs/ext4/file.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 749b222..8c8965c 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, { struct inode *inode = file_inode(vma->vm_file); struct super_block *sb = inode->i_sb; - int ret = VM_FAULT_NOPAGE; loff_t size; + int ret; sb_start_pagefault(sb); file_update_time(vma->vm_file); @@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); up_read(&EXT4_I(inode)->i_mmap_sem); sb_end_pagefault(sb); -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753735AbbLSFXe (ORCPT ); Sat, 19 Dec 2015 00:23:34 -0500 Received: from mga04.intel.com ([192.55.52.120]:24029 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753254AbbLSFWj (ORCPT ); Sat, 19 Dec 2015 00:22:39 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,448,1444719600"; d="scan'208";a="620559574" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: [PATCH v5 7/7] xfs: call dax_pfn_mkwrite() for DAX fsync/msync Date: Fri, 18 Dec 2015 22:22:20 -0700 Message-Id: <1450502540-8744-8-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org To properly support the new DAX fsync/msync infrastructure filesystems need to call dax_pfn_mkwrite() so that DAX can track when user pages are dirtied. Signed-off-by: Ross Zwisler --- fs/xfs/xfs_file.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index f5392ab..40ffbb1 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1603,9 +1603,8 @@ xfs_filemap_pmd_fault( /* * pfn_mkwrite was originally inteneded to ensure we capture time stamp * updates on write faults. In reality, it's need to serialise against - * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite() - * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault - * barrier in place. + * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED + * to ensure we serialise the fault barrier in place. */ static int xfs_filemap_pfn_mkwrite( @@ -1628,6 +1627,8 @@ xfs_filemap_pfn_mkwrite( size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else if (IS_DAX(inode)) + ret = dax_pfn_mkwrite(vma, vmf); xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); sb_end_pagefault(inode->i_sb); return ret; -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965260AbbLSSiF (ORCPT ); Sat, 19 Dec 2015 13:38:05 -0500 Received: from mail-yk0-f174.google.com ([209.85.160.174]:34968 "EHLO mail-yk0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965216AbbLSShr (ORCPT ); Sat, 19 Dec 2015 13:37:47 -0500 MIME-Version: 1.0 In-Reply-To: <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> Date: Sat, 19 Dec 2015 10:37:46 -0800 Message-ID: Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync From: Dan Williams To: Ross Zwisler Cc: "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler wrote: > To properly handle fsync/msync in an efficient way DAX needs to track dirty > pages so it is able to flush them durably to media on demand. > > The tracking of dirty pages is done via the radix tree in struct > address_space. This radix tree is already used by the page writeback > infrastructure for tracking dirty pages associated with an open file, and > it already has support for exceptional (non struct page*) entries. We > build upon these features to add exceptional entries to the radix tree for > DAX dirty PMD or PTE pages at fault time. > > Signed-off-by: Ross Zwisler [..] > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > + void *entry) > +{ > + struct radix_tree_root *page_tree = &mapping->page_tree; > + int type = RADIX_DAX_TYPE(entry); > + struct radix_tree_node *node; > + void **slot; > + > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > + WARN_ON_ONCE(1); > + return; > + } > + > + spin_lock_irq(&mapping->tree_lock); > + /* > + * Regular page slots are stabilized by the page lock even > + * without the tree itself locked. These unlocked entries > + * need verification under the tree lock. > + */ > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > + goto unlock; > + if (*slot != entry) > + goto unlock; > + > + /* another fsync thread may have already written back this entry */ > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > + goto unlock; > + > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > + > + if (type == RADIX_DAX_PMD) > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > + else > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); Hi Ross, I should have realized this sooner, but what guarantees that the address returned by RADIX_DAX_ADDR(entry) is still valid at this point? I think we need to store the sector in the radix tree and then perform a new dax_map_atomic() operation to either lookup a valid address or fail the sync request. Otherwise, if the device is gone we'll crash, or write into some other random vmalloc address space. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751832AbbLURGF (ORCPT ); Mon, 21 Dec 2015 12:06:05 -0500 Received: from mga02.intel.com ([134.134.136.20]:48068 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751439AbbLURGB (ORCPT ); Mon, 21 Dec 2015 12:06:01 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,460,1444719600"; d="scan'208";a="621703778" Date: Mon, 21 Dec 2015 10:05:45 -0700 From: Ross Zwisler To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Message-ID: <20151221170545.GA13494@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: > On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler > wrote: > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > > > Signed-off-by: Ross Zwisler > [..] > > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > > + void *entry) > > +{ > > + struct radix_tree_root *page_tree = &mapping->page_tree; > > + int type = RADIX_DAX_TYPE(entry); > > + struct radix_tree_node *node; > > + void **slot; > > + > > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > > + WARN_ON_ONCE(1); > > + return; > > + } > > + > > + spin_lock_irq(&mapping->tree_lock); > > + /* > > + * Regular page slots are stabilized by the page lock even > > + * without the tree itself locked. These unlocked entries > > + * need verification under the tree lock. > > + */ > > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > > + goto unlock; > > + if (*slot != entry) > > + goto unlock; > > + > > + /* another fsync thread may have already written back this entry */ > > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > > + goto unlock; > > + > > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > > + > > + if (type == RADIX_DAX_PMD) > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > > + else > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); > > Hi Ross, I should have realized this sooner, but what guarantees that > the address returned by RADIX_DAX_ADDR(entry) is still valid at this > point? I think we need to store the sector in the radix tree and then > perform a new dax_map_atomic() operation to either lookup a valid > address or fail the sync request. Otherwise, if the device is gone > we'll crash, or write into some other random vmalloc address space. Ah, good point, thank you. v4 of this series is based on a version of DAX where we aren't properly dealing with PMEM device removal. I've got an updated version that merges with your dax_map_atomic() changes, and I'll add this change into v5 which I will send out today. Thank you for the suggestion. One clarification, with the code as it is in v4 we are only doing clflush/clflushopt/clwb instructions on the kaddr we've stored in the radix tree, so I don't think that there is actually a risk of us doing a "write into some other random vmalloc address space"? I think at worse we will end up clflushing an address that either isn't mapped or has been remapped by someone else. Or are you worried that the clflush would trigger a cache writeback to a memory address where writes have side effects, thus triggering the side effect? I definitely think it needs to be fixed, I'm just trying to make sure I understood your comment. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751476AbbLURPb (ORCPT ); Mon, 21 Dec 2015 12:15:31 -0500 Received: from mx2.suse.de ([195.135.220.15]:46076 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751208AbbLURP2 (ORCPT ); Mon, 21 Dec 2015 12:15:28 -0500 Date: Mon, 21 Dec 2015 18:15:12 +0100 From: Jan Kara To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Message-ID: <20151221171512.GA7030@quack.suse.cz> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 18-12-15 22:22:15, Ross Zwisler wrote: > Add support for tracking dirty DAX entries in the struct address_space > radix tree. This tree is already used for dirty page writeback, and it > already supports the use of exceptional (non struct page*) entries. > > In order to properly track dirty DAX pages we will insert new exceptional > entries into the radix tree that represent dirty DAX PTE or PMD pages. > These exceptional entries will also contain the writeback addresses for the > PTE or PMD faults that we can use at fsync/msync time. > > There are currently two types of exceptional entries (shmem and shadow) > that can be placed into the radix tree, and this adds a third. We rely on > the fact that only one type of exceptional entry can be found in a given > radix tree based on its usage. This happens for free with DAX vs shmem but > we explicitly prevent shadow entries from being added to radix trees for > DAX mappings. > > The only shadow entries that would be generated for DAX radix trees would > be to track zero page mappings that were created for holes. These pages > would receive minimal benefit from having shadow entries, and the choice > to have only one type of exceptional entry in a given radix tree makes the > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > Signed-off-by: Ross Zwisler The patch looks good to me. Just one comment: When we have this exclusion between different types of exceptional entries, there is no real need to have separate counters of 'shadow' and 'dax' entries, is there? We can have one 'nrexceptional' counter and don't have to grow struct inode unnecessarily which would be really welcome since DAX isn't a mainstream feature. Could you please change the code? Thanks! Honza > --- > fs/block_dev.c | 3 ++- > fs/inode.c | 1 + > include/linux/dax.h | 5 ++++ > include/linux/fs.h | 1 + > include/linux/radix-tree.h | 9 +++++++ > mm/filemap.c | 13 +++++++--- > mm/truncate.c | 64 +++++++++++++++++++++++++++------------------- > mm/vmscan.c | 9 ++++++- > 8 files changed, 73 insertions(+), 32 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index c25639e..226dacc 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -75,7 +75,8 @@ void kill_bdev(struct block_device *bdev) > { > struct address_space *mapping = bdev->bd_inode->i_mapping; > > - if (mapping->nrpages == 0 && mapping->nrshadows == 0) > + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && > + mapping->nrdax == 0) > return; > > invalidate_bh_lrus(); > diff --git a/fs/inode.c b/fs/inode.c > index 1be5f90..79d828f 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -496,6 +496,7 @@ void clear_inode(struct inode *inode) > spin_lock_irq(&inode->i_data.tree_lock); > BUG_ON(inode->i_data.nrpages); > BUG_ON(inode->i_data.nrshadows); > + BUG_ON(inode->i_data.nrdax); > spin_unlock_irq(&inode->i_data.tree_lock); > BUG_ON(!list_empty(&inode->i_data.private_list)); > BUG_ON(!(inode->i_state & I_FREEING)); > diff --git a/include/linux/dax.h b/include/linux/dax.h > index b415e52..e9d57f68 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) > { > return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); > } > + > +static inline bool dax_mapping(struct address_space *mapping) > +{ > + return mapping->host && IS_DAX(mapping->host); > +} > #endif > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3aa5142..b9ac534 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -433,6 +433,7 @@ struct address_space { > /* Protected by tree_lock together with the radix tree */ > unsigned long nrpages; /* number of total pages */ > unsigned long nrshadows; /* number of shadow entries */ > + unsigned long nrdax; /* number of DAX entries */ > pgoff_t writeback_index;/* writeback starts here */ > const struct address_space_operations *a_ops; /* methods */ > unsigned long flags; /* error bits/gfp mask */ > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h > index 33170db..f793c99 100644 > --- a/include/linux/radix-tree.h > +++ b/include/linux/radix-tree.h > @@ -51,6 +51,15 @@ > #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 > #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 > > +#define RADIX_DAX_MASK 0xf > +#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY) > +#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY) > +#define RADIX_DAX_TYPE(entry) ((__force unsigned long)entry & RADIX_DAX_MASK) > +#define RADIX_DAX_ADDR(entry) ((void __pmem *)((unsigned long)entry & \ > + ~RADIX_DAX_MASK)) > +#define RADIX_DAX_ENTRY(addr, pmd) ((void *)((__force unsigned long)addr | \ > + (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE))) > + > static inline int radix_tree_is_indirect_ptr(void *ptr) > { > return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR); > diff --git a/mm/filemap.c b/mm/filemap.c > index 1bb0076..167a4d9 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -11,6 +11,7 @@ > */ > #include > #include > +#include > #include > #include > #include > @@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping, > p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); > if (!radix_tree_exceptional_entry(p)) > return -EEXIST; > + > + if (dax_mapping(mapping)) { > + WARN_ON(1); > + return -EINVAL; > + } > + > if (shadowp) > *shadowp = p; > mapping->nrshadows--; > @@ -1242,9 +1249,9 @@ repeat: > if (radix_tree_deref_retry(page)) > goto restart; > /* > - * A shadow entry of a recently evicted page, > - * or a swap entry from shmem/tmpfs. Return > - * it without attempting to raise page count. > + * A shadow entry of a recently evicted page, a swap > + * entry from shmem/tmpfs or a DAX entry. Return it > + * without attempting to raise page count. > */ > goto export; > } > diff --git a/mm/truncate.c b/mm/truncate.c > index 76e35ad..1dc9f29 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -9,6 +9,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping, > return; > > spin_lock_irq(&mapping->tree_lock); > - /* > - * Regular page slots are stabilized by the page lock even > - * without the tree itself locked. These unlocked entries > - * need verification under the tree lock. > - */ > - if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot)) > - goto unlock; > - if (*slot != entry) > - goto unlock; > - radix_tree_replace_slot(slot, NULL); > - mapping->nrshadows--; > - if (!node) > - goto unlock; > - workingset_node_shadows_dec(node); > - /* > - * Don't track node without shadow entries. > - * > - * Avoid acquiring the list_lru lock if already untracked. > - * The list_empty() test is safe as node->private_list is > - * protected by mapping->tree_lock. > - */ > - if (!workingset_node_shadows(node) && > - !list_empty(&node->private_list)) > - list_lru_del(&workingset_shadow_nodes, &node->private_list); > - __radix_tree_delete_node(&mapping->page_tree, node); > + > + if (dax_mapping(mapping)) { > + if (radix_tree_delete_item(&mapping->page_tree, index, entry)) > + mapping->nrdax--; > + } else { > + /* > + * Regular page slots are stabilized by the page lock even > + * without the tree itself locked. These unlocked entries > + * need verification under the tree lock. > + */ > + if (!__radix_tree_lookup(&mapping->page_tree, index, &node, > + &slot)) > + goto unlock; > + if (*slot != entry) > + goto unlock; > + radix_tree_replace_slot(slot, NULL); > + mapping->nrshadows--; > + if (!node) > + goto unlock; > + workingset_node_shadows_dec(node); > + /* > + * Don't track node without shadow entries. > + * > + * Avoid acquiring the list_lru lock if already untracked. > + * The list_empty() test is safe as node->private_list is > + * protected by mapping->tree_lock. > + */ > + if (!workingset_node_shadows(node) && > + !list_empty(&node->private_list)) > + list_lru_del(&workingset_shadow_nodes, > + &node->private_list); > + __radix_tree_delete_node(&mapping->page_tree, node); > + } > unlock: > spin_unlock_irq(&mapping->tree_lock); > } > @@ -228,7 +237,8 @@ void truncate_inode_pages_range(struct address_space *mapping, > int i; > > cleancache_invalidate_inode(mapping); > - if (mapping->nrpages == 0 && mapping->nrshadows == 0) > + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && > + mapping->nrdax == 0) > return; > > /* Offsets within partial pages */ > @@ -423,7 +433,7 @@ void truncate_inode_pages_final(struct address_space *mapping) > smp_rmb(); > nrshadows = mapping->nrshadows; > > - if (nrpages || nrshadows) { > + if (nrpages || nrshadows || mapping->nrdax) { > /* > * As truncation uses a lockless tree lookup, cycle > * the tree lock to make sure any ongoing tree > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2aec424..8071956 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -46,6 +46,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > * inode reclaim needs to empty out the radix tree or > * the nodes are lost. Don't plant shadows behind its > * back. > + * > + * We also don't store shadows for DAX mappings because the > + * only page cache pages found in these are zero pages > + * covering holes, and because we don't want to mix DAX > + * exceptional entries and shadow exceptional entries in the > + * same page_tree. > */ > if (reclaimed && page_is_file_cache(page) && > - !mapping_exiting(mapping)) > + !mapping_exiting(mapping) && !dax_mapping(mapping)) > shadow = workingset_eviction(mapping, page); > __delete_from_page_cache(page, shadow, memcg); > spin_unlock_irqrestore(&mapping->tree_lock, flags); > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751727AbbLURcK (ORCPT ); Mon, 21 Dec 2015 12:32:10 -0500 Received: from mx2.suse.de ([195.135.220.15]:47152 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751369AbbLURcF (ORCPT ); Mon, 21 Dec 2015 12:32:05 -0500 Date: Mon, 21 Dec 2015 18:32:02 +0100 From: Jan Kara To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 5/7] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Message-ID: <20151221173202.GB7030@quack.suse.cz> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-6-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1450502540-8744-6-git-send-email-ross.zwisler@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 18-12-15 22:22:18, Ross Zwisler wrote: > To properly support the new DAX fsync/msync infrastructure filesystems > need to call dax_pfn_mkwrite() so that DAX can track when user pages are > dirtied. The patch looks good to me. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Ross Zwisler > --- > fs/ext2/file.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 11a42c5..2c88d68 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, > { > struct inode *inode = file_inode(vma->vm_file); > struct ext2_inode_info *ei = EXT2_I(inode); > - int ret = VM_FAULT_NOPAGE; > loff_t size; > + int ret; > > sb_start_pagefault(inode->i_sb); > file_update_time(vma->vm_file); > @@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, > size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > if (vmf->pgoff >= size) > ret = VM_FAULT_SIGBUS; > + else > + ret = dax_pfn_mkwrite(vma, vmf); > > up_read(&ei->dax_sem); > sb_end_pagefault(inode->i_sb); > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751818AbbLURc3 (ORCPT ); Mon, 21 Dec 2015 12:32:29 -0500 Received: from mx2.suse.de ([195.135.220.15]:47179 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751690AbbLURc0 (ORCPT ); Mon, 21 Dec 2015 12:32:26 -0500 Date: Mon, 21 Dec 2015 18:32:24 +0100 From: Jan Kara To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 6/7] ext4: call dax_pfn_mkwrite() for DAX fsync/msync Message-ID: <20151221173223.GC7030@quack.suse.cz> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-7-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1450502540-8744-7-git-send-email-ross.zwisler@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 18-12-15 22:22:19, Ross Zwisler wrote: > To properly support the new DAX fsync/msync infrastructure filesystems > need to call dax_pfn_mkwrite() so that DAX can track when user pages are > dirtied. The patch looks good to me. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Ross Zwisler > --- > fs/ext4/file.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 749b222..8c8965c 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, > { > struct inode *inode = file_inode(vma->vm_file); > struct super_block *sb = inode->i_sb; > - int ret = VM_FAULT_NOPAGE; > loff_t size; > + int ret; > > sb_start_pagefault(sb); > file_update_time(vma->vm_file); > @@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, > size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > if (vmf->pgoff >= size) > ret = VM_FAULT_SIGBUS; > + else > + ret = dax_pfn_mkwrite(vma, vmf); > up_read(&EXT4_I(inode)->i_mmap_sem); > sb_end_pagefault(sb); > > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751938AbbLURq3 (ORCPT ); Mon, 21 Dec 2015 12:46:29 -0500 Received: from mga14.intel.com ([192.55.52.115]:21450 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751815AbbLURqX (ORCPT ); Mon, 21 Dec 2015 12:46:23 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,460,1444719600"; d="scan'208";a="878371481" Date: Mon, 21 Dec 2015 10:45:34 -0700 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Message-ID: <20151221174534.GA4978@linux.intel.com> Mail-Followup-To: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> <20151221171512.GA7030@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151221171512.GA7030@quack.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 21, 2015 at 06:15:12PM +0100, Jan Kara wrote: > On Fri 18-12-15 22:22:15, Ross Zwisler wrote: > > Add support for tracking dirty DAX entries in the struct address_space > > radix tree. This tree is already used for dirty page writeback, and it > > already supports the use of exceptional (non struct page*) entries. > > > > In order to properly track dirty DAX pages we will insert new exceptional > > entries into the radix tree that represent dirty DAX PTE or PMD pages. > > These exceptional entries will also contain the writeback addresses for the > > PTE or PMD faults that we can use at fsync/msync time. > > > > There are currently two types of exceptional entries (shmem and shadow) > > that can be placed into the radix tree, and this adds a third. We rely on > > the fact that only one type of exceptional entry can be found in a given > > radix tree based on its usage. This happens for free with DAX vs shmem but > > we explicitly prevent shadow entries from being added to radix trees for > > DAX mappings. > > > > The only shadow entries that would be generated for DAX radix trees would > > be to track zero page mappings that were created for holes. These pages > > would receive minimal benefit from having shadow entries, and the choice > > to have only one type of exceptional entry in a given radix tree makes the > > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > > > Signed-off-by: Ross Zwisler > > The patch looks good to me. Just one comment: When we have this exclusion > between different types of exceptional entries, there is no real need to > have separate counters of 'shadow' and 'dax' entries, is there? We can have > one 'nrexceptional' counter and don't have to grow struct inode > unnecessarily which would be really welcome since DAX isn't a mainstream > feature. Could you please change the code? Thanks! Sure, this sounds good. Thanks! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751699AbbLURtF (ORCPT ); Mon, 21 Dec 2015 12:49:05 -0500 Received: from mail-oi0-f49.google.com ([209.85.218.49]:36000 "EHLO mail-oi0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750965AbbLURtC (ORCPT ); Mon, 21 Dec 2015 12:49:02 -0500 MIME-Version: 1.0 In-Reply-To: <20151221170545.GA13494@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> <20151221170545.GA13494@linux.intel.com> Date: Mon, 21 Dec 2015 09:49:01 -0800 Message-ID: Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync From: Dan Williams To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 21, 2015 at 9:05 AM, Ross Zwisler wrote: > On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: >> On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler >> wrote: [..] >> Hi Ross, I should have realized this sooner, but what guarantees that >> the address returned by RADIX_DAX_ADDR(entry) is still valid at this >> point? I think we need to store the sector in the radix tree and then >> perform a new dax_map_atomic() operation to either lookup a valid >> address or fail the sync request. Otherwise, if the device is gone >> we'll crash, or write into some other random vmalloc address space. > > Ah, good point, thank you. v4 of this series is based on a version of > DAX where we aren't properly dealing with PMEM device removal. I've got an > updated version that merges with your dax_map_atomic() changes, and I'll add > this change into v5 which I will send out today. Thank you for the > suggestion. > > One clarification, with the code as it is in v4 we are only doing > clflush/clflushopt/clwb instructions on the kaddr we've stored in the radix > tree, so I don't think that there is actually a risk of us doing a "write into > some other random vmalloc address space"? I think at worse we will end up > clflushing an address that either isn't mapped or has been remapped by someone > else. Or are you worried that the clflush would trigger a cache writeback to > a memory address where writes have side effects, thus triggering the side > effect? > > I definitely think it needs to be fixed, I'm just trying to make sure I > understood your comment. True, this would be flushing an address that was dirtied while valid. Should be ok in practice for now since dax is effectively limited to x86, but we should not be leaning on x86 details in an architecture generic implementation like this. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751770AbbLUT1j (ORCPT ); Mon, 21 Dec 2015 14:27:39 -0500 Received: from mail-yk0-f170.google.com ([209.85.160.170]:33805 "EHLO mail-yk0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751693AbbLUT1g (ORCPT ); Mon, 21 Dec 2015 14:27:36 -0500 MIME-Version: 1.0 In-Reply-To: <20151221170545.GA13494@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> <20151221170545.GA13494@linux.intel.com> Date: Mon, 21 Dec 2015 11:27:35 -0800 Message-ID: Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync From: Dan Williams To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 21, 2015 at 9:05 AM, Ross Zwisler wrote: > On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: >> On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler >> wrote: >> > To properly handle fsync/msync in an efficient way DAX needs to track dirty >> > pages so it is able to flush them durably to media on demand. >> > >> > The tracking of dirty pages is done via the radix tree in struct >> > address_space. This radix tree is already used by the page writeback >> > infrastructure for tracking dirty pages associated with an open file, and >> > it already has support for exceptional (non struct page*) entries. We >> > build upon these features to add exceptional entries to the radix tree for >> > DAX dirty PMD or PTE pages at fault time. >> > >> > Signed-off-by: Ross Zwisler >> [..] >> > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, >> > + void *entry) >> > +{ >> > + struct radix_tree_root *page_tree = &mapping->page_tree; >> > + int type = RADIX_DAX_TYPE(entry); >> > + struct radix_tree_node *node; >> > + void **slot; >> > + >> > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { >> > + WARN_ON_ONCE(1); >> > + return; >> > + } >> > + >> > + spin_lock_irq(&mapping->tree_lock); >> > + /* >> > + * Regular page slots are stabilized by the page lock even >> > + * without the tree itself locked. These unlocked entries >> > + * need verification under the tree lock. >> > + */ >> > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) >> > + goto unlock; >> > + if (*slot != entry) >> > + goto unlock; >> > + >> > + /* another fsync thread may have already written back this entry */ >> > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) >> > + goto unlock; >> > + >> > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); >> > + >> > + if (type == RADIX_DAX_PMD) >> > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); >> > + else >> > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); >> >> Hi Ross, I should have realized this sooner, but what guarantees that >> the address returned by RADIX_DAX_ADDR(entry) is still valid at this >> point? I think we need to store the sector in the radix tree and then >> perform a new dax_map_atomic() operation to either lookup a valid >> address or fail the sync request. Otherwise, if the device is gone >> we'll crash, or write into some other random vmalloc address space. > > Ah, good point, thank you. v4 of this series is based on a version of > DAX where we aren't properly dealing with PMEM device removal. I've got an > updated version that merges with your dax_map_atomic() changes, and I'll add > this change into v5 which I will send out today. Thank you for the > suggestion. To make the merge simpler you could skip the rebase for now and just call blk_queue_enter() / blk_queue_exit() around the calls to wb_cache_pmem. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932980AbbLVWoo (ORCPT ); Tue, 22 Dec 2015 17:44:44 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:59583 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753856AbbLVWom (ORCPT ); Tue, 22 Dec 2015 17:44:42 -0500 Date: Tue, 22 Dec 2015 14:44:40 -0800 From: Andrew Morton To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 1/7] pmem: add wb_cache_pmem() to the PMEM API Message-Id: <20151222144440.1ad9e076464f4751f3de6a1f@linux-foundation.org> In-Reply-To: <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 18 Dec 2015 22:22:14 -0700 Ross Zwisler wrote: > The function __arch_wb_cache_pmem() was already an internal implementation > detail of the x86 PMEM API, but this functionality needs to be exported as > part of the general PMEM API to handle the fsync/msync case for DAX mmaps. > > One thing worth noting is that we really do want this to be part of the > PMEM API as opposed to a stand-alone function like clflush_cache_range() > because of ordering restrictions. By having wb_cache_pmem() as part of the > PMEM API we can leave it unordered, call it multiple times to write back > large amounts of memory, and then order the multiple calls with a single > wmb_pmem(). > > @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) > else > memset(vaddr, 0, size); > > - __arch_wb_cache_pmem(vaddr, size); > + arch_wb_cache_pmem(addr, size); > } > reject. I made this arch_wb_cache_pmem(vaddr, size); due to Dan's http://www.ozlabs.org/~akpm/mmots/broken-out/pmem-dax-clean-up-clear_pmem.patch From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933105AbbLVWqJ (ORCPT ); Tue, 22 Dec 2015 17:46:09 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:59630 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932711AbbLVWqH (ORCPT ); Tue, 22 Dec 2015 17:46:07 -0500 Date: Tue, 22 Dec 2015 14:46:05 -0800 From: Andrew Morton To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Message-Id: <20151222144605.08a84ded98a42d6125a7991e@linux-foundation.org> In-Reply-To: <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 18 Dec 2015 22:22:15 -0700 Ross Zwisler wrote: > Add support for tracking dirty DAX entries in the struct address_space > radix tree. This tree is already used for dirty page writeback, and it > already supports the use of exceptional (non struct page*) entries. > > In order to properly track dirty DAX pages we will insert new exceptional > entries into the radix tree that represent dirty DAX PTE or PMD pages. > These exceptional entries will also contain the writeback addresses for the > PTE or PMD faults that we can use at fsync/msync time. > > There are currently two types of exceptional entries (shmem and shadow) > that can be placed into the radix tree, and this adds a third. We rely on > the fact that only one type of exceptional entry can be found in a given > radix tree based on its usage. This happens for free with DAX vs shmem but > we explicitly prevent shadow entries from being added to radix trees for > DAX mappings. > > The only shadow entries that would be generated for DAX radix trees would > be to track zero page mappings that were created for holes. These pages > would receive minimal benefit from having shadow entries, and the choice > to have only one type of exceptional entry in a given radix tree makes the > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > > ... > > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) > { > return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); > } > + > +static inline bool dax_mapping(struct address_space *mapping) > +{ > + return mapping->host && IS_DAX(mapping->host); > +} Can we make this evaluate to plain old "0" when CONFIG_FS_DAX=n? That way a bunch of code in callers will fall away as well. If the compiler has any brains then a good way to do this would be to make IS_DAX be "0" but one would need to check that the zeroness properly propagated out of the inline. > #endif > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3aa5142..b9ac534 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -433,6 +433,7 @@ struct address_space { > /* Protected by tree_lock together with the radix tree */ > unsigned long nrpages; /* number of total pages */ > unsigned long nrshadows; /* number of shadow entries */ > + unsigned long nrdax; /* number of DAX entries */ hm, that's unfortunate - machines commonly carry tremendous numbers of address_spaces in memory and adding pork to them is rather a big deal. We can't avoid this somehow? Maybe share the space with nrshadows by some means? Find some other field which is unused for dax files? > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -11,6 +11,7 @@ > */ > #include > #include > +#include > #include > #include > #include > @@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping, > p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); > if (!radix_tree_exceptional_entry(p)) > return -EEXIST; > + > + if (dax_mapping(mapping)) { > + WARN_ON(1); > + return -EINVAL; > + } this: --- a/mm/filemap.c~dax-support-dirty-dax-entries-in-radix-tree-fix +++ a/mm/filemap.c @@ -581,10 +581,8 @@ static int page_cache_tree_insert(struct if (!radix_tree_exceptional_entry(p)) return -EEXIST; - if (dax_mapping(mapping)) { - WARN_ON(1); + if (WARN_ON(dax_mapping(mapping))) return -EINVAL; - } if (shadowp) *shadowp = p; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933269AbbLVWqS (ORCPT ); Tue, 22 Dec 2015 17:46:18 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:59649 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932711AbbLVWqN (ORCPT ); Tue, 22 Dec 2015 17:46:13 -0500 Date: Tue, 22 Dec 2015 14:46:11 -0800 From: Andrew Morton To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 3/7] mm: add find_get_entries_tag() Message-Id: <20151222144611.07002cfde41d035125da2fa5@linux-foundation.org> In-Reply-To: <1450502540-8744-4-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-4-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 18 Dec 2015 22:22:16 -0700 Ross Zwisler wrote: > Add find_get_entries_tag() to the family of functions that include > find_get_entries(), find_get_pages() and find_get_pages_tag(). This is > needed for DAX dirty page handling because we need a list of both page > offsets and radix tree entries ('indices' and 'entries' in this function) > that are marked with the PAGECACHE_TAG_TOWRITE tag. > > ... > > +EXPORT_SYMBOL(find_get_entries_tag); This is actually a pretty crappy name because it doesn't describe what subsystem it belongs to. scheduler? scatter/gather? filesystem? But given what we've already done, I don't see an obvious fix. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934038AbbLVWq3 (ORCPT ); Tue, 22 Dec 2015 17:46:29 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:59664 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932711AbbLVWq1 (ORCPT ); Tue, 22 Dec 2015 17:46:27 -0500 Date: Tue, 22 Dec 2015 14:46:25 -0800 From: Andrew Morton To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Message-Id: <20151222144625.f400e12e362cf9b00f6ffb36@linux-foundation.org> In-Reply-To: <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 18 Dec 2015 22:22:17 -0700 Ross Zwisler wrote: > To properly handle fsync/msync in an efficient way DAX needs to track dirty > pages so it is able to flush them durably to media on demand. > > The tracking of dirty pages is done via the radix tree in struct > address_space. This radix tree is already used by the page writeback > infrastructure for tracking dirty pages associated with an open file, and > it already has support for exceptional (non struct page*) entries. We > build upon these features to add exceptional entries to the radix tree for > DAX dirty PMD or PTE pages at fault time. I'm getting a few rejects here against other pending changes. Things look OK to me but please do runtime test the end result as it resides in linux-next. Which will be next year. > > ... > > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > + void *entry) > +{ > + struct radix_tree_root *page_tree = &mapping->page_tree; > + int type = RADIX_DAX_TYPE(entry); > + struct radix_tree_node *node; > + void **slot; > + > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > + WARN_ON_ONCE(1); > + return; > + } --- a/fs/dax.c~dax-add-support-for-fsync-sync-fix +++ a/fs/dax.c @@ -383,10 +383,8 @@ static void dax_writeback_one(struct add struct radix_tree_node *node; void **slot; - if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { - WARN_ON_ONCE(1); + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) return; - } spin_lock_irq(&mapping->tree_lock); /* > + spin_lock_irq(&mapping->tree_lock); > + /* > + * Regular page slots are stabilized by the page lock even > + * without the tree itself locked. These unlocked entries > + * need verification under the tree lock. > + */ > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > + goto unlock; > + if (*slot != entry) > + goto unlock; > + > + /* another fsync thread may have already written back this entry */ > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > + goto unlock; > + > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > + > + if (type == RADIX_DAX_PMD) > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > + else > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); > + unlock: > + spin_unlock_irq(&mapping->tree_lock); > +} > + > +/* > + * Flush the mapping to the persistent domain within the byte range of [start, > + * end]. This is required by data integrity operations to ensure file data is > + * on persistent storage prior to completion of the operation. > + */ > +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > + loff_t end) > +{ > + struct inode *inode = mapping->host; > + pgoff_t indices[PAGEVEC_SIZE]; > + pgoff_t start_page, end_page; > + struct pagevec pvec; > + void *entry; > + int i; > + > + if (inode->i_blkbits != PAGE_SHIFT) { > + WARN_ON_ONCE(1); > + return; > + } again > + rcu_read_lock(); > + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK); > + rcu_read_unlock(); What stabilizes the memory at *entry after rcu_read_unlock()? > + /* see if the start of our range is covered by a PMD entry */ > + if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) > + start &= PMD_MASK; > + > + start_page = start >> PAGE_CACHE_SHIFT; > + end_page = end >> PAGE_CACHE_SHIFT; > + > + tag_pages_for_writeback(mapping, start_page, end_page); > + > + pagevec_init(&pvec, 0); > + while (1) { > + pvec.nr = find_get_entries_tag(mapping, start_page, > + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE, > + pvec.pages, indices); > + > + if (pvec.nr == 0) > + break; > + > + for (i = 0; i < pvec.nr; i++) > + dax_writeback_one(mapping, indices[i], pvec.pages[i]); > + } > + wmb_pmem(); > +} > +EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); > + > > ... > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755194AbbLVXva (ORCPT ); Tue, 22 Dec 2015 18:51:30 -0500 Received: from mga04.intel.com ([192.55.52.120]:12821 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751859AbbLVXv2 (ORCPT ); Tue, 22 Dec 2015 18:51:28 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,466,1444719600"; d="scan'208";a="867762403" Date: Tue, 22 Dec 2015 16:51:23 -0700 From: Ross Zwisler To: Andrew Morton Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Message-ID: <20151222235123.GA24124@linux.intel.com> Mail-Followup-To: Ross Zwisler , Andrew Morton , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> <20151222144625.f400e12e362cf9b00f6ffb36@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151222144625.f400e12e362cf9b00f6ffb36@linux-foundation.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 22, 2015 at 02:46:25PM -0800, Andrew Morton wrote: > On Fri, 18 Dec 2015 22:22:17 -0700 Ross Zwisler wrote: > > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > I'm getting a few rejects here against other pending changes. Things > look OK to me but please do runtime test the end result as it resides > in linux-next. Which will be next year. Sounds good. I'm hoping to soon send out an updated version of this series which merges with Dan's changes to dax.c. Thank you for pulling these into -mm. > --- a/fs/dax.c~dax-add-support-for-fsync-sync-fix > +++ a/fs/dax.c > @@ -383,10 +383,8 @@ static void dax_writeback_one(struct add > struct radix_tree_node *node; > void **slot; > > - if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > - WARN_ON_ONCE(1); > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) > return; > - } This is much cleaner, thanks. I'll make this change throughout my set. > > +/* > > + * Flush the mapping to the persistent domain within the byte range of [start, > > + * end]. This is required by data integrity operations to ensure file data is > > + * on persistent storage prior to completion of the operation. > > + */ > > +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > > + loff_t end) > > +{ > > + struct inode *inode = mapping->host; > > + pgoff_t indices[PAGEVEC_SIZE]; > > + pgoff_t start_page, end_page; > > + struct pagevec pvec; > > + void *entry; > > + int i; > > + > > + if (inode->i_blkbits != PAGE_SHIFT) { > > + WARN_ON_ONCE(1); > > + return; > > + } > > again > > > + rcu_read_lock(); > > + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK); > > + rcu_read_unlock(); > > What stabilizes the memory at *entry after rcu_read_unlock()? Nothing in this function. We use the entry that is currently in the tree to know whether or not to expand the range of offsets that we need to flush. Even if we are racing with someone, expanding our flushing range is non-destructive. We get a list of entries based on what is dirty later in this function via find_get_entries_tag(), and before we take any action on those entries we re-verify them while holding the tree_lock in dax_writeback_one(). The next version of this series will have updated version of this code which also accounts for block device removal via dax_map_atomic() inside of dax_writeback_one(). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934101AbbLWAAO (ORCPT ); Tue, 22 Dec 2015 19:00:14 -0500 Received: from mga11.intel.com ([192.55.52.93]:61496 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933267AbbLWAAM (ORCPT ); Tue, 22 Dec 2015 19:00:12 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,466,1444719600"; d="scan'208";a="877113571" Date: Tue, 22 Dec 2015 17:00:10 -0700 From: Ross Zwisler To: Andrew Morton Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 1/7] pmem: add wb_cache_pmem() to the PMEM API Message-ID: <20151223000010.GB24124@linux.intel.com> Mail-Followup-To: Ross Zwisler , Andrew Morton , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> <20151222144440.1ad9e076464f4751f3de6a1f@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151222144440.1ad9e076464f4751f3de6a1f@linux-foundation.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 22, 2015 at 02:44:40PM -0800, Andrew Morton wrote: > On Fri, 18 Dec 2015 22:22:14 -0700 Ross Zwisler wrote: > > > The function __arch_wb_cache_pmem() was already an internal implementation > > detail of the x86 PMEM API, but this functionality needs to be exported as > > part of the general PMEM API to handle the fsync/msync case for DAX mmaps. > > > > One thing worth noting is that we really do want this to be part of the > > PMEM API as opposed to a stand-alone function like clflush_cache_range() > > because of ordering restrictions. By having wb_cache_pmem() as part of the > > PMEM API we can leave it unordered, call it multiple times to write back > > large amounts of memory, and then order the multiple calls with a single > > wmb_pmem(). > > > > @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) > > else > > memset(vaddr, 0, size); > > > > - __arch_wb_cache_pmem(vaddr, size); > > + arch_wb_cache_pmem(addr, size); > > } > > > > reject. I made this > > arch_wb_cache_pmem(vaddr, size); > > due to Dan's > http://www.ozlabs.org/~akpm/mmots/broken-out/pmem-dax-clean-up-clear_pmem.patch The first argument seems wrong to me - in arch_clear_pmem() 'addr' and 'vaddr' are the same address, with the only difference being 'addr' has the __pmem annotation. As of this patch arch_wb_cache_pmem() follows the lead of the rest of the exported PMEM API functions and takes an argument that has the __pmem annotation, so I believe it should be: arch_wb_cache_pmem(addr, size); Without this I think you'll get a sparse warning. This will be fixed up in the next version of my series which build upon Dan's patches. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934141AbbLWAQd (ORCPT ); Tue, 22 Dec 2015 19:16:33 -0500 Received: from mga04.intel.com ([192.55.52.120]:62680 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932594AbbLWAQb (ORCPT ); Tue, 22 Dec 2015 19:16:31 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,466,1444719600"; d="scan'208";a="17900463" Date: Tue, 22 Dec 2015 17:16:27 -0700 From: Ross Zwisler To: Andrew Morton Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Message-ID: <20151223001627.GC24124@linux.intel.com> Mail-Followup-To: Ross Zwisler , Andrew Morton , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> <20151222144605.08a84ded98a42d6125a7991e@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151222144605.08a84ded98a42d6125a7991e@linux-foundation.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 22, 2015 at 02:46:05PM -0800, Andrew Morton wrote: > On Fri, 18 Dec 2015 22:22:15 -0700 Ross Zwisler wrote: > > > Add support for tracking dirty DAX entries in the struct address_space > > radix tree. This tree is already used for dirty page writeback, and it > > already supports the use of exceptional (non struct page*) entries. > > > > In order to properly track dirty DAX pages we will insert new exceptional > > entries into the radix tree that represent dirty DAX PTE or PMD pages. > > These exceptional entries will also contain the writeback addresses for the > > PTE or PMD faults that we can use at fsync/msync time. > > > > There are currently two types of exceptional entries (shmem and shadow) > > that can be placed into the radix tree, and this adds a third. We rely on > > the fact that only one type of exceptional entry can be found in a given > > radix tree based on its usage. This happens for free with DAX vs shmem but > > we explicitly prevent shadow entries from being added to radix trees for > > DAX mappings. > > > > The only shadow entries that would be generated for DAX radix trees would > > be to track zero page mappings that were created for holes. These pages > > would receive minimal benefit from having shadow entries, and the choice > > to have only one type of exceptional entry in a given radix tree makes the > > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > > > > > ... > > > > --- a/include/linux/dax.h > > +++ b/include/linux/dax.h > > @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) > > { > > return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); > > } > > + > > +static inline bool dax_mapping(struct address_space *mapping) > > +{ > > + return mapping->host && IS_DAX(mapping->host); > > +} > > Can we make this evaluate to plain old "0" when CONFIG_FS_DAX=n? That > way a bunch of code in callers will fall away as well. > > If the compiler has any brains then a good way to do this would be to > make IS_DAX be "0" but one would need to check that the zeroness > properly propagated out of the inline. Ah, it already works that way due to some magic with IS_DAX(). I believe we already use the fact that blocks protected by IS_DAX() go away if CONFIG_FS_DAX isn't set. The trick is that S_DAX is defined to be 0 if CONFIG_FS_DAX isn't set. I'm pretty sure this is working because of the code in filemap_write_and_wait_range(). I added a block with the later "dax: add support for fsync/msync" patch which looks like this: @@ -482,6 +482,9 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; + if (dax_mapping(mapping) && mapping->nrdax) + dax_writeback_mapping_range(mapping, lstart, lend); + Without the dax_mapping() check there the behavior is the same, but we fail to compile if CONFIG_FS_DAX isn't set because dax_writeback_mapping_range() isn't defined. (Guess how I found that out. :) ) > > #endif > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 3aa5142..b9ac534 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -433,6 +433,7 @@ struct address_space { > > /* Protected by tree_lock together with the radix tree */ > > unsigned long nrpages; /* number of total pages */ > > unsigned long nrshadows; /* number of shadow entries */ > > + unsigned long nrdax; /* number of DAX entries */ > > hm, that's unfortunate - machines commonly carry tremendous numbers of > address_spaces in memory and adding pork to them is rather a big deal. > We can't avoid this somehow? Maybe share the space with nrshadows by > some means? Find some other field which is unused for dax files? Jan Kara noticed the same thing: https://lists.01.org/pipermail/linux-nvdimm/2015-December/003626.html It'll be fixed in the next spin of the patch set. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 22 Dec 2015 17:16:27 -0700 From: Ross Zwisler To: Andrew Morton Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Message-ID: <20151223001627.GC24124@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> <20151222144605.08a84ded98a42d6125a7991e@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151222144605.08a84ded98a42d6125a7991e@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: On Tue, Dec 22, 2015 at 02:46:05PM -0800, Andrew Morton wrote: > On Fri, 18 Dec 2015 22:22:15 -0700 Ross Zwisler wrote: > > > Add support for tracking dirty DAX entries in the struct address_space > > radix tree. This tree is already used for dirty page writeback, and it > > already supports the use of exceptional (non struct page*) entries. > > > > In order to properly track dirty DAX pages we will insert new exceptional > > entries into the radix tree that represent dirty DAX PTE or PMD pages. > > These exceptional entries will also contain the writeback addresses for the > > PTE or PMD faults that we can use at fsync/msync time. > > > > There are currently two types of exceptional entries (shmem and shadow) > > that can be placed into the radix tree, and this adds a third. We rely on > > the fact that only one type of exceptional entry can be found in a given > > radix tree based on its usage. This happens for free with DAX vs shmem but > > we explicitly prevent shadow entries from being added to radix trees for > > DAX mappings. > > > > The only shadow entries that would be generated for DAX radix trees would > > be to track zero page mappings that were created for holes. These pages > > would receive minimal benefit from having shadow entries, and the choice > > to have only one type of exceptional entry in a given radix tree makes the > > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > > > > > ... > > > > --- a/include/linux/dax.h > > +++ b/include/linux/dax.h > > @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) > > { > > return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); > > } > > + > > +static inline bool dax_mapping(struct address_space *mapping) > > +{ > > + return mapping->host && IS_DAX(mapping->host); > > +} > > Can we make this evaluate to plain old "0" when CONFIG_FS_DAX=n? That > way a bunch of code in callers will fall away as well. > > If the compiler has any brains then a good way to do this would be to > make IS_DAX be "0" but one would need to check that the zeroness > properly propagated out of the inline. Ah, it already works that way due to some magic with IS_DAX(). I believe we already use the fact that blocks protected by IS_DAX() go away if CONFIG_FS_DAX isn't set. The trick is that S_DAX is defined to be 0 if CONFIG_FS_DAX isn't set. I'm pretty sure this is working because of the code in filemap_write_and_wait_range(). I added a block with the later "dax: add support for fsync/msync" patch which looks like this: @@ -482,6 +482,9 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; + if (dax_mapping(mapping) && mapping->nrdax) + dax_writeback_mapping_range(mapping, lstart, lend); + Without the dax_mapping() check there the behavior is the same, but we fail to compile if CONFIG_FS_DAX isn't set because dax_writeback_mapping_range() isn't defined. (Guess how I found that out. :) ) > > #endif > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 3aa5142..b9ac534 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -433,6 +433,7 @@ struct address_space { > > /* Protected by tree_lock together with the radix tree */ > > unsigned long nrpages; /* number of total pages */ > > unsigned long nrshadows; /* number of shadow entries */ > > + unsigned long nrdax; /* number of DAX entries */ > > hm, that's unfortunate - machines commonly carry tremendous numbers of > address_spaces in memory and adding pork to them is rather a big deal. > We can't avoid this somehow? Maybe share the space with nrshadows by > some means? Find some other field which is unused for dax files? Jan Kara noticed the same thing: https://lists.01.org/pipermail/linux-nvdimm/2015-December/003626.html It'll be fixed in the next spin of the patch set. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 22 Dec 2015 14:46:05 -0800 From: Andrew Morton To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Message-Id: <20151222144605.08a84ded98a42d6125a7991e@linux-foundation.org> In-Reply-To: <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On Fri, 18 Dec 2015 22:22:15 -0700 Ross Zwisler wrote: > Add support for tracking dirty DAX entries in the struct address_space > radix tree. This tree is already used for dirty page writeback, and it > already supports the use of exceptional (non struct page*) entries. > > In order to properly track dirty DAX pages we will insert new exceptional > entries into the radix tree that represent dirty DAX PTE or PMD pages. > These exceptional entries will also contain the writeback addresses for the > PTE or PMD faults that we can use at fsync/msync time. > > There are currently two types of exceptional entries (shmem and shadow) > that can be placed into the radix tree, and this adds a third. We rely on > the fact that only one type of exceptional entry can be found in a given > radix tree based on its usage. This happens for free with DAX vs shmem but > we explicitly prevent shadow entries from being added to radix trees for > DAX mappings. > > The only shadow entries that would be generated for DAX radix trees would > be to track zero page mappings that were created for holes. These pages > would receive minimal benefit from having shadow entries, and the choice > to have only one type of exceptional entry in a given radix tree makes the > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > > ... > > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) > { > return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); > } > + > +static inline bool dax_mapping(struct address_space *mapping) > +{ > + return mapping->host && IS_DAX(mapping->host); > +} Can we make this evaluate to plain old "0" when CONFIG_FS_DAX=n? That way a bunch of code in callers will fall away as well. If the compiler has any brains then a good way to do this would be to make IS_DAX be "0" but one would need to check that the zeroness properly propagated out of the inline. > #endif > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3aa5142..b9ac534 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -433,6 +433,7 @@ struct address_space { > /* Protected by tree_lock together with the radix tree */ > unsigned long nrpages; /* number of total pages */ > unsigned long nrshadows; /* number of shadow entries */ > + unsigned long nrdax; /* number of DAX entries */ hm, that's unfortunate - machines commonly carry tremendous numbers of address_spaces in memory and adding pork to them is rather a big deal. We can't avoid this somehow? Maybe share the space with nrshadows by some means? Find some other field which is unused for dax files? > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -11,6 +11,7 @@ > */ > #include > #include > +#include > #include > #include > #include > @@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping, > p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); > if (!radix_tree_exceptional_entry(p)) > return -EEXIST; > + > + if (dax_mapping(mapping)) { > + WARN_ON(1); > + return -EINVAL; > + } this: --- a/mm/filemap.c~dax-support-dirty-dax-entries-in-radix-tree-fix +++ a/mm/filemap.c @@ -581,10 +581,8 @@ static int page_cache_tree_insert(struct if (!radix_tree_exceptional_entry(p)) return -EEXIST; - if (dax_mapping(mapping)) { - WARN_ON(1); + if (WARN_ON(dax_mapping(mapping))) return -EINVAL; - } if (shadowp) *shadowp = p; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 22 Dec 2015 17:00:10 -0700 From: Ross Zwisler To: Andrew Morton Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 1/7] pmem: add wb_cache_pmem() to the PMEM API Message-ID: <20151223000010.GB24124@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> <20151222144440.1ad9e076464f4751f3de6a1f@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151222144440.1ad9e076464f4751f3de6a1f@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: On Tue, Dec 22, 2015 at 02:44:40PM -0800, Andrew Morton wrote: > On Fri, 18 Dec 2015 22:22:14 -0700 Ross Zwisler wrote: > > > The function __arch_wb_cache_pmem() was already an internal implementation > > detail of the x86 PMEM API, but this functionality needs to be exported as > > part of the general PMEM API to handle the fsync/msync case for DAX mmaps. > > > > One thing worth noting is that we really do want this to be part of the > > PMEM API as opposed to a stand-alone function like clflush_cache_range() > > because of ordering restrictions. By having wb_cache_pmem() as part of the > > PMEM API we can leave it unordered, call it multiple times to write back > > large amounts of memory, and then order the multiple calls with a single > > wmb_pmem(). > > > > @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) > > else > > memset(vaddr, 0, size); > > > > - __arch_wb_cache_pmem(vaddr, size); > > + arch_wb_cache_pmem(addr, size); > > } > > > > reject. I made this > > arch_wb_cache_pmem(vaddr, size); > > due to Dan's > http://www.ozlabs.org/~akpm/mmots/broken-out/pmem-dax-clean-up-clear_pmem.patch The first argument seems wrong to me - in arch_clear_pmem() 'addr' and 'vaddr' are the same address, with the only difference being 'addr' has the __pmem annotation. As of this patch arch_wb_cache_pmem() follows the lead of the rest of the exported PMEM API functions and takes an argument that has the __pmem annotation, so I believe it should be: arch_wb_cache_pmem(addr, size); Without this I think you'll get a sparse warning. This will be fixed up in the next version of my series which build upon Dan's patches. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 22 Dec 2015 14:44:40 -0800 From: Andrew Morton To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 1/7] pmem: add wb_cache_pmem() to the PMEM API Message-Id: <20151222144440.1ad9e076464f4751f3de6a1f@linux-foundation.org> In-Reply-To: <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On Fri, 18 Dec 2015 22:22:14 -0700 Ross Zwisler wrote: > The function __arch_wb_cache_pmem() was already an internal implementation > detail of the x86 PMEM API, but this functionality needs to be exported as > part of the general PMEM API to handle the fsync/msync case for DAX mmaps. > > One thing worth noting is that we really do want this to be part of the > PMEM API as opposed to a stand-alone function like clflush_cache_range() > because of ordering restrictions. By having wb_cache_pmem() as part of the > PMEM API we can leave it unordered, call it multiple times to write back > large amounts of memory, and then order the multiple calls with a single > wmb_pmem(). > > @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) > else > memset(vaddr, 0, size); > > - __arch_wb_cache_pmem(vaddr, size); > + arch_wb_cache_pmem(addr, size); > } > reject. I made this arch_wb_cache_pmem(vaddr, size); due to Dan's http://www.ozlabs.org/~akpm/mmots/broken-out/pmem-dax-clean-up-clear_pmem.patch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 22 Dec 2015 16:51:23 -0700 From: Ross Zwisler To: Andrew Morton Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Message-ID: <20151222235123.GA24124@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> <20151222144625.f400e12e362cf9b00f6ffb36@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151222144625.f400e12e362cf9b00f6ffb36@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: On Tue, Dec 22, 2015 at 02:46:25PM -0800, Andrew Morton wrote: > On Fri, 18 Dec 2015 22:22:17 -0700 Ross Zwisler wrote: > > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > I'm getting a few rejects here against other pending changes. Things > look OK to me but please do runtime test the end result as it resides > in linux-next. Which will be next year. Sounds good. I'm hoping to soon send out an updated version of this series which merges with Dan's changes to dax.c. Thank you for pulling these into -mm. > --- a/fs/dax.c~dax-add-support-for-fsync-sync-fix > +++ a/fs/dax.c > @@ -383,10 +383,8 @@ static void dax_writeback_one(struct add > struct radix_tree_node *node; > void **slot; > > - if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > - WARN_ON_ONCE(1); > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) > return; > - } This is much cleaner, thanks. I'll make this change throughout my set. > > +/* > > + * Flush the mapping to the persistent domain within the byte range of [start, > > + * end]. This is required by data integrity operations to ensure file data is > > + * on persistent storage prior to completion of the operation. > > + */ > > +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > > + loff_t end) > > +{ > > + struct inode *inode = mapping->host; > > + pgoff_t indices[PAGEVEC_SIZE]; > > + pgoff_t start_page, end_page; > > + struct pagevec pvec; > > + void *entry; > > + int i; > > + > > + if (inode->i_blkbits != PAGE_SHIFT) { > > + WARN_ON_ONCE(1); > > + return; > > + } > > again > > > + rcu_read_lock(); > > + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK); > > + rcu_read_unlock(); > > What stabilizes the memory at *entry after rcu_read_unlock()? Nothing in this function. We use the entry that is currently in the tree to know whether or not to expand the range of offsets that we need to flush. Even if we are racing with someone, expanding our flushing range is non-destructive. We get a list of entries based on what is dirty later in this function via find_get_entries_tag(), and before we take any action on those entries we re-verify them while holding the tree_lock in dax_writeback_one(). The next version of this series will have updated version of this code which also accounts for block device removal via dax_map_atomic() inside of dax_writeback_one(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 22 Dec 2015 14:46:25 -0800 From: Andrew Morton To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Message-Id: <20151222144625.f400e12e362cf9b00f6ffb36@linux-foundation.org> In-Reply-To: <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On Fri, 18 Dec 2015 22:22:17 -0700 Ross Zwisler wrote: > To properly handle fsync/msync in an efficient way DAX needs to track dirty > pages so it is able to flush them durably to media on demand. > > The tracking of dirty pages is done via the radix tree in struct > address_space. This radix tree is already used by the page writeback > infrastructure for tracking dirty pages associated with an open file, and > it already has support for exceptional (non struct page*) entries. We > build upon these features to add exceptional entries to the radix tree for > DAX dirty PMD or PTE pages at fault time. I'm getting a few rejects here against other pending changes. Things look OK to me but please do runtime test the end result as it resides in linux-next. Which will be next year. > > ... > > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > + void *entry) > +{ > + struct radix_tree_root *page_tree = &mapping->page_tree; > + int type = RADIX_DAX_TYPE(entry); > + struct radix_tree_node *node; > + void **slot; > + > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > + WARN_ON_ONCE(1); > + return; > + } --- a/fs/dax.c~dax-add-support-for-fsync-sync-fix +++ a/fs/dax.c @@ -383,10 +383,8 @@ static void dax_writeback_one(struct add struct radix_tree_node *node; void **slot; - if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { - WARN_ON_ONCE(1); + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) return; - } spin_lock_irq(&mapping->tree_lock); /* > + spin_lock_irq(&mapping->tree_lock); > + /* > + * Regular page slots are stabilized by the page lock even > + * without the tree itself locked. These unlocked entries > + * need verification under the tree lock. > + */ > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > + goto unlock; > + if (*slot != entry) > + goto unlock; > + > + /* another fsync thread may have already written back this entry */ > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > + goto unlock; > + > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > + > + if (type == RADIX_DAX_PMD) > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > + else > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); > + unlock: > + spin_unlock_irq(&mapping->tree_lock); > +} > + > +/* > + * Flush the mapping to the persistent domain within the byte range of [start, > + * end]. This is required by data integrity operations to ensure file data is > + * on persistent storage prior to completion of the operation. > + */ > +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > + loff_t end) > +{ > + struct inode *inode = mapping->host; > + pgoff_t indices[PAGEVEC_SIZE]; > + pgoff_t start_page, end_page; > + struct pagevec pvec; > + void *entry; > + int i; > + > + if (inode->i_blkbits != PAGE_SHIFT) { > + WARN_ON_ONCE(1); > + return; > + } again > + rcu_read_lock(); > + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK); > + rcu_read_unlock(); What stabilizes the memory at *entry after rcu_read_unlock()? > + /* see if the start of our range is covered by a PMD entry */ > + if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) > + start &= PMD_MASK; > + > + start_page = start >> PAGE_CACHE_SHIFT; > + end_page = end >> PAGE_CACHE_SHIFT; > + > + tag_pages_for_writeback(mapping, start_page, end_page); > + > + pagevec_init(&pvec, 0); > + while (1) { > + pvec.nr = find_get_entries_tag(mapping, start_page, > + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE, > + pvec.pages, indices); > + > + if (pvec.nr == 0) > + break; > + > + for (i = 0; i < pvec.nr; i++) > + dax_writeback_one(mapping, indices[i], pvec.pages[i]); > + } > + wmb_pmem(); > +} > +EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); > + > > ... > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 22 Dec 2015 14:46:11 -0800 From: Andrew Morton To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Dan Williams , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 3/7] mm: add find_get_entries_tag() Message-Id: <20151222144611.07002cfde41d035125da2fa5@linux-foundation.org> In-Reply-To: <1450502540-8744-4-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-4-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On Fri, 18 Dec 2015 22:22:16 -0700 Ross Zwisler wrote: > Add find_get_entries_tag() to the family of functions that include > find_get_entries(), find_get_pages() and find_get_pages_tag(). This is > needed for DAX dirty page handling because we need a list of both page > offsets and radix tree entries ('indices' and 'entries' in this function) > that are marked with the PAGECACHE_TAG_TOWRITE tag. > > ... > > +EXPORT_SYMBOL(find_get_entries_tag); This is actually a pretty crappy name because it doesn't describe what subsystem it belongs to. scheduler? scatter/gather? filesystem? But given what we've already done, I don't see an obvious fix. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Date: Mon, 21 Dec 2015 10:05:45 -0700 Message-ID: <20151221170545.GA13494@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: Dave Hansen , "J. Bruce Fields" , Linux MM , Andreas Dilger , "H. Peter Anvin" , Jeff Layton , "linux-nvdimm@lists.01.org" , X86 ML , Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4 , XFS Developers , Alexander Viro , Thomas Gleixner , Theodore Ts'o , "linux-kernel@vger.kernel.org" , Jan Kara , linux-fsdevel , Andrew Morton , Matthew Wilcox To: Dan Williams Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com List-Id: linux-ext4.vger.kernel.org On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: > On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler > wrote: > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > > > Signed-off-by: Ross Zwisler > [..] > > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > > + void *entry) > > +{ > > + struct radix_tree_root *page_tree = &mapping->page_tree; > > + int type = RADIX_DAX_TYPE(entry); > > + struct radix_tree_node *node; > > + void **slot; > > + > > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > > + WARN_ON_ONCE(1); > > + return; > > + } > > + > > + spin_lock_irq(&mapping->tree_lock); > > + /* > > + * Regular page slots are stabilized by the page lock even > > + * without the tree itself locked. These unlocked entries > > + * need verification under the tree lock. > > + */ > > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > > + goto unlock; > > + if (*slot != entry) > > + goto unlock; > > + > > + /* another fsync thread may have already written back this entry */ > > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > > + goto unlock; > > + > > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > > + > > + if (type == RADIX_DAX_PMD) > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > > + else > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); > > Hi Ross, I should have realized this sooner, but what guarantees that > the address returned by RADIX_DAX_ADDR(entry) is still valid at this > point? I think we need to store the sector in the radix tree and then > perform a new dax_map_atomic() operation to either lookup a valid > address or fail the sync request. Otherwise, if the device is gone > we'll crash, or write into some other random vmalloc address space. Ah, good point, thank you. v4 of this series is based on a version of DAX where we aren't properly dealing with PMEM device removal. I've got an updated version that merges with your dax_map_atomic() changes, and I'll add this change into v5 which I will send out today. Thank you for the suggestion. One clarification, with the code as it is in v4 we are only doing clflush/clflushopt/clwb instructions on the kaddr we've stored in the radix tree, so I don't think that there is actually a risk of us doing a "write into some other random vmalloc address space"? I think at worse we will end up clflushing an address that either isn't mapped or has been remapped by someone else. Or are you worried that the clflush would trigger a cache writeback to a memory address where writes have side effects, thus triggering the side effect? I definitely think it needs to be fixed, I'm just trying to make sure I understood your comment. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Date: Mon, 21 Dec 2015 18:15:12 +0100 Message-ID: <20151221171512.GA7030@quack.suse.cz> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen To: Ross Zwisler Return-path: Received: from mx2.suse.de ([195.135.220.15]:46076 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751208AbbLURP2 (ORCPT ); Mon, 21 Dec 2015 12:15:28 -0500 Content-Disposition: inline In-Reply-To: <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-ext4-owner@vger.kernel.org List-ID: On Fri 18-12-15 22:22:15, Ross Zwisler wrote: > Add support for tracking dirty DAX entries in the struct address_space > radix tree. This tree is already used for dirty page writeback, and it > already supports the use of exceptional (non struct page*) entries. > > In order to properly track dirty DAX pages we will insert new exceptional > entries into the radix tree that represent dirty DAX PTE or PMD pages. > These exceptional entries will also contain the writeback addresses for the > PTE or PMD faults that we can use at fsync/msync time. > > There are currently two types of exceptional entries (shmem and shadow) > that can be placed into the radix tree, and this adds a third. We rely on > the fact that only one type of exceptional entry can be found in a given > radix tree based on its usage. This happens for free with DAX vs shmem but > we explicitly prevent shadow entries from being added to radix trees for > DAX mappings. > > The only shadow entries that would be generated for DAX radix trees would > be to track zero page mappings that were created for holes. These pages > would receive minimal benefit from having shadow entries, and the choice > to have only one type of exceptional entry in a given radix tree makes the > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > Signed-off-by: Ross Zwisler The patch looks good to me. Just one comment: When we have this exclusion between different types of exceptional entries, there is no real need to have separate counters of 'shadow' and 'dax' entries, is there? We can have one 'nrexceptional' counter and don't have to grow struct inode unnecessarily which would be really welcome since DAX isn't a mainstream feature. Could you please change the code? Thanks! Honza > --- > fs/block_dev.c | 3 ++- > fs/inode.c | 1 + > include/linux/dax.h | 5 ++++ > include/linux/fs.h | 1 + > include/linux/radix-tree.h | 9 +++++++ > mm/filemap.c | 13 +++++++--- > mm/truncate.c | 64 +++++++++++++++++++++++++++------------------- > mm/vmscan.c | 9 ++++++- > 8 files changed, 73 insertions(+), 32 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index c25639e..226dacc 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -75,7 +75,8 @@ void kill_bdev(struct block_device *bdev) > { > struct address_space *mapping = bdev->bd_inode->i_mapping; > > - if (mapping->nrpages == 0 && mapping->nrshadows == 0) > + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && > + mapping->nrdax == 0) > return; > > invalidate_bh_lrus(); > diff --git a/fs/inode.c b/fs/inode.c > index 1be5f90..79d828f 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -496,6 +496,7 @@ void clear_inode(struct inode *inode) > spin_lock_irq(&inode->i_data.tree_lock); > BUG_ON(inode->i_data.nrpages); > BUG_ON(inode->i_data.nrshadows); > + BUG_ON(inode->i_data.nrdax); > spin_unlock_irq(&inode->i_data.tree_lock); > BUG_ON(!list_empty(&inode->i_data.private_list)); > BUG_ON(!(inode->i_state & I_FREEING)); > diff --git a/include/linux/dax.h b/include/linux/dax.h > index b415e52..e9d57f68 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) > { > return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); > } > + > +static inline bool dax_mapping(struct address_space *mapping) > +{ > + return mapping->host && IS_DAX(mapping->host); > +} > #endif > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3aa5142..b9ac534 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -433,6 +433,7 @@ struct address_space { > /* Protected by tree_lock together with the radix tree */ > unsigned long nrpages; /* number of total pages */ > unsigned long nrshadows; /* number of shadow entries */ > + unsigned long nrdax; /* number of DAX entries */ > pgoff_t writeback_index;/* writeback starts here */ > const struct address_space_operations *a_ops; /* methods */ > unsigned long flags; /* error bits/gfp mask */ > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h > index 33170db..f793c99 100644 > --- a/include/linux/radix-tree.h > +++ b/include/linux/radix-tree.h > @@ -51,6 +51,15 @@ > #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 > #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 > > +#define RADIX_DAX_MASK 0xf > +#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY) > +#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY) > +#define RADIX_DAX_TYPE(entry) ((__force unsigned long)entry & RADIX_DAX_MASK) > +#define RADIX_DAX_ADDR(entry) ((void __pmem *)((unsigned long)entry & \ > + ~RADIX_DAX_MASK)) > +#define RADIX_DAX_ENTRY(addr, pmd) ((void *)((__force unsigned long)addr | \ > + (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE))) > + > static inline int radix_tree_is_indirect_ptr(void *ptr) > { > return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR); > diff --git a/mm/filemap.c b/mm/filemap.c > index 1bb0076..167a4d9 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -11,6 +11,7 @@ > */ > #include > #include > +#include > #include > #include > #include > @@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping, > p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); > if (!radix_tree_exceptional_entry(p)) > return -EEXIST; > + > + if (dax_mapping(mapping)) { > + WARN_ON(1); > + return -EINVAL; > + } > + > if (shadowp) > *shadowp = p; > mapping->nrshadows--; > @@ -1242,9 +1249,9 @@ repeat: > if (radix_tree_deref_retry(page)) > goto restart; > /* > - * A shadow entry of a recently evicted page, > - * or a swap entry from shmem/tmpfs. Return > - * it without attempting to raise page count. > + * A shadow entry of a recently evicted page, a swap > + * entry from shmem/tmpfs or a DAX entry. Return it > + * without attempting to raise page count. > */ > goto export; > } > diff --git a/mm/truncate.c b/mm/truncate.c > index 76e35ad..1dc9f29 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -9,6 +9,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping, > return; > > spin_lock_irq(&mapping->tree_lock); > - /* > - * Regular page slots are stabilized by the page lock even > - * without the tree itself locked. These unlocked entries > - * need verification under the tree lock. > - */ > - if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot)) > - goto unlock; > - if (*slot != entry) > - goto unlock; > - radix_tree_replace_slot(slot, NULL); > - mapping->nrshadows--; > - if (!node) > - goto unlock; > - workingset_node_shadows_dec(node); > - /* > - * Don't track node without shadow entries. > - * > - * Avoid acquiring the list_lru lock if already untracked. > - * The list_empty() test is safe as node->private_list is > - * protected by mapping->tree_lock. > - */ > - if (!workingset_node_shadows(node) && > - !list_empty(&node->private_list)) > - list_lru_del(&workingset_shadow_nodes, &node->private_list); > - __radix_tree_delete_node(&mapping->page_tree, node); > + > + if (dax_mapping(mapping)) { > + if (radix_tree_delete_item(&mapping->page_tree, index, entry)) > + mapping->nrdax--; > + } else { > + /* > + * Regular page slots are stabilized by the page lock even > + * without the tree itself locked. These unlocked entries > + * need verification under the tree lock. > + */ > + if (!__radix_tree_lookup(&mapping->page_tree, index, &node, > + &slot)) > + goto unlock; > + if (*slot != entry) > + goto unlock; > + radix_tree_replace_slot(slot, NULL); > + mapping->nrshadows--; > + if (!node) > + goto unlock; > + workingset_node_shadows_dec(node); > + /* > + * Don't track node without shadow entries. > + * > + * Avoid acquiring the list_lru lock if already untracked. > + * The list_empty() test is safe as node->private_list is > + * protected by mapping->tree_lock. > + */ > + if (!workingset_node_shadows(node) && > + !list_empty(&node->private_list)) > + list_lru_del(&workingset_shadow_nodes, > + &node->private_list); > + __radix_tree_delete_node(&mapping->page_tree, node); > + } > unlock: > spin_unlock_irq(&mapping->tree_lock); > } > @@ -228,7 +237,8 @@ void truncate_inode_pages_range(struct address_space *mapping, > int i; > > cleancache_invalidate_inode(mapping); > - if (mapping->nrpages == 0 && mapping->nrshadows == 0) > + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && > + mapping->nrdax == 0) > return; > > /* Offsets within partial pages */ > @@ -423,7 +433,7 @@ void truncate_inode_pages_final(struct address_space *mapping) > smp_rmb(); > nrshadows = mapping->nrshadows; > > - if (nrpages || nrshadows) { > + if (nrpages || nrshadows || mapping->nrdax) { > /* > * As truncation uses a lockless tree lookup, cycle > * the tree lock to make sure any ongoing tree > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2aec424..8071956 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -46,6 +46,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > * inode reclaim needs to empty out the radix tree or > * the nodes are lost. Don't plant shadows behind its > * back. > + * > + * We also don't store shadows for DAX mappings because the > + * only page cache pages found in these are zero pages > + * covering holes, and because we don't want to mix DAX > + * exceptional entries and shadow exceptional entries in the > + * same page_tree. > */ > if (reclaimed && page_is_file_cache(page) && > - !mapping_exiting(mapping)) > + !mapping_exiting(mapping) && !dax_mapping(mapping)) > shadow = workingset_eviction(mapping, page); > __delete_from_page_cache(page, shadow, memcg); > spin_unlock_irqrestore(&mapping->tree_lock, flags); > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v5 5/7] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Date: Mon, 21 Dec 2015 18:32:02 +0100 Message-ID: <20151221173202.GB7030@quack.suse.cz> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-6-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Dan Williams , Matthew Wilcox , Dave Hansen To: Ross Zwisler Return-path: Received: from mx2.suse.de ([195.135.220.15]:47152 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751369AbbLURcF (ORCPT ); Mon, 21 Dec 2015 12:32:05 -0500 Content-Disposition: inline In-Reply-To: <1450502540-8744-6-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-ext4-owner@vger.kernel.org List-ID: On Fri 18-12-15 22:22:18, Ross Zwisler wrote: > To properly support the new DAX fsync/msync infrastructure filesystems > need to call dax_pfn_mkwrite() so that DAX can track when user pages are > dirtied. The patch looks good to me. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Ross Zwisler > --- > fs/ext2/file.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 11a42c5..2c88d68 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, > { > struct inode *inode = file_inode(vma->vm_file); > struct ext2_inode_info *ei = EXT2_I(inode); > - int ret = VM_FAULT_NOPAGE; > loff_t size; > + int ret; > > sb_start_pagefault(inode->i_sb); > file_update_time(vma->vm_file); > @@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, > size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > if (vmf->pgoff >= size) > ret = VM_FAULT_SIGBUS; > + else > + ret = dax_pfn_mkwrite(vma, vmf); > > up_read(&ei->dax_sem); > sb_end_pagefault(inode->i_sb); > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Date: Mon, 21 Dec 2015 09:49:01 -0800 Message-ID: References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> <20151221170545.GA13494@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Return-path: Received: from mail-oi0-f51.google.com ([209.85.218.51]:36000 "EHLO mail-oi0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751387AbbLURtD (ORCPT ); Mon, 21 Dec 2015 12:49:03 -0500 Received: by mail-oi0-f51.google.com with SMTP id o62so90590369oif.3 for ; Mon, 21 Dec 2015 09:49:01 -0800 (PST) In-Reply-To: <20151221170545.GA13494@linux.intel.com> Sender: linux-ext4-owner@vger.kernel.org List-ID: On Mon, Dec 21, 2015 at 9:05 AM, Ross Zwisler wrote: > On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: >> On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler >> wrote: [..] >> Hi Ross, I should have realized this sooner, but what guarantees that >> the address returned by RADIX_DAX_ADDR(entry) is still valid at this >> point? I think we need to store the sector in the radix tree and then >> perform a new dax_map_atomic() operation to either lookup a valid >> address or fail the sync request. Otherwise, if the device is gone >> we'll crash, or write into some other random vmalloc address space. > > Ah, good point, thank you. v4 of this series is based on a version of > DAX where we aren't properly dealing with PMEM device removal. I've got an > updated version that merges with your dax_map_atomic() changes, and I'll add > this change into v5 which I will send out today. Thank you for the > suggestion. > > One clarification, with the code as it is in v4 we are only doing > clflush/clflushopt/clwb instructions on the kaddr we've stored in the radix > tree, so I don't think that there is actually a risk of us doing a "write into > some other random vmalloc address space"? I think at worse we will end up > clflushing an address that either isn't mapped or has been remapped by someone > else. Or are you worried that the clflush would trigger a cache writeback to > a memory address where writes have side effects, thus triggering the side > effect? > > I definitely think it needs to be fixed, I'm just trying to make sure I > understood your comment. True, this would be flushing an address that was dirtied while valid. Should be ok in practice for now since dax is effectively limited to x86, but we should not be leaning on x86 details in an architecture generic implementation like this. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Date: Mon, 21 Dec 2015 11:27:35 -0800 Message-ID: References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> <20151221170545.GA13494@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Return-path: In-Reply-To: <20151221170545.GA13494@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Mon, Dec 21, 2015 at 9:05 AM, Ross Zwisler wrote: > On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: >> On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler >> wrote: >> > To properly handle fsync/msync in an efficient way DAX needs to track dirty >> > pages so it is able to flush them durably to media on demand. >> > >> > The tracking of dirty pages is done via the radix tree in struct >> > address_space. This radix tree is already used by the page writeback >> > infrastructure for tracking dirty pages associated with an open file, and >> > it already has support for exceptional (non struct page*) entries. We >> > build upon these features to add exceptional entries to the radix tree for >> > DAX dirty PMD or PTE pages at fault time. >> > >> > Signed-off-by: Ross Zwisler >> [..] >> > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, >> > + void *entry) >> > +{ >> > + struct radix_tree_root *page_tree = &mapping->page_tree; >> > + int type = RADIX_DAX_TYPE(entry); >> > + struct radix_tree_node *node; >> > + void **slot; >> > + >> > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { >> > + WARN_ON_ONCE(1); >> > + return; >> > + } >> > + >> > + spin_lock_irq(&mapping->tree_lock); >> > + /* >> > + * Regular page slots are stabilized by the page lock even >> > + * without the tree itself locked. These unlocked entries >> > + * need verification under the tree lock. >> > + */ >> > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) >> > + goto unlock; >> > + if (*slot != entry) >> > + goto unlock; >> > + >> > + /* another fsync thread may have already written back this entry */ >> > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) >> > + goto unlock; >> > + >> > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); >> > + >> > + if (type == RADIX_DAX_PMD) >> > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); >> > + else >> > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); >> >> Hi Ross, I should have realized this sooner, but what guarantees that >> the address returned by RADIX_DAX_ADDR(entry) is still valid at this >> point? I think we need to store the sector in the radix tree and then >> perform a new dax_map_atomic() operation to either lookup a valid >> address or fail the sync request. Otherwise, if the device is gone >> we'll crash, or write into some other random vmalloc address space. > > Ah, good point, thank you. v4 of this series is based on a version of > DAX where we aren't properly dealing with PMEM device removal. I've got an > updated version that merges with your dax_map_atomic() changes, and I'll add > this change into v5 which I will send out today. Thank you for the > suggestion. To make the merge simpler you could skip the rebase for now and just call blk_queue_enter() / blk_queue_exit() around the calls to wb_cache_pmem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Date: Tue, 22 Dec 2015 17:16:27 -0700 Message-ID: <20151223001627.GC24124@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> <20151222144605.08a84ded98a42d6125a7991e@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: linux-nvdimm@ml01.01.org, Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox To: Andrew Morton Return-path: Content-Disposition: inline In-Reply-To: <20151222144605.08a84ded98a42d6125a7991e@linux-foundation.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com List-Id: linux-ext4.vger.kernel.org On Tue, Dec 22, 2015 at 02:46:05PM -0800, Andrew Morton wrote: > On Fri, 18 Dec 2015 22:22:15 -0700 Ross Zwisler wrote: > > > Add support for tracking dirty DAX entries in the struct address_space > > radix tree. This tree is already used for dirty page writeback, and it > > already supports the use of exceptional (non struct page*) entries. > > > > In order to properly track dirty DAX pages we will insert new exceptional > > entries into the radix tree that represent dirty DAX PTE or PMD pages. > > These exceptional entries will also contain the writeback addresses for the > > PTE or PMD faults that we can use at fsync/msync time. > > > > There are currently two types of exceptional entries (shmem and shadow) > > that can be placed into the radix tree, and this adds a third. We rely on > > the fact that only one type of exceptional entry can be found in a given > > radix tree based on its usage. This happens for free with DAX vs shmem but > > we explicitly prevent shadow entries from being added to radix trees for > > DAX mappings. > > > > The only shadow entries that would be generated for DAX radix trees would > > be to track zero page mappings that were created for holes. These pages > > would receive minimal benefit from having shadow entries, and the choice > > to have only one type of exceptional entry in a given radix tree makes the > > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > > > > > ... > > > > --- a/include/linux/dax.h > > +++ b/include/linux/dax.h > > @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) > > { > > return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); > > } > > + > > +static inline bool dax_mapping(struct address_space *mapping) > > +{ > > + return mapping->host && IS_DAX(mapping->host); > > +} > > Can we make this evaluate to plain old "0" when CONFIG_FS_DAX=n? That > way a bunch of code in callers will fall away as well. > > If the compiler has any brains then a good way to do this would be to > make IS_DAX be "0" but one would need to check that the zeroness > properly propagated out of the inline. Ah, it already works that way due to some magic with IS_DAX(). I believe we already use the fact that blocks protected by IS_DAX() go away if CONFIG_FS_DAX isn't set. The trick is that S_DAX is defined to be 0 if CONFIG_FS_DAX isn't set. I'm pretty sure this is working because of the code in filemap_write_and_wait_range(). I added a block with the later "dax: add support for fsync/msync" patch which looks like this: @@ -482,6 +482,9 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; + if (dax_mapping(mapping) && mapping->nrdax) + dax_writeback_mapping_range(mapping, lstart, lend); + Without the dax_mapping() check there the behavior is the same, but we fail to compile if CONFIG_FS_DAX isn't set because dax_writeback_mapping_range() isn't defined. (Guess how I found that out. :) ) > > #endif > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 3aa5142..b9ac534 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -433,6 +433,7 @@ struct address_space { > > /* Protected by tree_lock together with the radix tree */ > > unsigned long nrpages; /* number of total pages */ > > unsigned long nrshadows; /* number of shadow entries */ > > + unsigned long nrdax; /* number of DAX entries */ > > hm, that's unfortunate - machines commonly carry tremendous numbers of > address_spaces in memory and adding pork to them is rather a big deal. > We can't avoid this somehow? Maybe share the space with nrshadows by > some means? Find some other field which is unused for dax files? Jan Kara noticed the same thing: https://lists.01.org/pipermail/linux-nvdimm/2015-December/003626.html It'll be fixed in the next spin of the patch set. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id DBA8A7F51 for ; Fri, 18 Dec 2015 23:22:33 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay3.corp.sgi.com (Postfix) with ESMTP id 7CAE6AC003 for ; Fri, 18 Dec 2015 21:22:30 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by cuda.sgi.com with ESMTP id GKFOAF5gCHHUSIRH for ; Fri, 18 Dec 2015 21:22:28 -0800 (PST) From: Ross Zwisler Subject: [PATCH v5 0/7] DAX fsync/msync support Date: Fri, 18 Dec 2015 22:22:13 -0700 Message-Id: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox Changes from v4: - Explicity prevent shadow entries from being added to radix trees for DAX mappings in patch 2. The only shadow entries that would be generated for DAX radix trees would be to track zero page mappings that were created for holes. These pages would receive minimal benefit from having shadow entries, and the choice to have only one type of exceptional entry in a given radix tree makes the logic simpler both in clear_exceptional_entry() and in the rest of DAX. (Jan) - Added Reviewed-by from Jan to patch 3. This series is built upon ext4/master. A working tree with this series applied can be found here: https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_v5 Ross Zwisler (7): pmem: add wb_cache_pmem() to the PMEM API dax: support dirty DAX entries in radix tree mm: add find_get_entries_tag() dax: add support for fsync/sync ext2: call dax_pfn_mkwrite() for DAX fsync/msync ext4: call dax_pfn_mkwrite() for DAX fsync/msync xfs: call dax_pfn_mkwrite() for DAX fsync/msync arch/x86/include/asm/pmem.h | 11 +-- fs/block_dev.c | 3 +- fs/dax.c | 159 ++++++++++++++++++++++++++++++++++++++++++-- fs/ext2/file.c | 4 +- fs/ext4/file.c | 4 +- fs/inode.c | 1 + fs/xfs/xfs_file.c | 7 +- include/linux/dax.h | 7 ++ include/linux/fs.h | 1 + include/linux/pagemap.h | 3 + include/linux/pmem.h | 22 +++++- include/linux/radix-tree.h | 9 +++ mm/filemap.c | 84 ++++++++++++++++++++++- mm/truncate.c | 64 ++++++++++-------- mm/vmscan.c | 9 ++- 15 files changed, 339 insertions(+), 49 deletions(-) -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id C241D7F53 for ; Fri, 18 Dec 2015 23:22:34 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id B26B18F806F for ; Fri, 18 Dec 2015 21:22:34 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by cuda.sgi.com with ESMTP id SrO5jZyBpxDxIK1p for ; Fri, 18 Dec 2015 21:22:33 -0800 (PST) From: Ross Zwisler Subject: [PATCH v5 3/7] mm: add find_get_entries_tag() Date: Fri, 18 Dec 2015 22:22:16 -0700 Message-Id: <1450502540-8744-4-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox Add find_get_entries_tag() to the family of functions that include find_get_entries(), find_get_pages() and find_get_pages_tag(). This is needed for DAX dirty page handling because we need a list of both page offsets and radix tree entries ('indices' and 'entries' in this function) that are marked with the PAGECACHE_TAG_TOWRITE tag. Signed-off-by: Ross Zwisler Reviewed-by: Jan Kara --- include/linux/pagemap.h | 3 +++ mm/filemap.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 71 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 26eabf5..4db0425 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -361,6 +361,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages); unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, int tag, unsigned int nr_pages, struct page **pages); +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices); struct page *grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, unsigned flags); diff --git a/mm/filemap.c b/mm/filemap.c index 167a4d9..99dfbc9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1498,6 +1498,74 @@ repeat: } EXPORT_SYMBOL(find_get_pages_tag); +/** + * find_get_entries_tag - find and return entries that match @tag + * @mapping: the address_space to search + * @start: the starting page cache index + * @tag: the tag index + * @nr_entries: the maximum number of entries + * @entries: where the resulting entries are placed + * @indices: the cache indices corresponding to the entries in @entries + * + * Like find_get_entries, except we only return entries which are tagged with + * @tag. + */ +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices) +{ + void **slot; + unsigned int ret = 0; + struct radix_tree_iter iter; + + if (!nr_entries) + return 0; + + rcu_read_lock(); +restart: + radix_tree_for_each_tagged(slot, &mapping->page_tree, + &iter, start, tag) { + struct page *page; +repeat: + page = radix_tree_deref_slot(slot); + if (unlikely(!page)) + continue; + if (radix_tree_exception(page)) { + if (radix_tree_deref_retry(page)) { + /* + * Transient condition which can only trigger + * when entry at index 0 moves out of or back + * to root: none yet gotten, safe to restart. + */ + goto restart; + } + + /* + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. + */ + goto export; + } + if (!page_cache_get_speculative(page)) + goto repeat; + + /* Has the page moved? */ + if (unlikely(page != *slot)) { + page_cache_release(page); + goto repeat; + } +export: + indices[ret] = iter.index; + entries[ret] = page; + if (++ret == nr_entries) + break; + } + rcu_read_unlock(); + return ret; +} +EXPORT_SYMBOL(find_get_entries_tag); + /* * CD/DVDs are error prone. When a medium error occurs, the driver may fail * a _large_ part of the i/o request. Imagine the worst scenario: -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 5BAD27F51 for ; Fri, 18 Dec 2015 23:22:36 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay2.corp.sgi.com (Postfix) with ESMTP id 4B3D6304059 for ; Fri, 18 Dec 2015 21:22:36 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by cuda.sgi.com with ESMTP id hk2YhpVXQvAlicxM for ; Fri, 18 Dec 2015 21:22:34 -0800 (PST) From: Ross Zwisler Subject: [PATCH v5 4/7] dax: add support for fsync/sync Date: Fri, 18 Dec 2015 22:22:17 -0700 Message-Id: <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox To properly handle fsync/msync in an efficient way DAX needs to track dirty pages so it is able to flush them durably to media on demand. The tracking of dirty pages is done via the radix tree in struct address_space. This radix tree is already used by the page writeback infrastructure for tracking dirty pages associated with an open file, and it already has support for exceptional (non struct page*) entries. We build upon these features to add exceptional entries to the radix tree for DAX dirty PMD or PTE pages at fault time. Signed-off-by: Ross Zwisler --- fs/dax.c | 159 ++++++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/dax.h | 2 + mm/filemap.c | 3 + 3 files changed, 158 insertions(+), 6 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 43671b6..19347cf 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -289,6 +290,143 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh, return 0; } +static int dax_radix_entry(struct address_space *mapping, pgoff_t index, + void __pmem *addr, bool pmd_entry, bool dirty) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + int error = 0; + void *entry; + + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + + spin_lock_irq(&mapping->tree_lock); + entry = radix_tree_lookup(page_tree, index); + + if (entry) { + if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) + goto dirty; + radix_tree_delete(&mapping->page_tree, index); + mapping->nrdax--; + } + + if (!addr) { + /* + * This can happen during correct operation if our pfn_mkwrite + * fault raced against a hole punch operation. If this + * happens the pte that was hole punched will have been + * unmapped and the radix tree entry will have been removed by + * the time we are called, but the call will still happen. We + * will return all the way up to wp_pfn_shared(), where the + * pte_same() check will fail, eventually causing page fault + * to be retried by the CPU. + */ + goto unlock; + } else if (RADIX_DAX_TYPE(addr)) { + WARN_ONCE(1, "%s: invalid address %p\n", __func__, addr); + goto unlock; + } + + error = radix_tree_insert(page_tree, index, + RADIX_DAX_ENTRY(addr, pmd_entry)); + if (error) + goto unlock; + + mapping->nrdax++; + dirty: + if (dirty) + radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY); + unlock: + spin_unlock_irq(&mapping->tree_lock); + return error; +} + +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, + void *entry) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + int type = RADIX_DAX_TYPE(entry); + struct radix_tree_node *node; + void **slot; + + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { + WARN_ON_ONCE(1); + return; + } + + spin_lock_irq(&mapping->tree_lock); + /* + * Regular page slots are stabilized by the page lock even + * without the tree itself locked. These unlocked entries + * need verification under the tree lock. + */ + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) + goto unlock; + if (*slot != entry) + goto unlock; + + /* another fsync thread may have already written back this entry */ + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) + goto unlock; + + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); + + if (type == RADIX_DAX_PMD) + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); + else + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); + unlock: + spin_unlock_irq(&mapping->tree_lock); +} + +/* + * Flush the mapping to the persistent domain within the byte range of [start, + * end]. This is required by data integrity operations to ensure file data is + * on persistent storage prior to completion of the operation. + */ +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, + loff_t end) +{ + struct inode *inode = mapping->host; + pgoff_t indices[PAGEVEC_SIZE]; + pgoff_t start_page, end_page; + struct pagevec pvec; + void *entry; + int i; + + if (inode->i_blkbits != PAGE_SHIFT) { + WARN_ON_ONCE(1); + return; + } + + rcu_read_lock(); + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK); + rcu_read_unlock(); + + /* see if the start of our range is covered by a PMD entry */ + if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) + start &= PMD_MASK; + + start_page = start >> PAGE_CACHE_SHIFT; + end_page = end >> PAGE_CACHE_SHIFT; + + tag_pages_for_writeback(mapping, start_page, end_page); + + pagevec_init(&pvec, 0); + while (1) { + pvec.nr = find_get_entries_tag(mapping, start_page, + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE, + pvec.pages, indices); + + if (pvec.nr == 0) + break; + + for (i = 0; i < pvec.nr; i++) + dax_writeback_one(mapping, indices[i], pvec.pages[i]); + } + wmb_pmem(); +} +EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); + static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { @@ -329,7 +467,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, } error = vm_insert_mixed(vma, vaddr, pfn); + if (error) + goto out; + error = dax_radix_entry(mapping, vmf->pgoff, addr, false, + vmf->flags & FAULT_FLAG_WRITE); out: i_mmap_unlock_read(mapping); @@ -452,6 +594,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, delete_from_page_cache(page); unlock_page(page); page_cache_release(page); + page = NULL; } /* @@ -539,7 +682,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, pgoff_t size, pgoff; sector_t block, sector; unsigned long pfn; - int result = 0; + int error, result = 0; /* dax pmd mappings are broken wrt gup and fork */ if (!IS_ENABLED(CONFIG_FS_DAX_PMD)) @@ -651,6 +794,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); + + if (write) { + error = dax_radix_entry(mapping, pgoff, kaddr, true, + true); + if (error) + goto fallback; + } } out: @@ -702,15 +852,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault); * dax_pfn_mkwrite - handle first write to DAX page * @vma: The virtual memory area where the fault occurred * @vmf: The description of the fault - * */ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) { - struct super_block *sb = file_inode(vma->vm_file)->i_sb; + struct file *file = vma->vm_file; - sb_start_pagefault(sb); - file_update_time(vma->vm_file); - sb_end_pagefault(sb); + dax_radix_entry(file->f_mapping, vmf->pgoff, NULL, false, true); return VM_FAULT_NOPAGE; } EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); diff --git a/include/linux/dax.h b/include/linux/dax.h index e9d57f68..11eb183 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -41,4 +41,6 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, + loff_t end); #endif diff --git a/mm/filemap.c b/mm/filemap.c index 99dfbc9..9577783 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -482,6 +482,9 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; + if (dax_mapping(mapping) && mapping->nrdax) + dax_writeback_mapping_range(mapping, lstart, lend); + if (mapping->nrpages) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 5489D7F52 for ; Fri, 18 Dec 2015 23:22:34 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id 3729E8F806F for ; Fri, 18 Dec 2015 21:22:31 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by cuda.sgi.com with ESMTP id VRTukCxGukWGwbKn for ; Fri, 18 Dec 2015 21:22:29 -0800 (PST) From: Ross Zwisler Subject: [PATCH v5 1/7] pmem: add wb_cache_pmem() to the PMEM API Date: Fri, 18 Dec 2015 22:22:14 -0700 Message-Id: <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox The function __arch_wb_cache_pmem() was already an internal implementation detail of the x86 PMEM API, but this functionality needs to be exported as part of the general PMEM API to handle the fsync/msync case for DAX mmaps. One thing worth noting is that we really do want this to be part of the PMEM API as opposed to a stand-alone function like clflush_cache_range() because of ordering restrictions. By having wb_cache_pmem() as part of the PMEM API we can leave it unordered, call it multiple times to write back large amounts of memory, and then order the multiple calls with a single wmb_pmem(). Signed-off-by: Ross Zwisler --- arch/x86/include/asm/pmem.h | 11 ++++++----- include/linux/pmem.h | 22 +++++++++++++++++++++- 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index d8ce3ec..6c7ade0 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void) } /** - * __arch_wb_cache_pmem - write back a cache range with CLWB + * arch_wb_cache_pmem - write back a cache range with CLWB * @vaddr: virtual start address * @size: number of bytes to write back * * Write back a cache range using the CLWB (cache line write back) * instruction. This function requires explicit ordering with an - * arch_wmb_pmem() call. This API is internal to the x86 PMEM implementation. + * arch_wmb_pmem() call. */ -static inline void __arch_wb_cache_pmem(void *vaddr, size_t size) +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) { u16 x86_clflush_size = boot_cpu_data.x86_clflush_size; unsigned long clflush_mask = x86_clflush_size - 1; + void *vaddr = (void __force *)addr; void *vend = vaddr + size; void *p; @@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes, len = copy_from_iter_nocache(vaddr, bytes, i); if (__iter_needs_pmem_wb(i)) - __arch_wb_cache_pmem(vaddr, bytes); + arch_wb_cache_pmem(addr, bytes); return len; } @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) else memset(vaddr, 0, size); - __arch_wb_cache_pmem(vaddr, size); + arch_wb_cache_pmem(addr, size); } static inline bool __arch_has_wmb_pmem(void) diff --git a/include/linux/pmem.h b/include/linux/pmem.h index acfea8c..7c3d11a 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) { BUG(); } + +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) +{ + BUG(); +} #endif /* * Architectures that define ARCH_HAS_PMEM_API must provide * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(), - * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem(). + * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem() + * and arch_has_wmb_pmem(). */ static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size) { @@ -178,4 +184,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size) else default_clear_pmem(addr, size); } + +/** + * wb_cache_pmem - write back processor cache for PMEM memory range + * @addr: virtual start address + * @size: number of bytes to write back + * + * Write back the processor cache range starting at 'addr' for 'size' bytes. + * This function requires explicit ordering with a wmb_pmem() call. + */ +static inline void wb_cache_pmem(void __pmem *addr, size_t size) +{ + if (arch_has_pmem_api()) + arch_wb_cache_pmem(addr, size); +} #endif /* __PMEM_H__ */ -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id D5D837F5A for ; Fri, 18 Dec 2015 23:22:39 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id C738C8F8065 for ; Fri, 18 Dec 2015 21:22:39 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by cuda.sgi.com with ESMTP id 7Nr72cFVBqLo6p0Z for ; Fri, 18 Dec 2015 21:22:37 -0800 (PST) From: Ross Zwisler Subject: [PATCH v5 6/7] ext4: call dax_pfn_mkwrite() for DAX fsync/msync Date: Fri, 18 Dec 2015 22:22:19 -0700 Message-Id: <1450502540-8744-7-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox To properly support the new DAX fsync/msync infrastructure filesystems need to call dax_pfn_mkwrite() so that DAX can track when user pages are dirtied. Signed-off-by: Ross Zwisler --- fs/ext4/file.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 749b222..8c8965c 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, { struct inode *inode = file_inode(vma->vm_file); struct super_block *sb = inode->i_sb; - int ret = VM_FAULT_NOPAGE; loff_t size; + int ret; sb_start_pagefault(sb); file_update_time(vma->vm_file); @@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); up_read(&EXT4_I(inode)->i_mmap_sem); sb_end_pagefault(sb); -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id DBFC029E03 for ; Fri, 18 Dec 2015 23:22:42 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay1.corp.sgi.com (Postfix) with ESMTP id BC3A48F8068 for ; Fri, 18 Dec 2015 21:22:42 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by cuda.sgi.com with ESMTP id KmxdSNLxLdEmHmPK for ; Fri, 18 Dec 2015 21:22:39 -0800 (PST) From: Ross Zwisler Subject: [PATCH v5 7/7] xfs: call dax_pfn_mkwrite() for DAX fsync/msync Date: Fri, 18 Dec 2015 22:22:20 -0700 Message-Id: <1450502540-8744-8-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox To properly support the new DAX fsync/msync infrastructure filesystems need to call dax_pfn_mkwrite() so that DAX can track when user pages are dirtied. Signed-off-by: Ross Zwisler --- fs/xfs/xfs_file.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index f5392ab..40ffbb1 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1603,9 +1603,8 @@ xfs_filemap_pmd_fault( /* * pfn_mkwrite was originally inteneded to ensure we capture time stamp * updates on write faults. In reality, it's need to serialise against - * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite() - * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault - * barrier in place. + * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED + * to ensure we serialise the fault barrier in place. */ static int xfs_filemap_pfn_mkwrite( @@ -1628,6 +1627,8 @@ xfs_filemap_pfn_mkwrite( size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else if (IS_DAX(inode)) + ret = dax_pfn_mkwrite(vma, vmf); xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); sb_end_pagefault(inode->i_sb); return ret; -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id D410F29DFE for ; Fri, 18 Dec 2015 23:22:42 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id 7F238AC005 for ; Fri, 18 Dec 2015 21:22:42 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by cuda.sgi.com with ESMTP id euiGA44skiPuqqM3 for ; Fri, 18 Dec 2015 21:22:31 -0800 (PST) From: Ross Zwisler Subject: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Date: Fri, 18 Dec 2015 22:22:15 -0700 Message-Id: <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox Add support for tracking dirty DAX entries in the struct address_space radix tree. This tree is already used for dirty page writeback, and it already supports the use of exceptional (non struct page*) entries. In order to properly track dirty DAX pages we will insert new exceptional entries into the radix tree that represent dirty DAX PTE or PMD pages. These exceptional entries will also contain the writeback addresses for the PTE or PMD faults that we can use at fsync/msync time. There are currently two types of exceptional entries (shmem and shadow) that can be placed into the radix tree, and this adds a third. We rely on the fact that only one type of exceptional entry can be found in a given radix tree based on its usage. This happens for free with DAX vs shmem but we explicitly prevent shadow entries from being added to radix trees for DAX mappings. The only shadow entries that would be generated for DAX radix trees would be to track zero page mappings that were created for holes. These pages would receive minimal benefit from having shadow entries, and the choice to have only one type of exceptional entry in a given radix tree makes the logic simpler both in clear_exceptional_entry() and in the rest of DAX. Signed-off-by: Ross Zwisler --- fs/block_dev.c | 3 ++- fs/inode.c | 1 + include/linux/dax.h | 5 ++++ include/linux/fs.h | 1 + include/linux/radix-tree.h | 9 +++++++ mm/filemap.c | 13 +++++++--- mm/truncate.c | 64 +++++++++++++++++++++++++++------------------- mm/vmscan.c | 9 ++++++- 8 files changed, 73 insertions(+), 32 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index c25639e..226dacc 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -75,7 +75,8 @@ void kill_bdev(struct block_device *bdev) { struct address_space *mapping = bdev->bd_inode->i_mapping; - if (mapping->nrpages == 0 && mapping->nrshadows == 0) + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && + mapping->nrdax == 0) return; invalidate_bh_lrus(); diff --git a/fs/inode.c b/fs/inode.c index 1be5f90..79d828f 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -496,6 +496,7 @@ void clear_inode(struct inode *inode) spin_lock_irq(&inode->i_data.tree_lock); BUG_ON(inode->i_data.nrpages); BUG_ON(inode->i_data.nrshadows); + BUG_ON(inode->i_data.nrdax); spin_unlock_irq(&inode->i_data.tree_lock); BUG_ON(!list_empty(&inode->i_data.private_list)); BUG_ON(!(inode->i_state & I_FREEING)); diff --git a/include/linux/dax.h b/include/linux/dax.h index b415e52..e9d57f68 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) { return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); } + +static inline bool dax_mapping(struct address_space *mapping) +{ + return mapping->host && IS_DAX(mapping->host); +} #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index 3aa5142..b9ac534 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -433,6 +433,7 @@ struct address_space { /* Protected by tree_lock together with the radix tree */ unsigned long nrpages; /* number of total pages */ unsigned long nrshadows; /* number of shadow entries */ + unsigned long nrdax; /* number of DAX entries */ pgoff_t writeback_index;/* writeback starts here */ const struct address_space_operations *a_ops; /* methods */ unsigned long flags; /* error bits/gfp mask */ diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 33170db..f793c99 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -51,6 +51,15 @@ #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 +#define RADIX_DAX_MASK 0xf +#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY) +#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY) +#define RADIX_DAX_TYPE(entry) ((__force unsigned long)entry & RADIX_DAX_MASK) +#define RADIX_DAX_ADDR(entry) ((void __pmem *)((unsigned long)entry & \ + ~RADIX_DAX_MASK)) +#define RADIX_DAX_ENTRY(addr, pmd) ((void *)((__force unsigned long)addr | \ + (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE))) + static inline int radix_tree_is_indirect_ptr(void *ptr) { return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR); diff --git a/mm/filemap.c b/mm/filemap.c index 1bb0076..167a4d9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -11,6 +11,7 @@ */ #include #include +#include #include #include #include @@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping, p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); if (!radix_tree_exceptional_entry(p)) return -EEXIST; + + if (dax_mapping(mapping)) { + WARN_ON(1); + return -EINVAL; + } + if (shadowp) *shadowp = p; mapping->nrshadows--; @@ -1242,9 +1249,9 @@ repeat: if (radix_tree_deref_retry(page)) goto restart; /* - * A shadow entry of a recently evicted page, - * or a swap entry from shmem/tmpfs. Return - * it without attempting to raise page count. + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. */ goto export; } diff --git a/mm/truncate.c b/mm/truncate.c index 76e35ad..1dc9f29 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -9,6 +9,7 @@ #include #include +#include #include #include #include @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping, return; spin_lock_irq(&mapping->tree_lock); - /* - * Regular page slots are stabilized by the page lock even - * without the tree itself locked. These unlocked entries - * need verification under the tree lock. - */ - if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot)) - goto unlock; - if (*slot != entry) - goto unlock; - radix_tree_replace_slot(slot, NULL); - mapping->nrshadows--; - if (!node) - goto unlock; - workingset_node_shadows_dec(node); - /* - * Don't track node without shadow entries. - * - * Avoid acquiring the list_lru lock if already untracked. - * The list_empty() test is safe as node->private_list is - * protected by mapping->tree_lock. - */ - if (!workingset_node_shadows(node) && - !list_empty(&node->private_list)) - list_lru_del(&workingset_shadow_nodes, &node->private_list); - __radix_tree_delete_node(&mapping->page_tree, node); + + if (dax_mapping(mapping)) { + if (radix_tree_delete_item(&mapping->page_tree, index, entry)) + mapping->nrdax--; + } else { + /* + * Regular page slots are stabilized by the page lock even + * without the tree itself locked. These unlocked entries + * need verification under the tree lock. + */ + if (!__radix_tree_lookup(&mapping->page_tree, index, &node, + &slot)) + goto unlock; + if (*slot != entry) + goto unlock; + radix_tree_replace_slot(slot, NULL); + mapping->nrshadows--; + if (!node) + goto unlock; + workingset_node_shadows_dec(node); + /* + * Don't track node without shadow entries. + * + * Avoid acquiring the list_lru lock if already untracked. + * The list_empty() test is safe as node->private_list is + * protected by mapping->tree_lock. + */ + if (!workingset_node_shadows(node) && + !list_empty(&node->private_list)) + list_lru_del(&workingset_shadow_nodes, + &node->private_list); + __radix_tree_delete_node(&mapping->page_tree, node); + } unlock: spin_unlock_irq(&mapping->tree_lock); } @@ -228,7 +237,8 @@ void truncate_inode_pages_range(struct address_space *mapping, int i; cleancache_invalidate_inode(mapping); - if (mapping->nrpages == 0 && mapping->nrshadows == 0) + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && + mapping->nrdax == 0) return; /* Offsets within partial pages */ @@ -423,7 +433,7 @@ void truncate_inode_pages_final(struct address_space *mapping) smp_rmb(); nrshadows = mapping->nrshadows; - if (nrpages || nrshadows) { + if (nrpages || nrshadows || mapping->nrdax) { /* * As truncation uses a lockless tree lookup, cycle * the tree lock to make sure any ongoing tree diff --git a/mm/vmscan.c b/mm/vmscan.c index 2aec424..8071956 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -46,6 +46,7 @@ #include #include #include +#include #include #include @@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, * inode reclaim needs to empty out the radix tree or * the nodes are lost. Don't plant shadows behind its * back. + * + * We also don't store shadows for DAX mappings because the + * only page cache pages found in these are zero pages + * covering holes, and because we don't want to mix DAX + * exceptional entries and shadow exceptional entries in the + * same page_tree. */ if (reclaimed && page_is_file_cache(page) && - !mapping_exiting(mapping)) + !mapping_exiting(mapping) && !dax_mapping(mapping)) shadow = workingset_eviction(mapping, page); __delete_from_page_cache(page, shadow, memcg); spin_unlock_irqrestore(&mapping->tree_lock, flags); -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id AA3EF29DFD for ; Fri, 18 Dec 2015 23:22:42 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id 4BA15AC003 for ; Fri, 18 Dec 2015 21:22:42 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by cuda.sgi.com with ESMTP id novhpUJy2v8agpsu for ; Fri, 18 Dec 2015 21:22:36 -0800 (PST) From: Ross Zwisler Subject: [PATCH v5 5/7] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Date: Fri, 18 Dec 2015 22:22:18 -0700 Message-Id: <1450502540-8744-6-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox To properly support the new DAX fsync/msync infrastructure filesystems need to call dax_pfn_mkwrite() so that DAX can track when user pages are dirtied. Signed-off-by: Ross Zwisler --- fs/ext2/file.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 11a42c5..2c88d68 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, { struct inode *inode = file_inode(vma->vm_file); struct ext2_inode_info *ei = EXT2_I(inode); - int ret = VM_FAULT_NOPAGE; loff_t size; + int ret; sb_start_pagefault(inode->i_sb); file_update_time(vma->vm_file); @@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); up_read(&ei->dax_sem); sb_end_pagefault(inode->i_sb); -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 521AF7F85 for ; Sat, 19 Dec 2015 12:37:52 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 3E13B304032 for ; Sat, 19 Dec 2015 10:37:49 -0800 (PST) Received: from mail-yk0-f171.google.com (mail-yk0-f171.google.com [209.85.160.171]) by cuda.sgi.com with ESMTP id jsmvcTBJD6mJN7tW (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Sat, 19 Dec 2015 10:37:47 -0800 (PST) Received: by mail-yk0-f171.google.com with SMTP id 140so93534406ykp.0 for ; Sat, 19 Dec 2015 10:37:47 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> Date: Sat, 19 Dec 2015 10:37:46 -0800 Message-ID: Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: X86 ML , Theodore Ts'o , Andrew Morton , "linux-nvdimm@lists.01.org" , Jan Kara , "linux-kernel@vger.kernel.org" , Dave Hansen , XFS Developers , "J. Bruce Fields" , Linux MM , Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel , Matthew Wilcox , Jeff Layton , linux-ext4 , Thomas Gleixner , Matthew Wilcox On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler wrote: > To properly handle fsync/msync in an efficient way DAX needs to track dirty > pages so it is able to flush them durably to media on demand. > > The tracking of dirty pages is done via the radix tree in struct > address_space. This radix tree is already used by the page writeback > infrastructure for tracking dirty pages associated with an open file, and > it already has support for exceptional (non struct page*) entries. We > build upon these features to add exceptional entries to the radix tree for > DAX dirty PMD or PTE pages at fault time. > > Signed-off-by: Ross Zwisler [..] > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > + void *entry) > +{ > + struct radix_tree_root *page_tree = &mapping->page_tree; > + int type = RADIX_DAX_TYPE(entry); > + struct radix_tree_node *node; > + void **slot; > + > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > + WARN_ON_ONCE(1); > + return; > + } > + > + spin_lock_irq(&mapping->tree_lock); > + /* > + * Regular page slots are stabilized by the page lock even > + * without the tree itself locked. These unlocked entries > + * need verification under the tree lock. > + */ > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > + goto unlock; > + if (*slot != entry) > + goto unlock; > + > + /* another fsync thread may have already written back this entry */ > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > + goto unlock; > + > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > + > + if (type == RADIX_DAX_PMD) > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > + else > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); Hi Ross, I should have realized this sooner, but what guarantees that the address returned by RADIX_DAX_ADDR(entry) is still valid at this point? I think we need to store the sector in the radix tree and then perform a new dax_map_atomic() operation to either lookup a valid address or fail the sync request. Otherwise, if the device is gone we'll crash, or write into some other random vmalloc address space. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 1BB2A7F37 for ; Mon, 21 Dec 2015 11:32:27 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id 0B8378F8039 for ; Mon, 21 Dec 2015 09:32:26 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id 3c8lRq8Zt1NlkNXY (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NO) for ; Mon, 21 Dec 2015 09:32:25 -0800 (PST) Date: Mon, 21 Dec 2015 18:32:24 +0100 From: Jan Kara Subject: Re: [PATCH v5 6/7] ext4: call dax_pfn_mkwrite() for DAX fsync/msync Message-ID: <20151221173223.GC7030@quack.suse.cz> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-7-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1450502540-8744-7-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox On Fri 18-12-15 22:22:19, Ross Zwisler wrote: > To properly support the new DAX fsync/msync infrastructure filesystems > need to call dax_pfn_mkwrite() so that DAX can track when user pages are > dirtied. The patch looks good to me. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Ross Zwisler > --- > fs/ext4/file.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 749b222..8c8965c 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, > { > struct inode *inode = file_inode(vma->vm_file); > struct super_block *sb = inode->i_sb; > - int ret = VM_FAULT_NOPAGE; > loff_t size; > + int ret; > > sb_start_pagefault(sb); > file_update_time(vma->vm_file); > @@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, > size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > if (vmf->pgoff >= size) > ret = VM_FAULT_SIGBUS; > + else > + ret = dax_pfn_mkwrite(vma, vmf); > up_read(&EXT4_I(inode)->i_mmap_sem); > sb_end_pagefault(sb); > > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id BC5CC7F37 for ; Mon, 21 Dec 2015 11:46:25 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 9953530404E for ; Mon, 21 Dec 2015 09:46:25 -0800 (PST) Received: from mga14.intel.com ([192.55.52.115]) by cuda.sgi.com with ESMTP id pvjJOEjgyKwVzmbg for ; Mon, 21 Dec 2015 09:46:23 -0800 (PST) Date: Mon, 21 Dec 2015 10:45:34 -0700 From: Ross Zwisler Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Message-ID: <20151221174534.GA4978@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> <20151221171512.GA7030@quack.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20151221171512.GA7030@quack.suse.cz> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Jan Kara Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox On Mon, Dec 21, 2015 at 06:15:12PM +0100, Jan Kara wrote: > On Fri 18-12-15 22:22:15, Ross Zwisler wrote: > > Add support for tracking dirty DAX entries in the struct address_space > > radix tree. This tree is already used for dirty page writeback, and it > > already supports the use of exceptional (non struct page*) entries. > > > > In order to properly track dirty DAX pages we will insert new exceptional > > entries into the radix tree that represent dirty DAX PTE or PMD pages. > > These exceptional entries will also contain the writeback addresses for the > > PTE or PMD faults that we can use at fsync/msync time. > > > > There are currently two types of exceptional entries (shmem and shadow) > > that can be placed into the radix tree, and this adds a third. We rely on > > the fact that only one type of exceptional entry can be found in a given > > radix tree based on its usage. This happens for free with DAX vs shmem but > > we explicitly prevent shadow entries from being added to radix trees for > > DAX mappings. > > > > The only shadow entries that would be generated for DAX radix trees would > > be to track zero page mappings that were created for holes. These pages > > would receive minimal benefit from having shadow entries, and the choice > > to have only one type of exceptional entry in a given radix tree makes the > > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > > > Signed-off-by: Ross Zwisler > > The patch looks good to me. Just one comment: When we have this exclusion > between different types of exceptional entries, there is no real need to > have separate counters of 'shadow' and 'dax' entries, is there? We can have > one 'nrexceptional' counter and don't have to grow struct inode > unnecessarily which would be really welcome since DAX isn't a mainstream > feature. Could you please change the code? Thanks! Sure, this sounds good. Thanks! _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id D333529DF5 for ; Tue, 22 Dec 2015 16:44:47 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id B4F88304032 for ; Tue, 22 Dec 2015 14:44:44 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) by cuda.sgi.com with ESMTP id ijMW8cWMOPPvyDgb (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Tue, 22 Dec 2015 14:44:42 -0800 (PST) Date: Tue, 22 Dec 2015 14:44:40 -0800 From: Andrew Morton Subject: Re: [PATCH v5 1/7] pmem: add wb_cache_pmem() to the PMEM API Message-Id: <20151222144440.1ad9e076464f4751f3de6a1f@linux-foundation.org> In-Reply-To: <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: x86@kernel.org, Theodore Ts'o , linux-nvdimm@ml01.01.org, Jan Kara , linux-kernel@vger.kernel.org, Dave Hansen , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Jeff Layton , linux-ext4@vger.kernel.org, Thomas Gleixner , Dan Williams , Matthew Wilcox On Fri, 18 Dec 2015 22:22:14 -0700 Ross Zwisler wrote: > The function __arch_wb_cache_pmem() was already an internal implementation > detail of the x86 PMEM API, but this functionality needs to be exported as > part of the general PMEM API to handle the fsync/msync case for DAX mmaps. > > One thing worth noting is that we really do want this to be part of the > PMEM API as opposed to a stand-alone function like clflush_cache_range() > because of ordering restrictions. By having wb_cache_pmem() as part of the > PMEM API we can leave it unordered, call it multiple times to write back > large amounts of memory, and then order the multiple calls with a single > wmb_pmem(). > > @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) > else > memset(vaddr, 0, size); > > - __arch_wb_cache_pmem(vaddr, size); > + arch_wb_cache_pmem(addr, size); > } > reject. I made this arch_wb_cache_pmem(vaddr, size); due to Dan's http://www.ozlabs.org/~akpm/mmots/broken-out/pmem-dax-clean-up-clear_pmem.patch _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 4E99829DF5 for ; Tue, 22 Dec 2015 16:46:12 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay3.corp.sgi.com (Postfix) with ESMTP id D3DBFAC001 for ; Tue, 22 Dec 2015 14:46:08 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) by cuda.sgi.com with ESMTP id 1EO5TdLYj2mS1oUp (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Tue, 22 Dec 2015 14:46:06 -0800 (PST) Date: Tue, 22 Dec 2015 14:46:05 -0800 From: Andrew Morton Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Message-Id: <20151222144605.08a84ded98a42d6125a7991e@linux-foundation.org> In-Reply-To: <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: x86@kernel.org, Theodore Ts'o , linux-nvdimm@ml01.01.org, Jan Kara , linux-kernel@vger.kernel.org, Dave Hansen , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Jeff Layton , linux-ext4@vger.kernel.org, Thomas Gleixner , Dan Williams , Matthew Wilcox On Fri, 18 Dec 2015 22:22:15 -0700 Ross Zwisler wrote: > Add support for tracking dirty DAX entries in the struct address_space > radix tree. This tree is already used for dirty page writeback, and it > already supports the use of exceptional (non struct page*) entries. > > In order to properly track dirty DAX pages we will insert new exceptional > entries into the radix tree that represent dirty DAX PTE or PMD pages. > These exceptional entries will also contain the writeback addresses for the > PTE or PMD faults that we can use at fsync/msync time. > > There are currently two types of exceptional entries (shmem and shadow) > that can be placed into the radix tree, and this adds a third. We rely on > the fact that only one type of exceptional entry can be found in a given > radix tree based on its usage. This happens for free with DAX vs shmem but > we explicitly prevent shadow entries from being added to radix trees for > DAX mappings. > > The only shadow entries that would be generated for DAX radix trees would > be to track zero page mappings that were created for holes. These pages > would receive minimal benefit from having shadow entries, and the choice > to have only one type of exceptional entry in a given radix tree makes the > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > > ... > > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) > { > return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); > } > + > +static inline bool dax_mapping(struct address_space *mapping) > +{ > + return mapping->host && IS_DAX(mapping->host); > +} Can we make this evaluate to plain old "0" when CONFIG_FS_DAX=n? That way a bunch of code in callers will fall away as well. If the compiler has any brains then a good way to do this would be to make IS_DAX be "0" but one would need to check that the zeroness properly propagated out of the inline. > #endif > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3aa5142..b9ac534 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -433,6 +433,7 @@ struct address_space { > /* Protected by tree_lock together with the radix tree */ > unsigned long nrpages; /* number of total pages */ > unsigned long nrshadows; /* number of shadow entries */ > + unsigned long nrdax; /* number of DAX entries */ hm, that's unfortunate - machines commonly carry tremendous numbers of address_spaces in memory and adding pork to them is rather a big deal. We can't avoid this somehow? Maybe share the space with nrshadows by some means? Find some other field which is unused for dax files? > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -11,6 +11,7 @@ > */ > #include > #include > +#include > #include > #include > #include > @@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping, > p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); > if (!radix_tree_exceptional_entry(p)) > return -EEXIST; > + > + if (dax_mapping(mapping)) { > + WARN_ON(1); > + return -EINVAL; > + } this: --- a/mm/filemap.c~dax-support-dirty-dax-entries-in-radix-tree-fix +++ a/mm/filemap.c @@ -581,10 +581,8 @@ static int page_cache_tree_insert(struct if (!radix_tree_exceptional_entry(p)) return -EEXIST; - if (dax_mapping(mapping)) { - WARN_ON(1); + if (WARN_ON(dax_mapping(mapping))) return -EINVAL; - } if (shadowp) *shadowp = p; _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 5434729DF5 for ; Tue, 22 Dec 2015 16:46:14 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id 355B98F8040 for ; Tue, 22 Dec 2015 14:46:14 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) by cuda.sgi.com with ESMTP id ZO8Pow1V8r8C0GiD (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Tue, 22 Dec 2015 14:46:13 -0800 (PST) Date: Tue, 22 Dec 2015 14:46:11 -0800 From: Andrew Morton Subject: Re: [PATCH v5 3/7] mm: add find_get_entries_tag() Message-Id: <20151222144611.07002cfde41d035125da2fa5@linux-foundation.org> In-Reply-To: <1450502540-8744-4-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-4-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: x86@kernel.org, Theodore Ts'o , linux-nvdimm@ml01.01.org, Jan Kara , linux-kernel@vger.kernel.org, Dave Hansen , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Jeff Layton , linux-ext4@vger.kernel.org, Thomas Gleixner , Dan Williams , Matthew Wilcox On Fri, 18 Dec 2015 22:22:16 -0700 Ross Zwisler wrote: > Add find_get_entries_tag() to the family of functions that include > find_get_entries(), find_get_pages() and find_get_pages_tag(). This is > needed for DAX dirty page handling because we need a list of both page > offsets and radix tree entries ('indices' and 'entries' in this function) > that are marked with the PAGECACHE_TAG_TOWRITE tag. > > ... > > +EXPORT_SYMBOL(find_get_entries_tag); This is actually a pretty crappy name because it doesn't describe what subsystem it belongs to. scheduler? scatter/gather? filesystem? But given what we've already done, I don't see an obvious fix. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 301CB29DF5 for ; Tue, 22 Dec 2015 16:46:29 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay2.corp.sgi.com (Postfix) with ESMTP id 1E271304048 for ; Tue, 22 Dec 2015 14:46:29 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) by cuda.sgi.com with ESMTP id RkPDmp2SSLRGt7Bd (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Tue, 22 Dec 2015 14:46:27 -0800 (PST) Date: Tue, 22 Dec 2015 14:46:25 -0800 From: Andrew Morton Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Message-Id: <20151222144625.f400e12e362cf9b00f6ffb36@linux-foundation.org> In-Reply-To: <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: x86@kernel.org, Theodore Ts'o , linux-nvdimm@ml01.01.org, Jan Kara , linux-kernel@vger.kernel.org, Dave Hansen , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Jeff Layton , linux-ext4@vger.kernel.org, Thomas Gleixner , Dan Williams , Matthew Wilcox On Fri, 18 Dec 2015 22:22:17 -0700 Ross Zwisler wrote: > To properly handle fsync/msync in an efficient way DAX needs to track dirty > pages so it is able to flush them durably to media on demand. > > The tracking of dirty pages is done via the radix tree in struct > address_space. This radix tree is already used by the page writeback > infrastructure for tracking dirty pages associated with an open file, and > it already has support for exceptional (non struct page*) entries. We > build upon these features to add exceptional entries to the radix tree for > DAX dirty PMD or PTE pages at fault time. I'm getting a few rejects here against other pending changes. Things look OK to me but please do runtime test the end result as it resides in linux-next. Which will be next year. > > ... > > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > + void *entry) > +{ > + struct radix_tree_root *page_tree = &mapping->page_tree; > + int type = RADIX_DAX_TYPE(entry); > + struct radix_tree_node *node; > + void **slot; > + > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > + WARN_ON_ONCE(1); > + return; > + } --- a/fs/dax.c~dax-add-support-for-fsync-sync-fix +++ a/fs/dax.c @@ -383,10 +383,8 @@ static void dax_writeback_one(struct add struct radix_tree_node *node; void **slot; - if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { - WARN_ON_ONCE(1); + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) return; - } spin_lock_irq(&mapping->tree_lock); /* > + spin_lock_irq(&mapping->tree_lock); > + /* > + * Regular page slots are stabilized by the page lock even > + * without the tree itself locked. These unlocked entries > + * need verification under the tree lock. > + */ > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > + goto unlock; > + if (*slot != entry) > + goto unlock; > + > + /* another fsync thread may have already written back this entry */ > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > + goto unlock; > + > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > + > + if (type == RADIX_DAX_PMD) > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > + else > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); > + unlock: > + spin_unlock_irq(&mapping->tree_lock); > +} > + > +/* > + * Flush the mapping to the persistent domain within the byte range of [start, > + * end]. This is required by data integrity operations to ensure file data is > + * on persistent storage prior to completion of the operation. > + */ > +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > + loff_t end) > +{ > + struct inode *inode = mapping->host; > + pgoff_t indices[PAGEVEC_SIZE]; > + pgoff_t start_page, end_page; > + struct pagevec pvec; > + void *entry; > + int i; > + > + if (inode->i_blkbits != PAGE_SHIFT) { > + WARN_ON_ONCE(1); > + return; > + } again > + rcu_read_lock(); > + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK); > + rcu_read_unlock(); What stabilizes the memory at *entry after rcu_read_unlock()? > + /* see if the start of our range is covered by a PMD entry */ > + if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) > + start &= PMD_MASK; > + > + start_page = start >> PAGE_CACHE_SHIFT; > + end_page = end >> PAGE_CACHE_SHIFT; > + > + tag_pages_for_writeback(mapping, start_page, end_page); > + > + pagevec_init(&pvec, 0); > + while (1) { > + pvec.nr = find_get_entries_tag(mapping, start_page, > + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE, > + pvec.pages, indices); > + > + if (pvec.nr == 0) > + break; > + > + for (i = 0; i < pvec.nr; i++) > + dax_writeback_one(mapping, indices[i], pvec.pages[i]); > + } > + wmb_pmem(); > +} > +EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); > + > > ... > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 951ED29DF5 for ; Tue, 22 Dec 2015 17:51:33 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id 820C48F8052 for ; Tue, 22 Dec 2015 15:51:33 -0800 (PST) Received: from mga04.intel.com ([192.55.52.120]) by cuda.sgi.com with ESMTP id x82SZMBueey3G9Ky for ; Tue, 22 Dec 2015 15:51:28 -0800 (PST) Date: Tue, 22 Dec 2015 16:51:23 -0700 From: Ross Zwisler Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Message-ID: <20151222235123.GA24124@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> <20151222144625.f400e12e362cf9b00f6ffb36@linux-foundation.org> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20151222144625.f400e12e362cf9b00f6ffb36@linux-foundation.org> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Andrew Morton Cc: linux-nvdimm@ml01.01.org, Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox On Tue, Dec 22, 2015 at 02:46:25PM -0800, Andrew Morton wrote: > On Fri, 18 Dec 2015 22:22:17 -0700 Ross Zwisler wrote: > > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > I'm getting a few rejects here against other pending changes. Things > look OK to me but please do runtime test the end result as it resides > in linux-next. Which will be next year. Sounds good. I'm hoping to soon send out an updated version of this series which merges with Dan's changes to dax.c. Thank you for pulling these into -mm. > --- a/fs/dax.c~dax-add-support-for-fsync-sync-fix > +++ a/fs/dax.c > @@ -383,10 +383,8 @@ static void dax_writeback_one(struct add > struct radix_tree_node *node; > void **slot; > > - if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > - WARN_ON_ONCE(1); > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) > return; > - } This is much cleaner, thanks. I'll make this change throughout my set. > > +/* > > + * Flush the mapping to the persistent domain within the byte range of [start, > > + * end]. This is required by data integrity operations to ensure file data is > > + * on persistent storage prior to completion of the operation. > > + */ > > +void dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > > + loff_t end) > > +{ > > + struct inode *inode = mapping->host; > > + pgoff_t indices[PAGEVEC_SIZE]; > > + pgoff_t start_page, end_page; > > + struct pagevec pvec; > > + void *entry; > > + int i; > > + > > + if (inode->i_blkbits != PAGE_SHIFT) { > > + WARN_ON_ONCE(1); > > + return; > > + } > > again > > > + rcu_read_lock(); > > + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK); > > + rcu_read_unlock(); > > What stabilizes the memory at *entry after rcu_read_unlock()? Nothing in this function. We use the entry that is currently in the tree to know whether or not to expand the range of offsets that we need to flush. Even if we are racing with someone, expanding our flushing range is non-destructive. We get a list of entries based on what is dirty later in this function via find_get_entries_tag(), and before we take any action on those entries we re-verify them while holding the tree_lock in dax_writeback_one(). The next version of this series will have updated version of this code which also accounts for block device removal via dax_map_atomic() inside of dax_writeback_one(). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 190037F37 for ; Tue, 22 Dec 2015 18:00:21 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay3.corp.sgi.com (Postfix) with ESMTP id 914CDAC004 for ; Tue, 22 Dec 2015 16:00:17 -0800 (PST) Received: from mga01.intel.com ([192.55.52.88]) by cuda.sgi.com with ESMTP id 3nriHoIE9miGi9qF for ; Tue, 22 Dec 2015 16:00:16 -0800 (PST) Date: Tue, 22 Dec 2015 17:00:10 -0700 From: Ross Zwisler Subject: Re: [PATCH v5 1/7] pmem: add wb_cache_pmem() to the PMEM API Message-ID: <20151223000010.GB24124@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-2-git-send-email-ross.zwisler@linux.intel.com> <20151222144440.1ad9e076464f4751f3de6a1f@linux-foundation.org> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20151222144440.1ad9e076464f4751f3de6a1f@linux-foundation.org> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Andrew Morton Cc: linux-nvdimm@ml01.01.org, Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox On Tue, Dec 22, 2015 at 02:44:40PM -0800, Andrew Morton wrote: > On Fri, 18 Dec 2015 22:22:14 -0700 Ross Zwisler wrote: > > > The function __arch_wb_cache_pmem() was already an internal implementation > > detail of the x86 PMEM API, but this functionality needs to be exported as > > part of the general PMEM API to handle the fsync/msync case for DAX mmaps. > > > > One thing worth noting is that we really do want this to be part of the > > PMEM API as opposed to a stand-alone function like clflush_cache_range() > > because of ordering restrictions. By having wb_cache_pmem() as part of the > > PMEM API we can leave it unordered, call it multiple times to write back > > large amounts of memory, and then order the multiple calls with a single > > wmb_pmem(). > > > > @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) > > else > > memset(vaddr, 0, size); > > > > - __arch_wb_cache_pmem(vaddr, size); > > + arch_wb_cache_pmem(addr, size); > > } > > > > reject. I made this > > arch_wb_cache_pmem(vaddr, size); > > due to Dan's > http://www.ozlabs.org/~akpm/mmots/broken-out/pmem-dax-clean-up-clear_pmem.patch The first argument seems wrong to me - in arch_clear_pmem() 'addr' and 'vaddr' are the same address, with the only difference being 'addr' has the __pmem annotation. As of this patch arch_wb_cache_pmem() follows the lead of the rest of the exported PMEM API functions and takes an argument that has the __pmem annotation, so I believe it should be: arch_wb_cache_pmem(addr, size); Without this I think you'll get a sparse warning. This will be fixed up in the next version of my series which build upon Dan's patches. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 244017F52 for ; Mon, 21 Dec 2015 11:15:29 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay2.corp.sgi.com (Postfix) with ESMTP id 02830304051 for ; Mon, 21 Dec 2015 09:15:28 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id VPAzzzIG1mzluM5m (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NO) for ; Mon, 21 Dec 2015 09:15:25 -0800 (PST) Date: Mon, 21 Dec 2015 18:15:12 +0100 From: Jan Kara Subject: Re: [PATCH v5 2/7] dax: support dirty DAX entries in radix tree Message-ID: <20151221171512.GA7030@quack.suse.cz> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1450502540-8744-3-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox On Fri 18-12-15 22:22:15, Ross Zwisler wrote: > Add support for tracking dirty DAX entries in the struct address_space > radix tree. This tree is already used for dirty page writeback, and it > already supports the use of exceptional (non struct page*) entries. > > In order to properly track dirty DAX pages we will insert new exceptional > entries into the radix tree that represent dirty DAX PTE or PMD pages. > These exceptional entries will also contain the writeback addresses for the > PTE or PMD faults that we can use at fsync/msync time. > > There are currently two types of exceptional entries (shmem and shadow) > that can be placed into the radix tree, and this adds a third. We rely on > the fact that only one type of exceptional entry can be found in a given > radix tree based on its usage. This happens for free with DAX vs shmem but > we explicitly prevent shadow entries from being added to radix trees for > DAX mappings. > > The only shadow entries that would be generated for DAX radix trees would > be to track zero page mappings that were created for holes. These pages > would receive minimal benefit from having shadow entries, and the choice > to have only one type of exceptional entry in a given radix tree makes the > logic simpler both in clear_exceptional_entry() and in the rest of DAX. > > Signed-off-by: Ross Zwisler The patch looks good to me. Just one comment: When we have this exclusion between different types of exceptional entries, there is no real need to have separate counters of 'shadow' and 'dax' entries, is there? We can have one 'nrexceptional' counter and don't have to grow struct inode unnecessarily which would be really welcome since DAX isn't a mainstream feature. Could you please change the code? Thanks! Honza > --- > fs/block_dev.c | 3 ++- > fs/inode.c | 1 + > include/linux/dax.h | 5 ++++ > include/linux/fs.h | 1 + > include/linux/radix-tree.h | 9 +++++++ > mm/filemap.c | 13 +++++++--- > mm/truncate.c | 64 +++++++++++++++++++++++++++------------------- > mm/vmscan.c | 9 ++++++- > 8 files changed, 73 insertions(+), 32 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index c25639e..226dacc 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -75,7 +75,8 @@ void kill_bdev(struct block_device *bdev) > { > struct address_space *mapping = bdev->bd_inode->i_mapping; > > - if (mapping->nrpages == 0 && mapping->nrshadows == 0) > + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && > + mapping->nrdax == 0) > return; > > invalidate_bh_lrus(); > diff --git a/fs/inode.c b/fs/inode.c > index 1be5f90..79d828f 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -496,6 +496,7 @@ void clear_inode(struct inode *inode) > spin_lock_irq(&inode->i_data.tree_lock); > BUG_ON(inode->i_data.nrpages); > BUG_ON(inode->i_data.nrshadows); > + BUG_ON(inode->i_data.nrdax); > spin_unlock_irq(&inode->i_data.tree_lock); > BUG_ON(!list_empty(&inode->i_data.private_list)); > BUG_ON(!(inode->i_state & I_FREEING)); > diff --git a/include/linux/dax.h b/include/linux/dax.h > index b415e52..e9d57f68 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) > { > return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); > } > + > +static inline bool dax_mapping(struct address_space *mapping) > +{ > + return mapping->host && IS_DAX(mapping->host); > +} > #endif > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3aa5142..b9ac534 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -433,6 +433,7 @@ struct address_space { > /* Protected by tree_lock together with the radix tree */ > unsigned long nrpages; /* number of total pages */ > unsigned long nrshadows; /* number of shadow entries */ > + unsigned long nrdax; /* number of DAX entries */ > pgoff_t writeback_index;/* writeback starts here */ > const struct address_space_operations *a_ops; /* methods */ > unsigned long flags; /* error bits/gfp mask */ > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h > index 33170db..f793c99 100644 > --- a/include/linux/radix-tree.h > +++ b/include/linux/radix-tree.h > @@ -51,6 +51,15 @@ > #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 > #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 > > +#define RADIX_DAX_MASK 0xf > +#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY) > +#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY) > +#define RADIX_DAX_TYPE(entry) ((__force unsigned long)entry & RADIX_DAX_MASK) > +#define RADIX_DAX_ADDR(entry) ((void __pmem *)((unsigned long)entry & \ > + ~RADIX_DAX_MASK)) > +#define RADIX_DAX_ENTRY(addr, pmd) ((void *)((__force unsigned long)addr | \ > + (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE))) > + > static inline int radix_tree_is_indirect_ptr(void *ptr) > { > return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR); > diff --git a/mm/filemap.c b/mm/filemap.c > index 1bb0076..167a4d9 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -11,6 +11,7 @@ > */ > #include > #include > +#include > #include > #include > #include > @@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping, > p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); > if (!radix_tree_exceptional_entry(p)) > return -EEXIST; > + > + if (dax_mapping(mapping)) { > + WARN_ON(1); > + return -EINVAL; > + } > + > if (shadowp) > *shadowp = p; > mapping->nrshadows--; > @@ -1242,9 +1249,9 @@ repeat: > if (radix_tree_deref_retry(page)) > goto restart; > /* > - * A shadow entry of a recently evicted page, > - * or a swap entry from shmem/tmpfs. Return > - * it without attempting to raise page count. > + * A shadow entry of a recently evicted page, a swap > + * entry from shmem/tmpfs or a DAX entry. Return it > + * without attempting to raise page count. > */ > goto export; > } > diff --git a/mm/truncate.c b/mm/truncate.c > index 76e35ad..1dc9f29 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -9,6 +9,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping, > return; > > spin_lock_irq(&mapping->tree_lock); > - /* > - * Regular page slots are stabilized by the page lock even > - * without the tree itself locked. These unlocked entries > - * need verification under the tree lock. > - */ > - if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot)) > - goto unlock; > - if (*slot != entry) > - goto unlock; > - radix_tree_replace_slot(slot, NULL); > - mapping->nrshadows--; > - if (!node) > - goto unlock; > - workingset_node_shadows_dec(node); > - /* > - * Don't track node without shadow entries. > - * > - * Avoid acquiring the list_lru lock if already untracked. > - * The list_empty() test is safe as node->private_list is > - * protected by mapping->tree_lock. > - */ > - if (!workingset_node_shadows(node) && > - !list_empty(&node->private_list)) > - list_lru_del(&workingset_shadow_nodes, &node->private_list); > - __radix_tree_delete_node(&mapping->page_tree, node); > + > + if (dax_mapping(mapping)) { > + if (radix_tree_delete_item(&mapping->page_tree, index, entry)) > + mapping->nrdax--; > + } else { > + /* > + * Regular page slots are stabilized by the page lock even > + * without the tree itself locked. These unlocked entries > + * need verification under the tree lock. > + */ > + if (!__radix_tree_lookup(&mapping->page_tree, index, &node, > + &slot)) > + goto unlock; > + if (*slot != entry) > + goto unlock; > + radix_tree_replace_slot(slot, NULL); > + mapping->nrshadows--; > + if (!node) > + goto unlock; > + workingset_node_shadows_dec(node); > + /* > + * Don't track node without shadow entries. > + * > + * Avoid acquiring the list_lru lock if already untracked. > + * The list_empty() test is safe as node->private_list is > + * protected by mapping->tree_lock. > + */ > + if (!workingset_node_shadows(node) && > + !list_empty(&node->private_list)) > + list_lru_del(&workingset_shadow_nodes, > + &node->private_list); > + __radix_tree_delete_node(&mapping->page_tree, node); > + } > unlock: > spin_unlock_irq(&mapping->tree_lock); > } > @@ -228,7 +237,8 @@ void truncate_inode_pages_range(struct address_space *mapping, > int i; > > cleancache_invalidate_inode(mapping); > - if (mapping->nrpages == 0 && mapping->nrshadows == 0) > + if (mapping->nrpages == 0 && mapping->nrshadows == 0 && > + mapping->nrdax == 0) > return; > > /* Offsets within partial pages */ > @@ -423,7 +433,7 @@ void truncate_inode_pages_final(struct address_space *mapping) > smp_rmb(); > nrshadows = mapping->nrshadows; > > - if (nrpages || nrshadows) { > + if (nrpages || nrshadows || mapping->nrdax) { > /* > * As truncation uses a lockless tree lookup, cycle > * the tree lock to make sure any ongoing tree > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2aec424..8071956 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -46,6 +46,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > * inode reclaim needs to empty out the radix tree or > * the nodes are lost. Don't plant shadows behind its > * back. > + * > + * We also don't store shadows for DAX mappings because the > + * only page cache pages found in these are zero pages > + * covering holes, and because we don't want to mix DAX > + * exceptional entries and shadow exceptional entries in the > + * same page_tree. > */ > if (reclaimed && page_is_file_cache(page) && > - !mapping_exiting(mapping)) > + !mapping_exiting(mapping) && !dax_mapping(mapping)) > shadow = workingset_eviction(mapping, page); > __delete_from_page_cache(page, shadow, memcg); > spin_unlock_irqrestore(&mapping->tree_lock, flags); > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 0C1C87F37 for ; Mon, 21 Dec 2015 11:32:09 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id A005DAC002 for ; Mon, 21 Dec 2015 09:32:08 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id BsdSPvm6hAM6dean (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NO) for ; Mon, 21 Dec 2015 09:32:05 -0800 (PST) Date: Mon, 21 Dec 2015 18:32:02 +0100 From: Jan Kara Subject: Re: [PATCH v5 5/7] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Message-ID: <20151221173202.GB7030@quack.suse.cz> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-6-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1450502540-8744-6-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: Dave Hansen , "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar , Matthew Wilcox , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , Theodore Ts'o , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox On Fri 18-12-15 22:22:18, Ross Zwisler wrote: > To properly support the new DAX fsync/msync infrastructure filesystems > need to call dax_pfn_mkwrite() so that DAX can track when user pages are > dirtied. The patch looks good to me. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Ross Zwisler > --- > fs/ext2/file.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 11a42c5..2c88d68 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, > { > struct inode *inode = file_inode(vma->vm_file); > struct ext2_inode_info *ei = EXT2_I(inode); > - int ret = VM_FAULT_NOPAGE; > loff_t size; > + int ret; > > sb_start_pagefault(inode->i_sb); > file_update_time(vma->vm_file); > @@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, > size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > if (vmf->pgoff >= size) > ret = VM_FAULT_SIGBUS; > + else > + ret = dax_pfn_mkwrite(vma, vmf); > > up_read(&ei->dax_sem); > sb_end_pagefault(inode->i_sb); > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 267AA7F37 for ; Mon, 21 Dec 2015 11:49:04 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay2.corp.sgi.com (Postfix) with ESMTP id 06E1C304053 for ; Mon, 21 Dec 2015 09:49:03 -0800 (PST) Received: from mail-oi0-f42.google.com (mail-oi0-f42.google.com [209.85.218.42]) by cuda.sgi.com with ESMTP id GHSVCeLgVCZ2EuYB (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Mon, 21 Dec 2015 09:49:01 -0800 (PST) Received: by mail-oi0-f42.google.com with SMTP id y66so95616838oig.0 for ; Mon, 21 Dec 2015 09:49:01 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20151221170545.GA13494@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> <20151221170545.GA13494@linux.intel.com> Date: Mon, 21 Dec 2015 09:49:01 -0800 Message-ID: Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen On Mon, Dec 21, 2015 at 9:05 AM, Ross Zwisler wrote: > On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: >> On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler >> wrote: [..] >> Hi Ross, I should have realized this sooner, but what guarantees that >> the address returned by RADIX_DAX_ADDR(entry) is still valid at this >> point? I think we need to store the sector in the radix tree and then >> perform a new dax_map_atomic() operation to either lookup a valid >> address or fail the sync request. Otherwise, if the device is gone >> we'll crash, or write into some other random vmalloc address space. > > Ah, good point, thank you. v4 of this series is based on a version of > DAX where we aren't properly dealing with PMEM device removal. I've got an > updated version that merges with your dax_map_atomic() changes, and I'll add > this change into v5 which I will send out today. Thank you for the > suggestion. > > One clarification, with the code as it is in v4 we are only doing > clflush/clflushopt/clwb instructions on the kaddr we've stored in the radix > tree, so I don't think that there is actually a risk of us doing a "write into > some other random vmalloc address space"? I think at worse we will end up > clflushing an address that either isn't mapped or has been remapped by someone > else. Or are you worried that the clflush would trigger a cache writeback to > a memory address where writes have side effects, thus triggering the side > effect? > > I definitely think it needs to be fixed, I'm just trying to make sure I > understood your comment. True, this would be flushing an address that was dirtied while valid. Should be ok in practice for now since dax is effectively limited to x86, but we should not be leaning on x86 details in an architecture generic implementation like this. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 5FD2A29DF5 for ; Mon, 21 Dec 2015 13:27:41 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay2.corp.sgi.com (Postfix) with ESMTP id 4E232304053 for ; Mon, 21 Dec 2015 11:27:38 -0800 (PST) Received: from mail-yk0-f170.google.com (mail-yk0-f170.google.com [209.85.160.170]) by cuda.sgi.com with ESMTP id TPAoiVMWctfyklI8 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Mon, 21 Dec 2015 11:27:36 -0800 (PST) Received: by mail-yk0-f170.google.com with SMTP id p130so140390335yka.1 for ; Mon, 21 Dec 2015 11:27:36 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20151221170545.GA13494@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> <20151221170545.GA13494@linux.intel.com> Date: Mon, 21 Dec 2015 11:27:35 -0800 Message-ID: Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen On Mon, Dec 21, 2015 at 9:05 AM, Ross Zwisler wrote: > On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: >> On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler >> wrote: >> > To properly handle fsync/msync in an efficient way DAX needs to track dirty >> > pages so it is able to flush them durably to media on demand. >> > >> > The tracking of dirty pages is done via the radix tree in struct >> > address_space. This radix tree is already used by the page writeback >> > infrastructure for tracking dirty pages associated with an open file, and >> > it already has support for exceptional (non struct page*) entries. We >> > build upon these features to add exceptional entries to the radix tree for >> > DAX dirty PMD or PTE pages at fault time. >> > >> > Signed-off-by: Ross Zwisler >> [..] >> > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, >> > + void *entry) >> > +{ >> > + struct radix_tree_root *page_tree = &mapping->page_tree; >> > + int type = RADIX_DAX_TYPE(entry); >> > + struct radix_tree_node *node; >> > + void **slot; >> > + >> > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { >> > + WARN_ON_ONCE(1); >> > + return; >> > + } >> > + >> > + spin_lock_irq(&mapping->tree_lock); >> > + /* >> > + * Regular page slots are stabilized by the page lock even >> > + * without the tree itself locked. These unlocked entries >> > + * need verification under the tree lock. >> > + */ >> > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) >> > + goto unlock; >> > + if (*slot != entry) >> > + goto unlock; >> > + >> > + /* another fsync thread may have already written back this entry */ >> > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) >> > + goto unlock; >> > + >> > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); >> > + >> > + if (type == RADIX_DAX_PMD) >> > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); >> > + else >> > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); >> >> Hi Ross, I should have realized this sooner, but what guarantees that >> the address returned by RADIX_DAX_ADDR(entry) is still valid at this >> point? I think we need to store the sector in the radix tree and then >> perform a new dax_map_atomic() operation to either lookup a valid >> address or fail the sync request. Otherwise, if the device is gone >> we'll crash, or write into some other random vmalloc address space. > > Ah, good point, thank you. v4 of this series is based on a version of > DAX where we aren't properly dealing with PMEM device removal. I've got an > updated version that merges with your dax_map_atomic() changes, and I'll add > this change into v5 which I will send out today. Thank you for the > suggestion. To make the merge simpler you could skip the rebase for now and just call blk_queue_enter() / blk_queue_exit() around the calls to wb_cache_pmem. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs