From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 21 Dec 2015 10:05:45 -0700 From: Ross Zwisler Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Message-ID: <20151221170545.GA13494@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen List-ID: On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: > On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler > wrote: > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > > > Signed-off-by: Ross Zwisler > [..] > > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > > + void *entry) > > +{ > > + struct radix_tree_root *page_tree = &mapping->page_tree; > > + int type = RADIX_DAX_TYPE(entry); > > + struct radix_tree_node *node; > > + void **slot; > > + > > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > > + WARN_ON_ONCE(1); > > + return; > > + } > > + > > + spin_lock_irq(&mapping->tree_lock); > > + /* > > + * Regular page slots are stabilized by the page lock even > > + * without the tree itself locked. These unlocked entries > > + * need verification under the tree lock. > > + */ > > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > > + goto unlock; > > + if (*slot != entry) > > + goto unlock; > > + > > + /* another fsync thread may have already written back this entry */ > > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > > + goto unlock; > > + > > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > > + > > + if (type == RADIX_DAX_PMD) > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > > + else > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); > > Hi Ross, I should have realized this sooner, but what guarantees that > the address returned by RADIX_DAX_ADDR(entry) is still valid at this > point? I think we need to store the sector in the radix tree and then > perform a new dax_map_atomic() operation to either lookup a valid > address or fail the sync request. Otherwise, if the device is gone > we'll crash, or write into some other random vmalloc address space. Ah, good point, thank you. v4 of this series is based on a version of DAX where we aren't properly dealing with PMEM device removal. I've got an updated version that merges with your dax_map_atomic() changes, and I'll add this change into v5 which I will send out today. Thank you for the suggestion. One clarification, with the code as it is in v4 we are only doing clflush/clflushopt/clwb instructions on the kaddr we've stored in the radix tree, so I don't think that there is actually a risk of us doing a "write into some other random vmalloc address space"? I think at worse we will end up clflushing an address that either isn't mapped or has been remapped by someone else. Or are you worried that the clflush would trigger a cache writeback to a memory address where writes have side effects, thus triggering the side effect? I definitely think it needs to be fixed, I'm just trying to make sure I understood your comment. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751832AbbLURGF (ORCPT ); Mon, 21 Dec 2015 12:06:05 -0500 Received: from mga02.intel.com ([134.134.136.20]:48068 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751439AbbLURGB (ORCPT ); Mon, 21 Dec 2015 12:06:01 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,460,1444719600"; d="scan'208";a="621703778" Date: Mon, 21 Dec 2015 10:05:45 -0700 From: Ross Zwisler To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Message-ID: <20151221170545.GA13494@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , Dave Hansen References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: > On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler > wrote: > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > > > Signed-off-by: Ross Zwisler > [..] > > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > > + void *entry) > > +{ > > + struct radix_tree_root *page_tree = &mapping->page_tree; > > + int type = RADIX_DAX_TYPE(entry); > > + struct radix_tree_node *node; > > + void **slot; > > + > > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > > + WARN_ON_ONCE(1); > > + return; > > + } > > + > > + spin_lock_irq(&mapping->tree_lock); > > + /* > > + * Regular page slots are stabilized by the page lock even > > + * without the tree itself locked. These unlocked entries > > + * need verification under the tree lock. > > + */ > > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > > + goto unlock; > > + if (*slot != entry) > > + goto unlock; > > + > > + /* another fsync thread may have already written back this entry */ > > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > > + goto unlock; > > + > > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > > + > > + if (type == RADIX_DAX_PMD) > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > > + else > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); > > Hi Ross, I should have realized this sooner, but what guarantees that > the address returned by RADIX_DAX_ADDR(entry) is still valid at this > point? I think we need to store the sector in the radix tree and then > perform a new dax_map_atomic() operation to either lookup a valid > address or fail the sync request. Otherwise, if the device is gone > we'll crash, or write into some other random vmalloc address space. Ah, good point, thank you. v4 of this series is based on a version of DAX where we aren't properly dealing with PMEM device removal. I've got an updated version that merges with your dax_map_atomic() changes, and I'll add this change into v5 which I will send out today. Thank you for the suggestion. One clarification, with the code as it is in v4 we are only doing clflush/clflushopt/clwb instructions on the kaddr we've stored in the radix tree, so I don't think that there is actually a risk of us doing a "write into some other random vmalloc address space"? I think at worse we will end up clflushing an address that either isn't mapped or has been remapped by someone else. Or are you worried that the clflush would trigger a cache writeback to a memory address where writes have side effects, thus triggering the side effect? I definitely think it needs to be fixed, I'm just trying to make sure I understood your comment. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [PATCH v5 4/7] dax: add support for fsync/sync Date: Mon, 21 Dec 2015 10:05:45 -0700 Message-ID: <20151221170545.GA13494@linux.intel.com> References: <1450502540-8744-1-git-send-email-ross.zwisler@linux.intel.com> <1450502540-8744-5-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: Dave Hansen , "J. Bruce Fields" , Linux MM , Andreas Dilger , "H. Peter Anvin" , Jeff Layton , "linux-nvdimm@lists.01.org" , X86 ML , Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4 , XFS Developers , Alexander Viro , Thomas Gleixner , Theodore Ts'o , "linux-kernel@vger.kernel.org" , Jan Kara , linux-fsdevel , Andrew Morton , Matthew Wilcox To: Dan Williams Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com List-Id: linux-ext4.vger.kernel.org On Sat, Dec 19, 2015 at 10:37:46AM -0800, Dan Williams wrote: > On Fri, Dec 18, 2015 at 9:22 PM, Ross Zwisler > wrote: > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > > > Signed-off-by: Ross Zwisler > [..] > > +static void dax_writeback_one(struct address_space *mapping, pgoff_t index, > > + void *entry) > > +{ > > + struct radix_tree_root *page_tree = &mapping->page_tree; > > + int type = RADIX_DAX_TYPE(entry); > > + struct radix_tree_node *node; > > + void **slot; > > + > > + if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) { > > + WARN_ON_ONCE(1); > > + return; > > + } > > + > > + spin_lock_irq(&mapping->tree_lock); > > + /* > > + * Regular page slots are stabilized by the page lock even > > + * without the tree itself locked. These unlocked entries > > + * need verification under the tree lock. > > + */ > > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > > + goto unlock; > > + if (*slot != entry) > > + goto unlock; > > + > > + /* another fsync thread may have already written back this entry */ > > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > > + goto unlock; > > + > > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > > + > > + if (type == RADIX_DAX_PMD) > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE); > > + else > > + wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE); > > Hi Ross, I should have realized this sooner, but what guarantees that > the address returned by RADIX_DAX_ADDR(entry) is still valid at this > point? I think we need to store the sector in the radix tree and then > perform a new dax_map_atomic() operation to either lookup a valid > address or fail the sync request. Otherwise, if the device is gone > we'll crash, or write into some other random vmalloc address space. Ah, good point, thank you. v4 of this series is based on a version of DAX where we aren't properly dealing with PMEM device removal. I've got an updated version that merges with your dax_map_atomic() changes, and I'll add this change into v5 which I will send out today. Thank you for the suggestion. One clarification, with the code as it is in v4 we are only doing clflush/clflushopt/clwb instructions on the kaddr we've stored in the radix tree, so I don't think that there is actually a risk of us doing a "write into some other random vmalloc address space"? I think at worse we will end up clflushing an address that either isn't mapped or has been remapped by someone else. Or are you worried that the clflush would trigger a cache writeback to a memory address where writes have side effects, thus triggering the side effect? I definitely think it needs to be fixed, I'm just trying to make sure I understood your comment. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs