From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <56C03FFB.1040401@gmail.com> Date: Sun, 14 Feb 2016 10:51:07 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: Another proposal for DAX fault locking References: <20160209172416.GB12245@quack.suse.cz> <56BB758D.1000704@plexistor.com> <20160211103856.GE21760@quack.suse.cz> In-Reply-To: <20160211103856.GE21760@quack.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Jan Kara , Boaz Harrosh Cc: Ross Zwisler , linux-nvdimm@lists.01.org, Dave Chinner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org List-ID: On 02/11/2016 12:38 PM, Jan Kara wrote: > On Wed 10-02-16 19:38:21, Boaz Harrosh wrote: >> On 02/09/2016 07:24 PM, Jan Kara wrote: >>> Hello, >>> <> >>> >>> DAX will have an array of mutexes (the array can be made per device but >>> initially a global one should be OK). We will use mutexes in the array as a >>> replacement for page lock - we will use hashfn(mapping, index) to get >>> particular mutex protecting our offset in the mapping. On fault / page >>> mkwrite, we'll grab the mutex similarly to page lock and release it once we >>> are done updating page tables. This deals with races in [1]. When flushing >>> caches we grab the mutex before clearing writeable bit in page tables >>> and clearing dirty bit in the radix tree and drop it after we have flushed >>> caches for the pfn. This deals with races in [2]. >>> >>> Thoughts? >>> >> >> You could also use one of the radix-tree's special-bits as a bit lock. >> So no need for any extra allocations. > > Yes and I've suggested that once as well. But since we need sleeping > locks, you need some wait queues somewhere as well. So some allocations are > going to be needed anyway. They are already sleeping locks and there are all the proper "wait queues" in place. I'm talking about lock: err = wait_on_bit_lock(&some_long, SOME_BIT_LOCK, ...); and unlock: WARN_ON(!test_and_clear_bit(SOME_BIT_LOCK, &some_long)); wake_up_bit(&some_long, SOME_BIT_LOCK); > And mutexes have much better properties than Just saying that page-locks are implemented just this way these days so it is the performance and characteristics we already know. (You are replacing page locks, no?) > bit-locks so I prefer mutexes over cramming bit locks into radix tree. Plus > you'd have to be careful so that someone doesn't remove the bit from the > radix tree while you are working with it. > Sure! need to be careful, is our middle name. That said. Is your call. Thank you for working on this. Your plan sounds very good as well, and is very much needed, because DAX's mmap performance success right now. [Maybe one small enhancement perhaps allocate an array of mutexes per NUMA node and access the proper array through numa_node_id()] > Honza > Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160211105510.GG21760@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> <20160211105510.GG21760@quack.suse.cz> Date: Thu, 11 Feb 2016 22:05:24 +0100 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Cedric Blancher Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Dave Chinner , Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox List-ID: The Solaris 11 sources are still available at Illumos.org (Illumos is what Opensolaris was once, minus Suns bug database). Also, if you keep the performance in mind, remember that the world is moving towards many cores with many hardware threads per core, so optimising for the "two core with low latency" use case is wrong like a sin. More likely is the "8 core with 4 threads per core" use case for benchmarking because that's what we will end up in even low end hardware soon, maybe with variable bandwidth between cores if something like ScaleMP is used. Ced On 11 February 2016 at 11:55, Jan Kara wrote: > On Wed 10-02-16 23:39:43, Cedric Blancher wrote: >> AFAIK Solaris 11 uses a sparse tree instead of a array. Solves the >> scalability problem AND deals with variable page size. > > Well, but then you have to have this locking tree for every inode so the > memory overhead is relatively large, no? I've played with range locking of > mapping in the past but its performance was not stellar. Do you have any > reference for what Solaris does? > > Honza > >> On 10 February 2016 at 23:09, Dave Chinner wrote: >> > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: >> >> On Tue 09-02-16 10:18:53, Dan Williams wrote: >> >> > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: >> >> > > Hello, >> >> > > >> >> > > I was thinking about current issues with DAX fault locking [1] (data >> >> > > corruption due to racing faults allocating blocks) and also races which >> >> > > currently don't allow us to clear dirty tags in the radix tree due to races >> >> > > between faults and cache flushing [2]. Both of these exist because we don't >> >> > > have an equivalent of page lock available for DAX. While we have a >> >> > > reasonable solution available for problem [1], so far I'm not aware of a >> >> > > decent solution for [2]. After briefly discussing the issue with Mel he had >> >> > > a bright idea that we could used hashed locks to deal with [2] (and I think >> >> > > we can solve [1] with them as well). So my proposal looks as follows: >> >> > > >> >> > > DAX will have an array of mutexes (the array can be made per device but >> >> > > initially a global one should be OK). We will use mutexes in the array as a >> >> > > replacement for page lock - we will use hashfn(mapping, index) to get >> >> > > particular mutex protecting our offset in the mapping. On fault / page >> >> > > mkwrite, we'll grab the mutex similarly to page lock and release it once we >> >> > > are done updating page tables. This deals with races in [1]. When flushing >> >> > > caches we grab the mutex before clearing writeable bit in page tables >> >> > > and clearing dirty bit in the radix tree and drop it after we have flushed >> >> > > caches for the pfn. This deals with races in [2]. >> >> > > >> >> > > Thoughts? >> >> > > >> >> > >> >> > I like the fact that this makes the locking explicit and >> >> > straightforward rather than something more tricky. Can we make the >> >> > hashfn pfn based? I'm thinking we could later reuse this as part of >> >> > the solution for eliminating the need to allocate struct page, and we >> >> > don't have the 'mapping' available in all paths... >> >> >> >> So Mel originally suggested to use pfn for hashing as well. My concern with >> >> using pfn is that e.g. if you want to fill a hole, you don't have a pfn to >> >> lock. What you really need to protect is a logical offset in the file to >> >> serialize allocation of underlying blocks, its mapping into page tables, >> >> and flushing the blocks out of caches. So using inode/mapping and offset >> >> for the hashing is easier (it isn't obvious to me we can fix hole filling >> >> races with pfn-based locking). >> > >> > So how does that file+offset hash work when trying to lock different >> > ranges? file+offset hashing to determine the lock to use only works >> > if we are dealing with fixed size ranges that the locks affect. >> > e.g. offset has 4k granularity for a single page faults, but we also >> > need to handle 2MB granularity for huge page faults, and IIRC 1GB >> > granularity for giant page faults... >> > >> > What's the plan here? >> > >> > Cheers, >> > >> > Dave. >> > -- >> > Dave Chinner >> > david@fromorbit.com >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >> > the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> Cedric Blancher >> Institute Pasteur > -- > Jan Kara > SUSE Labs, CR -- Cedric Blancher Institute Pasteur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 11 Feb 2016 12:15:38 +0100 From: Jan Kara Subject: Re: Another proposal for DAX fault locking Message-ID: <20160211111538.GH21760@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> <20160210233253.GB30938@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210233253.GB30938@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: Dave Chinner , Jan Kara , Dan Williams , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox List-ID: On Wed 10-02-16 16:32:53, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 09:09:53AM +1100, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: > > > On Tue 09-02-16 10:18:53, Dan Williams wrote: > > > > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > > > > > Hello, > > > > > > > > > > I was thinking about current issues with DAX fault locking [1] (data > > > > > corruption due to racing faults allocating blocks) and also races which > > > > > currently don't allow us to clear dirty tags in the radix tree due to races > > > > > between faults and cache flushing [2]. Both of these exist because we don't > > > > > have an equivalent of page lock available for DAX. While we have a > > > > > reasonable solution available for problem [1], so far I'm not aware of a > > > > > decent solution for [2]. After briefly discussing the issue with Mel he had > > > > > a bright idea that we could used hashed locks to deal with [2] (and I think > > > > > we can solve [1] with them as well). So my proposal looks as follows: > > > > > > > > > > DAX will have an array of mutexes (the array can be made per device but > > > > > initially a global one should be OK). We will use mutexes in the array as a > > > > > replacement for page lock - we will use hashfn(mapping, index) to get > > > > > particular mutex protecting our offset in the mapping. On fault / page > > > > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > > > > are done updating page tables. This deals with races in [1]. When flushing > > > > > caches we grab the mutex before clearing writeable bit in page tables > > > > > and clearing dirty bit in the radix tree and drop it after we have flushed > > > > > caches for the pfn. This deals with races in [2]. > > > > > > > > > > Thoughts? > > > > > > > > > > > > > I like the fact that this makes the locking explicit and > > > > straightforward rather than something more tricky. Can we make the > > > > hashfn pfn based? I'm thinking we could later reuse this as part of > > > > the solution for eliminating the need to allocate struct page, and we > > > > don't have the 'mapping' available in all paths... > > > > > > So Mel originally suggested to use pfn for hashing as well. My concern with > > > using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > > > lock. What you really need to protect is a logical offset in the file to > > > serialize allocation of underlying blocks, its mapping into page tables, > > > and flushing the blocks out of caches. So using inode/mapping and offset > > > for the hashing is easier (it isn't obvious to me we can fix hole filling > > > races with pfn-based locking). > > > > So how does that file+offset hash work when trying to lock different > > ranges? file+offset hashing to determine the lock to use only works > > if we are dealing with fixed size ranges that the locks affect. > > e.g. offset has 4k granularity for a single page faults, but we also > > need to handle 2MB granularity for huge page faults, and IIRC 1GB > > granularity for giant page faults... > > > > What's the plan here? > > I wonder if it makes sense to tie the locking in with the radix tree? > Meaning, instead of having an array of mutexes, we lock based on the radix > tree entry. > > Right now we already have to check for PTE and PMD entries in the radix tree, > and with Matthew's suggested radix tree changes a lookup of a random address > would give you the appropriate PMD or PUD entry, if one was present. > > This sort of solves the need for having a hash function that works on > file+offset - that's all already there when using the radix tree... Yeah, so we need to be careful there are no aliasing issues (i.e., you do not have PTE and PMD entries covering the same offset). Other than that using the radix tree entry (or it's offset - you need to somehow map the entry to the mutex anyway) as a base for mapping should deal with issues with different page sizes. We will have to be careful, e.g. when allocating blocks for a PMD fault. We would have to insert PMD entry, lock it (so all newcomers will see the entry and block on it), walk the whole range the fault covers and clear out entries we find waiting if they are locked - lock aliasing may be an issue here - and only after that we can proceed with the fault. It is more complex than I'd wish but doable and I don't have anything better. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 11 Feb 2016 11:55:10 +0100 From: Jan Kara Subject: Re: Another proposal for DAX fault locking Message-ID: <20160211105510.GG21760@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Cedric Blancher Cc: Dave Chinner , Jan Kara , Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox List-ID: On Wed 10-02-16 23:39:43, Cedric Blancher wrote: > AFAIK Solaris 11 uses a sparse tree instead of a array. Solves the > scalability problem AND deals with variable page size. Well, but then you have to have this locking tree for every inode so the memory overhead is relatively large, no? I've played with range locking of mapping in the past but its performance was not stellar. Do you have any reference for what Solaris does? Honza > On 10 February 2016 at 23:09, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: > >> On Tue 09-02-16 10:18:53, Dan Williams wrote: > >> > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > >> > > Hello, > >> > > > >> > > I was thinking about current issues with DAX fault locking [1] (data > >> > > corruption due to racing faults allocating blocks) and also races which > >> > > currently don't allow us to clear dirty tags in the radix tree due to races > >> > > between faults and cache flushing [2]. Both of these exist because we don't > >> > > have an equivalent of page lock available for DAX. While we have a > >> > > reasonable solution available for problem [1], so far I'm not aware of a > >> > > decent solution for [2]. After briefly discussing the issue with Mel he had > >> > > a bright idea that we could used hashed locks to deal with [2] (and I think > >> > > we can solve [1] with them as well). So my proposal looks as follows: > >> > > > >> > > DAX will have an array of mutexes (the array can be made per device but > >> > > initially a global one should be OK). We will use mutexes in the array as a > >> > > replacement for page lock - we will use hashfn(mapping, index) to get > >> > > particular mutex protecting our offset in the mapping. On fault / page > >> > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > >> > > are done updating page tables. This deals with races in [1]. When flushing > >> > > caches we grab the mutex before clearing writeable bit in page tables > >> > > and clearing dirty bit in the radix tree and drop it after we have flushed > >> > > caches for the pfn. This deals with races in [2]. > >> > > > >> > > Thoughts? > >> > > > >> > > >> > I like the fact that this makes the locking explicit and > >> > straightforward rather than something more tricky. Can we make the > >> > hashfn pfn based? I'm thinking we could later reuse this as part of > >> > the solution for eliminating the need to allocate struct page, and we > >> > don't have the 'mapping' available in all paths... > >> > >> So Mel originally suggested to use pfn for hashing as well. My concern with > >> using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > >> lock. What you really need to protect is a logical offset in the file to > >> serialize allocation of underlying blocks, its mapping into page tables, > >> and flushing the blocks out of caches. So using inode/mapping and offset > >> for the hashing is easier (it isn't obvious to me we can fix hole filling > >> races with pfn-based locking). > > > > So how does that file+offset hash work when trying to lock different > > ranges? file+offset hashing to determine the lock to use only works > > if we are dealing with fixed size ranges that the locks affect. > > e.g. offset has 4k granularity for a single page faults, but we also > > need to handle 2MB granularity for huge page faults, and IIRC 1GB > > granularity for giant page faults... > > > > What's the plan here? > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Cedric Blancher > Institute Pasteur -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 11 Feb 2016 11:43:13 +0100 From: Jan Kara Subject: Re: Another proposal for DAX fault locking Message-ID: <20160211104313.GF21760@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , Dave Chinner , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox List-ID: On Wed 10-02-16 12:08:12, Dan Williams wrote: > On Wed, Feb 10, 2016 at 2:32 AM, Jan Kara wrote: > > On Tue 09-02-16 10:18:53, Dan Williams wrote: > >> On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > >> > Hello, > >> > > >> > I was thinking about current issues with DAX fault locking [1] (data > >> > corruption due to racing faults allocating blocks) and also races which > >> > currently don't allow us to clear dirty tags in the radix tree due to races > >> > between faults and cache flushing [2]. Both of these exist because we don't > >> > have an equivalent of page lock available for DAX. While we have a > >> > reasonable solution available for problem [1], so far I'm not aware of a > >> > decent solution for [2]. After briefly discussing the issue with Mel he had > >> > a bright idea that we could used hashed locks to deal with [2] (and I think > >> > we can solve [1] with them as well). So my proposal looks as follows: > >> > > >> > DAX will have an array of mutexes (the array can be made per device but > >> > initially a global one should be OK). We will use mutexes in the array as a > >> > replacement for page lock - we will use hashfn(mapping, index) to get > >> > particular mutex protecting our offset in the mapping. On fault / page > >> > mkwrite, we'll grab the mutex similarly to page lock and release it once we > >> > are done updating page tables. This deals with races in [1]. When flushing > >> > caches we grab the mutex before clearing writeable bit in page tables > >> > and clearing dirty bit in the radix tree and drop it after we have flushed > >> > caches for the pfn. This deals with races in [2]. > >> > > >> > Thoughts? > >> > > >> > >> I like the fact that this makes the locking explicit and > >> straightforward rather than something more tricky. Can we make the > >> hashfn pfn based? I'm thinking we could later reuse this as part of > >> the solution for eliminating the need to allocate struct page, and we > >> don't have the 'mapping' available in all paths... > > > > So Mel originally suggested to use pfn for hashing as well. My concern with > > using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > > lock. What you really need to protect is a logical offset in the file to > > serialize allocation of underlying blocks, its mapping into page tables, > > and flushing the blocks out of caches. So using inode/mapping and offset > > for the hashing is easier (it isn't obvious to me we can fix hole filling > > races with pfn-based locking). > > > > I'm not sure for which other purposes you'd like to use this lock and > > whether propagating file+offset to those call sites would make sense or > > not. struct page has the advantage that block mapping information is only > > attached to it, so when filling a hole, we can just allocate some page, > > attach it to the radix tree, use page lock for synchronization, and allocate > > blocks only after that. With pfns we cannot do this... > > Right, I am thinking of the direct-I/O path's use of the page lock and > the occasions where it relies on page->mapping lookups. Well, but the main problem with direct IO is that it takes page *reference* via get_user_pages(). So that's something different from page lock. Maybe the new lock could be abused to provide necessary exclusion for direct IO use as well but that would need deep thinking... So far it seems problematic to me. > Given we already have support for dynamically allocating struct page I > don't think we need to have a "pfn to lock" lookup in the initial > implementation of this locking scheme. Agreed. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 11 Feb 2016 11:38:56 +0100 From: Jan Kara Subject: Re: Another proposal for DAX fault locking Message-ID: <20160211103856.GE21760@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <56BB758D.1000704@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <56BB758D.1000704@plexistor.com> Sender: owner-linux-mm@kvack.org To: Boaz Harrosh Cc: Jan Kara , Ross Zwisler , linux-nvdimm@lists.01.org, Dave Chinner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org List-ID: On Wed 10-02-16 19:38:21, Boaz Harrosh wrote: > On 02/09/2016 07:24 PM, Jan Kara wrote: > > Hello, > > > > I was thinking about current issues with DAX fault locking [1] (data > > corruption due to racing faults allocating blocks) and also races which > > currently don't allow us to clear dirty tags in the radix tree due to races > > between faults and cache flushing [2]. Both of these exist because we don't > > have an equivalent of page lock available for DAX. While we have a > > reasonable solution available for problem [1], so far I'm not aware of a > > decent solution for [2]. After briefly discussing the issue with Mel he had > > a bright idea that we could used hashed locks to deal with [2] (and I think > > we can solve [1] with them as well). So my proposal looks as follows: > > > > DAX will have an array of mutexes (the array can be made per device but > > initially a global one should be OK). We will use mutexes in the array as a > > replacement for page lock - we will use hashfn(mapping, index) to get > > particular mutex protecting our offset in the mapping. On fault / page > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > are done updating page tables. This deals with races in [1]. When flushing > > caches we grab the mutex before clearing writeable bit in page tables > > and clearing dirty bit in the radix tree and drop it after we have flushed > > caches for the pfn. This deals with races in [2]. > > > > Thoughts? > > > > You could also use one of the radix-tree's special-bits as a bit lock. > So no need for any extra allocations. Yes and I've suggested that once as well. But since we need sleeping locks, you need some wait queues somewhere as well. So some allocations are going to be needed anyway. And mutexes have much better properties than bit-locks so I prefer mutexes over cramming bit locks into radix tree. Plus you'd have to be careful so that someone doesn't remove the bit from the radix tree while you are working with it. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 10 Feb 2016 17:13:00 -0700 From: Ross Zwisler Subject: Re: Another proposal for DAX fault locking Message-ID: <20160211001300.GB19534@linux.intel.com> References: <20160209172416.GB12245@quack.suse.cz> <20160210234406.GD30938@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Cedric Blancher Cc: Ross Zwisler , Jan Kara , Linux Kernel Mailing List , Dave Chinner , linux-fsdevel , Linux MM , Dan Williams , linux-nvdimm@lists.01.org, Mel Gorman , Matthew Wilcox List-ID: On Thu, Feb 11, 2016 at 12:51:05AM +0100, Cedric Blancher wrote: > There is another "twist" in this game: If there is a huge page with > 1GB with a small 4k page as "overlay" (e.g. mmap() MAP_FIXED somewhere > in the middle of a 1GB huge page), hows that handled? Ugh - I'm pretty sure we haven't touched overlays with DAX at all. The man page says this: If the memory region specified by addr and len overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded. I wonder if this would translate into a hole punch for our DAX mapping, whatever size it may be, plus an insert? If so, it seems like we just need to handle each of those operations correctly on their own (hole punch, insert), and things will take care of themselves? That being said, I know for a fact that PMD hole punch is currently broken. > On 11 February 2016 at 00:44, Ross Zwisler wrote: > > On Tue, Feb 09, 2016 at 06:24:16PM +0100, Jan Kara wrote: > >> Hello, > >> > >> I was thinking about current issues with DAX fault locking [1] (data > >> corruption due to racing faults allocating blocks) and also races which > >> currently don't allow us to clear dirty tags in the radix tree due to races > >> between faults and cache flushing [2]. Both of these exist because we don't > >> have an equivalent of page lock available for DAX. While we have a > >> reasonable solution available for problem [1], so far I'm not aware of a > >> decent solution for [2]. After briefly discussing the issue with Mel he had > >> a bright idea that we could used hashed locks to deal with [2] (and I think > >> we can solve [1] with them as well). So my proposal looks as follows: > >> > >> DAX will have an array of mutexes (the array can be made per device but > >> initially a global one should be OK). We will use mutexes in the array as a > >> replacement for page lock - we will use hashfn(mapping, index) to get > >> particular mutex protecting our offset in the mapping. On fault / page > >> mkwrite, we'll grab the mutex similarly to page lock and release it once we > >> are done updating page tables. This deals with races in [1]. When flushing > >> caches we grab the mutex before clearing writeable bit in page tables > >> and clearing dirty bit in the radix tree and drop it after we have flushed > >> caches for the pfn. This deals with races in [2]. > >> > >> Thoughts? > >> > >> Honza > >> > >> [1] http://oss.sgi.com/archives/xfs/2016-01/msg00575.html > >> [2] https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html > > > > Overall I think this sounds promising. I think a potential tie-in with the > > radix tree would maybe take us in a good direction. > > > > I had another idea of how to solve race #2 that involved sticking a seqlock > > around the DAX radix tree + pte_mkwrite() sequence, and on the flushing side > > if you noticed that you've raced against a page fault, just leaving the dirty > > page tree entry intact. > > > > I *think* this could work - I'd want to bang on it more - but if we have a > > general way of handling DAX locking that we can use instead of solving these > > issues one-by-one as they come up, that seems like a much better route. > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Cedric Blancher > Institute Pasteur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160210234406.GD30938@linux.intel.com> References: <20160209172416.GB12245@quack.suse.cz> <20160210234406.GD30938@linux.intel.com> Date: Thu, 11 Feb 2016 00:51:05 +0100 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Cedric Blancher Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Ross Zwisler , Jan Kara , Linux Kernel Mailing List , Dave Chinner , linux-fsdevel , Linux MM , Dan Williams , linux-nvdimm@lists.01.org, Mel Gorman , Matthew Wilcox List-ID: There is another "twist" in this game: If there is a huge page with 1GB with a small 4k page as "overlay" (e.g. mmap() MAP_FIXED somewhere in the middle of a 1GB huge page), hows that handled? Ced On 11 February 2016 at 00:44, Ross Zwisler wrote: > On Tue, Feb 09, 2016 at 06:24:16PM +0100, Jan Kara wrote: >> Hello, >> >> I was thinking about current issues with DAX fault locking [1] (data >> corruption due to racing faults allocating blocks) and also races which >> currently don't allow us to clear dirty tags in the radix tree due to races >> between faults and cache flushing [2]. Both of these exist because we don't >> have an equivalent of page lock available for DAX. While we have a >> reasonable solution available for problem [1], so far I'm not aware of a >> decent solution for [2]. After briefly discussing the issue with Mel he had >> a bright idea that we could used hashed locks to deal with [2] (and I think >> we can solve [1] with them as well). So my proposal looks as follows: >> >> DAX will have an array of mutexes (the array can be made per device but >> initially a global one should be OK). We will use mutexes in the array as a >> replacement for page lock - we will use hashfn(mapping, index) to get >> particular mutex protecting our offset in the mapping. On fault / page >> mkwrite, we'll grab the mutex similarly to page lock and release it once we >> are done updating page tables. This deals with races in [1]. When flushing >> caches we grab the mutex before clearing writeable bit in page tables >> and clearing dirty bit in the radix tree and drop it after we have flushed >> caches for the pfn. This deals with races in [2]. >> >> Thoughts? >> >> Honza >> >> [1] http://oss.sgi.com/archives/xfs/2016-01/msg00575.html >> [2] https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html > > Overall I think this sounds promising. I think a potential tie-in with the > radix tree would maybe take us in a good direction. > > I had another idea of how to solve race #2 that involved sticking a seqlock > around the DAX radix tree + pte_mkwrite() sequence, and on the flushing side > if you noticed that you've raced against a page fault, just leaving the dirty > page tree entry intact. > > I *think* this could work - I'd want to bang on it more - but if we have a > general way of handling DAX locking that we can use instead of solving these > issues one-by-one as they come up, that seems like a much better route. > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Cedric Blancher Institute Pasteur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 10 Feb 2016 16:44:06 -0700 From: Ross Zwisler Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210234406.GD30938@linux.intel.com> References: <20160209172416.GB12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160209172416.GB12245@quack.suse.cz> Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Ross Zwisler , linux-kernel@vger.kernel.org, Dave Chinner , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dan Williams , linux-nvdimm@lists.01.org, mgorman@suse.de, Matthew Wilcox List-ID: On Tue, Feb 09, 2016 at 06:24:16PM +0100, Jan Kara wrote: > Hello, > > I was thinking about current issues with DAX fault locking [1] (data > corruption due to racing faults allocating blocks) and also races which > currently don't allow us to clear dirty tags in the radix tree due to races > between faults and cache flushing [2]. Both of these exist because we don't > have an equivalent of page lock available for DAX. While we have a > reasonable solution available for problem [1], so far I'm not aware of a > decent solution for [2]. After briefly discussing the issue with Mel he had > a bright idea that we could used hashed locks to deal with [2] (and I think > we can solve [1] with them as well). So my proposal looks as follows: > > DAX will have an array of mutexes (the array can be made per device but > initially a global one should be OK). We will use mutexes in the array as a > replacement for page lock - we will use hashfn(mapping, index) to get > particular mutex protecting our offset in the mapping. On fault / page > mkwrite, we'll grab the mutex similarly to page lock and release it once we > are done updating page tables. This deals with races in [1]. When flushing > caches we grab the mutex before clearing writeable bit in page tables > and clearing dirty bit in the radix tree and drop it after we have flushed > caches for the pfn. This deals with races in [2]. > > Thoughts? > > Honza > > [1] http://oss.sgi.com/archives/xfs/2016-01/msg00575.html > [2] https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html Overall I think this sounds promising. I think a potential tie-in with the radix tree would maybe take us in a good direction. I had another idea of how to solve race #2 that involved sticking a seqlock around the DAX radix tree + pte_mkwrite() sequence, and on the flushing side if you noticed that you've raced against a page fault, just leaving the dirty page tree entry intact. I *think* this could work - I'd want to bang on it more - but if we have a general way of handling DAX locking that we can use instead of solving these issues one-by-one as they come up, that seems like a much better route. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 10 Feb 2016 16:34:25 -0700 From: Ross Zwisler Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210233425.GC30938@linux.intel.com> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Cedric Blancher Cc: Dave Chinner , Jan Kara , Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox List-ID: On Wed, Feb 10, 2016 at 11:39:43PM +0100, Cedric Blancher wrote: > AFAIK Solaris 11 uses a sparse tree instead of a array. Solves the > scalability problem AND deals with variable page size. Right - seems like tying the radix tree into the locking instead of using an array would have these same benefits. > On 10 February 2016 at 23:09, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: > >> On Tue 09-02-16 10:18:53, Dan Williams wrote: > >> > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > >> > > Hello, > >> > > > >> > > I was thinking about current issues with DAX fault locking [1] (data > >> > > corruption due to racing faults allocating blocks) and also races which > >> > > currently don't allow us to clear dirty tags in the radix tree due to races > >> > > between faults and cache flushing [2]. Both of these exist because we don't > >> > > have an equivalent of page lock available for DAX. While we have a > >> > > reasonable solution available for problem [1], so far I'm not aware of a > >> > > decent solution for [2]. After briefly discussing the issue with Mel he had > >> > > a bright idea that we could used hashed locks to deal with [2] (and I think > >> > > we can solve [1] with them as well). So my proposal looks as follows: > >> > > > >> > > DAX will have an array of mutexes (the array can be made per device but > >> > > initially a global one should be OK). We will use mutexes in the array as a > >> > > replacement for page lock - we will use hashfn(mapping, index) to get > >> > > particular mutex protecting our offset in the mapping. On fault / page > >> > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > >> > > are done updating page tables. This deals with races in [1]. When flushing > >> > > caches we grab the mutex before clearing writeable bit in page tables > >> > > and clearing dirty bit in the radix tree and drop it after we have flushed > >> > > caches for the pfn. This deals with races in [2]. > >> > > > >> > > Thoughts? > >> > > > >> > > >> > I like the fact that this makes the locking explicit and > >> > straightforward rather than something more tricky. Can we make the > >> > hashfn pfn based? I'm thinking we could later reuse this as part of > >> > the solution for eliminating the need to allocate struct page, and we > >> > don't have the 'mapping' available in all paths... > >> > >> So Mel originally suggested to use pfn for hashing as well. My concern with > >> using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > >> lock. What you really need to protect is a logical offset in the file to > >> serialize allocation of underlying blocks, its mapping into page tables, > >> and flushing the blocks out of caches. So using inode/mapping and offset > >> for the hashing is easier (it isn't obvious to me we can fix hole filling > >> races with pfn-based locking). > > > > So how does that file+offset hash work when trying to lock different > > ranges? file+offset hashing to determine the lock to use only works > > if we are dealing with fixed size ranges that the locks affect. > > e.g. offset has 4k granularity for a single page faults, but we also > > need to handle 2MB granularity for huge page faults, and IIRC 1GB > > granularity for giant page faults... > > > > What's the plan here? > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Cedric Blancher > Institute Pasteur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 10 Feb 2016 16:32:53 -0700 From: Ross Zwisler Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210233253.GB30938@linux.intel.com> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210220953.GW19486@dastard> Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: Jan Kara , Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox List-ID: On Thu, Feb 11, 2016 at 09:09:53AM +1100, Dave Chinner wrote: > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: > > On Tue 09-02-16 10:18:53, Dan Williams wrote: > > > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > > > > Hello, > > > > > > > > I was thinking about current issues with DAX fault locking [1] (data > > > > corruption due to racing faults allocating blocks) and also races which > > > > currently don't allow us to clear dirty tags in the radix tree due to races > > > > between faults and cache flushing [2]. Both of these exist because we don't > > > > have an equivalent of page lock available for DAX. While we have a > > > > reasonable solution available for problem [1], so far I'm not aware of a > > > > decent solution for [2]. After briefly discussing the issue with Mel he had > > > > a bright idea that we could used hashed locks to deal with [2] (and I think > > > > we can solve [1] with them as well). So my proposal looks as follows: > > > > > > > > DAX will have an array of mutexes (the array can be made per device but > > > > initially a global one should be OK). We will use mutexes in the array as a > > > > replacement for page lock - we will use hashfn(mapping, index) to get > > > > particular mutex protecting our offset in the mapping. On fault / page > > > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > > > are done updating page tables. This deals with races in [1]. When flushing > > > > caches we grab the mutex before clearing writeable bit in page tables > > > > and clearing dirty bit in the radix tree and drop it after we have flushed > > > > caches for the pfn. This deals with races in [2]. > > > > > > > > Thoughts? > > > > > > > > > > I like the fact that this makes the locking explicit and > > > straightforward rather than something more tricky. Can we make the > > > hashfn pfn based? I'm thinking we could later reuse this as part of > > > the solution for eliminating the need to allocate struct page, and we > > > don't have the 'mapping' available in all paths... > > > > So Mel originally suggested to use pfn for hashing as well. My concern with > > using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > > lock. What you really need to protect is a logical offset in the file to > > serialize allocation of underlying blocks, its mapping into page tables, > > and flushing the blocks out of caches. So using inode/mapping and offset > > for the hashing is easier (it isn't obvious to me we can fix hole filling > > races with pfn-based locking). > > So how does that file+offset hash work when trying to lock different > ranges? file+offset hashing to determine the lock to use only works > if we are dealing with fixed size ranges that the locks affect. > e.g. offset has 4k granularity for a single page faults, but we also > need to handle 2MB granularity for huge page faults, and IIRC 1GB > granularity for giant page faults... > > What's the plan here? I wonder if it makes sense to tie the locking in with the radix tree? Meaning, instead of having an array of mutexes, we lock based on the radix tree entry. Right now we already have to check for PTE and PMD entries in the radix tree, and with Matthew's suggested radix tree changes a lookup of a random address would give you the appropriate PMD or PUD entry, if one was present. This sort of solves the need for having a hash function that works on file+offset - that's all already there when using the radix tree... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160210220953.GW19486@dastard> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> Date: Wed, 10 Feb 2016 23:39:43 +0100 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Cedric Blancher Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: Jan Kara , Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox List-ID: AFAIK Solaris 11 uses a sparse tree instead of a array. Solves the scalability problem AND deals with variable page size. Ced On 10 February 2016 at 23:09, Dave Chinner wrote: > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: >> On Tue 09-02-16 10:18:53, Dan Williams wrote: >> > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: >> > > Hello, >> > > >> > > I was thinking about current issues with DAX fault locking [1] (data >> > > corruption due to racing faults allocating blocks) and also races which >> > > currently don't allow us to clear dirty tags in the radix tree due to races >> > > between faults and cache flushing [2]. Both of these exist because we don't >> > > have an equivalent of page lock available for DAX. While we have a >> > > reasonable solution available for problem [1], so far I'm not aware of a >> > > decent solution for [2]. After briefly discussing the issue with Mel he had >> > > a bright idea that we could used hashed locks to deal with [2] (and I think >> > > we can solve [1] with them as well). So my proposal looks as follows: >> > > >> > > DAX will have an array of mutexes (the array can be made per device but >> > > initially a global one should be OK). We will use mutexes in the array as a >> > > replacement for page lock - we will use hashfn(mapping, index) to get >> > > particular mutex protecting our offset in the mapping. On fault / page >> > > mkwrite, we'll grab the mutex similarly to page lock and release it once we >> > > are done updating page tables. This deals with races in [1]. When flushing >> > > caches we grab the mutex before clearing writeable bit in page tables >> > > and clearing dirty bit in the radix tree and drop it after we have flushed >> > > caches for the pfn. This deals with races in [2]. >> > > >> > > Thoughts? >> > > >> > >> > I like the fact that this makes the locking explicit and >> > straightforward rather than something more tricky. Can we make the >> > hashfn pfn based? I'm thinking we could later reuse this as part of >> > the solution for eliminating the need to allocate struct page, and we >> > don't have the 'mapping' available in all paths... >> >> So Mel originally suggested to use pfn for hashing as well. My concern with >> using pfn is that e.g. if you want to fill a hole, you don't have a pfn to >> lock. What you really need to protect is a logical offset in the file to >> serialize allocation of underlying blocks, its mapping into page tables, >> and flushing the blocks out of caches. So using inode/mapping and offset >> for the hashing is easier (it isn't obvious to me we can fix hole filling >> races with pfn-based locking). > > So how does that file+offset hash work when trying to lock different > ranges? file+offset hashing to determine the lock to use only works > if we are dealing with fixed size ranges that the locks affect. > e.g. offset has 4k granularity for a single page faults, but we also > need to handle 2MB granularity for huge page faults, and IIRC 1GB > granularity for giant page faults... > > What's the plan here? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Cedric Blancher Institute Pasteur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 11 Feb 2016 09:09:53 +1100 From: Dave Chinner Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210220953.GW19486@dastard> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210103249.GD12245@quack.suse.cz> Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox List-ID: On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: > On Tue 09-02-16 10:18:53, Dan Williams wrote: > > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > > > Hello, > > > > > > I was thinking about current issues with DAX fault locking [1] (data > > > corruption due to racing faults allocating blocks) and also races which > > > currently don't allow us to clear dirty tags in the radix tree due to races > > > between faults and cache flushing [2]. Both of these exist because we don't > > > have an equivalent of page lock available for DAX. While we have a > > > reasonable solution available for problem [1], so far I'm not aware of a > > > decent solution for [2]. After briefly discussing the issue with Mel he had > > > a bright idea that we could used hashed locks to deal with [2] (and I think > > > we can solve [1] with them as well). So my proposal looks as follows: > > > > > > DAX will have an array of mutexes (the array can be made per device but > > > initially a global one should be OK). We will use mutexes in the array as a > > > replacement for page lock - we will use hashfn(mapping, index) to get > > > particular mutex protecting our offset in the mapping. On fault / page > > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > > are done updating page tables. This deals with races in [1]. When flushing > > > caches we grab the mutex before clearing writeable bit in page tables > > > and clearing dirty bit in the radix tree and drop it after we have flushed > > > caches for the pfn. This deals with races in [2]. > > > > > > Thoughts? > > > > > > > I like the fact that this makes the locking explicit and > > straightforward rather than something more tricky. Can we make the > > hashfn pfn based? I'm thinking we could later reuse this as part of > > the solution for eliminating the need to allocate struct page, and we > > don't have the 'mapping' available in all paths... > > So Mel originally suggested to use pfn for hashing as well. My concern with > using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > lock. What you really need to protect is a logical offset in the file to > serialize allocation of underlying blocks, its mapping into page tables, > and flushing the blocks out of caches. So using inode/mapping and offset > for the hashing is easier (it isn't obvious to me we can fix hole filling > races with pfn-based locking). So how does that file+offset hash work when trying to lock different ranges? file+offset hashing to determine the lock to use only works if we are dealing with fixed size ranges that the locks affect. e.g. offset has 4k granularity for a single page faults, but we also need to handle 2MB granularity for huge page faults, and IIRC 1GB granularity for giant page faults... What's the plan here? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160210103249.GD12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> Date: Wed, 10 Feb 2016 12:08:12 -0800 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Dave Chinner , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox List-ID: On Wed, Feb 10, 2016 at 2:32 AM, Jan Kara wrote: > On Tue 09-02-16 10:18:53, Dan Williams wrote: >> On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: >> > Hello, >> > >> > I was thinking about current issues with DAX fault locking [1] (data >> > corruption due to racing faults allocating blocks) and also races which >> > currently don't allow us to clear dirty tags in the radix tree due to races >> > between faults and cache flushing [2]. Both of these exist because we don't >> > have an equivalent of page lock available for DAX. While we have a >> > reasonable solution available for problem [1], so far I'm not aware of a >> > decent solution for [2]. After briefly discussing the issue with Mel he had >> > a bright idea that we could used hashed locks to deal with [2] (and I think >> > we can solve [1] with them as well). So my proposal looks as follows: >> > >> > DAX will have an array of mutexes (the array can be made per device but >> > initially a global one should be OK). We will use mutexes in the array as a >> > replacement for page lock - we will use hashfn(mapping, index) to get >> > particular mutex protecting our offset in the mapping. On fault / page >> > mkwrite, we'll grab the mutex similarly to page lock and release it once we >> > are done updating page tables. This deals with races in [1]. When flushing >> > caches we grab the mutex before clearing writeable bit in page tables >> > and clearing dirty bit in the radix tree and drop it after we have flushed >> > caches for the pfn. This deals with races in [2]. >> > >> > Thoughts? >> > >> >> I like the fact that this makes the locking explicit and >> straightforward rather than something more tricky. Can we make the >> hashfn pfn based? I'm thinking we could later reuse this as part of >> the solution for eliminating the need to allocate struct page, and we >> don't have the 'mapping' available in all paths... > > So Mel originally suggested to use pfn for hashing as well. My concern with > using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > lock. What you really need to protect is a logical offset in the file to > serialize allocation of underlying blocks, its mapping into page tables, > and flushing the blocks out of caches. So using inode/mapping and offset > for the hashing is easier (it isn't obvious to me we can fix hole filling > races with pfn-based locking). > > I'm not sure for which other purposes you'd like to use this lock and > whether propagating file+offset to those call sites would make sense or > not. struct page has the advantage that block mapping information is only > attached to it, so when filling a hole, we can just allocate some page, > attach it to the radix tree, use page lock for synchronization, and allocate > blocks only after that. With pfns we cannot do this... Right, I am thinking of the direct-I/O path's use of the page lock and the occasions where it relies on page->mapping lookups. Given we already have support for dynamically allocating struct page I don't think we need to have a "pfn to lock" lookup in the initial implementation of this locking scheme. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <56BB758D.1000704@plexistor.com> Date: Wed, 10 Feb 2016 19:38:21 +0200 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: Another proposal for DAX fault locking References: <20160209172416.GB12245@quack.suse.cz> In-Reply-To: <20160209172416.GB12245@quack.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Jan Kara , Ross Zwisler Cc: linux-nvdimm@lists.01.org, Dave Chinner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org List-ID: On 02/09/2016 07:24 PM, Jan Kara wrote: > Hello, > > I was thinking about current issues with DAX fault locking [1] (data > corruption due to racing faults allocating blocks) and also races which > currently don't allow us to clear dirty tags in the radix tree due to races > between faults and cache flushing [2]. Both of these exist because we don't > have an equivalent of page lock available for DAX. While we have a > reasonable solution available for problem [1], so far I'm not aware of a > decent solution for [2]. After briefly discussing the issue with Mel he had > a bright idea that we could used hashed locks to deal with [2] (and I think > we can solve [1] with them as well). So my proposal looks as follows: > > DAX will have an array of mutexes (the array can be made per device but > initially a global one should be OK). We will use mutexes in the array as a > replacement for page lock - we will use hashfn(mapping, index) to get > particular mutex protecting our offset in the mapping. On fault / page > mkwrite, we'll grab the mutex similarly to page lock and release it once we > are done updating page tables. This deals with races in [1]. When flushing > caches we grab the mutex before clearing writeable bit in page tables > and clearing dirty bit in the radix tree and drop it after we have flushed > caches for the pfn. This deals with races in [2]. > > Thoughts? > You could also use one of the radix-tree's special-bits as a bit lock. So no need for any extra allocations. [latest page-lock is a bit-lock so performance is the same] Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 10 Feb 2016 13:35:18 +0100 From: Jan Kara Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210123518.GG12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <87egck4ukx.fsf@openvz.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87egck4ukx.fsf@openvz.org> Sender: owner-linux-mm@kvack.org To: Dmitry Monakhov Cc: Jan Kara , Ross Zwisler , linux-nvdimm@lists.01.org, Dave Chinner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org List-ID: On Wed 10-02-16 15:29:34, Dmitry Monakhov wrote: > Jan Kara writes: > > > Hello, > > > > I was thinking about current issues with DAX fault locking [1] (data > > corruption due to racing faults allocating blocks) and also races which > > currently don't allow us to clear dirty tags in the radix tree due to races > > between faults and cache flushing [2]. Both of these exist because we don't > > have an equivalent of page lock available for DAX. While we have a > > reasonable solution available for problem [1], so far I'm not aware of a > > decent solution for [2]. After briefly discussing the issue with Mel he had > > a bright idea that we could used hashed locks to deal with [2] (and I think > > we can solve [1] with them as well). So my proposal looks as follows: > > > > DAX will have an array of mutexes (the array can be made per device but > > initially a global one should be OK). We will use mutexes in the array as a > > replacement for page lock - we will use hashfn(mapping, index) to get > > particular mutex protecting our offset in the mapping. On fault / page > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > are done updating page tables. This deals with races in [1]. When flushing > > caches we grab the mutex before clearing writeable bit in page tables > > and clearing dirty bit in the radix tree and drop it after we have flushed > > caches for the pfn. This deals with races in [2]. > > > > Thoughts? > Agree, only small note: > Hash locks has side effect for batch locking due to collision. > Some times we want to lock several pages/entries (migration/defragmentation) > So we will endup with deadlock due to hash collision. Yeah, but at least for the purposes we want the locks for locking just one 'page' is enough. If we ever needed locking more 'pages', we would have to choose a different locking scheme. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Dmitry Monakhov Subject: Re: Another proposal for DAX fault locking In-Reply-To: <20160209172416.GB12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> Date: Wed, 10 Feb 2016 15:29:34 +0300 Message-ID: <87egck4ukx.fsf@openvz.org> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" Sender: owner-linux-mm@kvack.org To: Jan Kara , Ross Zwisler Cc: linux-nvdimm@lists.01.org, Dave Chinner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org List-ID: --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Jan Kara writes: > Hello, > > I was thinking about current issues with DAX fault locking [1] (data > corruption due to racing faults allocating blocks) and also races which > currently don't allow us to clear dirty tags in the radix tree due to rac= es > between faults and cache flushing [2]. Both of these exist because we don= 't > have an equivalent of page lock available for DAX. While we have a > reasonable solution available for problem [1], so far I'm not aware of a > decent solution for [2]. After briefly discussing the issue with Mel he h= ad > a bright idea that we could used hashed locks to deal with [2] (and I thi= nk > we can solve [1] with them as well). So my proposal looks as follows: > > DAX will have an array of mutexes (the array can be made per device but > initially a global one should be OK). We will use mutexes in the array as= a > replacement for page lock - we will use hashfn(mapping, index) to get > particular mutex protecting our offset in the mapping. On fault / page > mkwrite, we'll grab the mutex similarly to page lock and release it once = we > are done updating page tables. This deals with races in [1]. When flushing > caches we grab the mutex before clearing writeable bit in page tables > and clearing dirty bit in the radix tree and drop it after we have flushed > caches for the pfn. This deals with races in [2]. > > Thoughts? Agree, only small note: Hash locks has side effect for batch locking due to collision. Some times we want to lock several pages/entries (migration/defragmentation) So we will endup with deadlock due to hash collision. > > Honza > > [1] http://oss.sgi.com/archives/xfs/2016-01/msg00575.html > [2] https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html > > --=20 > Jan Kara > SUSE Labs, CR > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJWuy0uAAoJELhyPTmIL6kBG5cIAJXSd1Ta5jXS9UR4eWvbSrvb s1dDJ9/mSsczuXhCVFdsWd6JkweN3xHYaLyqbkUwNdk8qgsjMZCoAiggHK5/zF5k 7RAMhTyuM7n2NyRfYztG+YOI+713X4V5h3sPolY0wCzUMbbze7M9kYBmFrenMj3H qSvkN0lB5CO1C3NsZPmnga5uIBD11ony6ZP8sIRkZdB/M7GKK6n3UHvVAN2ulMJ/ y706UznvEH7VgMExYftjvME5VayoV0ktrKxQJJfxUJZcZC19HW7NC0rGkuny54Qr 0zmxw4TJLvAXolu73hgaShgOjTYKdgrTMC5iwl62v0L4unQAUu3Cc770tIUtH2Q= =of95 -----END PGP SIGNATURE----- --=-=-=-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 10 Feb 2016 11:32:49 +0100 From: Jan Kara Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210103249.GD12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , Dave Chinner , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox List-ID: On Tue 09-02-16 10:18:53, Dan Williams wrote: > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > > Hello, > > > > I was thinking about current issues with DAX fault locking [1] (data > > corruption due to racing faults allocating blocks) and also races which > > currently don't allow us to clear dirty tags in the radix tree due to races > > between faults and cache flushing [2]. Both of these exist because we don't > > have an equivalent of page lock available for DAX. While we have a > > reasonable solution available for problem [1], so far I'm not aware of a > > decent solution for [2]. After briefly discussing the issue with Mel he had > > a bright idea that we could used hashed locks to deal with [2] (and I think > > we can solve [1] with them as well). So my proposal looks as follows: > > > > DAX will have an array of mutexes (the array can be made per device but > > initially a global one should be OK). We will use mutexes in the array as a > > replacement for page lock - we will use hashfn(mapping, index) to get > > particular mutex protecting our offset in the mapping. On fault / page > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > are done updating page tables. This deals with races in [1]. When flushing > > caches we grab the mutex before clearing writeable bit in page tables > > and clearing dirty bit in the radix tree and drop it after we have flushed > > caches for the pfn. This deals with races in [2]. > > > > Thoughts? > > > > I like the fact that this makes the locking explicit and > straightforward rather than something more tricky. Can we make the > hashfn pfn based? I'm thinking we could later reuse this as part of > the solution for eliminating the need to allocate struct page, and we > don't have the 'mapping' available in all paths... So Mel originally suggested to use pfn for hashing as well. My concern with using pfn is that e.g. if you want to fill a hole, you don't have a pfn to lock. What you really need to protect is a logical offset in the file to serialize allocation of underlying blocks, its mapping into page tables, and flushing the blocks out of caches. So using inode/mapping and offset for the hashing is easier (it isn't obvious to me we can fix hole filling races with pfn-based locking). I'm not sure for which other purposes you'd like to use this lock and whether propagating file+offset to those call sites would make sense or not. struct page has the advantage that block mapping information is only attached to it, so when filling a hole, we can just allocate some page, attach it to the radix tree, use page lock for synchronization, and allocate blocks only after that. With pfns we cannot do this... Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 10 Feb 2016 11:18:57 +0100 From: Jan Kara Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210101857.GC12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210081922.GC4763@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210081922.GC4763@suse.de> Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: Cedric Blancher , Jan Kara , Ross Zwisler , Linux Kernel Mailing List , Dave Chinner , linux-fsdevel , linux-mm@kvack.org, Dan Williams , linux-nvdimm@lists.01.org, Matthew Wilcox List-ID: On Wed 10-02-16 08:19:22, Mel Gorman wrote: > On Tue, Feb 09, 2016 at 07:46:05PM +0100, Cedric Blancher wrote: > > On 9 February 2016 at 18:24, Jan Kara wrote: > > > Hello, > > > > > > I was thinking about current issues with DAX fault locking [1] (data > > > corruption due to racing faults allocating blocks) and also races which > > > currently don't allow us to clear dirty tags in the radix tree due to races > > > between faults and cache flushing [2]. Both of these exist because we don't > > > have an equivalent of page lock available for DAX. While we have a > > > reasonable solution available for problem [1], so far I'm not aware of a > > > decent solution for [2]. After briefly discussing the issue with Mel he had > > > a bright idea that we could used hashed locks to deal with [2] (and I think > > > we can solve [1] with them as well). So my proposal looks as follows: > > > > > > DAX will have an array of mutexes > > > > One folly here: Arrays of mutexes NEVER work unless you manage to > > align them to occupy one complete L2/L3 cache line each. Otherwise the > > CPUS will fight over cache lines each time they touch (read or write) > > a mutex, and it then becomes a O^n-like scalability problem if > > multiple mutexes occupy one cache line. It becomes WORSE as more > > mutexes fit into a single cache line and even more worse with the > > number of CPUS accessing such contested lines. > > > > That is a *potential* performance concern although I agree with you in that > mutex's false sharing a cache line would be a problem. However, it is a > performance concern that potentially is alleviated by alternative hashing > where as AFAIK the issues being faced currently are data corruption and > functional issues. I'd take a performance issue over a data corruption > issue any day of the week. Exactly. We have to add *some* locking to fix the data corruption. Cache aliasing of hashed mutexes may be an issue but I believe the result will be still better than a single mutex. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 10 Feb 2016 08:19:22 +0000 From: Mel Gorman Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210081922.GC4763@suse.de> References: <20160209172416.GB12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Cedric Blancher Cc: Jan Kara , Ross Zwisler , Linux Kernel Mailing List , Dave Chinner , linux-fsdevel , linux-mm@kvack.org, Dan Williams , linux-nvdimm@lists.01.org, Matthew Wilcox List-ID: On Tue, Feb 09, 2016 at 07:46:05PM +0100, Cedric Blancher wrote: > On 9 February 2016 at 18:24, Jan Kara wrote: > > Hello, > > > > I was thinking about current issues with DAX fault locking [1] (data > > corruption due to racing faults allocating blocks) and also races which > > currently don't allow us to clear dirty tags in the radix tree due to races > > between faults and cache flushing [2]. Both of these exist because we don't > > have an equivalent of page lock available for DAX. While we have a > > reasonable solution available for problem [1], so far I'm not aware of a > > decent solution for [2]. After briefly discussing the issue with Mel he had > > a bright idea that we could used hashed locks to deal with [2] (and I think > > we can solve [1] with them as well). So my proposal looks as follows: > > > > DAX will have an array of mutexes > > One folly here: Arrays of mutexes NEVER work unless you manage to > align them to occupy one complete L2/L3 cache line each. Otherwise the > CPUS will fight over cache lines each time they touch (read or write) > a mutex, and it then becomes a O^n-like scalability problem if > multiple mutexes occupy one cache line. It becomes WORSE as more > mutexes fit into a single cache line and even more worse with the > number of CPUS accessing such contested lines. > That is a *potential* performance concern although I agree with you in that mutex's false sharing a cache line would be a problem. However, it is a performance concern that potentially is alleviated by alternative hashing where as AFAIK the issues being faced currently are data corruption and functional issues. I'd take a performance issue over a data corruption issue any day of the week. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160209172416.GB12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> Date: Tue, 9 Feb 2016 19:46:05 +0100 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Cedric Blancher Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Ross Zwisler , Linux Kernel Mailing List , Dave Chinner , linux-fsdevel , linux-mm@kvack.org, Dan Williams , linux-nvdimm@lists.01.org, mgorman@suse.de, Matthew Wilcox List-ID: On 9 February 2016 at 18:24, Jan Kara wrote: > Hello, > > I was thinking about current issues with DAX fault locking [1] (data > corruption due to racing faults allocating blocks) and also races which > currently don't allow us to clear dirty tags in the radix tree due to races > between faults and cache flushing [2]. Both of these exist because we don't > have an equivalent of page lock available for DAX. While we have a > reasonable solution available for problem [1], so far I'm not aware of a > decent solution for [2]. After briefly discussing the issue with Mel he had > a bright idea that we could used hashed locks to deal with [2] (and I think > we can solve [1] with them as well). So my proposal looks as follows: > > DAX will have an array of mutexes One folly here: Arrays of mutexes NEVER work unless you manage to align them to occupy one complete L2/L3 cache line each. Otherwise the CPUS will fight over cache lines each time they touch (read or write) a mutex, and it then becomes a O^n-like scalability problem if multiple mutexes occupy one cache line. It becomes WORSE as more mutexes fit into a single cache line and even more worse with the number of CPUS accessing such contested lines. Ced -- Cedric Blancher Institute Pasteur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160209172416.GB12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> Date: Tue, 9 Feb 2016 10:18:53 -0800 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Dave Chinner , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox List-ID: I l On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > Hello, > > I was thinking about current issues with DAX fault locking [1] (data > corruption due to racing faults allocating blocks) and also races which > currently don't allow us to clear dirty tags in the radix tree due to races > between faults and cache flushing [2]. Both of these exist because we don't > have an equivalent of page lock available for DAX. While we have a > reasonable solution available for problem [1], so far I'm not aware of a > decent solution for [2]. After briefly discussing the issue with Mel he had > a bright idea that we could used hashed locks to deal with [2] (and I think > we can solve [1] with them as well). So my proposal looks as follows: > > DAX will have an array of mutexes (the array can be made per device but > initially a global one should be OK). We will use mutexes in the array as a > replacement for page lock - we will use hashfn(mapping, index) to get > particular mutex protecting our offset in the mapping. On fault / page > mkwrite, we'll grab the mutex similarly to page lock and release it once we > are done updating page tables. This deals with races in [1]. When flushing > caches we grab the mutex before clearing writeable bit in page tables > and clearing dirty bit in the radix tree and drop it after we have flushed > caches for the pfn. This deals with races in [2]. > > Thoughts? > I like the fact that this makes the locking explicit and straightforward rather than something more tricky. Can we make the hashfn pfn based? I'm thinking we could later reuse this as part of the solution for eliminating the need to allocate struct page, and we don't have the 'mapping' available in all paths... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 9 Feb 2016 18:24:16 +0100 From: Jan Kara Subject: Another proposal for DAX fault locking Message-ID: <20160209172416.GB12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, Dave Chinner , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dan Williams , linux-nvdimm@lists.01.org, mgorman@suse.de, Matthew Wilcox List-ID: Hello, I was thinking about current issues with DAX fault locking [1] (data corruption due to racing faults allocating blocks) and also races which currently don't allow us to clear dirty tags in the radix tree due to races between faults and cache flushing [2]. Both of these exist because we don't have an equivalent of page lock available for DAX. While we have a reasonable solution available for problem [1], so far I'm not aware of a decent solution for [2]. After briefly discussing the issue with Mel he had a bright idea that we could used hashed locks to deal with [2] (and I think we can solve [1] with them as well). So my proposal looks as follows: DAX will have an array of mutexes (the array can be made per device but initially a global one should be OK). We will use mutexes in the array as a replacement for page lock - we will use hashfn(mapping, index) to get particular mutex protecting our offset in the mapping. On fault / page mkwrite, we'll grab the mutex similarly to page lock and release it once we are done updating page tables. This deals with races in [1]. When flushing caches we grab the mutex before clearing writeable bit in page tables and clearing dirty bit in the radix tree and drop it after we have flushed caches for the pfn. This deals with races in [2]. Thoughts? Honza [1] http://oss.sgi.com/archives/xfs/2016-01/msg00575.html [2] https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932887AbcBIRYE (ORCPT ); Tue, 9 Feb 2016 12:24:04 -0500 Received: from mx2.suse.de ([195.135.220.15]:34017 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932113AbcBIRYC (ORCPT ); Tue, 9 Feb 2016 12:24:02 -0500 Date: Tue, 9 Feb 2016 18:24:16 +0100 From: Jan Kara To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, Dave Chinner , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dan Williams , linux-nvdimm@ml01.01.org, mgorman@suse.de, Matthew Wilcox Subject: Another proposal for DAX fault locking Message-ID: <20160209172416.GB12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, I was thinking about current issues with DAX fault locking [1] (data corruption due to racing faults allocating blocks) and also races which currently don't allow us to clear dirty tags in the radix tree due to races between faults and cache flushing [2]. Both of these exist because we don't have an equivalent of page lock available for DAX. While we have a reasonable solution available for problem [1], so far I'm not aware of a decent solution for [2]. After briefly discussing the issue with Mel he had a bright idea that we could used hashed locks to deal with [2] (and I think we can solve [1] with them as well). So my proposal looks as follows: DAX will have an array of mutexes (the array can be made per device but initially a global one should be OK). We will use mutexes in the array as a replacement for page lock - we will use hashfn(mapping, index) to get particular mutex protecting our offset in the mapping. On fault / page mkwrite, we'll grab the mutex similarly to page lock and release it once we are done updating page tables. This deals with races in [1]. When flushing caches we grab the mutex before clearing writeable bit in page tables and clearing dirty bit in the radix tree and drop it after we have flushed caches for the pfn. This deals with races in [2]. Thoughts? Honza [1] http://oss.sgi.com/archives/xfs/2016-01/msg00575.html [2] https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757086AbcBISSz (ORCPT ); Tue, 9 Feb 2016 13:18:55 -0500 Received: from mail-yw0-f180.google.com ([209.85.161.180]:36575 "EHLO mail-yw0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756694AbcBISSx (ORCPT ); Tue, 9 Feb 2016 13:18:53 -0500 MIME-Version: 1.0 In-Reply-To: <20160209172416.GB12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> Date: Tue, 9 Feb 2016 10:18:53 -0800 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Dan Williams To: Jan Kara Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Dave Chinner , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I l On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > Hello, > > I was thinking about current issues with DAX fault locking [1] (data > corruption due to racing faults allocating blocks) and also races which > currently don't allow us to clear dirty tags in the radix tree due to races > between faults and cache flushing [2]. Both of these exist because we don't > have an equivalent of page lock available for DAX. While we have a > reasonable solution available for problem [1], so far I'm not aware of a > decent solution for [2]. After briefly discussing the issue with Mel he had > a bright idea that we could used hashed locks to deal with [2] (and I think > we can solve [1] with them as well). So my proposal looks as follows: > > DAX will have an array of mutexes (the array can be made per device but > initially a global one should be OK). We will use mutexes in the array as a > replacement for page lock - we will use hashfn(mapping, index) to get > particular mutex protecting our offset in the mapping. On fault / page > mkwrite, we'll grab the mutex similarly to page lock and release it once we > are done updating page tables. This deals with races in [1]. When flushing > caches we grab the mutex before clearing writeable bit in page tables > and clearing dirty bit in the radix tree and drop it after we have flushed > caches for the pfn. This deals with races in [2]. > > Thoughts? > I like the fact that this makes the locking explicit and straightforward rather than something more tricky. Can we make the hashfn pfn based? I'm thinking we could later reuse this as part of the solution for eliminating the need to allocate struct page, and we don't have the 'mapping' available in all paths... From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756561AbcBISqH (ORCPT ); Tue, 9 Feb 2016 13:46:07 -0500 Received: from mail-pa0-f44.google.com ([209.85.220.44]:35483 "EHLO mail-pa0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753801AbcBISqF (ORCPT ); Tue, 9 Feb 2016 13:46:05 -0500 MIME-Version: 1.0 In-Reply-To: <20160209172416.GB12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> Date: Tue, 9 Feb 2016 19:46:05 +0100 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Cedric Blancher To: Jan Kara Cc: Ross Zwisler , Linux Kernel Mailing List , Dave Chinner , linux-fsdevel , linux-mm@kvack.org, Dan Williams , linux-nvdimm@ml01.01.org, mgorman@suse.de, Matthew Wilcox Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 9 February 2016 at 18:24, Jan Kara wrote: > Hello, > > I was thinking about current issues with DAX fault locking [1] (data > corruption due to racing faults allocating blocks) and also races which > currently don't allow us to clear dirty tags in the radix tree due to races > between faults and cache flushing [2]. Both of these exist because we don't > have an equivalent of page lock available for DAX. While we have a > reasonable solution available for problem [1], so far I'm not aware of a > decent solution for [2]. After briefly discussing the issue with Mel he had > a bright idea that we could used hashed locks to deal with [2] (and I think > we can solve [1] with them as well). So my proposal looks as follows: > > DAX will have an array of mutexes One folly here: Arrays of mutexes NEVER work unless you manage to align them to occupy one complete L2/L3 cache line each. Otherwise the CPUS will fight over cache lines each time they touch (read or write) a mutex, and it then becomes a O^n-like scalability problem if multiple mutexes occupy one cache line. It becomes WORSE as more mutexes fit into a single cache line and even more worse with the number of CPUS accessing such contested lines. Ced -- Cedric Blancher Institute Pasteur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756062AbcBJITe (ORCPT ); Wed, 10 Feb 2016 03:19:34 -0500 Received: from mx2.suse.de ([195.135.220.15]:37964 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752517AbcBJITc (ORCPT ); Wed, 10 Feb 2016 03:19:32 -0500 Date: Wed, 10 Feb 2016 08:19:22 +0000 From: Mel Gorman To: Cedric Blancher Cc: Jan Kara , Ross Zwisler , Linux Kernel Mailing List , Dave Chinner , linux-fsdevel , linux-mm@kvack.org, Dan Williams , linux-nvdimm@ml01.01.org, Matthew Wilcox Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210081922.GC4763@suse.de> References: <20160209172416.GB12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 09, 2016 at 07:46:05PM +0100, Cedric Blancher wrote: > On 9 February 2016 at 18:24, Jan Kara wrote: > > Hello, > > > > I was thinking about current issues with DAX fault locking [1] (data > > corruption due to racing faults allocating blocks) and also races which > > currently don't allow us to clear dirty tags in the radix tree due to races > > between faults and cache flushing [2]. Both of these exist because we don't > > have an equivalent of page lock available for DAX. While we have a > > reasonable solution available for problem [1], so far I'm not aware of a > > decent solution for [2]. After briefly discussing the issue with Mel he had > > a bright idea that we could used hashed locks to deal with [2] (and I think > > we can solve [1] with them as well). So my proposal looks as follows: > > > > DAX will have an array of mutexes > > One folly here: Arrays of mutexes NEVER work unless you manage to > align them to occupy one complete L2/L3 cache line each. Otherwise the > CPUS will fight over cache lines each time they touch (read or write) > a mutex, and it then becomes a O^n-like scalability problem if > multiple mutexes occupy one cache line. It becomes WORSE as more > mutexes fit into a single cache line and even more worse with the > number of CPUS accessing such contested lines. > That is a *potential* performance concern although I agree with you in that mutex's false sharing a cache line would be a problem. However, it is a performance concern that potentially is alleviated by alternative hashing where as AFAIK the issues being faced currently are data corruption and functional issues. I'd take a performance issue over a data corruption issue any day of the week. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757667AbcBJKSt (ORCPT ); Wed, 10 Feb 2016 05:18:49 -0500 Received: from mx2.suse.de ([195.135.220.15]:49822 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757356AbcBJKSm (ORCPT ); Wed, 10 Feb 2016 05:18:42 -0500 Date: Wed, 10 Feb 2016 11:18:57 +0100 From: Jan Kara To: Mel Gorman Cc: Cedric Blancher , Jan Kara , Ross Zwisler , Linux Kernel Mailing List , Dave Chinner , linux-fsdevel , linux-mm@kvack.org, Dan Williams , linux-nvdimm@ml01.01.org, Matthew Wilcox Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210101857.GC12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210081922.GC4763@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210081922.GC4763@suse.de> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 10-02-16 08:19:22, Mel Gorman wrote: > On Tue, Feb 09, 2016 at 07:46:05PM +0100, Cedric Blancher wrote: > > On 9 February 2016 at 18:24, Jan Kara wrote: > > > Hello, > > > > > > I was thinking about current issues with DAX fault locking [1] (data > > > corruption due to racing faults allocating blocks) and also races which > > > currently don't allow us to clear dirty tags in the radix tree due to races > > > between faults and cache flushing [2]. Both of these exist because we don't > > > have an equivalent of page lock available for DAX. While we have a > > > reasonable solution available for problem [1], so far I'm not aware of a > > > decent solution for [2]. After briefly discussing the issue with Mel he had > > > a bright idea that we could used hashed locks to deal with [2] (and I think > > > we can solve [1] with them as well). So my proposal looks as follows: > > > > > > DAX will have an array of mutexes > > > > One folly here: Arrays of mutexes NEVER work unless you manage to > > align them to occupy one complete L2/L3 cache line each. Otherwise the > > CPUS will fight over cache lines each time they touch (read or write) > > a mutex, and it then becomes a O^n-like scalability problem if > > multiple mutexes occupy one cache line. It becomes WORSE as more > > mutexes fit into a single cache line and even more worse with the > > number of CPUS accessing such contested lines. > > > > That is a *potential* performance concern although I agree with you in that > mutex's false sharing a cache line would be a problem. However, it is a > performance concern that potentially is alleviated by alternative hashing > where as AFAIK the issues being faced currently are data corruption and > functional issues. I'd take a performance issue over a data corruption > issue any day of the week. Exactly. We have to add *some* locking to fix the data corruption. Cache aliasing of hashed mutexes may be an issue but I believe the result will be still better than a single mutex. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752857AbcBJKck (ORCPT ); Wed, 10 Feb 2016 05:32:40 -0500 Received: from mx2.suse.de ([195.135.220.15]:51717 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752327AbcBJKcf (ORCPT ); Wed, 10 Feb 2016 05:32:35 -0500 Date: Wed, 10 Feb 2016 11:32:49 +0100 From: Jan Kara To: Dan Williams Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , Dave Chinner , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210103249.GD12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 09-02-16 10:18:53, Dan Williams wrote: > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > > Hello, > > > > I was thinking about current issues with DAX fault locking [1] (data > > corruption due to racing faults allocating blocks) and also races which > > currently don't allow us to clear dirty tags in the radix tree due to races > > between faults and cache flushing [2]. Both of these exist because we don't > > have an equivalent of page lock available for DAX. While we have a > > reasonable solution available for problem [1], so far I'm not aware of a > > decent solution for [2]. After briefly discussing the issue with Mel he had > > a bright idea that we could used hashed locks to deal with [2] (and I think > > we can solve [1] with them as well). So my proposal looks as follows: > > > > DAX will have an array of mutexes (the array can be made per device but > > initially a global one should be OK). We will use mutexes in the array as a > > replacement for page lock - we will use hashfn(mapping, index) to get > > particular mutex protecting our offset in the mapping. On fault / page > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > are done updating page tables. This deals with races in [1]. When flushing > > caches we grab the mutex before clearing writeable bit in page tables > > and clearing dirty bit in the radix tree and drop it after we have flushed > > caches for the pfn. This deals with races in [2]. > > > > Thoughts? > > > > I like the fact that this makes the locking explicit and > straightforward rather than something more tricky. Can we make the > hashfn pfn based? I'm thinking we could later reuse this as part of > the solution for eliminating the need to allocate struct page, and we > don't have the 'mapping' available in all paths... So Mel originally suggested to use pfn for hashing as well. My concern with using pfn is that e.g. if you want to fill a hole, you don't have a pfn to lock. What you really need to protect is a logical offset in the file to serialize allocation of underlying blocks, its mapping into page tables, and flushing the blocks out of caches. So using inode/mapping and offset for the hashing is easier (it isn't obvious to me we can fix hole filling races with pfn-based locking). I'm not sure for which other purposes you'd like to use this lock and whether propagating file+offset to those call sites would make sense or not. struct page has the advantage that block mapping information is only attached to it, so when filling a hole, we can just allocate some page, attach it to the radix tree, use page lock for synchronization, and allocate blocks only after that. With pfns we cannot do this... Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750987AbcBJM36 (ORCPT ); Wed, 10 Feb 2016 07:29:58 -0500 Received: from mail-lb0-f174.google.com ([209.85.217.174]:35121 "EHLO mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750801AbcBJM3l (ORCPT ); Wed, 10 Feb 2016 07:29:41 -0500 From: Dmitry Monakhov To: Jan Kara , Ross Zwisler Cc: linux-nvdimm@ml01.01.org, Dave Chinner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org Subject: Re: Another proposal for DAX fault locking In-Reply-To: <20160209172416.GB12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> User-Agent: Notmuch/0.18.1 (http://notmuchmail.org) Emacs/24.4.1 (x86_64-pc-linux-gnu) Date: Wed, 10 Feb 2016 15:29:34 +0300 Message-ID: <87egck4ukx.fsf@openvz.org> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Jan Kara writes: > Hello, > > I was thinking about current issues with DAX fault locking [1] (data > corruption due to racing faults allocating blocks) and also races which > currently don't allow us to clear dirty tags in the radix tree due to rac= es > between faults and cache flushing [2]. Both of these exist because we don= 't > have an equivalent of page lock available for DAX. While we have a > reasonable solution available for problem [1], so far I'm not aware of a > decent solution for [2]. After briefly discussing the issue with Mel he h= ad > a bright idea that we could used hashed locks to deal with [2] (and I thi= nk > we can solve [1] with them as well). So my proposal looks as follows: > > DAX will have an array of mutexes (the array can be made per device but > initially a global one should be OK). We will use mutexes in the array as= a > replacement for page lock - we will use hashfn(mapping, index) to get > particular mutex protecting our offset in the mapping. On fault / page > mkwrite, we'll grab the mutex similarly to page lock and release it once = we > are done updating page tables. This deals with races in [1]. When flushing > caches we grab the mutex before clearing writeable bit in page tables > and clearing dirty bit in the radix tree and drop it after we have flushed > caches for the pfn. This deals with races in [2]. > > Thoughts? Agree, only small note: Hash locks has side effect for batch locking due to collision. Some times we want to lock several pages/entries (migration/defragmentation) So we will endup with deadlock due to hash collision. > > Honza > > [1] http://oss.sgi.com/archives/xfs/2016-01/msg00575.html > [2] https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html > > --=20 > Jan Kara > SUSE Labs, CR > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJWuy0uAAoJELhyPTmIL6kBG5cIAJXSd1Ta5jXS9UR4eWvbSrvb s1dDJ9/mSsczuXhCVFdsWd6JkweN3xHYaLyqbkUwNdk8qgsjMZCoAiggHK5/zF5k 7RAMhTyuM7n2NyRfYztG+YOI+713X4V5h3sPolY0wCzUMbbze7M9kYBmFrenMj3H qSvkN0lB5CO1C3NsZPmnga5uIBD11ony6ZP8sIRkZdB/M7GKK6n3UHvVAN2ulMJ/ y706UznvEH7VgMExYftjvME5VayoV0ktrKxQJJfxUJZcZC19HW7NC0rGkuny54Qr 0zmxw4TJLvAXolu73hgaShgOjTYKdgrTMC5iwl62v0L4unQAUu3Cc770tIUtH2Q= =of95 -----END PGP SIGNATURE----- --=-=-=-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750981AbcBJMfJ (ORCPT ); Wed, 10 Feb 2016 07:35:09 -0500 Received: from mx2.suse.de ([195.135.220.15]:44316 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750801AbcBJMfF (ORCPT ); Wed, 10 Feb 2016 07:35:05 -0500 Date: Wed, 10 Feb 2016 13:35:18 +0100 From: Jan Kara To: Dmitry Monakhov Cc: Jan Kara , Ross Zwisler , linux-nvdimm@ml01.01.org, Dave Chinner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210123518.GG12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <87egck4ukx.fsf@openvz.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87egck4ukx.fsf@openvz.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 10-02-16 15:29:34, Dmitry Monakhov wrote: > Jan Kara writes: > > > Hello, > > > > I was thinking about current issues with DAX fault locking [1] (data > > corruption due to racing faults allocating blocks) and also races which > > currently don't allow us to clear dirty tags in the radix tree due to races > > between faults and cache flushing [2]. Both of these exist because we don't > > have an equivalent of page lock available for DAX. While we have a > > reasonable solution available for problem [1], so far I'm not aware of a > > decent solution for [2]. After briefly discussing the issue with Mel he had > > a bright idea that we could used hashed locks to deal with [2] (and I think > > we can solve [1] with them as well). So my proposal looks as follows: > > > > DAX will have an array of mutexes (the array can be made per device but > > initially a global one should be OK). We will use mutexes in the array as a > > replacement for page lock - we will use hashfn(mapping, index) to get > > particular mutex protecting our offset in the mapping. On fault / page > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > are done updating page tables. This deals with races in [1]. When flushing > > caches we grab the mutex before clearing writeable bit in page tables > > and clearing dirty bit in the radix tree and drop it after we have flushed > > caches for the pfn. This deals with races in [2]. > > > > Thoughts? > Agree, only small note: > Hash locks has side effect for batch locking due to collision. > Some times we want to lock several pages/entries (migration/defragmentation) > So we will endup with deadlock due to hash collision. Yeah, but at least for the purposes we want the locks for locking just one 'page' is enough. If we ever needed locking more 'pages', we would have to choose a different locking scheme. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752776AbcBJRi2 (ORCPT ); Wed, 10 Feb 2016 12:38:28 -0500 Received: from mail-wm0-f41.google.com ([74.125.82.41]:34145 "EHLO mail-wm0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752164AbcBJRiZ (ORCPT ); Wed, 10 Feb 2016 12:38:25 -0500 Message-ID: <56BB758D.1000704@plexistor.com> Date: Wed, 10 Feb 2016 19:38:21 +0200 From: Boaz Harrosh User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Jan Kara , Ross Zwisler CC: linux-nvdimm@ml01.01.org, Dave Chinner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org Subject: Re: Another proposal for DAX fault locking References: <20160209172416.GB12245@quack.suse.cz> In-Reply-To: <20160209172416.GB12245@quack.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/09/2016 07:24 PM, Jan Kara wrote: > Hello, > > I was thinking about current issues with DAX fault locking [1] (data > corruption due to racing faults allocating blocks) and also races which > currently don't allow us to clear dirty tags in the radix tree due to races > between faults and cache flushing [2]. Both of these exist because we don't > have an equivalent of page lock available for DAX. While we have a > reasonable solution available for problem [1], so far I'm not aware of a > decent solution for [2]. After briefly discussing the issue with Mel he had > a bright idea that we could used hashed locks to deal with [2] (and I think > we can solve [1] with them as well). So my proposal looks as follows: > > DAX will have an array of mutexes (the array can be made per device but > initially a global one should be OK). We will use mutexes in the array as a > replacement for page lock - we will use hashfn(mapping, index) to get > particular mutex protecting our offset in the mapping. On fault / page > mkwrite, we'll grab the mutex similarly to page lock and release it once we > are done updating page tables. This deals with races in [1]. When flushing > caches we grab the mutex before clearing writeable bit in page tables > and clearing dirty bit in the radix tree and drop it after we have flushed > caches for the pfn. This deals with races in [2]. > > Thoughts? > You could also use one of the radix-tree's special-bits as a bit lock. So no need for any extra allocations. [latest page-lock is a bit-lock so performance is the same] Thanks Boaz From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753655AbcBJUIO (ORCPT ); Wed, 10 Feb 2016 15:08:14 -0500 Received: from mail-yk0-f174.google.com ([209.85.160.174]:33339 "EHLO mail-yk0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752701AbcBJUIM (ORCPT ); Wed, 10 Feb 2016 15:08:12 -0500 MIME-Version: 1.0 In-Reply-To: <20160210103249.GD12245@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> Date: Wed, 10 Feb 2016 12:08:12 -0800 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Dan Williams To: Jan Kara Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Dave Chinner , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 10, 2016 at 2:32 AM, Jan Kara wrote: > On Tue 09-02-16 10:18:53, Dan Williams wrote: >> On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: >> > Hello, >> > >> > I was thinking about current issues with DAX fault locking [1] (data >> > corruption due to racing faults allocating blocks) and also races which >> > currently don't allow us to clear dirty tags in the radix tree due to races >> > between faults and cache flushing [2]. Both of these exist because we don't >> > have an equivalent of page lock available for DAX. While we have a >> > reasonable solution available for problem [1], so far I'm not aware of a >> > decent solution for [2]. After briefly discussing the issue with Mel he had >> > a bright idea that we could used hashed locks to deal with [2] (and I think >> > we can solve [1] with them as well). So my proposal looks as follows: >> > >> > DAX will have an array of mutexes (the array can be made per device but >> > initially a global one should be OK). We will use mutexes in the array as a >> > replacement for page lock - we will use hashfn(mapping, index) to get >> > particular mutex protecting our offset in the mapping. On fault / page >> > mkwrite, we'll grab the mutex similarly to page lock and release it once we >> > are done updating page tables. This deals with races in [1]. When flushing >> > caches we grab the mutex before clearing writeable bit in page tables >> > and clearing dirty bit in the radix tree and drop it after we have flushed >> > caches for the pfn. This deals with races in [2]. >> > >> > Thoughts? >> > >> >> I like the fact that this makes the locking explicit and >> straightforward rather than something more tricky. Can we make the >> hashfn pfn based? I'm thinking we could later reuse this as part of >> the solution for eliminating the need to allocate struct page, and we >> don't have the 'mapping' available in all paths... > > So Mel originally suggested to use pfn for hashing as well. My concern with > using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > lock. What you really need to protect is a logical offset in the file to > serialize allocation of underlying blocks, its mapping into page tables, > and flushing the blocks out of caches. So using inode/mapping and offset > for the hashing is easier (it isn't obvious to me we can fix hole filling > races with pfn-based locking). > > I'm not sure for which other purposes you'd like to use this lock and > whether propagating file+offset to those call sites would make sense or > not. struct page has the advantage that block mapping information is only > attached to it, so when filling a hole, we can just allocate some page, > attach it to the radix tree, use page lock for synchronization, and allocate > blocks only after that. With pfns we cannot do this... Right, I am thinking of the direct-I/O path's use of the page lock and the occasions where it relies on page->mapping lookups. Given we already have support for dynamically allocating struct page I don't think we need to have a "pfn to lock" lookup in the initial implementation of this locking scheme. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752027AbcBJWKP (ORCPT ); Wed, 10 Feb 2016 17:10:15 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:5285 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750864AbcBJWKM (ORCPT ); Wed, 10 Feb 2016 17:10:12 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AiCQCatLtWPBATLHleKAECgw+BP4ZjgXmdVQEBAQEBAQaLaolMhgcCAgEBAoE4TQEBAQEBAQcBAQEBQT+EQgEBBDocIxAIAw4KCSUPBSUDBxoTG4d/wFcBAQEHAgEdGIUyhH+IbAWWeI1Ijn2OP4JkGoFcKC6IUwEBAQ Date: Thu, 11 Feb 2016 09:09:53 +1100 From: Dave Chinner To: Jan Kara Cc: Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210220953.GW19486@dastard> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210103249.GD12245@quack.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: > On Tue 09-02-16 10:18:53, Dan Williams wrote: > > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > > > Hello, > > > > > > I was thinking about current issues with DAX fault locking [1] (data > > > corruption due to racing faults allocating blocks) and also races which > > > currently don't allow us to clear dirty tags in the radix tree due to races > > > between faults and cache flushing [2]. Both of these exist because we don't > > > have an equivalent of page lock available for DAX. While we have a > > > reasonable solution available for problem [1], so far I'm not aware of a > > > decent solution for [2]. After briefly discussing the issue with Mel he had > > > a bright idea that we could used hashed locks to deal with [2] (and I think > > > we can solve [1] with them as well). So my proposal looks as follows: > > > > > > DAX will have an array of mutexes (the array can be made per device but > > > initially a global one should be OK). We will use mutexes in the array as a > > > replacement for page lock - we will use hashfn(mapping, index) to get > > > particular mutex protecting our offset in the mapping. On fault / page > > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > > are done updating page tables. This deals with races in [1]. When flushing > > > caches we grab the mutex before clearing writeable bit in page tables > > > and clearing dirty bit in the radix tree and drop it after we have flushed > > > caches for the pfn. This deals with races in [2]. > > > > > > Thoughts? > > > > > > > I like the fact that this makes the locking explicit and > > straightforward rather than something more tricky. Can we make the > > hashfn pfn based? I'm thinking we could later reuse this as part of > > the solution for eliminating the need to allocate struct page, and we > > don't have the 'mapping' available in all paths... > > So Mel originally suggested to use pfn for hashing as well. My concern with > using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > lock. What you really need to protect is a logical offset in the file to > serialize allocation of underlying blocks, its mapping into page tables, > and flushing the blocks out of caches. So using inode/mapping and offset > for the hashing is easier (it isn't obvious to me we can fix hole filling > races with pfn-based locking). So how does that file+offset hash work when trying to lock different ranges? file+offset hashing to determine the lock to use only works if we are dealing with fixed size ranges that the locks affect. e.g. offset has 4k granularity for a single page faults, but we also need to handle 2MB granularity for huge page faults, and IIRC 1GB granularity for giant page faults... What's the plan here? Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752153AbcBJWjp (ORCPT ); Wed, 10 Feb 2016 17:39:45 -0500 Received: from mail-pa0-f45.google.com ([209.85.220.45]:33504 "EHLO mail-pa0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750865AbcBJWjo (ORCPT ); Wed, 10 Feb 2016 17:39:44 -0500 MIME-Version: 1.0 In-Reply-To: <20160210220953.GW19486@dastard> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> Date: Wed, 10 Feb 2016 23:39:43 +0100 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Cedric Blancher To: Dave Chinner Cc: Jan Kara , Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org AFAIK Solaris 11 uses a sparse tree instead of a array. Solves the scalability problem AND deals with variable page size. Ced On 10 February 2016 at 23:09, Dave Chinner wrote: > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: >> On Tue 09-02-16 10:18:53, Dan Williams wrote: >> > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: >> > > Hello, >> > > >> > > I was thinking about current issues with DAX fault locking [1] (data >> > > corruption due to racing faults allocating blocks) and also races which >> > > currently don't allow us to clear dirty tags in the radix tree due to races >> > > between faults and cache flushing [2]. Both of these exist because we don't >> > > have an equivalent of page lock available for DAX. While we have a >> > > reasonable solution available for problem [1], so far I'm not aware of a >> > > decent solution for [2]. After briefly discussing the issue with Mel he had >> > > a bright idea that we could used hashed locks to deal with [2] (and I think >> > > we can solve [1] with them as well). So my proposal looks as follows: >> > > >> > > DAX will have an array of mutexes (the array can be made per device but >> > > initially a global one should be OK). We will use mutexes in the array as a >> > > replacement for page lock - we will use hashfn(mapping, index) to get >> > > particular mutex protecting our offset in the mapping. On fault / page >> > > mkwrite, we'll grab the mutex similarly to page lock and release it once we >> > > are done updating page tables. This deals with races in [1]. When flushing >> > > caches we grab the mutex before clearing writeable bit in page tables >> > > and clearing dirty bit in the radix tree and drop it after we have flushed >> > > caches for the pfn. This deals with races in [2]. >> > > >> > > Thoughts? >> > > >> > >> > I like the fact that this makes the locking explicit and >> > straightforward rather than something more tricky. Can we make the >> > hashfn pfn based? I'm thinking we could later reuse this as part of >> > the solution for eliminating the need to allocate struct page, and we >> > don't have the 'mapping' available in all paths... >> >> So Mel originally suggested to use pfn for hashing as well. My concern with >> using pfn is that e.g. if you want to fill a hole, you don't have a pfn to >> lock. What you really need to protect is a logical offset in the file to >> serialize allocation of underlying blocks, its mapping into page tables, >> and flushing the blocks out of caches. So using inode/mapping and offset >> for the hashing is easier (it isn't obvious to me we can fix hole filling >> races with pfn-based locking). > > So how does that file+offset hash work when trying to lock different > ranges? file+offset hashing to determine the lock to use only works > if we are dealing with fixed size ranges that the locks affect. > e.g. offset has 4k granularity for a single page faults, but we also > need to handle 2MB granularity for huge page faults, and IIRC 1GB > granularity for giant page faults... > > What's the plan here? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Cedric Blancher Institute Pasteur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750858AbcBJXdI (ORCPT ); Wed, 10 Feb 2016 18:33:08 -0500 Received: from mga09.intel.com ([134.134.136.24]:55876 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750713AbcBJXdG (ORCPT ); Wed, 10 Feb 2016 18:33:06 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,427,1449561600"; d="scan'208";a="743949771" Date: Wed, 10 Feb 2016 16:32:53 -0700 From: Ross Zwisler To: Dave Chinner Cc: Jan Kara , Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210233253.GB30938@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dave Chinner , Jan Kara , Dan Williams , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210220953.GW19486@dastard> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 09:09:53AM +1100, Dave Chinner wrote: > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: > > On Tue 09-02-16 10:18:53, Dan Williams wrote: > > > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > > > > Hello, > > > > > > > > I was thinking about current issues with DAX fault locking [1] (data > > > > corruption due to racing faults allocating blocks) and also races which > > > > currently don't allow us to clear dirty tags in the radix tree due to races > > > > between faults and cache flushing [2]. Both of these exist because we don't > > > > have an equivalent of page lock available for DAX. While we have a > > > > reasonable solution available for problem [1], so far I'm not aware of a > > > > decent solution for [2]. After briefly discussing the issue with Mel he had > > > > a bright idea that we could used hashed locks to deal with [2] (and I think > > > > we can solve [1] with them as well). So my proposal looks as follows: > > > > > > > > DAX will have an array of mutexes (the array can be made per device but > > > > initially a global one should be OK). We will use mutexes in the array as a > > > > replacement for page lock - we will use hashfn(mapping, index) to get > > > > particular mutex protecting our offset in the mapping. On fault / page > > > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > > > are done updating page tables. This deals with races in [1]. When flushing > > > > caches we grab the mutex before clearing writeable bit in page tables > > > > and clearing dirty bit in the radix tree and drop it after we have flushed > > > > caches for the pfn. This deals with races in [2]. > > > > > > > > Thoughts? > > > > > > > > > > I like the fact that this makes the locking explicit and > > > straightforward rather than something more tricky. Can we make the > > > hashfn pfn based? I'm thinking we could later reuse this as part of > > > the solution for eliminating the need to allocate struct page, and we > > > don't have the 'mapping' available in all paths... > > > > So Mel originally suggested to use pfn for hashing as well. My concern with > > using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > > lock. What you really need to protect is a logical offset in the file to > > serialize allocation of underlying blocks, its mapping into page tables, > > and flushing the blocks out of caches. So using inode/mapping and offset > > for the hashing is easier (it isn't obvious to me we can fix hole filling > > races with pfn-based locking). > > So how does that file+offset hash work when trying to lock different > ranges? file+offset hashing to determine the lock to use only works > if we are dealing with fixed size ranges that the locks affect. > e.g. offset has 4k granularity for a single page faults, but we also > need to handle 2MB granularity for huge page faults, and IIRC 1GB > granularity for giant page faults... > > What's the plan here? I wonder if it makes sense to tie the locking in with the radix tree? Meaning, instead of having an array of mutexes, we lock based on the radix tree entry. Right now we already have to check for PTE and PMD entries in the radix tree, and with Matthew's suggested radix tree changes a lookup of a random address would give you the appropriate PMD or PUD entry, if one was present. This sort of solves the need for having a hash function that works on file+offset - that's all already there when using the radix tree... From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752480AbcBJXej (ORCPT ); Wed, 10 Feb 2016 18:34:39 -0500 Received: from mga09.intel.com ([134.134.136.24]:62113 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752197AbcBJXeh (ORCPT ); Wed, 10 Feb 2016 18:34:37 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,427,1449561600"; d="scan'208";a="743950370" Date: Wed, 10 Feb 2016 16:34:25 -0700 From: Ross Zwisler To: Cedric Blancher Cc: Dave Chinner , Jan Kara , Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210233425.GC30938@linux.intel.com> Mail-Followup-To: Ross Zwisler , Cedric Blancher , Dave Chinner , Jan Kara , Dan Williams , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 10, 2016 at 11:39:43PM +0100, Cedric Blancher wrote: > AFAIK Solaris 11 uses a sparse tree instead of a array. Solves the > scalability problem AND deals with variable page size. Right - seems like tying the radix tree into the locking instead of using an array would have these same benefits. > On 10 February 2016 at 23:09, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: > >> On Tue 09-02-16 10:18:53, Dan Williams wrote: > >> > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > >> > > Hello, > >> > > > >> > > I was thinking about current issues with DAX fault locking [1] (data > >> > > corruption due to racing faults allocating blocks) and also races which > >> > > currently don't allow us to clear dirty tags in the radix tree due to races > >> > > between faults and cache flushing [2]. Both of these exist because we don't > >> > > have an equivalent of page lock available for DAX. While we have a > >> > > reasonable solution available for problem [1], so far I'm not aware of a > >> > > decent solution for [2]. After briefly discussing the issue with Mel he had > >> > > a bright idea that we could used hashed locks to deal with [2] (and I think > >> > > we can solve [1] with them as well). So my proposal looks as follows: > >> > > > >> > > DAX will have an array of mutexes (the array can be made per device but > >> > > initially a global one should be OK). We will use mutexes in the array as a > >> > > replacement for page lock - we will use hashfn(mapping, index) to get > >> > > particular mutex protecting our offset in the mapping. On fault / page > >> > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > >> > > are done updating page tables. This deals with races in [1]. When flushing > >> > > caches we grab the mutex before clearing writeable bit in page tables > >> > > and clearing dirty bit in the radix tree and drop it after we have flushed > >> > > caches for the pfn. This deals with races in [2]. > >> > > > >> > > Thoughts? > >> > > > >> > > >> > I like the fact that this makes the locking explicit and > >> > straightforward rather than something more tricky. Can we make the > >> > hashfn pfn based? I'm thinking we could later reuse this as part of > >> > the solution for eliminating the need to allocate struct page, and we > >> > don't have the 'mapping' available in all paths... > >> > >> So Mel originally suggested to use pfn for hashing as well. My concern with > >> using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > >> lock. What you really need to protect is a logical offset in the file to > >> serialize allocation of underlying blocks, its mapping into page tables, > >> and flushing the blocks out of caches. So using inode/mapping and offset > >> for the hashing is easier (it isn't obvious to me we can fix hole filling > >> races with pfn-based locking). > > > > So how does that file+offset hash work when trying to lock different > > ranges? file+offset hashing to determine the lock to use only works > > if we are dealing with fixed size ranges that the locks affect. > > e.g. offset has 4k granularity for a single page faults, but we also > > need to handle 2MB granularity for huge page faults, and IIRC 1GB > > granularity for giant page faults... > > > > What's the plan here? > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Cedric Blancher > Institute Pasteur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751924AbcBJXoV (ORCPT ); Wed, 10 Feb 2016 18:44:21 -0500 Received: from mga02.intel.com ([134.134.136.20]:50799 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750728AbcBJXoS (ORCPT ); Wed, 10 Feb 2016 18:44:18 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,427,1449561600"; d="scan'208";a="909626863" Date: Wed, 10 Feb 2016 16:44:06 -0700 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , linux-kernel@vger.kernel.org, Dave Chinner , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dan Williams , linux-nvdimm@ml01.01.org, mgorman@suse.de, Matthew Wilcox Subject: Re: Another proposal for DAX fault locking Message-ID: <20160210234406.GD30938@linux.intel.com> Mail-Followup-To: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, Dave Chinner , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dan Williams , linux-nvdimm@lists.01.org, mgorman@suse.de, Matthew Wilcox References: <20160209172416.GB12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160209172416.GB12245@quack.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 09, 2016 at 06:24:16PM +0100, Jan Kara wrote: > Hello, > > I was thinking about current issues with DAX fault locking [1] (data > corruption due to racing faults allocating blocks) and also races which > currently don't allow us to clear dirty tags in the radix tree due to races > between faults and cache flushing [2]. Both of these exist because we don't > have an equivalent of page lock available for DAX. While we have a > reasonable solution available for problem [1], so far I'm not aware of a > decent solution for [2]. After briefly discussing the issue with Mel he had > a bright idea that we could used hashed locks to deal with [2] (and I think > we can solve [1] with them as well). So my proposal looks as follows: > > DAX will have an array of mutexes (the array can be made per device but > initially a global one should be OK). We will use mutexes in the array as a > replacement for page lock - we will use hashfn(mapping, index) to get > particular mutex protecting our offset in the mapping. On fault / page > mkwrite, we'll grab the mutex similarly to page lock and release it once we > are done updating page tables. This deals with races in [1]. When flushing > caches we grab the mutex before clearing writeable bit in page tables > and clearing dirty bit in the radix tree and drop it after we have flushed > caches for the pfn. This deals with races in [2]. > > Thoughts? > > Honza > > [1] http://oss.sgi.com/archives/xfs/2016-01/msg00575.html > [2] https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html Overall I think this sounds promising. I think a potential tie-in with the radix tree would maybe take us in a good direction. I had another idea of how to solve race #2 that involved sticking a seqlock around the DAX radix tree + pte_mkwrite() sequence, and on the flushing side if you noticed that you've raced against a page fault, just leaving the dirty page tree entry intact. I *think* this could work - I'd want to bang on it more - but if we have a general way of handling DAX locking that we can use instead of solving these issues one-by-one as they come up, that seems like a much better route. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751885AbcBJXvK (ORCPT ); Wed, 10 Feb 2016 18:51:10 -0500 Received: from mail-pf0-f180.google.com ([209.85.192.180]:34815 "EHLO mail-pf0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751136AbcBJXvH (ORCPT ); Wed, 10 Feb 2016 18:51:07 -0500 MIME-Version: 1.0 In-Reply-To: <20160210234406.GD30938@linux.intel.com> References: <20160209172416.GB12245@quack.suse.cz> <20160210234406.GD30938@linux.intel.com> Date: Thu, 11 Feb 2016 00:51:05 +0100 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Cedric Blancher To: Ross Zwisler , Jan Kara , Linux Kernel Mailing List , Dave Chinner , linux-fsdevel , Linux MM , Dan Williams , linux-nvdimm@ml01.01.org, Mel Gorman , Matthew Wilcox Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org There is another "twist" in this game: If there is a huge page with 1GB with a small 4k page as "overlay" (e.g. mmap() MAP_FIXED somewhere in the middle of a 1GB huge page), hows that handled? Ced On 11 February 2016 at 00:44, Ross Zwisler wrote: > On Tue, Feb 09, 2016 at 06:24:16PM +0100, Jan Kara wrote: >> Hello, >> >> I was thinking about current issues with DAX fault locking [1] (data >> corruption due to racing faults allocating blocks) and also races which >> currently don't allow us to clear dirty tags in the radix tree due to races >> between faults and cache flushing [2]. Both of these exist because we don't >> have an equivalent of page lock available for DAX. While we have a >> reasonable solution available for problem [1], so far I'm not aware of a >> decent solution for [2]. After briefly discussing the issue with Mel he had >> a bright idea that we could used hashed locks to deal with [2] (and I think >> we can solve [1] with them as well). So my proposal looks as follows: >> >> DAX will have an array of mutexes (the array can be made per device but >> initially a global one should be OK). We will use mutexes in the array as a >> replacement for page lock - we will use hashfn(mapping, index) to get >> particular mutex protecting our offset in the mapping. On fault / page >> mkwrite, we'll grab the mutex similarly to page lock and release it once we >> are done updating page tables. This deals with races in [1]. When flushing >> caches we grab the mutex before clearing writeable bit in page tables >> and clearing dirty bit in the radix tree and drop it after we have flushed >> caches for the pfn. This deals with races in [2]. >> >> Thoughts? >> >> Honza >> >> [1] http://oss.sgi.com/archives/xfs/2016-01/msg00575.html >> [2] https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html > > Overall I think this sounds promising. I think a potential tie-in with the > radix tree would maybe take us in a good direction. > > I had another idea of how to solve race #2 that involved sticking a seqlock > around the DAX radix tree + pte_mkwrite() sequence, and on the flushing side > if you noticed that you've raced against a page fault, just leaving the dirty > page tree entry intact. > > I *think* this could work - I'd want to bang on it more - but if we have a > general way of handling DAX locking that we can use instead of solving these > issues one-by-one as they come up, that seems like a much better route. > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Cedric Blancher Institute Pasteur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752435AbcBKANR (ORCPT ); Wed, 10 Feb 2016 19:13:17 -0500 Received: from mga01.intel.com ([192.55.52.88]:63057 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750760AbcBKANP (ORCPT ); Wed, 10 Feb 2016 19:13:15 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,428,1449561600"; d="scan'208";a="45909007" Date: Wed, 10 Feb 2016 17:13:00 -0700 From: Ross Zwisler To: Cedric Blancher Cc: Ross Zwisler , Jan Kara , Linux Kernel Mailing List , Dave Chinner , linux-fsdevel , Linux MM , Dan Williams , linux-nvdimm@ml01.01.org, Mel Gorman , Matthew Wilcox Subject: Re: Another proposal for DAX fault locking Message-ID: <20160211001300.GB19534@linux.intel.com> Mail-Followup-To: Ross Zwisler , Cedric Blancher , Jan Kara , Linux Kernel Mailing List , Dave Chinner , linux-fsdevel , Linux MM , Dan Williams , linux-nvdimm@lists.01.org, Mel Gorman , Matthew Wilcox References: <20160209172416.GB12245@quack.suse.cz> <20160210234406.GD30938@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 12:51:05AM +0100, Cedric Blancher wrote: > There is another "twist" in this game: If there is a huge page with > 1GB with a small 4k page as "overlay" (e.g. mmap() MAP_FIXED somewhere > in the middle of a 1GB huge page), hows that handled? Ugh - I'm pretty sure we haven't touched overlays with DAX at all. The man page says this: If the memory region specified by addr and len overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded. I wonder if this would translate into a hole punch for our DAX mapping, whatever size it may be, plus an insert? If so, it seems like we just need to handle each of those operations correctly on their own (hole punch, insert), and things will take care of themselves? That being said, I know for a fact that PMD hole punch is currently broken. > On 11 February 2016 at 00:44, Ross Zwisler wrote: > > On Tue, Feb 09, 2016 at 06:24:16PM +0100, Jan Kara wrote: > >> Hello, > >> > >> I was thinking about current issues with DAX fault locking [1] (data > >> corruption due to racing faults allocating blocks) and also races which > >> currently don't allow us to clear dirty tags in the radix tree due to races > >> between faults and cache flushing [2]. Both of these exist because we don't > >> have an equivalent of page lock available for DAX. While we have a > >> reasonable solution available for problem [1], so far I'm not aware of a > >> decent solution for [2]. After briefly discussing the issue with Mel he had > >> a bright idea that we could used hashed locks to deal with [2] (and I think > >> we can solve [1] with them as well). So my proposal looks as follows: > >> > >> DAX will have an array of mutexes (the array can be made per device but > >> initially a global one should be OK). We will use mutexes in the array as a > >> replacement for page lock - we will use hashfn(mapping, index) to get > >> particular mutex protecting our offset in the mapping. On fault / page > >> mkwrite, we'll grab the mutex similarly to page lock and release it once we > >> are done updating page tables. This deals with races in [1]. When flushing > >> caches we grab the mutex before clearing writeable bit in page tables > >> and clearing dirty bit in the radix tree and drop it after we have flushed > >> caches for the pfn. This deals with races in [2]. > >> > >> Thoughts? > >> > >> Honza > >> > >> [1] http://oss.sgi.com/archives/xfs/2016-01/msg00575.html > >> [2] https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html > > > > Overall I think this sounds promising. I think a potential tie-in with the > > radix tree would maybe take us in a good direction. > > > > I had another idea of how to solve race #2 that involved sticking a seqlock > > around the DAX radix tree + pte_mkwrite() sequence, and on the flushing side > > if you noticed that you've raced against a page fault, just leaving the dirty > > page tree entry intact. > > > > I *think* this could work - I'd want to bang on it more - but if we have a > > general way of handling DAX locking that we can use instead of solving these > > issues one-by-one as they come up, that seems like a much better route. > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Cedric Blancher > Institute Pasteur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751570AbcBKKin (ORCPT ); Thu, 11 Feb 2016 05:38:43 -0500 Received: from mx2.suse.de ([195.135.220.15]:38946 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750963AbcBKKil (ORCPT ); Thu, 11 Feb 2016 05:38:41 -0500 Date: Thu, 11 Feb 2016 11:38:56 +0100 From: Jan Kara To: Boaz Harrosh Cc: Jan Kara , Ross Zwisler , linux-nvdimm@ml01.01.org, Dave Chinner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org Subject: Re: Another proposal for DAX fault locking Message-ID: <20160211103856.GE21760@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <56BB758D.1000704@plexistor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <56BB758D.1000704@plexistor.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 10-02-16 19:38:21, Boaz Harrosh wrote: > On 02/09/2016 07:24 PM, Jan Kara wrote: > > Hello, > > > > I was thinking about current issues with DAX fault locking [1] (data > > corruption due to racing faults allocating blocks) and also races which > > currently don't allow us to clear dirty tags in the radix tree due to races > > between faults and cache flushing [2]. Both of these exist because we don't > > have an equivalent of page lock available for DAX. While we have a > > reasonable solution available for problem [1], so far I'm not aware of a > > decent solution for [2]. After briefly discussing the issue with Mel he had > > a bright idea that we could used hashed locks to deal with [2] (and I think > > we can solve [1] with them as well). So my proposal looks as follows: > > > > DAX will have an array of mutexes (the array can be made per device but > > initially a global one should be OK). We will use mutexes in the array as a > > replacement for page lock - we will use hashfn(mapping, index) to get > > particular mutex protecting our offset in the mapping. On fault / page > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > are done updating page tables. This deals with races in [1]. When flushing > > caches we grab the mutex before clearing writeable bit in page tables > > and clearing dirty bit in the radix tree and drop it after we have flushed > > caches for the pfn. This deals with races in [2]. > > > > Thoughts? > > > > You could also use one of the radix-tree's special-bits as a bit lock. > So no need for any extra allocations. Yes and I've suggested that once as well. But since we need sleeping locks, you need some wait queues somewhere as well. So some allocations are going to be needed anyway. And mutexes have much better properties than bit-locks so I prefer mutexes over cramming bit locks into radix tree. Plus you'd have to be careful so that someone doesn't remove the bit from the radix tree while you are working with it. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751735AbcBKKm7 (ORCPT ); Thu, 11 Feb 2016 05:42:59 -0500 Received: from mx2.suse.de ([195.135.220.15]:39345 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750782AbcBKKm5 (ORCPT ); Thu, 11 Feb 2016 05:42:57 -0500 Date: Thu, 11 Feb 2016 11:43:13 +0100 From: Jan Kara To: Dan Williams Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , Dave Chinner , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox Subject: Re: Another proposal for DAX fault locking Message-ID: <20160211104313.GF21760@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 10-02-16 12:08:12, Dan Williams wrote: > On Wed, Feb 10, 2016 at 2:32 AM, Jan Kara wrote: > > On Tue 09-02-16 10:18:53, Dan Williams wrote: > >> On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > >> > Hello, > >> > > >> > I was thinking about current issues with DAX fault locking [1] (data > >> > corruption due to racing faults allocating blocks) and also races which > >> > currently don't allow us to clear dirty tags in the radix tree due to races > >> > between faults and cache flushing [2]. Both of these exist because we don't > >> > have an equivalent of page lock available for DAX. While we have a > >> > reasonable solution available for problem [1], so far I'm not aware of a > >> > decent solution for [2]. After briefly discussing the issue with Mel he had > >> > a bright idea that we could used hashed locks to deal with [2] (and I think > >> > we can solve [1] with them as well). So my proposal looks as follows: > >> > > >> > DAX will have an array of mutexes (the array can be made per device but > >> > initially a global one should be OK). We will use mutexes in the array as a > >> > replacement for page lock - we will use hashfn(mapping, index) to get > >> > particular mutex protecting our offset in the mapping. On fault / page > >> > mkwrite, we'll grab the mutex similarly to page lock and release it once we > >> > are done updating page tables. This deals with races in [1]. When flushing > >> > caches we grab the mutex before clearing writeable bit in page tables > >> > and clearing dirty bit in the radix tree and drop it after we have flushed > >> > caches for the pfn. This deals with races in [2]. > >> > > >> > Thoughts? > >> > > >> > >> I like the fact that this makes the locking explicit and > >> straightforward rather than something more tricky. Can we make the > >> hashfn pfn based? I'm thinking we could later reuse this as part of > >> the solution for eliminating the need to allocate struct page, and we > >> don't have the 'mapping' available in all paths... > > > > So Mel originally suggested to use pfn for hashing as well. My concern with > > using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > > lock. What you really need to protect is a logical offset in the file to > > serialize allocation of underlying blocks, its mapping into page tables, > > and flushing the blocks out of caches. So using inode/mapping and offset > > for the hashing is easier (it isn't obvious to me we can fix hole filling > > races with pfn-based locking). > > > > I'm not sure for which other purposes you'd like to use this lock and > > whether propagating file+offset to those call sites would make sense or > > not. struct page has the advantage that block mapping information is only > > attached to it, so when filling a hole, we can just allocate some page, > > attach it to the radix tree, use page lock for synchronization, and allocate > > blocks only after that. With pfns we cannot do this... > > Right, I am thinking of the direct-I/O path's use of the page lock and > the occasions where it relies on page->mapping lookups. Well, but the main problem with direct IO is that it takes page *reference* via get_user_pages(). So that's something different from page lock. Maybe the new lock could be abused to provide necessary exclusion for direct IO use as well but that would need deep thinking... So far it seems problematic to me. > Given we already have support for dynamically allocating struct page I > don't think we need to have a "pfn to lock" lookup in the initial > implementation of this locking scheme. Agreed. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751889AbcBKKy5 (ORCPT ); Thu, 11 Feb 2016 05:54:57 -0500 Received: from mx2.suse.de ([195.135.220.15]:40200 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751212AbcBKKy4 (ORCPT ); Thu, 11 Feb 2016 05:54:56 -0500 Date: Thu, 11 Feb 2016 11:55:10 +0100 From: Jan Kara To: Cedric Blancher Cc: Dave Chinner , Jan Kara , Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox Subject: Re: Another proposal for DAX fault locking Message-ID: <20160211105510.GG21760@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 10-02-16 23:39:43, Cedric Blancher wrote: > AFAIK Solaris 11 uses a sparse tree instead of a array. Solves the > scalability problem AND deals with variable page size. Well, but then you have to have this locking tree for every inode so the memory overhead is relatively large, no? I've played with range locking of mapping in the past but its performance was not stellar. Do you have any reference for what Solaris does? Honza > On 10 February 2016 at 23:09, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: > >> On Tue 09-02-16 10:18:53, Dan Williams wrote: > >> > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > >> > > Hello, > >> > > > >> > > I was thinking about current issues with DAX fault locking [1] (data > >> > > corruption due to racing faults allocating blocks) and also races which > >> > > currently don't allow us to clear dirty tags in the radix tree due to races > >> > > between faults and cache flushing [2]. Both of these exist because we don't > >> > > have an equivalent of page lock available for DAX. While we have a > >> > > reasonable solution available for problem [1], so far I'm not aware of a > >> > > decent solution for [2]. After briefly discussing the issue with Mel he had > >> > > a bright idea that we could used hashed locks to deal with [2] (and I think > >> > > we can solve [1] with them as well). So my proposal looks as follows: > >> > > > >> > > DAX will have an array of mutexes (the array can be made per device but > >> > > initially a global one should be OK). We will use mutexes in the array as a > >> > > replacement for page lock - we will use hashfn(mapping, index) to get > >> > > particular mutex protecting our offset in the mapping. On fault / page > >> > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > >> > > are done updating page tables. This deals with races in [1]. When flushing > >> > > caches we grab the mutex before clearing writeable bit in page tables > >> > > and clearing dirty bit in the radix tree and drop it after we have flushed > >> > > caches for the pfn. This deals with races in [2]. > >> > > > >> > > Thoughts? > >> > > > >> > > >> > I like the fact that this makes the locking explicit and > >> > straightforward rather than something more tricky. Can we make the > >> > hashfn pfn based? I'm thinking we could later reuse this as part of > >> > the solution for eliminating the need to allocate struct page, and we > >> > don't have the 'mapping' available in all paths... > >> > >> So Mel originally suggested to use pfn for hashing as well. My concern with > >> using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > >> lock. What you really need to protect is a logical offset in the file to > >> serialize allocation of underlying blocks, its mapping into page tables, > >> and flushing the blocks out of caches. So using inode/mapping and offset > >> for the hashing is easier (it isn't obvious to me we can fix hole filling > >> races with pfn-based locking). > > > > So how does that file+offset hash work when trying to lock different > > ranges? file+offset hashing to determine the lock to use only works > > if we are dealing with fixed size ranges that the locks affect. > > e.g. offset has 4k granularity for a single page faults, but we also > > need to handle 2MB granularity for huge page faults, and IIRC 1GB > > granularity for giant page faults... > > > > What's the plan here? > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Cedric Blancher > Institute Pasteur -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752101AbcBKLPZ (ORCPT ); Thu, 11 Feb 2016 06:15:25 -0500 Received: from mx2.suse.de ([195.135.220.15]:42186 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751954AbcBKLPX (ORCPT ); Thu, 11 Feb 2016 06:15:23 -0500 Date: Thu, 11 Feb 2016 12:15:38 +0100 From: Jan Kara To: Ross Zwisler Cc: Dave Chinner , Jan Kara , Dan Williams , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox Subject: Re: Another proposal for DAX fault locking Message-ID: <20160211111538.GH21760@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> <20160210233253.GB30938@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210233253.GB30938@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 10-02-16 16:32:53, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 09:09:53AM +1100, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: > > > On Tue 09-02-16 10:18:53, Dan Williams wrote: > > > > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: > > > > > Hello, > > > > > > > > > > I was thinking about current issues with DAX fault locking [1] (data > > > > > corruption due to racing faults allocating blocks) and also races which > > > > > currently don't allow us to clear dirty tags in the radix tree due to races > > > > > between faults and cache flushing [2]. Both of these exist because we don't > > > > > have an equivalent of page lock available for DAX. While we have a > > > > > reasonable solution available for problem [1], so far I'm not aware of a > > > > > decent solution for [2]. After briefly discussing the issue with Mel he had > > > > > a bright idea that we could used hashed locks to deal with [2] (and I think > > > > > we can solve [1] with them as well). So my proposal looks as follows: > > > > > > > > > > DAX will have an array of mutexes (the array can be made per device but > > > > > initially a global one should be OK). We will use mutexes in the array as a > > > > > replacement for page lock - we will use hashfn(mapping, index) to get > > > > > particular mutex protecting our offset in the mapping. On fault / page > > > > > mkwrite, we'll grab the mutex similarly to page lock and release it once we > > > > > are done updating page tables. This deals with races in [1]. When flushing > > > > > caches we grab the mutex before clearing writeable bit in page tables > > > > > and clearing dirty bit in the radix tree and drop it after we have flushed > > > > > caches for the pfn. This deals with races in [2]. > > > > > > > > > > Thoughts? > > > > > > > > > > > > > I like the fact that this makes the locking explicit and > > > > straightforward rather than something more tricky. Can we make the > > > > hashfn pfn based? I'm thinking we could later reuse this as part of > > > > the solution for eliminating the need to allocate struct page, and we > > > > don't have the 'mapping' available in all paths... > > > > > > So Mel originally suggested to use pfn for hashing as well. My concern with > > > using pfn is that e.g. if you want to fill a hole, you don't have a pfn to > > > lock. What you really need to protect is a logical offset in the file to > > > serialize allocation of underlying blocks, its mapping into page tables, > > > and flushing the blocks out of caches. So using inode/mapping and offset > > > for the hashing is easier (it isn't obvious to me we can fix hole filling > > > races with pfn-based locking). > > > > So how does that file+offset hash work when trying to lock different > > ranges? file+offset hashing to determine the lock to use only works > > if we are dealing with fixed size ranges that the locks affect. > > e.g. offset has 4k granularity for a single page faults, but we also > > need to handle 2MB granularity for huge page faults, and IIRC 1GB > > granularity for giant page faults... > > > > What's the plan here? > > I wonder if it makes sense to tie the locking in with the radix tree? > Meaning, instead of having an array of mutexes, we lock based on the radix > tree entry. > > Right now we already have to check for PTE and PMD entries in the radix tree, > and with Matthew's suggested radix tree changes a lookup of a random address > would give you the appropriate PMD or PUD entry, if one was present. > > This sort of solves the need for having a hash function that works on > file+offset - that's all already there when using the radix tree... Yeah, so we need to be careful there are no aliasing issues (i.e., you do not have PTE and PMD entries covering the same offset). Other than that using the radix tree entry (or it's offset - you need to somehow map the entry to the mutex anyway) as a base for mapping should deal with issues with different page sizes. We will have to be careful, e.g. when allocating blocks for a PMD fault. We would have to insert PMD entry, lock it (so all newcomers will see the entry and block on it), walk the whole range the fault covers and clear out entries we find waiting if they are locked - lock aliasing may be an issue here - and only after that we can proceed with the fault. It is more complex than I'd wish but doable and I don't have anything better. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751436AbcBKVF0 (ORCPT ); Thu, 11 Feb 2016 16:05:26 -0500 Received: from mail-pf0-f172.google.com ([209.85.192.172]:33923 "EHLO mail-pf0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751067AbcBKVFY (ORCPT ); Thu, 11 Feb 2016 16:05:24 -0500 MIME-Version: 1.0 In-Reply-To: <20160211105510.GG21760@quack.suse.cz> References: <20160209172416.GB12245@quack.suse.cz> <20160210103249.GD12245@quack.suse.cz> <20160210220953.GW19486@dastard> <20160211105510.GG21760@quack.suse.cz> Date: Thu, 11 Feb 2016 22:05:24 +0100 Message-ID: Subject: Re: Another proposal for DAX fault locking From: Cedric Blancher To: Jan Kara Cc: Dave Chinner , Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The Solaris 11 sources are still available at Illumos.org (Illumos is what Opensolaris was once, minus Suns bug database). Also, if you keep the performance in mind, remember that the world is moving towards many cores with many hardware threads per core, so optimising for the "two core with low latency" use case is wrong like a sin. More likely is the "8 core with 4 threads per core" use case for benchmarking because that's what we will end up in even low end hardware soon, maybe with variable bandwidth between cores if something like ScaleMP is used. Ced On 11 February 2016 at 11:55, Jan Kara wrote: > On Wed 10-02-16 23:39:43, Cedric Blancher wrote: >> AFAIK Solaris 11 uses a sparse tree instead of a array. Solves the >> scalability problem AND deals with variable page size. > > Well, but then you have to have this locking tree for every inode so the > memory overhead is relatively large, no? I've played with range locking of > mapping in the past but its performance was not stellar. Do you have any > reference for what Solaris does? > > Honza > >> On 10 February 2016 at 23:09, Dave Chinner wrote: >> > On Wed, Feb 10, 2016 at 11:32:49AM +0100, Jan Kara wrote: >> >> On Tue 09-02-16 10:18:53, Dan Williams wrote: >> >> > On Tue, Feb 9, 2016 at 9:24 AM, Jan Kara wrote: >> >> > > Hello, >> >> > > >> >> > > I was thinking about current issues with DAX fault locking [1] (data >> >> > > corruption due to racing faults allocating blocks) and also races which >> >> > > currently don't allow us to clear dirty tags in the radix tree due to races >> >> > > between faults and cache flushing [2]. Both of these exist because we don't >> >> > > have an equivalent of page lock available for DAX. While we have a >> >> > > reasonable solution available for problem [1], so far I'm not aware of a >> >> > > decent solution for [2]. After briefly discussing the issue with Mel he had >> >> > > a bright idea that we could used hashed locks to deal with [2] (and I think >> >> > > we can solve [1] with them as well). So my proposal looks as follows: >> >> > > >> >> > > DAX will have an array of mutexes (the array can be made per device but >> >> > > initially a global one should be OK). We will use mutexes in the array as a >> >> > > replacement for page lock - we will use hashfn(mapping, index) to get >> >> > > particular mutex protecting our offset in the mapping. On fault / page >> >> > > mkwrite, we'll grab the mutex similarly to page lock and release it once we >> >> > > are done updating page tables. This deals with races in [1]. When flushing >> >> > > caches we grab the mutex before clearing writeable bit in page tables >> >> > > and clearing dirty bit in the radix tree and drop it after we have flushed >> >> > > caches for the pfn. This deals with races in [2]. >> >> > > >> >> > > Thoughts? >> >> > > >> >> > >> >> > I like the fact that this makes the locking explicit and >> >> > straightforward rather than something more tricky. Can we make the >> >> > hashfn pfn based? I'm thinking we could later reuse this as part of >> >> > the solution for eliminating the need to allocate struct page, and we >> >> > don't have the 'mapping' available in all paths... >> >> >> >> So Mel originally suggested to use pfn for hashing as well. My concern with >> >> using pfn is that e.g. if you want to fill a hole, you don't have a pfn to >> >> lock. What you really need to protect is a logical offset in the file to >> >> serialize allocation of underlying blocks, its mapping into page tables, >> >> and flushing the blocks out of caches. So using inode/mapping and offset >> >> for the hashing is easier (it isn't obvious to me we can fix hole filling >> >> races with pfn-based locking). >> > >> > So how does that file+offset hash work when trying to lock different >> > ranges? file+offset hashing to determine the lock to use only works >> > if we are dealing with fixed size ranges that the locks affect. >> > e.g. offset has 4k granularity for a single page faults, but we also >> > need to handle 2MB granularity for huge page faults, and IIRC 1GB >> > granularity for giant page faults... >> > >> > What's the plan here? >> > >> > Cheers, >> > >> > Dave. >> > -- >> > Dave Chinner >> > david@fromorbit.com >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >> > the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> Cedric Blancher >> Institute Pasteur > -- > Jan Kara > SUSE Labs, CR -- Cedric Blancher Institute Pasteur From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751808AbcBNIvN (ORCPT ); Sun, 14 Feb 2016 03:51:13 -0500 Received: from mail-wm0-f54.google.com ([74.125.82.54]:38478 "EHLO mail-wm0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751472AbcBNIvL (ORCPT ); Sun, 14 Feb 2016 03:51:11 -0500 Message-ID: <56C03FFB.1040401@gmail.com> Date: Sun, 14 Feb 2016 10:51:07 +0200 From: Boaz Harrosh User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Jan Kara , Boaz Harrosh CC: Ross Zwisler , linux-nvdimm@ml01.01.org, Dave Chinner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org Subject: Re: Another proposal for DAX fault locking References: <20160209172416.GB12245@quack.suse.cz> <56BB758D.1000704@plexistor.com> <20160211103856.GE21760@quack.suse.cz> In-Reply-To: <20160211103856.GE21760@quack.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/11/2016 12:38 PM, Jan Kara wrote: > On Wed 10-02-16 19:38:21, Boaz Harrosh wrote: >> On 02/09/2016 07:24 PM, Jan Kara wrote: >>> Hello, >>> <> >>> >>> DAX will have an array of mutexes (the array can be made per device but >>> initially a global one should be OK). We will use mutexes in the array as a >>> replacement for page lock - we will use hashfn(mapping, index) to get >>> particular mutex protecting our offset in the mapping. On fault / page >>> mkwrite, we'll grab the mutex similarly to page lock and release it once we >>> are done updating page tables. This deals with races in [1]. When flushing >>> caches we grab the mutex before clearing writeable bit in page tables >>> and clearing dirty bit in the radix tree and drop it after we have flushed >>> caches for the pfn. This deals with races in [2]. >>> >>> Thoughts? >>> >> >> You could also use one of the radix-tree's special-bits as a bit lock. >> So no need for any extra allocations. > > Yes and I've suggested that once as well. But since we need sleeping > locks, you need some wait queues somewhere as well. So some allocations are > going to be needed anyway. They are already sleeping locks and there are all the proper "wait queues" in place. I'm talking about lock: err = wait_on_bit_lock(&some_long, SOME_BIT_LOCK, ...); and unlock: WARN_ON(!test_and_clear_bit(SOME_BIT_LOCK, &some_long)); wake_up_bit(&some_long, SOME_BIT_LOCK); > And mutexes have much better properties than Just saying that page-locks are implemented just this way these days so it is the performance and characteristics we already know. (You are replacing page locks, no?) > bit-locks so I prefer mutexes over cramming bit locks into radix tree. Plus > you'd have to be careful so that someone doesn't remove the bit from the > radix tree while you are working with it. > Sure! need to be careful, is our middle name. That said. Is your call. Thank you for working on this. Your plan sounds very good as well, and is very much needed, because DAX's mmap performance success right now. [Maybe one small enhancement perhaps allocate an array of mutexes per NUMA node and access the proper array through numa_node_id()] > Honza > Thanks Boaz