From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id A123B21A04823 for ; Mon, 1 May 2017 15:38:57 -0700 (PDT) Date: Mon, 1 May 2017 16:38:55 -0600 From: Ross Zwisler Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads Message-ID: <20170501223855.GA25862@linux.intel.com> References: <20170420191446.GA21694@linux.intel.com> <20170421034437.4359-1-ross.zwisler@linux.intel.com> <20170421034437.4359-2-ross.zwisler@linux.intel.com> <20170425111043.GH2793@quack2.suse.cz> <20170425225936.GA29655@linux.intel.com> <20170426085235.GA21738@quack2.suse.cz> <20170426225236.GA25838@linux.intel.com> <20170427072659.GA29789@quack2.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20170427072659.GA29789@quack2.suse.cz> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Jan Kara Cc: Latchesar Ionkov , Trond Myklebust , linux-mm@kvack.org, Christoph Hellwig , linux-cifs@vger.kernel.org, Matthew Wilcox , Andrey Ryabinin , Eric Van Hensbergen , linux-nvdimm@lists.01.org, Alexander Viro , v9fs-developer@lists.sourceforge.net, Jens Axboe , linux-nfs@vger.kernel.org, "Darrick J. Wong" , samba-technical@lists.samba.org, linux-kernel@vger.kernel.org, Steve French , Alexey Kuznetsov , Johannes Weiner , linux-fsdevel@vger.kernel.org, Ron Minnich , Andrew Morton , Anna Schumaker List-ID: On Thu, Apr 27, 2017 at 09:26:59AM +0200, Jan Kara wrote: > On Wed 26-04-17 16:52:36, Ross Zwisler wrote: <> > > I don't think this alone is enough to save us. The I/O path doesn't currently > > take any DAX radix tree entry locks, so our race would just become: > > > > CPU1 - write(2) CPU2 - read fault > > > > dax_iomap_pte_fault() > > grab_mapping_entry() // newly moved > > ->iomap_begin() - sees hole > > dax_iomap_rw() > > iomap_apply() > > ->iomap_begin - allocates blocks > > dax_iomap_actor() > > invalidate_inode_pages2_range() > > - there's nothing to invalidate > > - we add zero page in the radix > > tree & map it to page tables > > > > In their current form I don't think we want to take DAX radix tree entry locks > > in the I/O path because that would effectively serialize I/O over a given > > radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range > > would be serialized. > > Note that invalidate_inode_pages2_range() will see the entry created by > grab_mapping_entry() on CPU2 and block waiting for its lock and this is > exactly what stops the race. The invalidate_inode_pages2_range() > effectively makes sure there isn't any page fault in progress for given > range... Yep, this is the bit that I was missing. Thanks. > Also note that writes to a file are serialized by i_rwsem anyway (and at > least serialization of writes to the overlapping range is required by POSIX) > so this doesn't add any more serialization than we already have. > > > > Another solution would be to grab i_mmap_sem for write when doing write > > > fault of a page and similarly have it grabbed for writing when doing > > > write(2). This would scale rather poorly but if we later replaced it with a > > > range lock (Davidlohr has already posted a nice implementation of it) it > > > won't be as bad. But I guess option 1) is better... > > > > The best idea I had for handling this sounds similar, which would be to > > convert the radix tree locks to essentially be reader/writer locks. I/O and > > faults that don't modify the block mapping could just take read-level locks, > > and could all run concurrently. I/O or faults that modify a block mapping > > would take a write lock, and serialize with other writers and readers. > > Well, this would be difficult to implement inside the radix tree (not > enough bits in the entry) so you'd have to go for some external locking > primitive anyway. And if you do that, read-write range lock Davidlohr has > implemented is what you describe - well we could also have a radix tree > with rwsems but I suspect the overhead of maintaining that would be too > large. It would require larger rewrite than reusing entry locks as I > suggest above though and it isn't an obvious performance win for realistic > workloads either so I'd like to see some performance numbers before going > that way. It likely improves a situation where processes race to fault the > same page for which we already know the block mapping but I'm not sure if > that translates to any measurable performance wins for workloads on DAX > filesystem. > > > You could know if you needed a write lock without asking the filesystem - if > > you're a write and the radix tree entry is empty or is for a zero page, you > > grab the write lock. > > > > This dovetails nicely with the idea of having the radix tree act as a cache > > for block mappings. You take the appropriate lock on the radix tree entry, > > and it has the block mapping info for your I/O or fault so you don't have to > > call into the FS. I/O would also participate so we would keep info about > > block mappings that we gather from I/O to help shortcut our page faults. > > > > How does this sound vs the range lock idea? How hard do you think it would be > > to convert our current wait queue system to reader/writer style locking? > > > > Also, how do you think we should deal with the current PMD corruption? Should > > we go with the current fix (I can augment the comments as you suggested), and > > then handle optimizations to that approach and the solution to this larger > > race as a follow-on? > > So for now I'm still more inclined to just stay with the radix tree lock as > is and just fix up the locking as I suggest and go for larger rewrite only > if we can demonstrate further performance wins. Sounds good. > WRT your second patch, if we go with the locking as I suggest, it is enough > to unmap the whole range after invalidate_inode_pages2() has cleared radix > tree entries (*) which will be much cheaper (for large writes) than doing > unmapping entry by entry. I'm still not convinced that it is safe to do the unmap in a separate step. I see your point about it being expensive to do a rmap walk to unmap each entry in __dax_invalidate_mapping_entry(), but I think we might need to because the unmap is part of the contract imposed by invalidate_inode_pages2_range() and invalidate_inode_pages2(). This exists in the header comment above each: * Any pages which are found to be mapped into pagetables are unmapped prior * to invalidation. If you look at the usage of invalidate_inode_pages2_range() in generic_file_direct_write() for example (which I realize we won't call for a DAX inode, but still), I think that it really does rely on the fact that invalidated pages are unmapped, right? If it didn't, and hole pages were mapped, the hole pages could remain mapped while a direct I/O write allocated blocks and then wrote real data. If we really want to unmap the entire range at once, maybe it would have to be done in invalidate_inode_pages2_range(), after the loop? My hesitation about this is that we'd be leaking yet more DAX special casing up into the mm/truncate.c code. Or am I missing something? > So I'd go for that. I'll prepare a patch for the > locking change - it will require changes to ext4 transaction handling so it > won't be completely trivial. > > (*) The flow of information is: filesystem block mapping info -> radix tree > -> page tables so if 'filesystem block mapping info' changes, we should go > invalidate corresponding radix tree entries (new entries will already have > uptodate info) and then invalidate corresponding page tables (again once > radix tree has no stale entries, we are sure new page table entries will be > uptodate). > > Honza > -- > Jan Kara > SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads Date: Mon, 1 May 2017 16:38:55 -0600 Message-ID: <20170501223855.GA25862@linux.intel.com> References: <20170420191446.GA21694@linux.intel.com> <20170421034437.4359-1-ross.zwisler@linux.intel.com> <20170421034437.4359-2-ross.zwisler@linux.intel.com> <20170425111043.GH2793@quack2.suse.cz> <20170425225936.GA29655@linux.intel.com> <20170426085235.GA21738@quack2.suse.cz> <20170426225236.GA25838@linux.intel.com> <20170427072659.GA29789@quack2.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: Latchesar Ionkov , Trond Myklebust , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Christoph Hellwig , linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Matthew Wilcox , Andrey Ryabinin , Eric Van Hensbergen , linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Alexander Viro , v9fs-developer-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Jens Axboe , linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, "Darrick J. Wong" , samba-technical-w/Ol4Ecudpl8XjKLYN78aQ@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Steve French , Alexey Kuznetsov , Johannes Weiner , linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ron Minnich , Andrew Morton , Anna Schumaker To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20170427072659.GA29789-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org Sender: "Linux-nvdimm" List-Id: linux-cifs.vger.kernel.org On Thu, Apr 27, 2017 at 09:26:59AM +0200, Jan Kara wrote: > On Wed 26-04-17 16:52:36, Ross Zwisler wrote: <> > > I don't think this alone is enough to save us. The I/O path doesn't currently > > take any DAX radix tree entry locks, so our race would just become: > > > > CPU1 - write(2) CPU2 - read fault > > > > dax_iomap_pte_fault() > > grab_mapping_entry() // newly moved > > ->iomap_begin() - sees hole > > dax_iomap_rw() > > iomap_apply() > > ->iomap_begin - allocates blocks > > dax_iomap_actor() > > invalidate_inode_pages2_range() > > - there's nothing to invalidate > > - we add zero page in the radix > > tree & map it to page tables > > > > In their current form I don't think we want to take DAX radix tree entry locks > > in the I/O path because that would effectively serialize I/O over a given > > radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range > > would be serialized. > > Note that invalidate_inode_pages2_range() will see the entry created by > grab_mapping_entry() on CPU2 and block waiting for its lock and this is > exactly what stops the race. The invalidate_inode_pages2_range() > effectively makes sure there isn't any page fault in progress for given > range... Yep, this is the bit that I was missing. Thanks. > Also note that writes to a file are serialized by i_rwsem anyway (and at > least serialization of writes to the overlapping range is required by POSIX) > so this doesn't add any more serialization than we already have. > > > > Another solution would be to grab i_mmap_sem for write when doing write > > > fault of a page and similarly have it grabbed for writing when doing > > > write(2). This would scale rather poorly but if we later replaced it with a > > > range lock (Davidlohr has already posted a nice implementation of it) it > > > won't be as bad. But I guess option 1) is better... > > > > The best idea I had for handling this sounds similar, which would be to > > convert the radix tree locks to essentially be reader/writer locks. I/O and > > faults that don't modify the block mapping could just take read-level locks, > > and could all run concurrently. I/O or faults that modify a block mapping > > would take a write lock, and serialize with other writers and readers. > > Well, this would be difficult to implement inside the radix tree (not > enough bits in the entry) so you'd have to go for some external locking > primitive anyway. And if you do that, read-write range lock Davidlohr has > implemented is what you describe - well we could also have a radix tree > with rwsems but I suspect the overhead of maintaining that would be too > large. It would require larger rewrite than reusing entry locks as I > suggest above though and it isn't an obvious performance win for realistic > workloads either so I'd like to see some performance numbers before going > that way. It likely improves a situation where processes race to fault the > same page for which we already know the block mapping but I'm not sure if > that translates to any measurable performance wins for workloads on DAX > filesystem. > > > You could know if you needed a write lock without asking the filesystem - if > > you're a write and the radix tree entry is empty or is for a zero page, you > > grab the write lock. > > > > This dovetails nicely with the idea of having the radix tree act as a cache > > for block mappings. You take the appropriate lock on the radix tree entry, > > and it has the block mapping info for your I/O or fault so you don't have to > > call into the FS. I/O would also participate so we would keep info about > > block mappings that we gather from I/O to help shortcut our page faults. > > > > How does this sound vs the range lock idea? How hard do you think it would be > > to convert our current wait queue system to reader/writer style locking? > > > > Also, how do you think we should deal with the current PMD corruption? Should > > we go with the current fix (I can augment the comments as you suggested), and > > then handle optimizations to that approach and the solution to this larger > > race as a follow-on? > > So for now I'm still more inclined to just stay with the radix tree lock as > is and just fix up the locking as I suggest and go for larger rewrite only > if we can demonstrate further performance wins. Sounds good. > WRT your second patch, if we go with the locking as I suggest, it is enough > to unmap the whole range after invalidate_inode_pages2() has cleared radix > tree entries (*) which will be much cheaper (for large writes) than doing > unmapping entry by entry. I'm still not convinced that it is safe to do the unmap in a separate step. I see your point about it being expensive to do a rmap walk to unmap each entry in __dax_invalidate_mapping_entry(), but I think we might need to because the unmap is part of the contract imposed by invalidate_inode_pages2_range() and invalidate_inode_pages2(). This exists in the header comment above each: * Any pages which are found to be mapped into pagetables are unmapped prior * to invalidation. If you look at the usage of invalidate_inode_pages2_range() in generic_file_direct_write() for example (which I realize we won't call for a DAX inode, but still), I think that it really does rely on the fact that invalidated pages are unmapped, right? If it didn't, and hole pages were mapped, the hole pages could remain mapped while a direct I/O write allocated blocks and then wrote real data. If we really want to unmap the entire range at once, maybe it would have to be done in invalidate_inode_pages2_range(), after the loop? My hesitation about this is that we'd be leaking yet more DAX special casing up into the mm/truncate.c code. Or am I missing something? > So I'd go for that. I'll prepare a patch for the > locking change - it will require changes to ext4 transaction handling so it > won't be completely trivial. > > (*) The flow of information is: filesystem block mapping info -> radix tree > -> page tables so if 'filesystem block mapping info' changes, we should go > invalidate corresponding radix tree entries (new entries will already have > uptodate info) and then invalidate corresponding page tables (again once > radix tree has no stale entries, we are sure new page table entries will be > uptodate). > > Honza > -- > Jan Kara > SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751604AbdEAWjD (ORCPT ); Mon, 1 May 2017 18:39:03 -0400 Received: from mga05.intel.com ([192.55.52.43]:3305 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750818AbdEAWjA (ORCPT ); Mon, 1 May 2017 18:39:00 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.37,401,1488873600"; d="scan'208";a="963224750" Date: Mon, 1 May 2017 16:38:55 -0600 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , Andrew Morton , linux-kernel@vger.kernel.org, Alexander Viro , Alexey Kuznetsov , Andrey Ryabinin , Anna Schumaker , Christoph Hellwig , Dan Williams , "Darrick J. Wong" , Eric Van Hensbergen , Jens Axboe , Johannes Weiner , Konrad Rzeszutek Wilk , Latchesar Ionkov , linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-nvdimm@ml01.01.org, Matthew Wilcox , Ron Minnich , samba-technical@lists.samba.org, Steve French , Trond Myklebust , v9fs-developer@lists.sourceforge.net Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads Message-ID: <20170501223855.GA25862@linux.intel.com> Mail-Followup-To: Ross Zwisler , Jan Kara , Andrew Morton , linux-kernel@vger.kernel.org, Alexander Viro , Alexey Kuznetsov , Andrey Ryabinin , Anna Schumaker , Christoph Hellwig , Dan Williams , "Darrick J. Wong" , Eric Van Hensbergen , Jens Axboe , Johannes Weiner , Konrad Rzeszutek Wilk , Latchesar Ionkov , linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-nvdimm@lists.01.org, Matthew Wilcox , Ron Minnich , samba-technical@lists.samba.org, Steve French , Trond Myklebust , v9fs-developer@lists.sourceforge.net References: <20170420191446.GA21694@linux.intel.com> <20170421034437.4359-1-ross.zwisler@linux.intel.com> <20170421034437.4359-2-ross.zwisler@linux.intel.com> <20170425111043.GH2793@quack2.suse.cz> <20170425225936.GA29655@linux.intel.com> <20170426085235.GA21738@quack2.suse.cz> <20170426225236.GA25838@linux.intel.com> <20170427072659.GA29789@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170427072659.GA29789@quack2.suse.cz> User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 27, 2017 at 09:26:59AM +0200, Jan Kara wrote: > On Wed 26-04-17 16:52:36, Ross Zwisler wrote: <> > > I don't think this alone is enough to save us. The I/O path doesn't currently > > take any DAX radix tree entry locks, so our race would just become: > > > > CPU1 - write(2) CPU2 - read fault > > > > dax_iomap_pte_fault() > > grab_mapping_entry() // newly moved > > ->iomap_begin() - sees hole > > dax_iomap_rw() > > iomap_apply() > > ->iomap_begin - allocates blocks > > dax_iomap_actor() > > invalidate_inode_pages2_range() > > - there's nothing to invalidate > > - we add zero page in the radix > > tree & map it to page tables > > > > In their current form I don't think we want to take DAX radix tree entry locks > > in the I/O path because that would effectively serialize I/O over a given > > radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range > > would be serialized. > > Note that invalidate_inode_pages2_range() will see the entry created by > grab_mapping_entry() on CPU2 and block waiting for its lock and this is > exactly what stops the race. The invalidate_inode_pages2_range() > effectively makes sure there isn't any page fault in progress for given > range... Yep, this is the bit that I was missing. Thanks. > Also note that writes to a file are serialized by i_rwsem anyway (and at > least serialization of writes to the overlapping range is required by POSIX) > so this doesn't add any more serialization than we already have. > > > > Another solution would be to grab i_mmap_sem for write when doing write > > > fault of a page and similarly have it grabbed for writing when doing > > > write(2). This would scale rather poorly but if we later replaced it with a > > > range lock (Davidlohr has already posted a nice implementation of it) it > > > won't be as bad. But I guess option 1) is better... > > > > The best idea I had for handling this sounds similar, which would be to > > convert the radix tree locks to essentially be reader/writer locks. I/O and > > faults that don't modify the block mapping could just take read-level locks, > > and could all run concurrently. I/O or faults that modify a block mapping > > would take a write lock, and serialize with other writers and readers. > > Well, this would be difficult to implement inside the radix tree (not > enough bits in the entry) so you'd have to go for some external locking > primitive anyway. And if you do that, read-write range lock Davidlohr has > implemented is what you describe - well we could also have a radix tree > with rwsems but I suspect the overhead of maintaining that would be too > large. It would require larger rewrite than reusing entry locks as I > suggest above though and it isn't an obvious performance win for realistic > workloads either so I'd like to see some performance numbers before going > that way. It likely improves a situation where processes race to fault the > same page for which we already know the block mapping but I'm not sure if > that translates to any measurable performance wins for workloads on DAX > filesystem. > > > You could know if you needed a write lock without asking the filesystem - if > > you're a write and the radix tree entry is empty or is for a zero page, you > > grab the write lock. > > > > This dovetails nicely with the idea of having the radix tree act as a cache > > for block mappings. You take the appropriate lock on the radix tree entry, > > and it has the block mapping info for your I/O or fault so you don't have to > > call into the FS. I/O would also participate so we would keep info about > > block mappings that we gather from I/O to help shortcut our page faults. > > > > How does this sound vs the range lock idea? How hard do you think it would be > > to convert our current wait queue system to reader/writer style locking? > > > > Also, how do you think we should deal with the current PMD corruption? Should > > we go with the current fix (I can augment the comments as you suggested), and > > then handle optimizations to that approach and the solution to this larger > > race as a follow-on? > > So for now I'm still more inclined to just stay with the radix tree lock as > is and just fix up the locking as I suggest and go for larger rewrite only > if we can demonstrate further performance wins. Sounds good. > WRT your second patch, if we go with the locking as I suggest, it is enough > to unmap the whole range after invalidate_inode_pages2() has cleared radix > tree entries (*) which will be much cheaper (for large writes) than doing > unmapping entry by entry. I'm still not convinced that it is safe to do the unmap in a separate step. I see your point about it being expensive to do a rmap walk to unmap each entry in __dax_invalidate_mapping_entry(), but I think we might need to because the unmap is part of the contract imposed by invalidate_inode_pages2_range() and invalidate_inode_pages2(). This exists in the header comment above each: * Any pages which are found to be mapped into pagetables are unmapped prior * to invalidation. If you look at the usage of invalidate_inode_pages2_range() in generic_file_direct_write() for example (which I realize we won't call for a DAX inode, but still), I think that it really does rely on the fact that invalidated pages are unmapped, right? If it didn't, and hole pages were mapped, the hole pages could remain mapped while a direct I/O write allocated blocks and then wrote real data. If we really want to unmap the entire range at once, maybe it would have to be done in invalidate_inode_pages2_range(), after the loop? My hesitation about this is that we'd be leaking yet more DAX special casing up into the mm/truncate.c code. Or am I missing something? > So I'd go for that. I'll prepare a patch for the > locking change - it will require changes to ext4 transaction handling so it > won't be completely trivial. > > (*) The flow of information is: filesystem block mapping info -> radix tree > -> page tables so if 'filesystem block mapping info' changes, we should go > invalidate corresponding radix tree entries (new entries will already have > uptodate info) and then invalidate corresponding page tables (again once > radix tree has no stale entries, we are sure new page table entries will be > uptodate). > > Honza > -- > Jan Kara > SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 1 May 2017 16:38:55 -0600 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , Andrew Morton , linux-kernel@vger.kernel.org, Alexander Viro , Alexey Kuznetsov , Andrey Ryabinin , Anna Schumaker , Christoph Hellwig , Dan Williams , "Darrick J. Wong" , Eric Van Hensbergen , Jens Axboe , Johannes Weiner , Konrad Rzeszutek Wilk , Latchesar Ionkov , linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-nvdimm@lists.01.org, Matthew Wilcox , Ron Minnich , samba-technical@lists.samba.org, Steve French , Trond Myklebust , v9fs-developer@lists.sourceforge.net Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads Message-ID: <20170501223855.GA25862@linux.intel.com> References: <20170420191446.GA21694@linux.intel.com> <20170421034437.4359-1-ross.zwisler@linux.intel.com> <20170421034437.4359-2-ross.zwisler@linux.intel.com> <20170425111043.GH2793@quack2.suse.cz> <20170425225936.GA29655@linux.intel.com> <20170426085235.GA21738@quack2.suse.cz> <20170426225236.GA25838@linux.intel.com> <20170427072659.GA29789@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170427072659.GA29789@quack2.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: On Thu, Apr 27, 2017 at 09:26:59AM +0200, Jan Kara wrote: > On Wed 26-04-17 16:52:36, Ross Zwisler wrote: <> > > I don't think this alone is enough to save us. The I/O path doesn't currently > > take any DAX radix tree entry locks, so our race would just become: > > > > CPU1 - write(2) CPU2 - read fault > > > > dax_iomap_pte_fault() > > grab_mapping_entry() // newly moved > > ->iomap_begin() - sees hole > > dax_iomap_rw() > > iomap_apply() > > ->iomap_begin - allocates blocks > > dax_iomap_actor() > > invalidate_inode_pages2_range() > > - there's nothing to invalidate > > - we add zero page in the radix > > tree & map it to page tables > > > > In their current form I don't think we want to take DAX radix tree entry locks > > in the I/O path because that would effectively serialize I/O over a given > > radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range > > would be serialized. > > Note that invalidate_inode_pages2_range() will see the entry created by > grab_mapping_entry() on CPU2 and block waiting for its lock and this is > exactly what stops the race. The invalidate_inode_pages2_range() > effectively makes sure there isn't any page fault in progress for given > range... Yep, this is the bit that I was missing. Thanks. > Also note that writes to a file are serialized by i_rwsem anyway (and at > least serialization of writes to the overlapping range is required by POSIX) > so this doesn't add any more serialization than we already have. > > > > Another solution would be to grab i_mmap_sem for write when doing write > > > fault of a page and similarly have it grabbed for writing when doing > > > write(2). This would scale rather poorly but if we later replaced it with a > > > range lock (Davidlohr has already posted a nice implementation of it) it > > > won't be as bad. But I guess option 1) is better... > > > > The best idea I had for handling this sounds similar, which would be to > > convert the radix tree locks to essentially be reader/writer locks. I/O and > > faults that don't modify the block mapping could just take read-level locks, > > and could all run concurrently. I/O or faults that modify a block mapping > > would take a write lock, and serialize with other writers and readers. > > Well, this would be difficult to implement inside the radix tree (not > enough bits in the entry) so you'd have to go for some external locking > primitive anyway. And if you do that, read-write range lock Davidlohr has > implemented is what you describe - well we could also have a radix tree > with rwsems but I suspect the overhead of maintaining that would be too > large. It would require larger rewrite than reusing entry locks as I > suggest above though and it isn't an obvious performance win for realistic > workloads either so I'd like to see some performance numbers before going > that way. It likely improves a situation where processes race to fault the > same page for which we already know the block mapping but I'm not sure if > that translates to any measurable performance wins for workloads on DAX > filesystem. > > > You could know if you needed a write lock without asking the filesystem - if > > you're a write and the radix tree entry is empty or is for a zero page, you > > grab the write lock. > > > > This dovetails nicely with the idea of having the radix tree act as a cache > > for block mappings. You take the appropriate lock on the radix tree entry, > > and it has the block mapping info for your I/O or fault so you don't have to > > call into the FS. I/O would also participate so we would keep info about > > block mappings that we gather from I/O to help shortcut our page faults. > > > > How does this sound vs the range lock idea? How hard do you think it would be > > to convert our current wait queue system to reader/writer style locking? > > > > Also, how do you think we should deal with the current PMD corruption? Should > > we go with the current fix (I can augment the comments as you suggested), and > > then handle optimizations to that approach and the solution to this larger > > race as a follow-on? > > So for now I'm still more inclined to just stay with the radix tree lock as > is and just fix up the locking as I suggest and go for larger rewrite only > if we can demonstrate further performance wins. Sounds good. > WRT your second patch, if we go with the locking as I suggest, it is enough > to unmap the whole range after invalidate_inode_pages2() has cleared radix > tree entries (*) which will be much cheaper (for large writes) than doing > unmapping entry by entry. I'm still not convinced that it is safe to do the unmap in a separate step. I see your point about it being expensive to do a rmap walk to unmap each entry in __dax_invalidate_mapping_entry(), but I think we might need to because the unmap is part of the contract imposed by invalidate_inode_pages2_range() and invalidate_inode_pages2(). This exists in the header comment above each: * Any pages which are found to be mapped into pagetables are unmapped prior * to invalidation. If you look at the usage of invalidate_inode_pages2_range() in generic_file_direct_write() for example (which I realize we won't call for a DAX inode, but still), I think that it really does rely on the fact that invalidated pages are unmapped, right? If it didn't, and hole pages were mapped, the hole pages could remain mapped while a direct I/O write allocated blocks and then wrote real data. If we really want to unmap the entire range at once, maybe it would have to be done in invalidate_inode_pages2_range(), after the loop? My hesitation about this is that we'd be leaking yet more DAX special casing up into the mm/truncate.c code. Or am I missing something? > So I'd go for that. I'll prepare a patch for the > locking change - it will require changes to ext4 transaction handling so it > won't be completely trivial. > > (*) The flow of information is: filesystem block mapping info -> radix tree > -> page tables so if 'filesystem block mapping info' changes, we should go > invalidate corresponding radix tree entries (new entries will already have > uptodate info) and then invalidate corresponding page tables (again once > radix tree has no stale entries, we are sure new page table entries will be > uptodate). > > Honza > -- > Jan Kara > SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com ([192.55.52.43]:3305 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750818AbdEAWjA (ORCPT ); Mon, 1 May 2017 18:39:00 -0400 Date: Mon, 1 May 2017 16:38:55 -0600 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , Andrew Morton , linux-kernel@vger.kernel.org, Alexander Viro , Alexey Kuznetsov , Andrey Ryabinin , Anna Schumaker , Christoph Hellwig , Dan Williams , "Darrick J. Wong" , Eric Van Hensbergen , Jens Axboe , Johannes Weiner , Konrad Rzeszutek Wilk , Latchesar Ionkov , linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-nvdimm@lists.01.org, Matthew Wilcox , Ron Minnich , samba-technical@lists.samba.org, Steve French , Trond Myklebust , v9fs-developer@lists.sourceforge.net Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads Message-ID: <20170501223855.GA25862@linux.intel.com> References: <20170420191446.GA21694@linux.intel.com> <20170421034437.4359-1-ross.zwisler@linux.intel.com> <20170421034437.4359-2-ross.zwisler@linux.intel.com> <20170425111043.GH2793@quack2.suse.cz> <20170425225936.GA29655@linux.intel.com> <20170426085235.GA21738@quack2.suse.cz> <20170426225236.GA25838@linux.intel.com> <20170427072659.GA29789@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20170427072659.GA29789@quack2.suse.cz> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Thu, Apr 27, 2017 at 09:26:59AM +0200, Jan Kara wrote: > On Wed 26-04-17 16:52:36, Ross Zwisler wrote: <> > > I don't think this alone is enough to save us. The I/O path doesn't currently > > take any DAX radix tree entry locks, so our race would just become: > > > > CPU1 - write(2) CPU2 - read fault > > > > dax_iomap_pte_fault() > > grab_mapping_entry() // newly moved > > ->iomap_begin() - sees hole > > dax_iomap_rw() > > iomap_apply() > > ->iomap_begin - allocates blocks > > dax_iomap_actor() > > invalidate_inode_pages2_range() > > - there's nothing to invalidate > > - we add zero page in the radix > > tree & map it to page tables > > > > In their current form I don't think we want to take DAX radix tree entry locks > > in the I/O path because that would effectively serialize I/O over a given > > radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range > > would be serialized. > > Note that invalidate_inode_pages2_range() will see the entry created by > grab_mapping_entry() on CPU2 and block waiting for its lock and this is > exactly what stops the race. The invalidate_inode_pages2_range() > effectively makes sure there isn't any page fault in progress for given > range... Yep, this is the bit that I was missing. Thanks. > Also note that writes to a file are serialized by i_rwsem anyway (and at > least serialization of writes to the overlapping range is required by POSIX) > so this doesn't add any more serialization than we already have. > > > > Another solution would be to grab i_mmap_sem for write when doing write > > > fault of a page and similarly have it grabbed for writing when doing > > > write(2). This would scale rather poorly but if we later replaced it with a > > > range lock (Davidlohr has already posted a nice implementation of it) it > > > won't be as bad. But I guess option 1) is better... > > > > The best idea I had for handling this sounds similar, which would be to > > convert the radix tree locks to essentially be reader/writer locks. I/O and > > faults that don't modify the block mapping could just take read-level locks, > > and could all run concurrently. I/O or faults that modify a block mapping > > would take a write lock, and serialize with other writers and readers. > > Well, this would be difficult to implement inside the radix tree (not > enough bits in the entry) so you'd have to go for some external locking > primitive anyway. And if you do that, read-write range lock Davidlohr has > implemented is what you describe - well we could also have a radix tree > with rwsems but I suspect the overhead of maintaining that would be too > large. It would require larger rewrite than reusing entry locks as I > suggest above though and it isn't an obvious performance win for realistic > workloads either so I'd like to see some performance numbers before going > that way. It likely improves a situation where processes race to fault the > same page for which we already know the block mapping but I'm not sure if > that translates to any measurable performance wins for workloads on DAX > filesystem. > > > You could know if you needed a write lock without asking the filesystem - if > > you're a write and the radix tree entry is empty or is for a zero page, you > > grab the write lock. > > > > This dovetails nicely with the idea of having the radix tree act as a cache > > for block mappings. You take the appropriate lock on the radix tree entry, > > and it has the block mapping info for your I/O or fault so you don't have to > > call into the FS. I/O would also participate so we would keep info about > > block mappings that we gather from I/O to help shortcut our page faults. > > > > How does this sound vs the range lock idea? How hard do you think it would be > > to convert our current wait queue system to reader/writer style locking? > > > > Also, how do you think we should deal with the current PMD corruption? Should > > we go with the current fix (I can augment the comments as you suggested), and > > then handle optimizations to that approach and the solution to this larger > > race as a follow-on? > > So for now I'm still more inclined to just stay with the radix tree lock as > is and just fix up the locking as I suggest and go for larger rewrite only > if we can demonstrate further performance wins. Sounds good. > WRT your second patch, if we go with the locking as I suggest, it is enough > to unmap the whole range after invalidate_inode_pages2() has cleared radix > tree entries (*) which will be much cheaper (for large writes) than doing > unmapping entry by entry. I'm still not convinced that it is safe to do the unmap in a separate step. I see your point about it being expensive to do a rmap walk to unmap each entry in __dax_invalidate_mapping_entry(), but I think we might need to because the unmap is part of the contract imposed by invalidate_inode_pages2_range() and invalidate_inode_pages2(). This exists in the header comment above each: * Any pages which are found to be mapped into pagetables are unmapped prior * to invalidation. If you look at the usage of invalidate_inode_pages2_range() in generic_file_direct_write() for example (which I realize we won't call for a DAX inode, but still), I think that it really does rely on the fact that invalidated pages are unmapped, right? If it didn't, and hole pages were mapped, the hole pages could remain mapped while a direct I/O write allocated blocks and then wrote real data. If we really want to unmap the entire range at once, maybe it would have to be done in invalidate_inode_pages2_range(), after the loop? My hesitation about this is that we'd be leaking yet more DAX special casing up into the mm/truncate.c code. Or am I missing something? > So I'd go for that. I'll prepare a patch for the > locking change - it will require changes to ext4 transaction handling so it > won't be completely trivial. > > (*) The flow of information is: filesystem block mapping info -> radix tree > -> page tables so if 'filesystem block mapping info' changes, we should go > invalidate corresponding radix tree entries (new entries will already have > uptodate info) and then invalidate corresponding page tables (again once > radix tree has no stale entries, we are sure new page table entries will be > uptodate). > > Honza > -- > Jan Kara > SUSE Labs, CR