From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 4 May 2017 11:12:33 +0200 From: Jan Kara Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads Message-ID: <20170504091233.GA808@quack2.suse.cz> References: <20170420191446.GA21694@linux.intel.com> <20170421034437.4359-1-ross.zwisler@linux.intel.com> <20170421034437.4359-2-ross.zwisler@linux.intel.com> <20170425111043.GH2793@quack2.suse.cz> <20170425225936.GA29655@linux.intel.com> <20170426085235.GA21738@quack2.suse.cz> <20170426225236.GA25838@linux.intel.com> <20170427072659.GA29789@quack2.suse.cz> <20170501223855.GA25862@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170501223855.GA25862@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: Jan Kara , Andrew Morton , linux-kernel@vger.kernel.org, Alexander Viro , Alexey Kuznetsov , Andrey Ryabinin , Anna Schumaker , Christoph Hellwig , Dan Williams , "Darrick J. Wong" , Eric Van Hensbergen , Jens Axboe , Johannes Weiner , Konrad Rzeszutek Wilk , Latchesar Ionkov , linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-nvdimm@lists.01.org, Matthew Wilcox , Ron Minnich , samba-technical@lists.samba.org, Steve French , Trond Myklebust , v9fs-developer@lists.sourceforge.net List-ID: On Mon 01-05-17 16:38:55, Ross Zwisler wrote: > > So for now I'm still more inclined to just stay with the radix tree lock as > > is and just fix up the locking as I suggest and go for larger rewrite only > > if we can demonstrate further performance wins. > > Sounds good. > > > WRT your second patch, if we go with the locking as I suggest, it is enough > > to unmap the whole range after invalidate_inode_pages2() has cleared radix > > tree entries (*) which will be much cheaper (for large writes) than doing > > unmapping entry by entry. > > I'm still not convinced that it is safe to do the unmap in a separate step. I > see your point about it being expensive to do a rmap walk to unmap each entry > in __dax_invalidate_mapping_entry(), but I think we might need to because the > unmap is part of the contract imposed by invalidate_inode_pages2_range() and > invalidate_inode_pages2(). This exists in the header comment above each: > > * Any pages which are found to be mapped into pagetables are unmapped prior > * to invalidation. > > If you look at the usage of invalidate_inode_pages2_range() in > generic_file_direct_write() for example (which I realize we won't call for a > DAX inode, but still), I think that it really does rely on the fact that > invalidated pages are unmapped, right? If it didn't, and hole pages were > mapped, the hole pages could remain mapped while a direct I/O write allocated > blocks and then wrote real data. > > If we really want to unmap the entire range at once, maybe it would have to be > done in invalidate_inode_pages2_range(), after the loop? My hesitation about > this is that we'd be leaking yet more DAX special casing up into the > mm/truncate.c code. > > Or am I missing something? No, my thinking was to put the invalidation at the end of invalidate_inode_pages2_range(). I agree it means more special-casing for DAX in mm/truncate.c. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads Date: Thu, 4 May 2017 11:12:33 +0200 Message-ID: <20170504091233.GA808@quack2.suse.cz> References: <20170420191446.GA21694@linux.intel.com> <20170421034437.4359-1-ross.zwisler@linux.intel.com> <20170421034437.4359-2-ross.zwisler@linux.intel.com> <20170425111043.GH2793@quack2.suse.cz> <20170425225936.GA29655@linux.intel.com> <20170426085235.GA21738@quack2.suse.cz> <20170426225236.GA25838@linux.intel.com> <20170427072659.GA29789@quack2.suse.cz> <20170501223855.GA25862@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Andrew Morton , linux-kernel@vger.kernel.org, Alexander Viro , Alexey Kuznetsov , Andrey Ryabinin , Anna Schumaker , Christoph Hellwig , Dan Williams , "Darrick J. Wong" , Eric Van Hensbergen , Jens Axboe , Johannes Weiner , Konrad Rzeszutek Wilk , Latchesar Ionkov , linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-nvdimm@lists.01.org, Matthew Wilcox , Ron Minnich , samba-technical@lists.samba.org, Steve French , Trond Myklebust Return-path: Content-Disposition: inline In-Reply-To: <20170501223855.GA25862@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-cifs.vger.kernel.org On Mon 01-05-17 16:38:55, Ross Zwisler wrote: > > So for now I'm still more inclined to just stay with the radix tree lock as > > is and just fix up the locking as I suggest and go for larger rewrite only > > if we can demonstrate further performance wins. > > Sounds good. > > > WRT your second patch, if we go with the locking as I suggest, it is enough > > to unmap the whole range after invalidate_inode_pages2() has cleared radix > > tree entries (*) which will be much cheaper (for large writes) than doing > > unmapping entry by entry. > > I'm still not convinced that it is safe to do the unmap in a separate step. I > see your point about it being expensive to do a rmap walk to unmap each entry > in __dax_invalidate_mapping_entry(), but I think we might need to because the > unmap is part of the contract imposed by invalidate_inode_pages2_range() and > invalidate_inode_pages2(). This exists in the header comment above each: > > * Any pages which are found to be mapped into pagetables are unmapped prior > * to invalidation. > > If you look at the usage of invalidate_inode_pages2_range() in > generic_file_direct_write() for example (which I realize we won't call for a > DAX inode, but still), I think that it really does rely on the fact that > invalidated pages are unmapped, right? If it didn't, and hole pages were > mapped, the hole pages could remain mapped while a direct I/O write allocated > blocks and then wrote real data. > > If we really want to unmap the entire range at once, maybe it would have to be > done in invalidate_inode_pages2_range(), after the loop? My hesitation about > this is that we'd be leaking yet more DAX special casing up into the > mm/truncate.c code. > > Or am I missing something? No, my thinking was to put the invalidation at the end of invalidate_inode_pages2_range(). I agree it means more special-casing for DAX in mm/truncate.c. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752368AbdEDOoQ (ORCPT ); Thu, 4 May 2017 10:44:16 -0400 Received: from mx2.suse.de ([195.135.220.15]:51939 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751120AbdEDOnt (ORCPT ); Thu, 4 May 2017 10:43:49 -0400 Date: Thu, 4 May 2017 11:12:33 +0200 From: Jan Kara To: Ross Zwisler Cc: Jan Kara , Andrew Morton , linux-kernel@vger.kernel.org, Alexander Viro , Alexey Kuznetsov , Andrey Ryabinin , Anna Schumaker , Christoph Hellwig , Dan Williams , "Darrick J. Wong" , Eric Van Hensbergen , Jens Axboe , Johannes Weiner , Konrad Rzeszutek Wilk , Latchesar Ionkov , linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-nvdimm@ml01.01.org, Matthew Wilcox , Ron Minnich , samba-technical@lists.samba.org, Steve French , Trond Myklebust , v9fs-developer@lists.sourceforge.net Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads Message-ID: <20170504091233.GA808@quack2.suse.cz> References: <20170420191446.GA21694@linux.intel.com> <20170421034437.4359-1-ross.zwisler@linux.intel.com> <20170421034437.4359-2-ross.zwisler@linux.intel.com> <20170425111043.GH2793@quack2.suse.cz> <20170425225936.GA29655@linux.intel.com> <20170426085235.GA21738@quack2.suse.cz> <20170426225236.GA25838@linux.intel.com> <20170427072659.GA29789@quack2.suse.cz> <20170501223855.GA25862@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170501223855.GA25862@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 01-05-17 16:38:55, Ross Zwisler wrote: > > So for now I'm still more inclined to just stay with the radix tree lock as > > is and just fix up the locking as I suggest and go for larger rewrite only > > if we can demonstrate further performance wins. > > Sounds good. > > > WRT your second patch, if we go with the locking as I suggest, it is enough > > to unmap the whole range after invalidate_inode_pages2() has cleared radix > > tree entries (*) which will be much cheaper (for large writes) than doing > > unmapping entry by entry. > > I'm still not convinced that it is safe to do the unmap in a separate step. I > see your point about it being expensive to do a rmap walk to unmap each entry > in __dax_invalidate_mapping_entry(), but I think we might need to because the > unmap is part of the contract imposed by invalidate_inode_pages2_range() and > invalidate_inode_pages2(). This exists in the header comment above each: > > * Any pages which are found to be mapped into pagetables are unmapped prior > * to invalidation. > > If you look at the usage of invalidate_inode_pages2_range() in > generic_file_direct_write() for example (which I realize we won't call for a > DAX inode, but still), I think that it really does rely on the fact that > invalidated pages are unmapped, right? If it didn't, and hole pages were > mapped, the hole pages could remain mapped while a direct I/O write allocated > blocks and then wrote real data. > > If we really want to unmap the entire range at once, maybe it would have to be > done in invalidate_inode_pages2_range(), after the loop? My hesitation about > this is that we'd be leaking yet more DAX special casing up into the > mm/truncate.c code. > > Or am I missing something? No, my thinking was to put the invalidation at the end of invalidate_inode_pages2_range(). I agree it means more special-casing for DAX in mm/truncate.c. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de ([195.135.220.15]:51939 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751120AbdEDOnt (ORCPT ); Thu, 4 May 2017 10:43:49 -0400 Date: Thu, 4 May 2017 11:12:33 +0200 From: Jan Kara To: Ross Zwisler Cc: Jan Kara , Andrew Morton , linux-kernel@vger.kernel.org, Alexander Viro , Alexey Kuznetsov , Andrey Ryabinin , Anna Schumaker , Christoph Hellwig , Dan Williams , "Darrick J. Wong" , Eric Van Hensbergen , Jens Axboe , Johannes Weiner , Konrad Rzeszutek Wilk , Latchesar Ionkov , linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-nvdimm@lists.01.org, Matthew Wilcox , Ron Minnich , samba-technical@lists.samba.org, Steve French , Trond Myklebust , v9fs-developer@lists.sourceforge.net Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads Message-ID: <20170504091233.GA808@quack2.suse.cz> References: <20170420191446.GA21694@linux.intel.com> <20170421034437.4359-1-ross.zwisler@linux.intel.com> <20170421034437.4359-2-ross.zwisler@linux.intel.com> <20170425111043.GH2793@quack2.suse.cz> <20170425225936.GA29655@linux.intel.com> <20170426085235.GA21738@quack2.suse.cz> <20170426225236.GA25838@linux.intel.com> <20170427072659.GA29789@quack2.suse.cz> <20170501223855.GA25862@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20170501223855.GA25862@linux.intel.com> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Mon 01-05-17 16:38:55, Ross Zwisler wrote: > > So for now I'm still more inclined to just stay with the radix tree lock as > > is and just fix up the locking as I suggest and go for larger rewrite only > > if we can demonstrate further performance wins. > > Sounds good. > > > WRT your second patch, if we go with the locking as I suggest, it is enough > > to unmap the whole range after invalidate_inode_pages2() has cleared radix > > tree entries (*) which will be much cheaper (for large writes) than doing > > unmapping entry by entry. > > I'm still not convinced that it is safe to do the unmap in a separate step. I > see your point about it being expensive to do a rmap walk to unmap each entry > in __dax_invalidate_mapping_entry(), but I think we might need to because the > unmap is part of the contract imposed by invalidate_inode_pages2_range() and > invalidate_inode_pages2(). This exists in the header comment above each: > > * Any pages which are found to be mapped into pagetables are unmapped prior > * to invalidation. > > If you look at the usage of invalidate_inode_pages2_range() in > generic_file_direct_write() for example (which I realize we won't call for a > DAX inode, but still), I think that it really does rely on the fact that > invalidated pages are unmapped, right? If it didn't, and hole pages were > mapped, the hole pages could remain mapped while a direct I/O write allocated > blocks and then wrote real data. > > If we really want to unmap the entire range at once, maybe it would have to be > done in invalidate_inode_pages2_range(), after the loop? My hesitation about > this is that we'd be leaking yet more DAX special casing up into the > mm/truncate.c code. > > Or am I missing something? No, my thinking was to put the invalidation at the end of invalidate_inode_pages2_range(). I agree it means more special-casing for DAX in mm/truncate.c. Honza -- Jan Kara SUSE Labs, CR