From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 26 Jan 2016 09:46:54 +0100 From: Jan Kara Subject: Re: [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race Message-ID: <20160126084654.GT24938@quack.suse.cz> References: <1453503971-5319-1-git-send-email-ross.zwisler@linux.intel.com> <20160124220107.GI20456@dastard> <20160125204636.GI2948@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160125204636.GI2948@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org To: Matthew Wilcox Cc: Dave Chinner , Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Mon 25-01-16 15:46:36, Matthew Wilcox wrote: > On Mon, Jan 25, 2016 at 09:01:07AM +1100, Dave Chinner wrote: > > On Fri, Jan 22, 2016 at 04:06:11PM -0700, Ross Zwisler wrote: > > > With the current DAX code the following race exists: > > > > > > Process 1 Process 2 > > > --------- --------- > > > > > > __dax_fault() - read file f, index 0 > > > get_block() -> returns hole > > > __dax_fault() - write file f, index 0 > > > get_block() -> allocates blocks > > > dax_insert_mapping() > > > dax_load_hole() > > > *data corruption* > > > > > > An analogous race exists between __dax_fault() loading a hole and > > > __dax_pmd_fault() allocating a PMD DAX page and trying to insert it, and > > > that race also ends in data corruption. > > > > Ok, so why doesn't this problem exist for the normal page cache > > insertion case with concurrent read vs write faults? It's because > > the write fault first does a read fault and so always the write > > fault always has a page in the radix tree for the get_block call > > that allocates the extents, right? > > No, it's because allocation of blocks is separated from allocation of > struct page. > > > And DAX has an optimisation in the page fault part where it skips > > the read fault part of the write fault? And so essentially the DAX > > write fault is missing the object (page lock of page in the radix > > tree) that the non-DAX write fault uses to avoid this problem? > > > > What happens if we get rid of that DAX write fault optimisation that > > skips the initial read fault? The write fault will always run on a > > mapping that has a hole loaded, right?, so the race between > > dax_load_hole() and dax_insert_mapping() goes away, because nothing > > will be calling dax_load_hole() once the write fault is allocating > > blocks.... > > So in your proposal, we'd look in the radix tree, find nothing, > call get_block(..., 0). If we get something back, we can insert it. > If we hit a hole, we allocate a struct page, put it in the radix tree > and return to user space. If that was a write fault after all, it'll > come back to us through the ->page_mkwrite handler where we can take the > page lock on the allocated struct page, then call down to DAX which calls > back through get_block to allocate? Then DAX kicks the struct page out > of the page cache and frees it. > > That seems to work to me. And we can get rid of pfn_mkwrite at the same > time which seems like a win to me. Getting rid of pfn_mkwrite() would be nice, I agree. But the above scheme still has issues when PMD pages come into play. Or maybe to start from the beginning: How would you like PMD faults to work? Because there is no obvious 'struct page' to protect the allocation of 2MB worth of blocks. You could resort to similar tricks like transparent huge pages do (compound pages) but then the cure is IMHO worse than the disease. There is one option: No allocation of blocks for PMD faults. That would make code much simpler and solve the races but I'm not sure whether that is really an acceptable loss of functionality... Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934153AbcAZIqs (ORCPT ); Tue, 26 Jan 2016 03:46:48 -0500 Received: from mx2.suse.de ([195.135.220.15]:56254 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932681AbcAZIqq (ORCPT ); Tue, 26 Jan 2016 03:46:46 -0500 Date: Tue, 26 Jan 2016 09:46:54 +0100 From: Jan Kara To: Matthew Wilcox Cc: Dave Chinner , Ross Zwisler , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race Message-ID: <20160126084654.GT24938@quack.suse.cz> References: <1453503971-5319-1-git-send-email-ross.zwisler@linux.intel.com> <20160124220107.GI20456@dastard> <20160125204636.GI2948@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160125204636.GI2948@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 25-01-16 15:46:36, Matthew Wilcox wrote: > On Mon, Jan 25, 2016 at 09:01:07AM +1100, Dave Chinner wrote: > > On Fri, Jan 22, 2016 at 04:06:11PM -0700, Ross Zwisler wrote: > > > With the current DAX code the following race exists: > > > > > > Process 1 Process 2 > > > --------- --------- > > > > > > __dax_fault() - read file f, index 0 > > > get_block() -> returns hole > > > __dax_fault() - write file f, index 0 > > > get_block() -> allocates blocks > > > dax_insert_mapping() > > > dax_load_hole() > > > *data corruption* > > > > > > An analogous race exists between __dax_fault() loading a hole and > > > __dax_pmd_fault() allocating a PMD DAX page and trying to insert it, and > > > that race also ends in data corruption. > > > > Ok, so why doesn't this problem exist for the normal page cache > > insertion case with concurrent read vs write faults? It's because > > the write fault first does a read fault and so always the write > > fault always has a page in the radix tree for the get_block call > > that allocates the extents, right? > > No, it's because allocation of blocks is separated from allocation of > struct page. > > > And DAX has an optimisation in the page fault part where it skips > > the read fault part of the write fault? And so essentially the DAX > > write fault is missing the object (page lock of page in the radix > > tree) that the non-DAX write fault uses to avoid this problem? > > > > What happens if we get rid of that DAX write fault optimisation that > > skips the initial read fault? The write fault will always run on a > > mapping that has a hole loaded, right?, so the race between > > dax_load_hole() and dax_insert_mapping() goes away, because nothing > > will be calling dax_load_hole() once the write fault is allocating > > blocks.... > > So in your proposal, we'd look in the radix tree, find nothing, > call get_block(..., 0). If we get something back, we can insert it. > If we hit a hole, we allocate a struct page, put it in the radix tree > and return to user space. If that was a write fault after all, it'll > come back to us through the ->page_mkwrite handler where we can take the > page lock on the allocated struct page, then call down to DAX which calls > back through get_block to allocate? Then DAX kicks the struct page out > of the page cache and frees it. > > That seems to work to me. And we can get rid of pfn_mkwrite at the same > time which seems like a win to me. Getting rid of pfn_mkwrite() would be nice, I agree. But the above scheme still has issues when PMD pages come into play. Or maybe to start from the beginning: How would you like PMD faults to work? Because there is no obvious 'struct page' to protect the allocation of 2MB worth of blocks. You could resort to similar tricks like transparent huge pages do (compound pages) but then the cure is IMHO worse than the disease. There is one option: No allocation of blocks for PMD faults. That would make code much simpler and solve the races but I'm not sure whether that is really an acceptable loss of functionality... Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race Date: Tue, 26 Jan 2016 09:46:54 +0100 Message-ID: <20160126084654.GT24938@quack.suse.cz> References: <1453503971-5319-1-git-send-email-ross.zwisler@linux.intel.com> <20160124220107.GI20456@dastard> <20160125204636.GI2948@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Ross Zwisler , linux-ext4@vger.kernel.org, Andrew Morton , Dan Williams To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20160125204636.GI2948@linux.intel.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com List-Id: linux-ext4.vger.kernel.org On Mon 25-01-16 15:46:36, Matthew Wilcox wrote: > On Mon, Jan 25, 2016 at 09:01:07AM +1100, Dave Chinner wrote: > > On Fri, Jan 22, 2016 at 04:06:11PM -0700, Ross Zwisler wrote: > > > With the current DAX code the following race exists: > > > > > > Process 1 Process 2 > > > --------- --------- > > > > > > __dax_fault() - read file f, index 0 > > > get_block() -> returns hole > > > __dax_fault() - write file f, index 0 > > > get_block() -> allocates blocks > > > dax_insert_mapping() > > > dax_load_hole() > > > *data corruption* > > > > > > An analogous race exists between __dax_fault() loading a hole and > > > __dax_pmd_fault() allocating a PMD DAX page and trying to insert it, and > > > that race also ends in data corruption. > > > > Ok, so why doesn't this problem exist for the normal page cache > > insertion case with concurrent read vs write faults? It's because > > the write fault first does a read fault and so always the write > > fault always has a page in the radix tree for the get_block call > > that allocates the extents, right? > > No, it's because allocation of blocks is separated from allocation of > struct page. > > > And DAX has an optimisation in the page fault part where it skips > > the read fault part of the write fault? And so essentially the DAX > > write fault is missing the object (page lock of page in the radix > > tree) that the non-DAX write fault uses to avoid this problem? > > > > What happens if we get rid of that DAX write fault optimisation that > > skips the initial read fault? The write fault will always run on a > > mapping that has a hole loaded, right?, so the race between > > dax_load_hole() and dax_insert_mapping() goes away, because nothing > > will be calling dax_load_hole() once the write fault is allocating > > blocks.... > > So in your proposal, we'd look in the radix tree, find nothing, > call get_block(..., 0). If we get something back, we can insert it. > If we hit a hole, we allocate a struct page, put it in the radix tree > and return to user space. If that was a write fault after all, it'll > come back to us through the ->page_mkwrite handler where we can take the > page lock on the allocated struct page, then call down to DAX which calls > back through get_block to allocate? Then DAX kicks the struct page out > of the page cache and frees it. > > That seems to work to me. And we can get rid of pfn_mkwrite at the same > time which seems like a win to me. Getting rid of pfn_mkwrite() would be nice, I agree. But the above scheme still has issues when PMD pages come into play. Or maybe to start from the beginning: How would you like PMD faults to work? Because there is no obvious 'struct page' to protect the allocation of 2MB worth of blocks. You could resort to similar tricks like transparent huge pages do (compound pages) but then the cure is IMHO worse than the disease. There is one option: No allocation of blocks for PMD faults. That would make code much simpler and solve the races but I'm not sure whether that is really an acceptable loss of functionality... Honza -- Jan Kara SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs