From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Date: Tue, 26 Jan 2016 09:46:54 +0100
From: Jan Kara <jack@suse.cz>
Subject: Re: [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race
Message-ID: <20160126084654.GT24938@quack.suse.cz>
References: <1453503971-5319-1-git-send-email-ross.zwisler@linux.intel.com>
 <20160124220107.GI20456@dastard>
 <20160125204636.GI2948@linux.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160125204636.GI2948@linux.intel.com>
Sender: linux-kernel-owner@vger.kernel.org
To: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>, Ross Zwisler <ross.zwisler@linux.intel.com>, linux-kernel@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>, Alexander Viro <viro@zeniv.linux.org.uk>, Andreas Dilger <adilger.kernel@dilger.ca>, Andrew Morton <akpm@linux-foundation.org>, Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.com>, linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com
List-ID: <linux-nvdimm@lists.01.org>

On Mon 25-01-16 15:46:36, Matthew Wilcox wrote:
> On Mon, Jan 25, 2016 at 09:01:07AM +1100, Dave Chinner wrote:
> > On Fri, Jan 22, 2016 at 04:06:11PM -0700, Ross Zwisler wrote:
> > > With the current DAX code the following race exists:
> > > 
> > > Process 1                	Process 2
> > > ---------			---------
> > > 
> > > __dax_fault() - read file f, index 0
> > >   get_block() -> returns hole
> > >                              	__dax_fault() - write file f, index 0
> > >                                   get_block() -> allocates blocks
> > >                                   dax_insert_mapping()
> > >   dax_load_hole()
> > >   *data corruption*
> > > 
> > > An analogous race exists between __dax_fault() loading a hole and
> > > __dax_pmd_fault() allocating a PMD DAX page and trying to insert it, and
> > > that race also ends in data corruption.
> > 
> > Ok, so why doesn't this problem exist for the normal page cache
> > insertion case with concurrent read vs write faults?  It's because
> > the write fault first does a read fault and so always the write
> > fault always has a page in the radix tree for the get_block call
> > that allocates the extents, right?
> 
> No, it's because allocation of blocks is separated from allocation of
> struct page.
> 
> > And DAX has an optimisation in the page fault part where it skips
> > the read fault part of the write fault?  And so essentially the DAX
> > write fault is missing the object (page lock of page in the radix
> > tree) that the non-DAX write fault uses to avoid this problem?
> >
> > What happens if we get rid of that DAX write fault optimisation that
> > skips the initial read fault? The write fault will always run on a
> > mapping that has a hole loaded, right?, so the race between
> > dax_load_hole() and dax_insert_mapping() goes away, because nothing
> > will be calling dax_load_hole() once the write fault is allocating
> > blocks....
> 
> So in your proposal, we'd look in the radix tree, find nothing,
> call get_block(..., 0).  If we get something back, we can insert it.
> If we hit a hole, we allocate a struct page, put it in the radix tree
> and return to user space.  If that was a write fault after all, it'll
> come back to us through the ->page_mkwrite handler where we can take the
> page lock on the allocated struct page, then call down to DAX which calls
> back through get_block to allocate?  Then DAX kicks the struct page out
> of the page cache and frees it.
> 
> That seems to work to me.  And we can get rid of pfn_mkwrite at the same
> time which seems like a win to me.

Getting rid of pfn_mkwrite() would be nice, I agree. But the above scheme
still has issues when PMD pages come into play. Or maybe to start from the
beginning: How would you like PMD faults to work? Because there is no
obvious 'struct page' to protect the allocation of 2MB worth of blocks. You
could resort to similar tricks like transparent huge pages do (compound
pages) but then the cure is IMHO worse than the disease.

There is one option: No allocation of blocks for PMD faults. That would
make code much simpler and solve the races but I'm not sure whether that is
really an acceptable loss of functionality...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S934153AbcAZIqs (ORCPT <rfc822;w@1wt.eu>);
	Tue, 26 Jan 2016 03:46:48 -0500
Received: from mx2.suse.de ([195.135.220.15]:56254 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S932681AbcAZIqq (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Tue, 26 Jan 2016 03:46:46 -0500
Date: Tue, 26 Jan 2016 09:46:54 +0100
From: Jan Kara <jack@suse.cz>
To: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>,
        Ross Zwisler <ross.zwisler@linux.intel.com>,
        linux-kernel@vger.kernel.org, "Theodore Ts'o" <tytso@mit.edu>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Andreas Dilger <adilger.kernel@dilger.ca>,
        Andrew Morton <akpm@linux-foundation.org>,
        Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.com>,
        linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        linux-nvdimm@ml01.01.org, xfs@oss.sgi.com
Subject: Re: [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race
Message-ID: <20160126084654.GT24938@quack.suse.cz>
References: <1453503971-5319-1-git-send-email-ross.zwisler@linux.intel.com>
 <20160124220107.GI20456@dastard>
 <20160125204636.GI2948@linux.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160125204636.GI2948@linux.intel.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon 25-01-16 15:46:36, Matthew Wilcox wrote:
> On Mon, Jan 25, 2016 at 09:01:07AM +1100, Dave Chinner wrote:
> > On Fri, Jan 22, 2016 at 04:06:11PM -0700, Ross Zwisler wrote:
> > > With the current DAX code the following race exists:
> > > 
> > > Process 1                	Process 2
> > > ---------			---------
> > > 
> > > __dax_fault() - read file f, index 0
> > >   get_block() -> returns hole
> > >                              	__dax_fault() - write file f, index 0
> > >                                   get_block() -> allocates blocks
> > >                                   dax_insert_mapping()
> > >   dax_load_hole()
> > >   *data corruption*
> > > 
> > > An analogous race exists between __dax_fault() loading a hole and
> > > __dax_pmd_fault() allocating a PMD DAX page and trying to insert it, and
> > > that race also ends in data corruption.
> > 
> > Ok, so why doesn't this problem exist for the normal page cache
> > insertion case with concurrent read vs write faults?  It's because
> > the write fault first does a read fault and so always the write
> > fault always has a page in the radix tree for the get_block call
> > that allocates the extents, right?
> 
> No, it's because allocation of blocks is separated from allocation of
> struct page.
> 
> > And DAX has an optimisation in the page fault part where it skips
> > the read fault part of the write fault?  And so essentially the DAX
> > write fault is missing the object (page lock of page in the radix
> > tree) that the non-DAX write fault uses to avoid this problem?
> >
> > What happens if we get rid of that DAX write fault optimisation that
> > skips the initial read fault? The write fault will always run on a
> > mapping that has a hole loaded, right?, so the race between
> > dax_load_hole() and dax_insert_mapping() goes away, because nothing
> > will be calling dax_load_hole() once the write fault is allocating
> > blocks....
> 
> So in your proposal, we'd look in the radix tree, find nothing,
> call get_block(..., 0).  If we get something back, we can insert it.
> If we hit a hole, we allocate a struct page, put it in the radix tree
> and return to user space.  If that was a write fault after all, it'll
> come back to us through the ->page_mkwrite handler where we can take the
> page lock on the allocated struct page, then call down to DAX which calls
> back through get_block to allocate?  Then DAX kicks the struct page out
> of the page cache and frees it.
> 
> That seems to work to me.  And we can get rid of pfn_mkwrite at the same
> time which seems like a win to me.

Getting rid of pfn_mkwrite() would be nice, I agree. But the above scheme
still has issues when PMD pages come into play. Or maybe to start from the
beginning: How would you like PMD faults to work? Because there is no
obvious 'struct page' to protect the allocation of 2MB worth of blocks. You
could resort to similar tricks like transparent huge pages do (compound
pages) but then the cure is IMHO worse than the disease.

There is one option: No allocation of blocks for PMD faults. That would
make code much simpler and solve the races but I'm not sure whether that is
really an acceptable loss of functionality...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jan Kara <jack@suse.cz>
Subject: Re: [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race
Date: Tue, 26 Jan 2016 09:46:54 +0100
Message-ID: <20160126084654.GT24938@quack.suse.cz>
References: <1453503971-5319-1-git-send-email-ross.zwisler@linux.intel.com>
	<20160124220107.GI20456@dastard>
	<20160125204636.GI2948@linux.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Cc: Theodore Ts'o <tytso@mit.edu>, linux-nvdimm@lists.01.org,
	linux-kernel@vger.kernel.org, xfs@oss.sgi.com,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Jan Kara <jack@suse.com>, linux-fsdevel@vger.kernel.org,
	Ross Zwisler <ross.zwisler@linux.intel.com>, linux-ext4@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>
To: Matthew Wilcox <willy@linux.intel.com>
Return-path: <xfs-bounces@oss.sgi.com>
Content-Disposition: inline
In-Reply-To: <20160125204636.GI2948@linux.intel.com>
List-Unsubscribe: <http://oss.sgi.com/mailman/options/xfs>,
	<mailto:xfs-request@oss.sgi.com?subject=unsubscribe>
List-Archive: <http://oss.sgi.com/pipermail/xfs>
List-Post: <mailto:xfs@oss.sgi.com>
List-Help: <mailto:xfs-request@oss.sgi.com?subject=help>
List-Subscribe: <http://oss.sgi.com/mailman/listinfo/xfs>,
	<mailto:xfs-request@oss.sgi.com?subject=subscribe>
Errors-To: xfs-bounces@oss.sgi.com
Sender: xfs-bounces@oss.sgi.com
List-Id: linux-ext4.vger.kernel.org

On Mon 25-01-16 15:46:36, Matthew Wilcox wrote:
> On Mon, Jan 25, 2016 at 09:01:07AM +1100, Dave Chinner wrote:
> > On Fri, Jan 22, 2016 at 04:06:11PM -0700, Ross Zwisler wrote:
> > > With the current DAX code the following race exists:
> > > 
> > > Process 1                	Process 2
> > > ---------			---------
> > > 
> > > __dax_fault() - read file f, index 0
> > >   get_block() -> returns hole
> > >                              	__dax_fault() - write file f, index 0
> > >                                   get_block() -> allocates blocks
> > >                                   dax_insert_mapping()
> > >   dax_load_hole()
> > >   *data corruption*
> > > 
> > > An analogous race exists between __dax_fault() loading a hole and
> > > __dax_pmd_fault() allocating a PMD DAX page and trying to insert it, and
> > > that race also ends in data corruption.
> > 
> > Ok, so why doesn't this problem exist for the normal page cache
> > insertion case with concurrent read vs write faults?  It's because
> > the write fault first does a read fault and so always the write
> > fault always has a page in the radix tree for the get_block call
> > that allocates the extents, right?
> 
> No, it's because allocation of blocks is separated from allocation of
> struct page.
> 
> > And DAX has an optimisation in the page fault part where it skips
> > the read fault part of the write fault?  And so essentially the DAX
> > write fault is missing the object (page lock of page in the radix
> > tree) that the non-DAX write fault uses to avoid this problem?
> >
> > What happens if we get rid of that DAX write fault optimisation that
> > skips the initial read fault? The write fault will always run on a
> > mapping that has a hole loaded, right?, so the race between
> > dax_load_hole() and dax_insert_mapping() goes away, because nothing
> > will be calling dax_load_hole() once the write fault is allocating
> > blocks....
> 
> So in your proposal, we'd look in the radix tree, find nothing,
> call get_block(..., 0).  If we get something back, we can insert it.
> If we hit a hole, we allocate a struct page, put it in the radix tree
> and return to user space.  If that was a write fault after all, it'll
> come back to us through the ->page_mkwrite handler where we can take the
> page lock on the allocated struct page, then call down to DAX which calls
> back through get_block to allocate?  Then DAX kicks the struct page out
> of the page cache and frees it.
> 
> That seems to work to me.  And we can get rid of pfn_mkwrite at the same
> time which seems like a win to me.

Getting rid of pfn_mkwrite() would be nice, I agree. But the above scheme
still has issues when PMD pages come into play. Or maybe to start from the
beginning: How would you like PMD faults to work? Because there is no
obvious 'struct page' to protect the allocation of 2MB worth of blocks. You
could resort to similar tricks like transparent huge pages do (compound
pages) but then the cure is IMHO worse than the disease.

There is one option: No allocation of blocks for PMD faults. That would
make code much simpler and solve the races but I'm not sure whether that is
really an acceptable loss of functionality...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs