From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1957280AbdDZIyp (ORCPT ); Wed, 26 Apr 2017 04:54:45 -0400 Received: from mx2.suse.de ([195.135.220.15]:57849 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1954922AbdDZIwm (ORCPT ); Wed, 26 Apr 2017 04:52:42 -0400 Date: Wed, 26 Apr 2017 10:52:35 +0200 From: Jan Kara To: Ross Zwisler Cc: Jan Kara , Andrew Morton , linux-kernel@vger.kernel.org, Alexander Viro , Alexey Kuznetsov , Andrey Ryabinin , Anna Schumaker , Christoph Hellwig , Dan Williams , "Darrick J. Wong" , Eric Van Hensbergen , Jens Axboe , Johannes Weiner , Konrad Rzeszutek Wilk , Latchesar Ionkov , linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-nvdimm@ml01.01.org, Matthew Wilcox , Ron Minnich , samba-technical@lists.samba.org, Steve French , Trond Myklebust , v9fs-developer@lists.sourceforge.net Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads Message-ID: <20170426085235.GA21738@quack2.suse.cz> References: <20170420191446.GA21694@linux.intel.com> <20170421034437.4359-1-ross.zwisler@linux.intel.com> <20170421034437.4359-2-ross.zwisler@linux.intel.com> <20170425111043.GH2793@quack2.suse.cz> <20170425225936.GA29655@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170425225936.GA29655@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 25-04-17 16:59:36, Ross Zwisler wrote: > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote: > <> > > Hum, but now thinking more about it I have hard time figuring out why write > > vs fault cannot actually still race: > > > > CPU1 - write(2) CPU2 - read fault > > > > dax_iomap_pte_fault() > > ->iomap_begin() - sees hole > > dax_iomap_rw() > > iomap_apply() > > ->iomap_begin - allocates blocks > > dax_iomap_actor() > > invalidate_inode_pages2_range() > > - there's nothing to invalidate > > grab_mapping_entry() > > - we add zero page in the radix > > tree & map it to page tables > > > > Similarly read vs write fault may end up racing in a wrong way and try to > > replace already existing exceptional entry with a hole page? > > Yep, this race seems real to me, too. This seems very much like the issues > that exist when a thread is doing direct I/O. One thread is doing I/O to an > intermediate buffer (page cache for direct I/O case, zero page for us), and > the other is going around it directly to media, and they can get out of sync. > > IIRC the direct I/O code looked something like: > > 1/ invalidate existing mappings > 2/ do direct I/O to media > 3/ invalidate mappings again, just in case. Should be cheap if there weren't > any conflicting faults. This makes sure any new allocations we made are > faulted in. Yeah, the problem is people generally expect weird behavior when they mix direct and buffered IO (or let alone mmap) however everyone expects standard read(2) and write(2) to be completely coherent with mmap(2). > I guess one option would be to replicate that logic in the DAX I/O path, or we > could try and enhance our locking so page faults can't race with I/O since > both can allocate blocks. In the abstract way, the problem is that we have radix tree (and page tables) cache block mapping information and the operation: "read block mapping information, store it in the radix tree" is not serialized in any way against other block allocations so the information we store can be out of date by the time we store it. One way to solve this would be to move ->iomap_begin call in the fault paths under entry lock although that would mean I have to redo how ext4 handles DAX faults because with current code it would create lock inversion wrt transaction start. Another solution would be to grab i_mmap_sem for write when doing write fault of a page and similarly have it grabbed for writing when doing write(2). This would scale rather poorly but if we later replaced it with a range lock (Davidlohr has already posted a nice implementation of it) it won't be as bad. But I guess option 1) is better... Honza -- Jan Kara SUSE Labs, CR