From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 40B8621A07A80 for ; Tue, 11 Sep 2018 08:14:23 -0700 (PDT) Date: Tue, 11 Sep 2018 17:14:21 +0200 From: Jan Kara Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Message-ID: <20180911151421.GD6104@quack2.suse.cz> References: <20180710191031.17919-1-ross.zwisler@linux.intel.com> <20180711081741.lmr44sp4cmt3f6um@quack2.suse.cz> <20180725222839.GA28304@linux.intel.com> <20180807084545.6gzxrtrvb34hyhdq@quack2.suse.cz> <8c70d61a-fc5c-b928-334a-fbb2567b8dea@sandeen.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <8c70d61a-fc5c-b928-334a-fbb2567b8dea@sandeen.net> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Eric Sandeen Cc: Jan Kara , linux-nvdimm@lists.01.org, darrick.wong@oracle.com, Theodore Ts'o , Ross Zwisler , linux-fsdevel , lczerner@redhat.com, linux-ext4 List-ID: On Mon 10-09-18 17:18:49, Eric Sandeen wrote: > On 8/7/18 3:45 AM, Jan Kara wrote: > > On Fri 27-07-18 10:28:51, Ross Zwisler wrote: > >> + fsdevel and the xfs list. > >> > >> On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler > >> wrote: > >>> On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote: > >>>> On Tue 10-07-18 13:10:29, Ross Zwisler wrote: > >>>>> Changes since v3: > >>>>> * Added an ext4_break_layouts() call to ext4_insert_range() to ensure > >>>>> that the {ext4,xfs}_break_layouts() calls have the same meaning. > >>>>> (Dave, Darrick and Jan) > >>>> > >>>> How about the occasional WARN_ON_ONCE you mention below. Were you able to > >>>> hunt them down? > >>> > >>> The root cause of this issue is that while the ei->i_mmap_sem provides > >>> synchronization between ext4_break_layouts() and page faults, it doesn't > >>> provide synchronize us with the direct I/O path. This exact same issue exists > >>> in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK. > >>> > >>> This allows the direct I/O path to do I/O and raise & lower page->_refcount > >>> while we're executing a truncate/hole punch. This leads to us trying to free > >>> a page with an elevated refcount. > >>> > >>> Here's one instance of the race: > >>> > >>> CPU 0 CPU 1 > >>> ----- ----- > >>> ext4_punch_hole() > >>> ext4_break_layouts() # all pages have refcount=1 > >>> > >>> ext4_direct_IO() > >>> ... lots of layers ... > >>> follow_page_pte() > >>> get_page() # elevates refcount > >>> > >>> truncate_pagecache_range() > >>> ... a few layers ... > >>> dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE() > >>> > > > > So this is a very different race from the one below. And it should be > > impossible to happen. This race is exactly the reason why > > dax_layout_busy_page() has unmap_mapping_range() call to force GUP to fault > > which blocks on ei->i_mmap_sem / XFS_MMAPLOCK and thus avoids the race. > > > >>> A similar race occurs when the refcount is being dropped while we're running > >>> ext4_break_layouts(), and this is the one that my test was actually hitting: > >>> > >>> CPU 0 CPU 1 > >>> ----- ----- > >>> ext4_direct_IO() > >>> ... lots of layers ... > >>> follow_page_pte() > >>> get_page() > >>> # elevates refcount of page X > >>> ext4_punch_hole() > >>> ext4_break_layouts() # two pages, X & Y, have refcount == 2 > >>> __wait_var_event() # called for page X > >>> > >>> __put_devmap_managed_page() > >>> # drops refcount of X to 1 > >>> > >>> # __wait_var_events() checks X's refcount in "if (condition)", and breaks. > >>> # We never actually called ext4_wait_dax_page(), so 'retry' in > >>> # ext4_break_layouts() is still false. Exit do/while loop in > >>> # ext4_break_layouts, never attempting to wait on page Y which still has an > >>> # elevated refcount of 2. > >>> > >>> truncate_pagecache_range() > >>> ... a few layers ... > >>> dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE() > >>> > >>> This second race can be fixed with the patch at the end of this function, > >>> which I think should go in, unless there is a benfit to the current retry > >>> scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()? > >>> With this patch applied I've been able to run my unit test through > >>> thousands of iterations, where it used to failed consistently within 10 or > >>> so. > >>> > >>> Even so, I wonder if the real solution is to add synchronization between > >>> the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how > >>> this should be handled? > >>> > >>> --- >8 --- > >>> > >>> From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001 > >>> From: Ross Zwisler > >>> Date: Wed, 25 Jul 2018 16:16:05 -0600 > >>> Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts() > >>> > >>> If the refcount of a page is lowered between the time that it is returned > >>> by dax_busy_page() and when the refcount is again checked in > >>> ext4_break_layouts() => ___wait_var_event(), the waiting function > >>> ext4_wait_dax_page() will never be called. This means that > >>> ext4_break_layouts() will still have 'retry' set to false, so we'll stop > >>> looping and never check the refcount of other pages in this inode. > >>> > >>> Instead, always continue looping as long as dax_layout_busy_page() gives us > >>> a page which it found with an elevated refcount. > >>> > >>> Note that this works around the race exposed by my unit test, but I think > >>> that there is another race that needs to be addressed, probably with > >>> additional synchronization added between direct I/O and > >>> {ext4,xfs}_break_layouts(). > >>> > >>> Signed-off-by: Ross Zwisler > > > > OK, this is a good catch and the patch looks good. You can add: > > > > Reviewed-by: Jan Kara > > > > Also please post this fix officially to Ted to include it in his tree (I > > can see that he has all your other patches queued for the merge window). > > Did these ever get on Ted's radar? I don't see it upstream yet. Hum, it seems Ted never picked this patch up. I guess I'll gather the two fixes you pointed out and resend them to Ted. Honza -- Jan Kara SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm