linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Sandeen <sandeen@sandeen.net>
To: Jan Kara <jack@suse.cz>, Ross Zwisler <zwisler@kernel.org>
Cc: linux-nvdimm@lists.01.org, darrick.wong@oracle.com,
	lczerner@redhat.com, linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Theodore Ts'o <tytso@mit.edu>
Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch
Date: Mon, 10 Sep 2018 17:18:49 -0500	[thread overview]
Message-ID: <8c70d61a-fc5c-b928-334a-fbb2567b8dea@sandeen.net> (raw)
In-Reply-To: <20180807084545.6gzxrtrvb34hyhdq@quack2.suse.cz>

On 8/7/18 3:45 AM, Jan Kara wrote:
> On Fri 27-07-18 10:28:51, Ross Zwisler wrote:
>> + fsdevel and the xfs list.
>>
>> On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>>> On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
>>>> On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
>>>>> Changes since v3:
>>>>>  * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
>>>>>    that the {ext4,xfs}_break_layouts() calls have the same meaning.
>>>>>    (Dave, Darrick and Jan)
>>>>
>>>> How about the occasional WARN_ON_ONCE you mention below. Were you able to
>>>> hunt them down?
>>>
>>> The root cause of this issue is that while the ei->i_mmap_sem provides
>>> synchronization between ext4_break_layouts() and page faults, it doesn't
>>> provide synchronize us with the direct I/O path.  This exact same issue exists
>>> in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.
>>>
>>> This allows the direct I/O path to do I/O and raise & lower page->_refcount
>>> while we're executing a truncate/hole punch.  This leads to us trying to free
>>> a page with an elevated refcount.
>>>
>>> Here's one instance of the race:
>>>
>>> CPU 0                                   CPU 1
>>> -----                                   -----
>>> ext4_punch_hole()
>>>   ext4_break_layouts() # all pages have refcount=1
>>>
>>>                                         ext4_direct_IO()
>>>                                           ... lots of layers ...
>>>                                           follow_page_pte()
>>>                                             get_page() # elevates refcount
>>>
>>>   truncate_pagecache_range()
>>>    ... a few layers ...
>>>    dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()
>>>
> 
> So this is a very different race from the one below. And it should be
> impossible to happen. This race is exactly the reason why
> dax_layout_busy_page() has unmap_mapping_range() call to force GUP to fault
> which blocks on ei->i_mmap_sem / XFS_MMAPLOCK and thus avoids the race.
> 
>>> A similar race occurs when the refcount is being dropped while we're running
>>> ext4_break_layouts(), and this is the one that my test was actually hitting:
>>>
>>> CPU 0                                   CPU 1
>>> -----                                   -----
>>>                                         ext4_direct_IO()
>>>                                           ... lots of layers ...
>>>                                           follow_page_pte()
>>>                                             get_page()
>>>                                             # elevates refcount of page X
>>> ext4_punch_hole()
>>>   ext4_break_layouts() # two pages, X & Y, have refcount == 2
>>>     __wait_var_event() # called for page X
>>>
>>>                                           __put_devmap_managed_page()
>>>                                           # drops refcount of X to 1
>>>
>>>    # __wait_var_events() checks X's refcount in "if (condition)", and breaks.
>>>    # We never actually called ext4_wait_dax_page(), so 'retry' in
>>>    # ext4_break_layouts() is still false.  Exit do/while loop in
>>>    # ext4_break_layouts, never attempting to wait on page Y which still has an
>>>    # elevated refcount of 2.
>>>
>>>   truncate_pagecache_range()
>>>    ... a few layers ...
>>>    dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()
>>>
>>> This second race can be fixed with the patch at the end of this function,
>>> which I think should go in, unless there is a benfit to the current retry
>>> scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
>>> With this patch applied I've been able to run my unit test through
>>> thousands of iterations, where it used to failed consistently within 10 or
>>> so.
>>>
>>> Even so, I wonder if the real solution is to add synchronization between
>>> the direct I/O path and {ext4,xfs}_break_layouts()?  Other ideas on how
>>> this should be handled?
>>>
>>> --- >8 ---
>>>
>>> From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
>>> From: Ross Zwisler <zwisler@kernel.org>
>>> Date: Wed, 25 Jul 2018 16:16:05 -0600
>>> Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()
>>>
>>> If the refcount of a page is lowered between the time that it is returned
>>> by dax_busy_page() and when the refcount is again checked in
>>> ext4_break_layouts() => ___wait_var_event(), the waiting function
>>> ext4_wait_dax_page() will never be called.  This means that
>>> ext4_break_layouts() will still have 'retry' set to false, so we'll stop
>>> looping and never check the refcount of other pages in this inode.
>>>
>>> Instead, always continue looping as long as dax_layout_busy_page() gives us
>>> a page which it found with an elevated refcount.
>>>
>>> Note that this works around the race exposed by my unit test, but I think
>>> that there is another race that needs to be addressed, probably with
>>> additional synchronization added between direct I/O and
>>> {ext4,xfs}_break_layouts().
>>>
>>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> OK, this is a good catch and the patch looks good. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Also please post this fix officially to Ted to include it in his tree (I
> can see that he has all your other patches queued for the merge window).

Did these ever get on Ted's radar?  I don't see it upstream yet.

Thanks,
-Eric

  reply	other threads:[~2018-09-11  3:15 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20180710191031.17919-1-ross.zwisler@linux.intel.com>
     [not found] ` <20180711081741.lmr44sp4cmt3f6um@quack2.suse.cz>
     [not found]   ` <20180725222839.GA28304@linux.intel.com>
2018-07-27 16:28     ` Ross Zwisler
2018-08-06  3:55       ` Dave Chinner
2018-08-06 15:49         ` Christoph Hellwig
2018-08-06 22:25           ` Dave Chinner
2018-08-07  8:45       ` Jan Kara
2018-09-10 22:18         ` Eric Sandeen [this message]
2018-09-11 15:14           ` Jan Kara
2018-09-11 15:20             ` Jan Kara
2018-09-11 17:28               ` Theodore Y. Ts'o
2018-09-11 18:21                 ` Eric Sandeen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8c70d61a-fc5c-b928-334a-fbb2567b8dea@sandeen.net \
    --to=sandeen@sandeen.net \
    --cc=darrick.wong@oracle.com \
    --cc=jack@suse.cz \
    --cc=lczerner@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=tytso@mit.edu \
    --cc=zwisler@kernel.org \
    --subject='Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).