linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	"Oliver O'Halloran" <oohall@gmail.com>,
	Yumei Huang <yuhuang@redhat.com>, Michal Hocko <mhocko@suse.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	KVM list <kvm@vger.kernel.org>, Linux MM <linux-mm@kvack.org>,
	Gleb Natapov <gleb@kernel.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	mtosatti@redhat.com,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
Date: Fri, 16 Sep 2016 08:33:50 +1000	[thread overview]
Message-ID: <20160915223350.GU22388@dastard> (raw)
In-Reply-To: <20160915214222.505f4888@roar.ozlabs.ibm.com>

On Thu, Sep 15, 2016 at 09:42:22PM +1000, Nicholas Piggin wrote:
> On Thu, 15 Sep 2016 20:32:10 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> > 
> > You still haven't described anything about what a per-block flag
> > design is supposed to look like.... :/
> 
> For the API, or implementation? I'm not quite sure what you mean
> here. For implementation it's possible to carefully ensure metadata
> is persistent when allocating blocks in page fault but before
> mapping pages. Truncate or hole punch or such things can be made to
> work by invalidating all such mappings and holding them off until
> you can cope with them again. Not necessarily for a filesystem with
> *all* capabilities of XFS -- I don't know -- but for a complete basic
> one.

SO, essentially, it comes down to synchrnous metadta updates again.
but synchronous updates would be conditional on whether an extent
metadata with the "nofsync" flag asserted was updated? Where's the
nofsync flag kept? in memory at a generic layer, or in the
filesystem, potentially in an on-disk structure? How would the
application set it for a given range?

> > > > > Filesystem will
> > > > > invalidate all such mappings before it does buffered IOs or hole punch,
> > > > > and will sync metadata after allocating a new block but before returning
> > > > > from a fault.    
> > > > 
> > > > ... requires synchronous metadata updates from page fault context,
> > > > which we already know is not a good solution.  I'll quote one of
> > > > Christoph's previous replies to save me the trouble:
> > > > 
> > > > 	"You could write all metadata synchronously from the page
> > > > 	fault handler, but that's basically asking for all kinds of
> > > > 	deadlocks."
> > > > So, let's redirect back to the "no sync" flag you were talking about
> > > > - can you answer the questions I asked above? It would be especially
> > > > important to highlight how the proposed feature would avoid requiring
> > > > synchronous metadata updates in page fault contexts....  
> > > 
> > > Right. So what deadlocks are you concerned about?  
> > 
> > It basically puts the entire journal checkpoint path under a page
> > fault context. i.e. a whole new global locking context problem is
> 
> Yes there are potentially some new lock orderings created if you
> do that, depending on what locks the filesystem does.

Well, that's the whole issue.

> > created as this path can now be run both inside and outside the
> > mmap_sem. Nothing ever good comes from running filesystem locking
> > code both inside and outside the mmap_sem.
> 
> You mean that some cases journal checkpoint runs with mmap_sem
> held, and others without mmap_sem held? Not that mmap_sem is taken
> inside journal checkpoint.

Maybe not, but now we open up the potential for locks held inside
or outside mmap sem to interact with the journal locks that are now
held inside and outside mmap_sem. See below....

> Then I don't really see why that's a
> problem. I mean performance could suffer a bit, but with fault
> retry you can almost always do the syncing outside mmap_sem in
> practice.
> 
> Yes, I'll preemptively agree with you -- We don't want to add any
> such burden if it is not needed and well justified.
> 
> > FWIW, We've never executed synchronous transactions inside page
> > faults in XFS, and I think ext4 is in the same boat - it may be even
> > worse because of the way it does ordered data dispatch through the
> > journal. I don't really even want to think about the level of hurt
> > this might put btrfs or other COW/log structured filesystems under.
> > I'm sure Christoph can reel off a bunch more issues off the top of
> > his head....
> 
> I asked him, we'll see what he thinks. I don't beleive there is
> anything fundamental about mm or fs core layers that cause deadlocks
> though.

Spent 5 minutes looking at ext4 for an example: filesystems are
allowed to take page locks during transaction commit. e.g ext4
journal commit when using the default ordered data mode:

jbd2_journal_commit_transaction
  journal_submit_data_buffers()
    journal_submit_inode_data_buffers
      generic_writepages()
        ext4_writepages()
	  mpage_prepare_extent_to_map()
	    lock_page()

i.e. if we fault on the user buffer during a write() operation and
that user buffer is a mmapped DAX file that needs to be allocated
and we have synchronous metadata updates during page faults, we
deadlock on the page lock held above the page fault context...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2016-09-15 22:35 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-08  4:32 DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps) Dan Williams
2016-09-08 22:56 ` Ross Zwisler
2016-09-08 23:04   ` Dan Williams
2016-09-09  8:55     ` Xiao Guangrong
2016-09-09 15:40       ` Dan Williams
2016-09-12  6:00         ` Xiao Guangrong
2016-09-12  3:44       ` Rudoff, Andy
2016-09-12  6:31         ` Xiao Guangrong
2016-09-12  1:40   ` Dave Chinner
2016-09-15  5:55     ` Darrick J. Wong
2016-09-15  6:25       ` Dave Chinner
2016-09-12  5:27   ` Christoph Hellwig
2016-09-12  7:25     ` Oliver O'Halloran
2016-09-12  7:51       ` Christoph Hellwig
2016-09-12  8:05         ` Nicholas Piggin
2016-09-12 15:01           ` Christoph Hellwig
2016-09-13  1:31             ` Nicholas Piggin
2016-09-13  4:06               ` Dan Williams
2016-09-13  5:40                 ` Nicholas Piggin
2016-09-12 21:34           ` Dave Chinner
2016-09-13  1:53             ` Nicholas Piggin
2016-09-13  7:17               ` Christoph Hellwig
2016-09-13  9:06                 ` Nicholas Piggin
2016-09-14  7:39               ` Dave Chinner
2016-09-14 10:19                 ` Nicholas Piggin
2016-09-15  2:31                   ` Dave Chinner
2016-09-15  3:49                     ` Nicholas Piggin
2016-09-15 10:32                       ` Dave Chinner
2016-09-15 11:42                         ` Nicholas Piggin
2016-09-15 22:33                           ` Dave Chinner [this message]
2016-09-16  5:54                             ` Nicholas Piggin
2016-12-19 21:11                               ` Ross Zwisler
2016-12-20  1:09                                 ` Darrick J. Wong
2016-12-20  1:18                                   ` Dan Williams
2016-12-21  0:40                                     ` Darrick J. Wong
2016-12-21 16:53                                       ` Dan Williams
2016-12-21 21:24                                         ` Dave Chinner
2016-12-21 21:33                                           ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160915223350.GU22388@dastard \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=gleb@kernel.org \
    --cc=guangrong.xiao@linux.intel.com \
    --cc=hch@infradead.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=mhocko@suse.com \
    --cc=mtosatti@redhat.com \
    --cc=npiggin@gmail.com \
    --cc=oohall@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=yuhuang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).