linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nicholas Piggin <npiggin@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	"Oliver O'Halloran" <oohall@gmail.com>,
	Yumei Huang <yuhuang@redhat.com>, Michal Hocko <mhocko@suse.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	KVM list <kvm@vger.kernel.org>, Linux MM <linux-mm@kvack.org>,
	Gleb Natapov <gleb@kernel.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	mtosatti@redhat.com,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
Date: Wed, 14 Sep 2016 20:19:36 +1000	[thread overview]
Message-ID: <20160914201936.08315277@roar.ozlabs.ibm.com> (raw)
In-Reply-To: <20160914073902.GQ22388@dastard>

On Wed, 14 Sep 2016 17:39:02 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Sep 13, 2016 at 11:53:11AM +1000, Nicholas Piggin wrote:
> > On Tue, 13 Sep 2016 07:34:36 +1000
> > Dave Chinner <david@fromorbit.com> wrote:
> > But let me understand your example in the absence of that.
> > 
> > - Application mmaps a file, faults in block 0
> > - FS allocates block, creates mappings, syncs metadata, sets "no fsync"
> >   flag for that block, and completes the fault.
> > - Application writes some data to block 0, completes userspace flushes
> > 
> > * At this point, a crash must return with above data (or newer).
> > 
> > - Application starts writing more stuff into block 0
> > - Concurrently, fault in block 1
> > - FS starts to allocate, splits trees including mappings to block 0
> > 
> > * Crash
> > 
> > Is that right?  
> 
> No.
> 
> - app write faults block 0, fs allocates
> < time passes while app does stuff to block 0 mapping >
> - fs syncs journal, block 0 metadata now persistent
> < time passes while app does stuff to block 0 mapping >
> - app structure grows, faults block 1, fs allocates
> - app adds pointers to data in block 1 from block 0, does
>   userspace pmem data sync.
> 
> *crash*
> 
> > How does your filesystem lose data before the sync
> > point?  
> 
> After recovery, file has a data in block 0, but no block 1 because
> the allocation transaction for block 1 was not flushed to the
> journal. Data in block 0 points to data in block 1, but block 1 does
> not exist. IOWs, the application has corrupt data because it never
> issued a data synchronisation request to the filesystem....
> 
> ----
> 
> Ok, looking back over your example, you seem to be suggesting a new
> page fault behaviour is required from filesystems that has not been
> described or explained, and that behaviour is triggered
> (persistently) somehow from userspace. You've also suggested
> filesystems store a persistent per-block "no fsync" flag
> in their extent map as part of the implementation. Right?

This is what we're talking about. Of course a filesystem can't just
start supporting the feature without any changes.


> Reading between the lines, I'm guessing that the "no fsync" flag has
> very specific update semantics, constraints and requirements.  Can
> you outline how you expect this flag to be set and updated, how it's
> used consistently between different applications (e.g. cp of a file
> vs the app using the file), behavioural constraints it implies for
> page faults vs non-mmap access to the data in the block, how
> you'd expect filesystems to deal with things like a hole punch
> landing in the middle of an extent marked with "no fsync", etc?

Well that's what's being discussed. An approach close to what I did is
to allow the app request a "no sync" type of mmap. Filesystem will
invalidate all such mappings before it does buffered IOs or hole punch,
and will sync metadata after allocating a new block but before returning
from a fault.

The app could query rather than request, but I found request seemed to
work better. The filesystem might be working with apps that don't use
the feature for example, and doesn't want to flush just in case any one
ever queried in the past.


> [snip]
> 
> > If there is any huge complexity or unsolved problem, it is in XFS.
> > Conceptual problem is simple.  
> 
> Play nice and be constructive, please?

So you agree that the persistent memory people who have come with some
requirements and ideas for an API should not be immediately shut down
with bogus handwaving.

Thanks,
Nick

  reply	other threads:[~2016-09-14 10:19 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-08  4:32 DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps) Dan Williams
2016-09-08 22:56 ` Ross Zwisler
2016-09-08 23:04   ` Dan Williams
2016-09-09  8:55     ` Xiao Guangrong
2016-09-09 15:40       ` Dan Williams
2016-09-12  6:00         ` Xiao Guangrong
2016-09-12  3:44       ` Rudoff, Andy
2016-09-12  6:31         ` Xiao Guangrong
2016-09-12  1:40   ` Dave Chinner
2016-09-15  5:55     ` Darrick J. Wong
2016-09-15  6:25       ` Dave Chinner
2016-09-12  5:27   ` Christoph Hellwig
2016-09-12  7:25     ` Oliver O'Halloran
2016-09-12  7:51       ` Christoph Hellwig
2016-09-12  8:05         ` Nicholas Piggin
2016-09-12 15:01           ` Christoph Hellwig
2016-09-13  1:31             ` Nicholas Piggin
2016-09-13  4:06               ` Dan Williams
2016-09-13  5:40                 ` Nicholas Piggin
2016-09-12 21:34           ` Dave Chinner
2016-09-13  1:53             ` Nicholas Piggin
2016-09-13  7:17               ` Christoph Hellwig
2016-09-13  9:06                 ` Nicholas Piggin
2016-09-14  7:39               ` Dave Chinner
2016-09-14 10:19                 ` Nicholas Piggin [this message]
2016-09-15  2:31                   ` Dave Chinner
2016-09-15  3:49                     ` Nicholas Piggin
2016-09-15 10:32                       ` Dave Chinner
2016-09-15 11:42                         ` Nicholas Piggin
2016-09-15 22:33                           ` Dave Chinner
2016-09-16  5:54                             ` Nicholas Piggin
2016-12-19 21:11                               ` Ross Zwisler
2016-12-20  1:09                                 ` Darrick J. Wong
2016-12-20  1:18                                   ` Dan Williams
2016-12-21  0:40                                     ` Darrick J. Wong
2016-12-21 16:53                                       ` Dan Williams
2016-12-21 21:24                                         ` Dave Chinner
2016-12-21 21:33                                           ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160914201936.08315277@roar.ozlabs.ibm.com \
    --to=npiggin@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=david@fromorbit.com \
    --cc=gleb@kernel.org \
    --cc=guangrong.xiao@linux.intel.com \
    --cc=hch@infradead.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=mhocko@suse.com \
    --cc=mtosatti@redhat.com \
    --cc=oohall@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=yuhuang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).