linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>, Gleb Natapov <gleb@kernel.org>,
	mtosatti@redhat.com, KVM list <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Yumei Huang <yuhuang@redhat.com>, Linux MM <linux-mm@kvack.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
Date: Thu, 8 Sep 2016 16:04:00 -0700	[thread overview]
Message-ID: <CAPcyv4h5y4MHdXtdrdPRtG7L0_KCoxf_xwDGnHQ2r5yZoqkFzQ@mail.gmail.com> (raw)
In-Reply-To: <20160908225636.GB15167@linux.intel.com>

On Thu, Sep 8, 2016 at 3:56 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
>> [ adding linux-fsdevel and linux-nvdimm ]
>>
>> On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong
>> <guangrong.xiao@linux.intel.com> wrote:
>> [..]
>> > However, it is not easy to handle the case that the new VMA overlays with
>> > the old VMA
>> > already got by userspace. I think we have some choices:
>> > 1: One way is completely skipping the new VMA region as current kernel code
>> > does but i
>> >    do not think this is good as the later VMAs will be dropped.
>> >
>> > 2: show the un-overlayed portion of new VMA. In your case, we just show the
>> > region
>> >    (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
>> > created
>> >    region with different attributions.
>> >
>> > 3: completely show the new VMA as this patch does.
>> >
>> > Which one do you prefer?
>> >
>>
>> I don't have a preference, but perhaps this breakage and uncertainty
>> is a good opportunity to propose a more reliable interface for NVML to
>> get the information it needs?
>>
>> My understanding is that it is looking for the VM_MIXEDMAP flag which
>> is already ambiguous for determining if DAX is enabled even if this
>> dynamic listing issue is fixed.  XFS has arranged for DAX to be a
>> per-inode capability and has an XFS-specific inode flag.  We can make
>> that a common inode flag, but it seems we should have a way to
>> interrogate the mapping itself in the case where the inode is unknown
>> or unavailable.  I'm thinking extensions to mincore to have flags for
>> DAX and possibly whether the page is part of a pte, pmd, or pud
>> mapping.  Just floating that idea before starting to look into the
>> implementation, comments or other ideas welcome...
>
> I think this goes back to our previous discussion about support for the PMEM
> programming model.  Really I think what NVML needs isn't a way to tell if it
> is getting a DAX mapping, but whether it is getting a DAX mapping on a
> filesystem that fully supports the PMEM programming model.  This of course is
> defined to be a filesystem where it can do all of its flushes from userspace
> safely and never call fsync/msync, and that allocations that happen in page
> faults will be synchronized to media before the page fault completes.
>
> IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
> everything or can I rely fully on flushes from userspace?"
>
> For all existing implementations, I think the answer is "you need to use
> fsync/msync" because we don't yet have proper support for the PMEM programming
> model.
>
> My best idea of how to support this was a per-inode flag similar to the one
> supported by XFS that says "you have a PMEM capable DAX mapping", which NVML
> would then interpret to mean "you can do flushes from userspace and be fully
> safe".  I think we really want this interface to be common over XFS and ext4.
>
> If we can figure out a better way of doing this interface, say via mincore,
> that's fine, but I don't think we can detangle this from the PMEM API
> discussion.

Whether a persistent memory mapping requires an msync/fsync is a
filesystem specific question.  This mincore proposal is separate from
that.  Consider device-DAX for volatile memory or mincore() called on
an anonymous memory range.  In those cases persistence and filesystem
metadata are not in the picture, but it would still be useful for
userspace to know "is there page cache backing this mapping?" or "what
is the TLB geometry of this mapping?".

  reply	other threads:[~2016-09-08 23:04 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-08  4:32 DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps) Dan Williams
2016-09-08 22:56 ` Ross Zwisler
2016-09-08 23:04   ` Dan Williams [this message]
2016-09-09  8:55     ` Xiao Guangrong
2016-09-09 15:40       ` Dan Williams
2016-09-12  6:00         ` Xiao Guangrong
2016-09-12  3:44       ` Rudoff, Andy
2016-09-12  6:31         ` Xiao Guangrong
2016-09-12  1:40   ` Dave Chinner
2016-09-15  5:55     ` Darrick J. Wong
2016-09-15  6:25       ` Dave Chinner
2016-09-12  5:27   ` Christoph Hellwig
2016-09-12  7:25     ` Oliver O'Halloran
2016-09-12  7:51       ` Christoph Hellwig
2016-09-12  8:05         ` Nicholas Piggin
2016-09-12 15:01           ` Christoph Hellwig
2016-09-13  1:31             ` Nicholas Piggin
2016-09-13  4:06               ` Dan Williams
2016-09-13  5:40                 ` Nicholas Piggin
2016-09-12 21:34           ` Dave Chinner
2016-09-13  1:53             ` Nicholas Piggin
2016-09-13  7:17               ` Christoph Hellwig
2016-09-13  9:06                 ` Nicholas Piggin
2016-09-14  7:39               ` Dave Chinner
2016-09-14 10:19                 ` Nicholas Piggin
2016-09-15  2:31                   ` Dave Chinner
2016-09-15  3:49                     ` Nicholas Piggin
2016-09-15 10:32                       ` Dave Chinner
2016-09-15 11:42                         ` Nicholas Piggin
2016-09-15 22:33                           ` Dave Chinner
2016-09-16  5:54                             ` Nicholas Piggin
2016-12-19 21:11                               ` Ross Zwisler
2016-12-20  1:09                                 ` Darrick J. Wong
2016-12-20  1:18                                   ` Dan Williams
2016-12-21  0:40                                     ` Darrick J. Wong
2016-12-21 16:53                                       ` Dan Williams
2016-12-21 21:24                                         ` Dave Chinner
2016-12-21 21:33                                           ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4h5y4MHdXtdrdPRtG7L0_KCoxf_xwDGnHQ2r5yZoqkFzQ@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=gleb@kernel.org \
    --cc=guangrong.xiao@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=mhocko@suse.com \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=stefanha@redhat.com \
    --cc=yuhuang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).