linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
@ 2016-09-08  4:32 Dan Williams
  2016-09-08 22:56 ` Ross Zwisler
  0 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2016-09-08  4:32 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Dave Hansen, Paolo Bonzini, Andrew Morton, Michal Hocko,
	Gleb Natapov, mtosatti, KVM list, linux-kernel, Stefan Hajnoczi,
	Yumei Huang, Linux MM, Ross Zwisler, linux-nvdimm@lists.01.org,
	linux-fsdevel

[ adding linux-fsdevel and linux-nvdimm ]

On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong
<guangrong.xiao@linux.intel.com> wrote:
[..]
> However, it is not easy to handle the case that the new VMA overlays with
> the old VMA
> already got by userspace. I think we have some choices:
> 1: One way is completely skipping the new VMA region as current kernel code
> does but i
>    do not think this is good as the later VMAs will be dropped.
>
> 2: show the un-overlayed portion of new VMA. In your case, we just show the
> region
>    (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
> created
>    region with different attributions.
>
> 3: completely show the new VMA as this patch does.
>
> Which one do you prefer?
>

I don't have a preference, but perhaps this breakage and uncertainty
is a good opportunity to propose a more reliable interface for NVML to
get the information it needs?

My understanding is that it is looking for the VM_MIXEDMAP flag which
is already ambiguous for determining if DAX is enabled even if this
dynamic listing issue is fixed.  XFS has arranged for DAX to be a
per-inode capability and has an XFS-specific inode flag.  We can make
that a common inode flag, but it seems we should have a way to
interrogate the mapping itself in the case where the inode is unknown
or unavailable.  I'm thinking extensions to mincore to have flags for
DAX and possibly whether the page is part of a pte, pmd, or pud
mapping.  Just floating that idea before starting to look into the
implementation, comments or other ideas welcome...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-08  4:32 DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps) Dan Williams
@ 2016-09-08 22:56 ` Ross Zwisler
  2016-09-08 23:04   ` Dan Williams
                     ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Ross Zwisler @ 2016-09-08 22:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Xiao Guangrong, Dave Hansen, Paolo Bonzini, Andrew Morton,
	Michal Hocko, Gleb Natapov, mtosatti, KVM list, linux-kernel,
	Stefan Hajnoczi, Yumei Huang, Linux MM, Ross Zwisler,
	linux-nvdimm@lists.01.org, linux-fsdevel

On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
> [ adding linux-fsdevel and linux-nvdimm ]
> 
> On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong
> <guangrong.xiao@linux.intel.com> wrote:
> [..]
> > However, it is not easy to handle the case that the new VMA overlays with
> > the old VMA
> > already got by userspace. I think we have some choices:
> > 1: One way is completely skipping the new VMA region as current kernel code
> > does but i
> >    do not think this is good as the later VMAs will be dropped.
> >
> > 2: show the un-overlayed portion of new VMA. In your case, we just show the
> > region
> >    (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
> > created
> >    region with different attributions.
> >
> > 3: completely show the new VMA as this patch does.
> >
> > Which one do you prefer?
> >
> 
> I don't have a preference, but perhaps this breakage and uncertainty
> is a good opportunity to propose a more reliable interface for NVML to
> get the information it needs?
> 
> My understanding is that it is looking for the VM_MIXEDMAP flag which
> is already ambiguous for determining if DAX is enabled even if this
> dynamic listing issue is fixed.  XFS has arranged for DAX to be a
> per-inode capability and has an XFS-specific inode flag.  We can make
> that a common inode flag, but it seems we should have a way to
> interrogate the mapping itself in the case where the inode is unknown
> or unavailable.  I'm thinking extensions to mincore to have flags for
> DAX and possibly whether the page is part of a pte, pmd, or pud
> mapping.  Just floating that idea before starting to look into the
> implementation, comments or other ideas welcome...

I think this goes back to our previous discussion about support for the PMEM
programming model.  Really I think what NVML needs isn't a way to tell if it
is getting a DAX mapping, but whether it is getting a DAX mapping on a
filesystem that fully supports the PMEM programming model.  This of course is
defined to be a filesystem where it can do all of its flushes from userspace
safely and never call fsync/msync, and that allocations that happen in page
faults will be synchronized to media before the page fault completes.

IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
everything or can I rely fully on flushes from userspace?" 

For all existing implementations, I think the answer is "you need to use
fsync/msync" because we don't yet have proper support for the PMEM programming
model.

My best idea of how to support this was a per-inode flag similar to the one
supported by XFS that says "you have a PMEM capable DAX mapping", which NVML
would then interpret to mean "you can do flushes from userspace and be fully
safe".  I think we really want this interface to be common over XFS and ext4.

If we can figure out a better way of doing this interface, say via mincore,
that's fine, but I don't think we can detangle this from the PMEM API
discussion.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-08 22:56 ` Ross Zwisler
@ 2016-09-08 23:04   ` Dan Williams
  2016-09-09  8:55     ` Xiao Guangrong
  2016-09-12  1:40   ` Dave Chinner
  2016-09-12  5:27   ` Christoph Hellwig
  2 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2016-09-08 23:04 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Xiao Guangrong, Dave Hansen,
	Paolo Bonzini, Andrew Morton, Michal Hocko, Gleb Natapov,
	mtosatti, KVM list, linux-kernel, Stefan Hajnoczi, Yumei Huang,
	Linux MM, linux-nvdimm@lists.01.org, linux-fsdevel

On Thu, Sep 8, 2016 at 3:56 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
>> [ adding linux-fsdevel and linux-nvdimm ]
>>
>> On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong
>> <guangrong.xiao@linux.intel.com> wrote:
>> [..]
>> > However, it is not easy to handle the case that the new VMA overlays with
>> > the old VMA
>> > already got by userspace. I think we have some choices:
>> > 1: One way is completely skipping the new VMA region as current kernel code
>> > does but i
>> >    do not think this is good as the later VMAs will be dropped.
>> >
>> > 2: show the un-overlayed portion of new VMA. In your case, we just show the
>> > region
>> >    (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
>> > created
>> >    region with different attributions.
>> >
>> > 3: completely show the new VMA as this patch does.
>> >
>> > Which one do you prefer?
>> >
>>
>> I don't have a preference, but perhaps this breakage and uncertainty
>> is a good opportunity to propose a more reliable interface for NVML to
>> get the information it needs?
>>
>> My understanding is that it is looking for the VM_MIXEDMAP flag which
>> is already ambiguous for determining if DAX is enabled even if this
>> dynamic listing issue is fixed.  XFS has arranged for DAX to be a
>> per-inode capability and has an XFS-specific inode flag.  We can make
>> that a common inode flag, but it seems we should have a way to
>> interrogate the mapping itself in the case where the inode is unknown
>> or unavailable.  I'm thinking extensions to mincore to have flags for
>> DAX and possibly whether the page is part of a pte, pmd, or pud
>> mapping.  Just floating that idea before starting to look into the
>> implementation, comments or other ideas welcome...
>
> I think this goes back to our previous discussion about support for the PMEM
> programming model.  Really I think what NVML needs isn't a way to tell if it
> is getting a DAX mapping, but whether it is getting a DAX mapping on a
> filesystem that fully supports the PMEM programming model.  This of course is
> defined to be a filesystem where it can do all of its flushes from userspace
> safely and never call fsync/msync, and that allocations that happen in page
> faults will be synchronized to media before the page fault completes.
>
> IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
> everything or can I rely fully on flushes from userspace?"
>
> For all existing implementations, I think the answer is "you need to use
> fsync/msync" because we don't yet have proper support for the PMEM programming
> model.
>
> My best idea of how to support this was a per-inode flag similar to the one
> supported by XFS that says "you have a PMEM capable DAX mapping", which NVML
> would then interpret to mean "you can do flushes from userspace and be fully
> safe".  I think we really want this interface to be common over XFS and ext4.
>
> If we can figure out a better way of doing this interface, say via mincore,
> that's fine, but I don't think we can detangle this from the PMEM API
> discussion.

Whether a persistent memory mapping requires an msync/fsync is a
filesystem specific question.  This mincore proposal is separate from
that.  Consider device-DAX for volatile memory or mincore() called on
an anonymous memory range.  In those cases persistence and filesystem
metadata are not in the picture, but it would still be useful for
userspace to know "is there page cache backing this mapping?" or "what
is the TLB geometry of this mapping?".

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-08 23:04   ` Dan Williams
@ 2016-09-09  8:55     ` Xiao Guangrong
  2016-09-09 15:40       ` Dan Williams
  2016-09-12  3:44       ` Rudoff, Andy
  0 siblings, 2 replies; 38+ messages in thread
From: Xiao Guangrong @ 2016-09-09  8:55 UTC (permalink / raw)
  To: Dan Williams, Ross Zwisler, Dave Hansen, Paolo Bonzini,
	Andrew Morton, Michal Hocko, Gleb Natapov, mtosatti, KVM list,
	linux-kernel, Stefan Hajnoczi, Yumei Huang, Linux MM,
	linux-nvdimm@lists.01.org, linux-fsdevel



On 09/09/2016 07:04 AM, Dan Williams wrote:
> On Thu, Sep 8, 2016 at 3:56 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
>> On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
>>> [ adding linux-fsdevel and linux-nvdimm ]
>>>
>>> On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong
>>> <guangrong.xiao@linux.intel.com> wrote:
>>> [..]
>>>> However, it is not easy to handle the case that the new VMA overlays with
>>>> the old VMA
>>>> already got by userspace. I think we have some choices:
>>>> 1: One way is completely skipping the new VMA region as current kernel code
>>>> does but i
>>>>    do not think this is good as the later VMAs will be dropped.
>>>>
>>>> 2: show the un-overlayed portion of new VMA. In your case, we just show the
>>>> region
>>>>    (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
>>>> created
>>>>    region with different attributions.
>>>>
>>>> 3: completely show the new VMA as this patch does.
>>>>
>>>> Which one do you prefer?
>>>>
>>>
>>> I don't have a preference, but perhaps this breakage and uncertainty
>>> is a good opportunity to propose a more reliable interface for NVML to
>>> get the information it needs?
>>>
>>> My understanding is that it is looking for the VM_MIXEDMAP flag which
>>> is already ambiguous for determining if DAX is enabled even if this
>>> dynamic listing issue is fixed.  XFS has arranged for DAX to be a
>>> per-inode capability and has an XFS-specific inode flag.  We can make
>>> that a common inode flag, but it seems we should have a way to
>>> interrogate the mapping itself in the case where the inode is unknown
>>> or unavailable.  I'm thinking extensions to mincore to have flags for
>>> DAX and possibly whether the page is part of a pte, pmd, or pud
>>> mapping.  Just floating that idea before starting to look into the
>>> implementation, comments or other ideas welcome...
>>
>> I think this goes back to our previous discussion about support for the PMEM
>> programming model.  Really I think what NVML needs isn't a way to tell if it
>> is getting a DAX mapping, but whether it is getting a DAX mapping on a
>> filesystem that fully supports the PMEM programming model.  This of course is
>> defined to be a filesystem where it can do all of its flushes from userspace
>> safely and never call fsync/msync, and that allocations that happen in page
>> faults will be synchronized to media before the page fault completes.
>>
>> IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
>> everything or can I rely fully on flushes from userspace?"
>>
>> For all existing implementations, I think the answer is "you need to use
>> fsync/msync" because we don't yet have proper support for the PMEM programming
>> model.
>>
>> My best idea of how to support this was a per-inode flag similar to the one
>> supported by XFS that says "you have a PMEM capable DAX mapping", which NVML
>> would then interpret to mean "you can do flushes from userspace and be fully
>> safe".  I think we really want this interface to be common over XFS and ext4.
>>
>> If we can figure out a better way of doing this interface, say via mincore,
>> that's fine, but I don't think we can detangle this from the PMEM API
>> discussion.
>
> Whether a persistent memory mapping requires an msync/fsync is a
> filesystem specific question.  This mincore proposal is separate from
> that.  Consider device-DAX for volatile memory or mincore() called on
> an anonymous memory range.  In those cases persistence and filesystem
> metadata are not in the picture, but it would still be useful for
> userspace to know "is there page cache backing this mapping?" or "what
> is the TLB geometry of this mapping?".

I got a question about msync/fsync which is beyond the topic of this thread :)

Whether msync/fsync can make data persistent depends on ADR feature on memory
controller, if it exists everything works well, otherwise, we need to have another
interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint table' is
particularly useful for nvdimm virtualization if we use normal memory to emulate
nvdimm with data persistent characteristic (the data will be flushed to a
persistent storage, e.g, disk).

Does current PMEM programming model fully supports 'Flush hint table'? Is
userspace allowed to use these addresses?

Thanks!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-09  8:55     ` Xiao Guangrong
@ 2016-09-09 15:40       ` Dan Williams
  2016-09-12  6:00         ` Xiao Guangrong
  2016-09-12  3:44       ` Rudoff, Andy
  1 sibling, 1 reply; 38+ messages in thread
From: Dan Williams @ 2016-09-09 15:40 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Ross Zwisler, Dave Hansen, Paolo Bonzini, Andrew Morton,
	Michal Hocko, Gleb Natapov, mtosatti, KVM list, linux-kernel,
	Stefan Hajnoczi, Yumei Huang, Linux MM,
	linux-nvdimm@lists.01.org, linux-fsdevel

On Fri, Sep 9, 2016 at 1:55 AM, Xiao Guangrong
<guangrong.xiao@linux.intel.com> wrote:
[..]
>>
>> Whether a persistent memory mapping requires an msync/fsync is a
>> filesystem specific question.  This mincore proposal is separate from
>> that.  Consider device-DAX for volatile memory or mincore() called on
>> an anonymous memory range.  In those cases persistence and filesystem
>> metadata are not in the picture, but it would still be useful for
>> userspace to know "is there page cache backing this mapping?" or "what
>> is the TLB geometry of this mapping?".
>
>
> I got a question about msync/fsync which is beyond the topic of this thread
> :)
>
> Whether msync/fsync can make data persistent depends on ADR feature on
> memory
> controller, if it exists everything works well, otherwise, we need to have
> another
> interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint
> table' is
> particularly useful for nvdimm virtualization if we use normal memory to
> emulate
> nvdimm with data persistent characteristic (the data will be flushed to a
> persistent storage, e.g, disk).
>
> Does current PMEM programming model fully supports 'Flush hint table'? Is
> userspace allowed to use these addresses?

If you publish flush hint addresses in the virtual NFIT the guest VM
will write to them whenever a REQ_FLUSH or REQ_FUA request is sent to
the virtual /dev/pmemX device.  Yes, seems straightforward to take a
VM exit on those events and flush simulated pmem to persistent
storage.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-08 22:56 ` Ross Zwisler
  2016-09-08 23:04   ` Dan Williams
@ 2016-09-12  1:40   ` Dave Chinner
  2016-09-15  5:55     ` Darrick J. Wong
  2016-09-12  5:27   ` Christoph Hellwig
  2 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2016-09-12  1:40 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Xiao Guangrong, Dave Hansen,
	Paolo Bonzini, Andrew Morton, Michal Hocko, Gleb Natapov,
	mtosatti, KVM list, linux-kernel, Stefan Hajnoczi, Yumei Huang,
	Linux MM, linux-nvdimm@lists.01.org, linux-fsdevel

On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
> On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
> > My understanding is that it is looking for the VM_MIXEDMAP flag which
> > is already ambiguous for determining if DAX is enabled even if this
> > dynamic listing issue is fixed.  XFS has arranged for DAX to be a
> > per-inode capability and has an XFS-specific inode flag.  We can make
> > that a common inode flag, but it seems we should have a way to
> > interrogate the mapping itself in the case where the inode is unknown
> > or unavailable.  I'm thinking extensions to mincore to have flags for
> > DAX and possibly whether the page is part of a pte, pmd, or pud
> > mapping.  Just floating that idea before starting to look into the
> > implementation, comments or other ideas welcome...
> 
> I think this goes back to our previous discussion about support for the PMEM
> programming model.  Really I think what NVML needs isn't a way to tell if it
> is getting a DAX mapping, but whether it is getting a DAX mapping on a
> filesystem that fully supports the PMEM programming model.  This of course is
> defined to be a filesystem where it can do all of its flushes from userspace
> safely and never call fsync/msync, and that allocations that happen in page
> faults will be synchronized to media before the page fault completes.
> 
> IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
> everything or can I rely fully on flushes from userspace?" 

"need fsync/msync" is a dynamic state of an inode, not a static
property. i.e. users can do things that change an inode behind the
back of a mapping, even if they are not aware that this might
happen. As such, a filesystem can invalidate an existing mapping
at any time and userspace won't notice because it will simply fault
in a new mapping on the next access...

> For all existing implementations, I think the answer is "you need to use
> fsync/msync" because we don't yet have proper support for the PMEM programming
> model.

Yes, that is correct.

FWIW, I don't think it will ever be possible to support this ....
wonderful "PMEM programming model" from any current or future kernel
filesystem without a very specific set of restrictions on what can
be done to a file.  e.g.

	1. the file has to be fully allocated and zeroed before
	   use. Preallocation/zeroing via unwritten extents is not
	   allowed. Sparse files are not allowed. Shared extents are
	   not allowed.
	2. set the "PMEM_IMMUTABLE" inode flag - filesystem must
	   check the file is fully allocated before allowing it to
	   be set, and caller must have CAP_LINUX_IMMUTABLE.
	3. Inode metadata is now immutable, and file data can only
	   be accessed and/or modified via mmap().
	4. All non-mmap methods of inode data modification
	   will now fail with EPERM.
	5. all methods of inode metadata modification will now fail
	   with EPERM, timestamp udpdates will be ignored.
	6. PMEM_IMMUTABLE flag can only be removed if the file is
	   not currently mapped and caller has CAP_LINUX_IMMUTABLE.

A flag like this /should/ make it possible to avoid fsync/msync() on
a file for existing filesystems, but it also means that such files
have significant management issues (hence the need for
CAP_LINUX_IMMUTABLE to cover it's use).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-09  8:55     ` Xiao Guangrong
  2016-09-09 15:40       ` Dan Williams
@ 2016-09-12  3:44       ` Rudoff, Andy
  2016-09-12  6:31         ` Xiao Guangrong
  1 sibling, 1 reply; 38+ messages in thread
From: Rudoff, Andy @ 2016-09-12  3:44 UTC (permalink / raw)
  To: Xiao Guangrong, Williams, Dan J, Ross Zwisler, Hansen, Dave,
	Paolo Bonzini, Andrew Morton, Michal Hocko, Gleb Natapov,
	mtosatti, KVM list, linux-kernel, Stefan Hajnoczi, Yumei Huang,
	Linux MM, linux-nvdimm@lists.01.org, linux-fsdevel

>Whether msync/fsync can make data persistent depends on ADR feature on
>memory controller, if it exists everything works well, otherwise, we need
>to have another interface that is why 'Flush hint table' in ACPI comes
>in. 'Flush hint table' is particularly useful for nvdimm virtualization if
>we use normal memory to emulate nvdimm with data persistent characteristic
>(the data will be flushed to a persistent storage, e.g, disk).
>
>Does current PMEM programming model fully supports 'Flush hint table'? Is
>userspace allowed to use these addresses?

The Flush hint table is NOT a replacement for ADR.  To support pmem on
the x86 architecture, the platform is required to ensure that a pmem
store flushed from the CPU caches is in the persistent domain so that the
application need not take any additional steps to make it persistent.
The most common way to do this is the ADR feature.

If the above is not true, then your x86 platform does not support pmem.

Flush hints are for use by the BIOS and drivers and are not intended to
be used in user space.  Flush hints provide two things:

First, if a driver needs to write to command registers or movable windows
on a DIMM, the Flush hint (if provided in the NFIT) is required to flush
the command to the DIMM or ensure stores done through the movable window
are complete before moving it somewhere else.

Second, for the rare case where the kernel wants to flush stores to the
smallest possible failure domain (i.e. to the DIMM even though ADR will
handle flushing it from a larger domain), the flush hints provide a way
to do this.  This might be useful for things like file system journals to
help ensure the file system is consistent even in the face of ADR failure.

-andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-08 22:56 ` Ross Zwisler
  2016-09-08 23:04   ` Dan Williams
  2016-09-12  1:40   ` Dave Chinner
@ 2016-09-12  5:27   ` Christoph Hellwig
  2016-09-12  7:25     ` Oliver O'Halloran
  2 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-12  5:27 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Xiao Guangrong, Dave Hansen,
	Paolo Bonzini, Andrew Morton, Michal Hocko, Gleb Natapov,
	mtosatti, KVM list, linux-kernel, Stefan Hajnoczi, Yumei Huang,
	Linux MM, linux-nvdimm@lists.01.org, linux-fsdevel

On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
> I think this goes back to our previous discussion about support for the PMEM
> programming model.  Really I think what NVML needs isn't a way to tell if it
> is getting a DAX mapping, but whether it is getting a DAX mapping on a
> filesystem that fully supports the PMEM programming model.  This of course is
> defined to be a filesystem where it can do all of its flushes from userspace
> safely and never call fsync/msync, and that allocations that happen in page
> faults will be synchronized to media before the page fault completes.

That's a an easy way to flag:  you will never get that from a Linux
filesystem, period.

NVML folks really need to stop taking crack and dreaming this could
happen.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-09 15:40       ` Dan Williams
@ 2016-09-12  6:00         ` Xiao Guangrong
  0 siblings, 0 replies; 38+ messages in thread
From: Xiao Guangrong @ 2016-09-12  6:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Dave Hansen, Paolo Bonzini, Andrew Morton,
	Michal Hocko, Gleb Natapov, mtosatti, KVM list, linux-kernel,
	Stefan Hajnoczi, Yumei Huang, Linux MM,
	linux-nvdimm@lists.01.org, linux-fsdevel



On 09/09/2016 11:40 PM, Dan Williams wrote:
> On Fri, Sep 9, 2016 at 1:55 AM, Xiao Guangrong
> <guangrong.xiao@linux.intel.com> wrote:
> [..]
>>>
>>> Whether a persistent memory mapping requires an msync/fsync is a
>>> filesystem specific question.  This mincore proposal is separate from
>>> that.  Consider device-DAX for volatile memory or mincore() called on
>>> an anonymous memory range.  In those cases persistence and filesystem
>>> metadata are not in the picture, but it would still be useful for
>>> userspace to know "is there page cache backing this mapping?" or "what
>>> is the TLB geometry of this mapping?".
>>
>>
>> I got a question about msync/fsync which is beyond the topic of this thread
>> :)
>>
>> Whether msync/fsync can make data persistent depends on ADR feature on
>> memory
>> controller, if it exists everything works well, otherwise, we need to have
>> another
>> interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint
>> table' is
>> particularly useful for nvdimm virtualization if we use normal memory to
>> emulate
>> nvdimm with data persistent characteristic (the data will be flushed to a
>> persistent storage, e.g, disk).
>>
>> Does current PMEM programming model fully supports 'Flush hint table'? Is
>> userspace allowed to use these addresses?
>
> If you publish flush hint addresses in the virtual NFIT the guest VM
> will write to them whenever a REQ_FLUSH or REQ_FUA request is sent to
> the virtual /dev/pmemX device.  Yes, seems straightforward to take a
> VM exit on those events and flush simulated pmem to persistent
> storage.
>

Thank you, Dan!

However REQ_FLUSH or REQ_FUA is handled in kernel space, okay, after following
up the discussion in this thread, i understood that currently filesystems have
not supported the case that usespace itself make data be persistent without
kernel's involvement. So that works.

Hmm, Does device-DAX support this case (make data be persistent without
msync/fsync)? I guess no, but just want to confirm it. :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-12  3:44       ` Rudoff, Andy
@ 2016-09-12  6:31         ` Xiao Guangrong
  0 siblings, 0 replies; 38+ messages in thread
From: Xiao Guangrong @ 2016-09-12  6:31 UTC (permalink / raw)
  To: Rudoff, Andy, Williams, Dan J, Ross Zwisler, Hansen, Dave,
	Paolo Bonzini, Andrew Morton, Michal Hocko, Gleb Natapov,
	mtosatti, KVM list, linux-kernel, Stefan Hajnoczi, Yumei Huang,
	Linux MM, linux-nvdimm@lists.01.org, linux-fsdevel



On 09/12/2016 11:44 AM, Rudoff, Andy wrote:
>> Whether msync/fsync can make data persistent depends on ADR feature on
>> memory controller, if it exists everything works well, otherwise, we need
>> to have another interface that is why 'Flush hint table' in ACPI comes
>> in. 'Flush hint table' is particularly useful for nvdimm virtualization if
>> we use normal memory to emulate nvdimm with data persistent characteristic
>> (the data will be flushed to a persistent storage, e.g, disk).
>>
>> Does current PMEM programming model fully supports 'Flush hint table'? Is
>> userspace allowed to use these addresses?
>
> The Flush hint table is NOT a replacement for ADR.  To support pmem on
> the x86 architecture, the platform is required to ensure that a pmem
> store flushed from the CPU caches is in the persistent domain so that the
> application need not take any additional steps to make it persistent.
> The most common way to do this is the ADR feature.
>
> If the above is not true, then your x86 platform does not support pmem.

Understood.

However, virtualization is a special case as we can use normal memory
to emulate NVDIMM for the vm so that vm can bypass local file-cache,
reduce memory usage and io path, etc. Currently, this usage is useful
for lightweight virtualization, such as clean container.

Under this case, ADR is available on physical platform but it can
not help us to make data persistence for the vm. So that virtualizeing
'flush hint table' is a good way to handle it based on the acpi spec:
| software can write to any one of these Flush Hint Addresses to
| cause any preceding writes to the NVDIMM region to be flushed
| out of the intervening platform buffers 1 to the targeted NVDIMM
| (to achieve durability)

>
> Flush hints are for use by the BIOS and drivers and are not intended to
> be used in user space.  Flush hints provide two things:
>
> First, if a driver needs to write to command registers or movable windows
> on a DIMM, the Flush hint (if provided in the NFIT) is required to flush
> the command to the DIMM or ensure stores done through the movable window
> are complete before moving it somewhere else.
>
> Second, for the rare case where the kernel wants to flush stores to the
> smallest possible failure domain (i.e. to the DIMM even though ADR will
> handle flushing it from a larger domain), the flush hints provide a way
> to do this.  This might be useful for things like file system journals to
> help ensure the file system is consistent even in the face of ADR failure.

We are assuming ADR can fail, however, do we have a way to know whether
ADR works correctly? Maybe MCE can work on it?

This is necessary to support making data persistent without 'fsync/msync'
in userspace. Or do we need to unconditionally use 'flush hint address'
if it is available as current nvdimm driver does?

Thanks!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-12  5:27   ` Christoph Hellwig
@ 2016-09-12  7:25     ` Oliver O'Halloran
  2016-09-12  7:51       ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Oliver O'Halloran @ 2016-09-12  7:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, Dan Williams, Xiao Guangrong, Dave Hansen,
	Paolo Bonzini, Andrew Morton, Michal Hocko, Gleb Natapov,
	mtosatti, KVM list, linux-kernel, Stefan Hajnoczi, Yumei Huang,
	Linux MM, linux-nvdimm@lists.01.org, linux-fsdevel

On Mon, Sep 12, 2016 at 3:27 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
>> I think this goes back to our previous discussion about support for the PMEM
>> programming model.  Really I think what NVML needs isn't a way to tell if it
>> is getting a DAX mapping, but whether it is getting a DAX mapping on a
>> filesystem that fully supports the PMEM programming model.  This of course is
>> defined to be a filesystem where it can do all of its flushes from userspace
>> safely and never call fsync/msync, and that allocations that happen in page
>> faults will be synchronized to media before the page fault completes.
>
> That's a an easy way to flag:  you will never get that from a Linux
> filesystem, period.
>
> NVML folks really need to stop taking crack and dreaming this could
> happen.

Well, that's a bummer.

What are the problems here? Is this a matter of existing filesystems
being unable/unwilling to support this or is it just fundamentally
broken? The end goal is to let applications manage the persistence of
their own data without having to involve the kernel in every IOP, but
if we can't do that then what would a 90% solution look like? I think
most people would be OK with having to do an fsync() occasionally, but
not after ever write to pmem.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-12  7:25     ` Oliver O'Halloran
@ 2016-09-12  7:51       ` Christoph Hellwig
  2016-09-12  8:05         ` Nicholas Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-12  7:51 UTC (permalink / raw)
  To: Oliver O'Halloran
  Cc: Christoph Hellwig, Yumei Huang, Michal Hocko, Xiao Guangrong,
	Andrew Morton, KVM list, Linux MM, Gleb Natapov,
	linux-nvdimm@lists.01.org, mtosatti, linux-kernel, Dave Hansen,
	Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Mon, Sep 12, 2016 at 05:25:15PM +1000, Oliver O'Halloran wrote:
> What are the problems here? Is this a matter of existing filesystems
> being unable/unwilling to support this or is it just fundamentally
> broken?

It's a fundamentally broken model.  See Dave's post that actually was
sent slightly earlier then mine for the list of required items, which
is fairly unrealistic.  You could probably try to architect a file
system for it, but I doubt it would gain much traction.

> The end goal is to let applications manage the persistence of
> their own data without having to involve the kernel in every IOP, but
> if we can't do that then what would a 90% solution look like? I think
> most people would be OK with having to do an fsync() occasionally, but
> not after ever write to pmem.

You need an fsync for each write that you want to persist.  This sounds
painful for now.  But I have an implementation that will allow the
atomic commit of more or less arbitrary amounts of previous writes for
XFS that I plan to land once the reflink work is in.

That way you create almost arbitrarily complex data structures in your
programs and commit them atomicly.  It's not going to fit the nvml
model, but that whole think has been complete bullshit since the
beginning anyway.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-12  7:51       ` Christoph Hellwig
@ 2016-09-12  8:05         ` Nicholas Piggin
  2016-09-12 15:01           ` Christoph Hellwig
  2016-09-12 21:34           ` Dave Chinner
  0 siblings, 2 replies; 38+ messages in thread
From: Nicholas Piggin @ 2016-09-12  8:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Oliver O'Halloran, Yumei Huang, Michal Hocko, Xiao Guangrong,
	Andrew Morton, KVM list, Linux MM, Gleb Natapov,
	linux-nvdimm@lists.01.org, mtosatti, linux-kernel, Dave Hansen,
	Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Mon, 12 Sep 2016 00:51:28 -0700
Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Sep 12, 2016 at 05:25:15PM +1000, Oliver O'Halloran wrote:
> > What are the problems here? Is this a matter of existing filesystems
> > being unable/unwilling to support this or is it just fundamentally
> > broken?  
> 
> It's a fundamentally broken model.  See Dave's post that actually was
> sent slightly earlier then mine for the list of required items, which
> is fairly unrealistic.  You could probably try to architect a file
> system for it, but I doubt it would gain much traction.

It's not fundamentally broken, it just doesn't fit well existing
filesystems.

Dave's post of requirements is also wrong. A filesystem does not have
to guarantee all that, it only has to guarantee that is the case for
a given block after it has a mapping and page fault returns, other
operations can be supported by invalidating mappings, etc.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-12  8:05         ` Nicholas Piggin
@ 2016-09-12 15:01           ` Christoph Hellwig
  2016-09-13  1:31             ` Nicholas Piggin
  2016-09-12 21:34           ` Dave Chinner
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-12 15:01 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Christoph Hellwig, Oliver O'Halloran, Yumei Huang,
	Michal Hocko, Xiao Guangrong, Andrew Morton, KVM list, Linux MM,
	Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti, linux-kernel,
	Dave Hansen, Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Mon, Sep 12, 2016 at 06:05:07PM +1000, Nicholas Piggin wrote:
> It's not fundamentally broken, it just doesn't fit well existing
> filesystems.

Or the existing file system architecture for that matter.  Which makes
it a fundamentally broken model.

> Dave's post of requirements is also wrong. A filesystem does not have
> to guarantee all that, it only has to guarantee that is the case for
> a given block after it has a mapping and page fault returns, other
> operations can be supported by invalidating mappings, etc.

Which doesn't really matter if your use case is manipulating
fully mapped files.

But back to the point: if you want to use a full blown Linux or Unix
filesystem you will always have to fsync (or variants of it like msync),
period.

If you want a volume manager on stereoids that hands out large chunks
of storage memory that can't ever be moved, truncated, shared, allocated
on demand, etc - implement it in your library on top of a device file.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-12  8:05         ` Nicholas Piggin
  2016-09-12 15:01           ` Christoph Hellwig
@ 2016-09-12 21:34           ` Dave Chinner
  2016-09-13  1:53             ` Nicholas Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2016-09-12 21:34 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Christoph Hellwig, Oliver O'Halloran, Yumei Huang,
	Michal Hocko, Xiao Guangrong, Andrew Morton, KVM list, Linux MM,
	Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti, linux-kernel,
	Dave Hansen, Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Mon, Sep 12, 2016 at 06:05:07PM +1000, Nicholas Piggin wrote:
> On Mon, 12 Sep 2016 00:51:28 -0700
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Mon, Sep 12, 2016 at 05:25:15PM +1000, Oliver O'Halloran wrote:
> > > What are the problems here? Is this a matter of existing filesystems
> > > being unable/unwilling to support this or is it just fundamentally
> > > broken?  
> > 
> > It's a fundamentally broken model.  See Dave's post that actually was
> > sent slightly earlier then mine for the list of required items, which
> > is fairly unrealistic.  You could probably try to architect a file
> > system for it, but I doubt it would gain much traction.
> 
> It's not fundamentally broken, it just doesn't fit well existing
> filesystems.
> 
> Dave's post of requirements is also wrong. A filesystem does not have
> to guarantee all that, it only has to guarantee that is the case for
> a given block after it has a mapping and page fault returns, other
> operations can be supported by invalidating mappings, etc.

Sure, but filesystems are completely unaware of what is mapped at
any given time, or what constraints that mapping might have. Trying
to make filesystems aware of per-page mapping constraints seems like
a fairly significant layering violation based on a flawed
assumption. i.e. that operations on other parts of the file do not
affect the block that requires immutable metadata.

e.g an extent operation in some other area of the file can cause a
tip-to-root extent tree split or merge, and that moves the metadata
that points to the mapped block that we've told userspace "doesn't
need fsync".  We now need an fsync to ensure that the metadata is
consistent on disk again, even though that block has not physically
been moved. IOWs, the immutable data block updates are now not
ordered correctly w.r.t. other updates done to the file, especially
when we consider crash recovery....

All this will expose is an unfixable problem with ordering of stable
data + metadata operations and their synchronisation. As such, it
seems like nothing but a major cluster-fuck to try to do mapping
specific, per-block immutable metadata - it adds major complexity
and even more untractable problems.

Yes, we /could/ try to solve this but, quite frankly, it's far
easier to change the broken PMEM programming model assumptions than
it is to implement what you are suggesting. Or to do what Christoph
suggested and just use a wrapper around something like device
mapper to hand out chunks of unchanging, static pmem to
applications...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-12 15:01           ` Christoph Hellwig
@ 2016-09-13  1:31             ` Nicholas Piggin
  2016-09-13  4:06               ` Dan Williams
  0 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2016-09-13  1:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Oliver O'Halloran, Yumei Huang, Michal Hocko, Xiao Guangrong,
	Andrew Morton, KVM list, Linux MM, Gleb Natapov,
	linux-nvdimm@lists.01.org, mtosatti, linux-kernel, Dave Hansen,
	Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Mon, 12 Sep 2016 08:01:48 -0700
Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Sep 12, 2016 at 06:05:07PM +1000, Nicholas Piggin wrote:
> > It's not fundamentally broken, it just doesn't fit well existing
> > filesystems.  
> 
> Or the existing file system architecture for that matter.  Which makes
> it a fundamentally broken model.

Not really. A few reasonable changes can be made to improve things.
Until just now you thought it was fundamentally impossible to make a
reasonable implementation due to Dave's "constraints".

> 
> > Dave's post of requirements is also wrong. A filesystem does not have
> > to guarantee all that, it only has to guarantee that is the case for
> > a given block after it has a mapping and page fault returns, other
> > operations can be supported by invalidating mappings, etc.  
> 
> Which doesn't really matter if your use case is manipulating
> fully mapped files.

Nothing that says you have to use them fully mapped always and not
use other APIs on them.


> But back to the point: if you want to use a full blown Linux or Unix
> filesystem you will always have to fsync (or variants of it like msync),
> period.

That's circular logic. First you said that should not be done
because of your imagined constraints.

In fact, it's not unreasonable to describe some additional semantics
of the storage that is unavailable with traditional filesystems.

That said, a noop system call is on the order of 100 cycles nowadays,
so rushing to implement these APIs without seeing good numbers and
actual users ready to go seems premature. *This* is the real reason
not to implement new APIs yet.


> If you want a volume manager on stereoids that hands out large chunks
> of storage memory that can't ever be moved, truncated, shared, allocated
> on demand, etc - implement it in your library on top of a device file.

Those constraints don't exist either. I've written a filesystem
that avoids them. It isn't rocket science.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-12 21:34           ` Dave Chinner
@ 2016-09-13  1:53             ` Nicholas Piggin
  2016-09-13  7:17               ` Christoph Hellwig
  2016-09-14  7:39               ` Dave Chinner
  0 siblings, 2 replies; 38+ messages in thread
From: Nicholas Piggin @ 2016-09-13  1:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Oliver O'Halloran, Yumei Huang,
	Michal Hocko, Xiao Guangrong, Andrew Morton, KVM list, Linux MM,
	Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti, linux-kernel,
	Dave Hansen, Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Tue, 13 Sep 2016 07:34:36 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Mon, Sep 12, 2016 at 06:05:07PM +1000, Nicholas Piggin wrote:
> > On Mon, 12 Sep 2016 00:51:28 -0700
> > Christoph Hellwig <hch@infradead.org> wrote:
> >   
> > > On Mon, Sep 12, 2016 at 05:25:15PM +1000, Oliver O'Halloran wrote:  
> > > > What are the problems here? Is this a matter of existing filesystems
> > > > being unable/unwilling to support this or is it just fundamentally
> > > > broken?    
> > > 
> > > It's a fundamentally broken model.  See Dave's post that actually was
> > > sent slightly earlier then mine for the list of required items, which
> > > is fairly unrealistic.  You could probably try to architect a file
> > > system for it, but I doubt it would gain much traction.  
> > 
> > It's not fundamentally broken, it just doesn't fit well existing
> > filesystems.
> > 
> > Dave's post of requirements is also wrong. A filesystem does not have
> > to guarantee all that, it only has to guarantee that is the case for
> > a given block after it has a mapping and page fault returns, other
> > operations can be supported by invalidating mappings, etc.  
> 
> Sure, but filesystems are completely unaware of what is mapped at
> any given time, or what constraints that mapping might have. Trying
> to make filesystems aware of per-page mapping constraints seems like

I'm not sure what you mean. The filesystem can hand out mappings
and fault them in itself. It can invalidate them.


> a fairly significant layering violation based on a flawed
> assumption. i.e. that operations on other parts of the file do not
> affect the block that requires immutable metadata.
> 
> e.g an extent operation in some other area of the file can cause a
> tip-to-root extent tree split or merge, and that moves the metadata
> that points to the mapped block that we've told userspace "doesn't
> need fsync".  We now need an fsync to ensure that the metadata is
> consistent on disk again, even though that block has not physically
> been moved.

You don't, because the filesystem can invalidate existing mappings
and do the right thing when they are faulted in again. That's the
big^Wmedium hammer approach that can cope with most problems.

But let me understand your example in the absence of that.

- Application mmaps a file, faults in block 0
- FS allocates block, creates mappings, syncs metadata, sets "no fsync"
  flag for that block, and completes the fault.
- Application writes some data to block 0, completes userspace flushes

* At this point, a crash must return with above data (or newer).

- Application starts writing more stuff into block 0
- Concurrently, fault in block 1
- FS starts to allocate, splits trees including mappings to block 0

* Crash

Is that right? How does your filesystem lose data before the sync
point?

> IOWs, the immutable data block updates are now not
> ordered correctly w.r.t. other updates done to the file, especially
> when we consider crash recovery....
> 
> All this will expose is an unfixable problem with ordering of stable
> data + metadata operations and their synchronisation. As such, it
> seems like nothing but a major cluster-fuck to try to do mapping
> specific, per-block immutable metadata - it adds major complexity
> and even more untractable problems.
> 
> Yes, we /could/ try to solve this but, quite frankly, it's far
> easier to change the broken PMEM programming model assumptions than
> it is to implement what you are suggesting. Or to do what Christoph
> suggested and just use a wrapper around something like device
> mapper to hand out chunks of unchanging, static pmem to
> applications...

If there is any huge complexity or unsolved problem, it is in XFS.
Conceptual problem is simple.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-13  1:31             ` Nicholas Piggin
@ 2016-09-13  4:06               ` Dan Williams
  2016-09-13  5:40                 ` Nicholas Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2016-09-13  4:06 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Christoph Hellwig, Yumei Huang, Michal Hocko, Xiao Guangrong,
	KVM list, Dave Hansen, Gleb Natapov, linux-nvdimm@lists.01.org,
	mtosatti, linux-kernel, Linux MM, Stefan Hajnoczi, linux-fsdevel,
	Paolo Bonzini, Andrew Morton

On Mon, Sep 12, 2016 at 6:31 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> On Mon, 12 Sep 2016 08:01:48 -0700
[..]
> That said, a noop system call is on the order of 100 cycles nowadays,
> so rushing to implement these APIs without seeing good numbers and
> actual users ready to go seems premature. *This* is the real reason
> not to implement new APIs yet.

Yes, and harvesting the current crop of low hanging performance fruit
in the filesystem-DAX I/O path remains on the todo list.

In the meantime we're pursuing this mm api, mincore+ or whatever we
end up with, to allow userspace to distinguish memory address ranges
that are backed by a filesystem requiring coordination of metadata
updates + flushes for updates, vs something like device-dax that does
not.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-13  4:06               ` Dan Williams
@ 2016-09-13  5:40                 ` Nicholas Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2016-09-13  5:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Yumei Huang, Michal Hocko, Xiao Guangrong,
	KVM list, Dave Hansen, Gleb Natapov, linux-nvdimm@lists.01.org,
	mtosatti, linux-kernel, Linux MM, Stefan Hajnoczi, linux-fsdevel,
	Paolo Bonzini, Andrew Morton

On Mon, 12 Sep 2016 21:06:49 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Mon, Sep 12, 2016 at 6:31 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> > On Mon, 12 Sep 2016 08:01:48 -0700  
> [..]
> > That said, a noop system call is on the order of 100 cycles nowadays,
> > so rushing to implement these APIs without seeing good numbers and
> > actual users ready to go seems premature. *This* is the real reason
> > not to implement new APIs yet.  
> 
> Yes, and harvesting the current crop of low hanging performance fruit
> in the filesystem-DAX I/O path remains on the todo list.
> 
> In the meantime we're pursuing this mm api, mincore+ or whatever we
> end up with, to allow userspace to distinguish memory address ranges
> that are backed by a filesystem requiring coordination of metadata
> updates + flushes for updates, vs something like device-dax that does
> not.

Yes, that's reasonable.

Do you need page/block granularity? Do you need a way to advise/request
the fs for a particular capability? Is it enough to request and check
success? Would the capability be likely to change, and if so, how would
you notify the app asynchronously?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-13  1:53             ` Nicholas Piggin
@ 2016-09-13  7:17               ` Christoph Hellwig
  2016-09-13  9:06                 ` Nicholas Piggin
  2016-09-14  7:39               ` Dave Chinner
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-13  7:17 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Dave Chinner, Christoph Hellwig, Oliver O'Halloran,
	Yumei Huang, Michal Hocko, Xiao Guangrong, Andrew Morton,
	KVM list, Linux MM, Gleb Natapov, linux-nvdimm@lists.01.org,
	mtosatti, linux-kernel, Dave Hansen, Stefan Hajnoczi,
	linux-fsdevel, Paolo Bonzini

On Tue, Sep 13, 2016 at 11:53:11AM +1000, Nicholas Piggin wrote:
> - Application mmaps a file, faults in block 0
> - FS allocates block, creates mappings, syncs metadata, sets "no fsync"
>   flag for that block, and completes the fault.
> - Application writes some data to block 0, completes userspace flushes
> 
> * At this point, a crash must return with above data (or newer).
> 
> - Application starts writing more stuff into block 0
> - Concurrently, fault in block 1
> - FS starts to allocate, splits trees including mappings to block 0
> 
> * Crash
> 
> Is that right? How does your filesystem lose data before the sync
> point?

Witht all current file systems chances are your metadata hasn't been
flushed out.  You could write all metadata synchronously from the
page fault handler, but that's basically asking for all kinds of
deadlocks.

> If there is any huge complexity or unsolved problem, it is in XFS.
> Conceptual problem is simple.

Good to have you back and make all the hard thing simple :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-13  7:17               ` Christoph Hellwig
@ 2016-09-13  9:06                 ` Nicholas Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2016-09-13  9:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Oliver O'Halloran, Yumei Huang, Michal Hocko,
	Xiao Guangrong, Andrew Morton, KVM list, Linux MM, Gleb Natapov,
	linux-nvdimm@lists.01.org, mtosatti, linux-kernel, Dave Hansen,
	Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Tue, 13 Sep 2016 00:17:32 -0700
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Sep 13, 2016 at 11:53:11AM +1000, Nicholas Piggin wrote:
> > - Application mmaps a file, faults in block 0
> > - FS allocates block, creates mappings, syncs metadata, sets "no fsync"
> >   flag for that block, and completes the fault.
> > - Application writes some data to block 0, completes userspace flushes
> > 
> > * At this point, a crash must return with above data (or newer).
> > 
> > - Application starts writing more stuff into block 0
> > - Concurrently, fault in block 1
> > - FS starts to allocate, splits trees including mappings to block 0
> > 
> > * Crash
> > 
> > Is that right? How does your filesystem lose data before the sync
> > point?  
> 
> Witht all current file systems chances are your metadata hasn't been
> flushed out.  You could write all metadata synchronously from the

Yes, that's a possibility. Another would be an advise call to
request the capability for a given region.


> page fault handler, but that's basically asking for all kinds of
> deadlocks.

Such as?


> > If there is any huge complexity or unsolved problem, it is in XFS.
> > Conceptual problem is simple.  
> 
> Good to have you back and make all the hard thing simple :)

Thanks...? :)

I don't mean to say it's simple to add it to any filesystem or
that vfs and mm doesn't need any changes at all.

If we can agree on something, no new APIs should be added without
careful thought and justification and users. I only suggest not to
dismiss it out of hand.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-13  1:53             ` Nicholas Piggin
  2016-09-13  7:17               ` Christoph Hellwig
@ 2016-09-14  7:39               ` Dave Chinner
  2016-09-14 10:19                 ` Nicholas Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2016-09-14  7:39 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Christoph Hellwig, Oliver O'Halloran, Yumei Huang,
	Michal Hocko, Xiao Guangrong, Andrew Morton, KVM list, Linux MM,
	Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti, linux-kernel,
	Dave Hansen, Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Tue, Sep 13, 2016 at 11:53:11AM +1000, Nicholas Piggin wrote:
> On Tue, 13 Sep 2016 07:34:36 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> But let me understand your example in the absence of that.
> 
> - Application mmaps a file, faults in block 0
> - FS allocates block, creates mappings, syncs metadata, sets "no fsync"
>   flag for that block, and completes the fault.
> - Application writes some data to block 0, completes userspace flushes
> 
> * At this point, a crash must return with above data (or newer).
> 
> - Application starts writing more stuff into block 0
> - Concurrently, fault in block 1
> - FS starts to allocate, splits trees including mappings to block 0
> 
> * Crash
> 
> Is that right?

No.

- app write faults block 0, fs allocates
< time passes while app does stuff to block 0 mapping >
- fs syncs journal, block 0 metadata now persistent
< time passes while app does stuff to block 0 mapping >
- app structure grows, faults block 1, fs allocates
- app adds pointers to data in block 1 from block 0, does
  userspace pmem data sync.

*crash*

> How does your filesystem lose data before the sync
> point?

After recovery, file has a data in block 0, but no block 1 because
the allocation transaction for block 1 was not flushed to the
journal. Data in block 0 points to data in block 1, but block 1 does
not exist. IOWs, the application has corrupt data because it never
issued a data synchronisation request to the filesystem....

----

Ok, looking back over your example, you seem to be suggesting a new
page fault behaviour is required from filesystems that has not been
described or explained, and that behaviour is triggered
(persistently) somehow from userspace. You've also suggested
filesystems store a persistent per-block "no fsync" flag
in their extent map as part of the implementation. Right?

Reading between the lines, I'm guessing that the "no fsync" flag has
very specific update semantics, constraints and requirements.  Can
you outline how you expect this flag to be set and updated, how it's
used consistently between different applications (e.g. cp of a file
vs the app using the file), behavioural constraints it implies for
page faults vs non-mmap access to the data in the block, how
you'd expect filesystems to deal with things like a hole punch
landing in the middle of an extent marked with "no fsync", etc?

[snip]

> If there is any huge complexity or unsolved problem, it is in XFS.
> Conceptual problem is simple.

Play nice and be constructive, please?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-14  7:39               ` Dave Chinner
@ 2016-09-14 10:19                 ` Nicholas Piggin
  2016-09-15  2:31                   ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2016-09-14 10:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Oliver O'Halloran, Yumei Huang,
	Michal Hocko, Xiao Guangrong, Andrew Morton, KVM list, Linux MM,
	Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti, linux-kernel,
	Dave Hansen, Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Wed, 14 Sep 2016 17:39:02 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Sep 13, 2016 at 11:53:11AM +1000, Nicholas Piggin wrote:
> > On Tue, 13 Sep 2016 07:34:36 +1000
> > Dave Chinner <david@fromorbit.com> wrote:
> > But let me understand your example in the absence of that.
> > 
> > - Application mmaps a file, faults in block 0
> > - FS allocates block, creates mappings, syncs metadata, sets "no fsync"
> >   flag for that block, and completes the fault.
> > - Application writes some data to block 0, completes userspace flushes
> > 
> > * At this point, a crash must return with above data (or newer).
> > 
> > - Application starts writing more stuff into block 0
> > - Concurrently, fault in block 1
> > - FS starts to allocate, splits trees including mappings to block 0
> > 
> > * Crash
> > 
> > Is that right?  
> 
> No.
> 
> - app write faults block 0, fs allocates
> < time passes while app does stuff to block 0 mapping >
> - fs syncs journal, block 0 metadata now persistent
> < time passes while app does stuff to block 0 mapping >
> - app structure grows, faults block 1, fs allocates
> - app adds pointers to data in block 1 from block 0, does
>   userspace pmem data sync.
> 
> *crash*
> 
> > How does your filesystem lose data before the sync
> > point?  
> 
> After recovery, file has a data in block 0, but no block 1 because
> the allocation transaction for block 1 was not flushed to the
> journal. Data in block 0 points to data in block 1, but block 1 does
> not exist. IOWs, the application has corrupt data because it never
> issued a data synchronisation request to the filesystem....
> 
> ----
> 
> Ok, looking back over your example, you seem to be suggesting a new
> page fault behaviour is required from filesystems that has not been
> described or explained, and that behaviour is triggered
> (persistently) somehow from userspace. You've also suggested
> filesystems store a persistent per-block "no fsync" flag
> in their extent map as part of the implementation. Right?

This is what we're talking about. Of course a filesystem can't just
start supporting the feature without any changes.


> Reading between the lines, I'm guessing that the "no fsync" flag has
> very specific update semantics, constraints and requirements.  Can
> you outline how you expect this flag to be set and updated, how it's
> used consistently between different applications (e.g. cp of a file
> vs the app using the file), behavioural constraints it implies for
> page faults vs non-mmap access to the data in the block, how
> you'd expect filesystems to deal with things like a hole punch
> landing in the middle of an extent marked with "no fsync", etc?

Well that's what's being discussed. An approach close to what I did is
to allow the app request a "no sync" type of mmap. Filesystem will
invalidate all such mappings before it does buffered IOs or hole punch,
and will sync metadata after allocating a new block but before returning
from a fault.

The app could query rather than request, but I found request seemed to
work better. The filesystem might be working with apps that don't use
the feature for example, and doesn't want to flush just in case any one
ever queried in the past.


> [snip]
> 
> > If there is any huge complexity or unsolved problem, it is in XFS.
> > Conceptual problem is simple.  
> 
> Play nice and be constructive, please?

So you agree that the persistent memory people who have come with some
requirements and ideas for an API should not be immediately shut down
with bogus handwaving.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-14 10:19                 ` Nicholas Piggin
@ 2016-09-15  2:31                   ` Dave Chinner
  2016-09-15  3:49                     ` Nicholas Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2016-09-15  2:31 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Christoph Hellwig, Oliver O'Halloran, Yumei Huang,
	Michal Hocko, Xiao Guangrong, Andrew Morton, KVM list, Linux MM,
	Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti, linux-kernel,
	Dave Hansen, Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> On Wed, 14 Sep 2016 17:39:02 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> > Ok, looking back over your example, you seem to be suggesting a new
> > page fault behaviour is required from filesystems that has not been
> > described or explained, and that behaviour is triggered
> > (persistently) somehow from userspace. You've also suggested
> > filesystems store a persistent per-block "no fsync" flag
> > in their extent map as part of the implementation. Right?
> 
> This is what we're talking about. Of course a filesystem can't just
> start supporting the feature without any changes.

Sure, but one first has to describe the feature desired before all
parties can discuss it. We need more than vague references and
allusions from you to define the solution you are proposing.

Once everyone understands what is being describing, we might be able
to work out how it can be implemented in a simple, generic manner
rather than require every filesystem to change their on-disk
formats. IOWs, we need you to describe /details/ of semantics,
behaviour and data integrity constraints that are required, not
describe an implementation of something we have no knwoledge about.

> > Reading between the lines, I'm guessing that the "no fsync" flag has
> > very specific update semantics, constraints and requirements.  Can
> > you outline how you expect this flag to be set and updated, how it's
> > used consistently between different applications (e.g. cp of a file
> > vs the app using the file), behavioural constraints it implies for
> > page faults vs non-mmap access to the data in the block, how
> > you'd expect filesystems to deal with things like a hole punch
> > landing in the middle of an extent marked with "no fsync", etc?
> 
> Well that's what's being discussed.  An approach close to what I did is
> to allow the app request a "no sync" type of mmap.

That's not an answer to the questions I asked about about the "no
sync" flag you were proposing. You've redirected to the a different
solution, one that ....

> Filesystem will
> invalidate all such mappings before it does buffered IOs or hole punch,
> and will sync metadata after allocating a new block but before returning
> from a fault.

... requires synchronous metadata updates from page fault context,
which we already know is not a good solution.  I'll quote one of
Christoph's previous replies to save me the trouble:

	"You could write all metadata synchronously from the page
	fault handler, but that's basically asking for all kinds of
	deadlocks."

So, let's redirect back to the "no sync" flag you were talking about
- can you answer the questions I asked above? It would be especially
important to highlight how the proposed feature would avoid requiring
synchronous metadata updates in page fault contexts....

> > [snip]
> > 
> > > If there is any huge complexity or unsolved problem, it is in XFS.
> > > Conceptual problem is simple.  
> > 
> > Play nice and be constructive, please?
> 
> So you agree that the persistent memory people who have come with some
> requirements and ideas for an API should not be immediately shut down
> with bogus handwaving.

Pull your head in, Nick.

You've been absent from the community for the last 5 years. You
suddenly barge in with a massive chip on your shoulder and try to
throw your weight around. You're being arrogant, obnoxious, evasive
and petty. You're belittling anyone who dares to question your
proclamations. You're not listening to the replies you are getting.
You're baiting people to try to get an adverse reaction from them
and when someone gives you the adverse reaction you were fishing
for, you play the victim card.

That's textbook bullying behaviour.

Nick, this behaviour does not help progress the discussion in any
way. It only serves to annoy the other people who are sincerely
trying to understand and determine if/how we can solve the problem
in some way.

So, again, play nice and be constructive, please?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-15  2:31                   ` Dave Chinner
@ 2016-09-15  3:49                     ` Nicholas Piggin
  2016-09-15 10:32                       ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2016-09-15  3:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Oliver O'Halloran, Yumei Huang,
	Michal Hocko, Xiao Guangrong, Andrew Morton, KVM list, Linux MM,
	Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti, linux-kernel,
	Dave Hansen, Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Thu, 15 Sep 2016 12:31:33 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> > On Wed, 14 Sep 2016 17:39:02 +1000
> > Dave Chinner <david@fromorbit.com> wrote:  
> > > Ok, looking back over your example, you seem to be suggesting a new
> > > page fault behaviour is required from filesystems that has not been
> > > described or explained, and that behaviour is triggered
> > > (persistently) somehow from userspace. You've also suggested
> > > filesystems store a persistent per-block "no fsync" flag
> > > in their extent map as part of the implementation. Right?  
> > 
> > This is what we're talking about. Of course a filesystem can't just
> > start supporting the feature without any changes.  
> 
> Sure, but one first has to describe the feature desired before all

The DAX people have been. They want to be able to get mappings
that can be synced without doing fsync. The *exact* extent of
those capabilities and what the API exactly looks like is up for
discussion.


> parties can discuss it. We need more than vague references and
> allusions from you to define the solution you are proposing.
> 
> Once everyone understands what is being describing, we might be able
> to work out how it can be implemented in a simple, generic manner
> rather than require every filesystem to change their on-disk
> formats. IOWs, we need you to describe /details/ of semantics,
> behaviour and data integrity constraints that are required, not
> describe an implementation of something we have no knwoledge about.

Well you said it was impossible already and Christoph told them
they were smoking crack :)

Anyway, there's a few questions. Implementation and API. Some
filesystems may never cope with it. Of course it should be as
generic as possible though.


> > > Reading between the lines, I'm guessing that the "no fsync" flag has
> > > very specific update semantics, constraints and requirements.  Can
> > > you outline how you expect this flag to be set and updated, how it's
> > > used consistently between different applications (e.g. cp of a file
> > > vs the app using the file), behavioural constraints it implies for
> > > page faults vs non-mmap access to the data in the block, how
> > > you'd expect filesystems to deal with things like a hole punch
> > > landing in the middle of an extent marked with "no fsync", etc?  
> > 
> > Well that's what's being discussed.  An approach close to what I did is
> > to allow the app request a "no sync" type of mmap.  
> 
> That's not an answer to the questions I asked about about the "no
> sync" flag you were proposing. You've redirected to the a different
> solution, one that ....

No sync flag would do the same thing exactly in terms of consistency.
It would just do the no-sync sequence by default rather than being
asked for it. More of an API detail than implementation.

> 
> > Filesystem will
> > invalidate all such mappings before it does buffered IOs or hole punch,
> > and will sync metadata after allocating a new block but before returning
> > from a fault.  
> 
> ... requires synchronous metadata updates from page fault context,
> which we already know is not a good solution.  I'll quote one of
> Christoph's previous replies to save me the trouble:
> 
> 	"You could write all metadata synchronously from the page
> 	fault handler, but that's basically asking for all kinds of
> 	deadlocks."
> So, let's redirect back to the "no sync" flag you were talking about
> - can you answer the questions I asked above? It would be especially
> important to highlight how the proposed feature would avoid requiring
> synchronous metadata updates in page fault contexts....

Right. So what deadlocks are you concerned about?

There could be a scale of capabilities here, for different filesystems
that do things differently. 

Some filesystems could require fsync for metadata, but allow fdatasync
to be skipped. Users would need to have some knowledge of block size
or do preallocation and sync.

That might put more burden on libraries/applications if there are
concurrent operations, but that might be something they can deal with
-- fdatasync already requires some knowledge of concurrent operations
(or lack thereof).


> > > [snip]
> > >   
> > > > If there is any huge complexity or unsolved problem, it is in XFS.
> > > > Conceptual problem is simple.    
> > > 
> > > Play nice and be constructive, please?  
> > 
> > So you agree that the persistent memory people who have come with some
> > requirements and ideas for an API should not be immediately shut down
> > with bogus handwaving.  
> 
> Pull your head in, Nick.
> 
> You've been absent from the community for the last 5 years. You
> suddenly barge in with a massive chip on your shoulder and try to

I'm trying to give some constructive input to the nvdimm guys.

You and Christoph know a huge amount about vfs and filesystems.
But sometimes you shut people down prematurely. It can be very
intimidating for someone who might not know *exactly* what they
are asking for or have not considered some difficult locking case
in a filesystem. I'm sure it's not intentional, but that's how it
can come across.

That said, I don't want to derail their thread any further with
this. So I apologise for my tone to you, Dave.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-12  1:40   ` Dave Chinner
@ 2016-09-15  5:55     ` Darrick J. Wong
  2016-09-15  6:25       ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2016-09-15  5:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, Dan Williams, Xiao Guangrong, Dave Hansen,
	Paolo Bonzini, Andrew Morton, Michal Hocko, Gleb Natapov,
	mtosatti, KVM list, linux-kernel, Stefan Hajnoczi, Yumei Huang,
	Linux MM, linux-nvdimm@lists.01.org, linux-fsdevel

On Mon, Sep 12, 2016 at 11:40:35AM +1000, Dave Chinner wrote:
> On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
> > On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
> > > My understanding is that it is looking for the VM_MIXEDMAP flag which
> > > is already ambiguous for determining if DAX is enabled even if this
> > > dynamic listing issue is fixed.  XFS has arranged for DAX to be a
> > > per-inode capability and has an XFS-specific inode flag.  We can make
> > > that a common inode flag, but it seems we should have a way to
> > > interrogate the mapping itself in the case where the inode is unknown
> > > or unavailable.  I'm thinking extensions to mincore to have flags for
> > > DAX and possibly whether the page is part of a pte, pmd, or pud
> > > mapping.  Just floating that idea before starting to look into the
> > > implementation, comments or other ideas welcome...
> > 
> > I think this goes back to our previous discussion about support for the PMEM
> > programming model.  Really I think what NVML needs isn't a way to tell if it
> > is getting a DAX mapping, but whether it is getting a DAX mapping on a
> > filesystem that fully supports the PMEM programming model.  This of course is
> > defined to be a filesystem where it can do all of its flushes from userspace
> > safely and never call fsync/msync, and that allocations that happen in page
> > faults will be synchronized to media before the page fault completes.
> > 
> > IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
> > everything or can I rely fully on flushes from userspace?" 
> 
> "need fsync/msync" is a dynamic state of an inode, not a static
> property. i.e. users can do things that change an inode behind the
> back of a mapping, even if they are not aware that this might
> happen. As such, a filesystem can invalidate an existing mapping
> at any time and userspace won't notice because it will simply fault
> in a new mapping on the next access...
> 
> > For all existing implementations, I think the answer is "you need to use
> > fsync/msync" because we don't yet have proper support for the PMEM programming
> > model.
> 
> Yes, that is correct.
> 
> FWIW, I don't think it will ever be possible to support this ....
> wonderful "PMEM programming model" from any current or future kernel
> filesystem without a very specific set of restrictions on what can
> be done to a file.  e.g.
> 
> 	1. the file has to be fully allocated and zeroed before
> 	   use. Preallocation/zeroing via unwritten extents is not
> 	   allowed. Sparse files are not allowed. Shared extents are
> 	   not allowed.
> 	2. set the "PMEM_IMMUTABLE" inode flag - filesystem must
> 	   check the file is fully allocated before allowing it to
> 	   be set, and caller must have CAP_LINUX_IMMUTABLE.
> 	3. Inode metadata is now immutable, and file data can only
> 	   be accessed and/or modified via mmap().
> 	4. All non-mmap methods of inode data modification
> 	   will now fail with EPERM.
> 	5. all methods of inode metadata modification will now fail
> 	   with EPERM, timestamp udpdates will be ignored.
> 	6. PMEM_IMMUTABLE flag can only be removed if the file is
> 	   not currently mapped and caller has CAP_LINUX_IMMUTABLE.
> 
> A flag like this /should/ make it possible to avoid fsync/msync() on
> a file for existing filesystems, but it also means that such files
> have significant management issues (hence the need for
> CAP_LINUX_IMMUTABLE to cover it's use).

Hmmm... I started to ponder such a flag, but ran into some questions.
If it's PMEM_IMMUTABLE, does this mean that none of 1-6 apply if the
filesystem discovers it isn't on pmem?

I thought about just having a 'immutable metadata' flag where any
timestamp, xattr, or block mapping update just returns EPERM.  There
wouldn't be any checks as in (1); if you left a hole in the file prior
to setting the flag then you won't be filling it unless you clear the
flag.  OTOH if it merely made the metadata unchangeable then it's a
stretch to get to non-mmap data accesses also being disallowed.

Maybe the immutable metadata and mmap-only properties would only be
implied if both DAX and IMMUTABLE_META are set on a file?

Ok no more rambling until sleep. :)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-15  5:55     ` Darrick J. Wong
@ 2016-09-15  6:25       ` Dave Chinner
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2016-09-15  6:25 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ross Zwisler, Dan Williams, Xiao Guangrong, Dave Hansen,
	Paolo Bonzini, Andrew Morton, Michal Hocko, Gleb Natapov,
	mtosatti, KVM list, linux-kernel, Stefan Hajnoczi, Yumei Huang,
	Linux MM, linux-nvdimm@lists.01.org, linux-fsdevel

On Wed, Sep 14, 2016 at 10:55:03PM -0700, Darrick J. Wong wrote:
> On Mon, Sep 12, 2016 at 11:40:35AM +1000, Dave Chinner wrote:
> > On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
> > > On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
> > > > My understanding is that it is looking for the VM_MIXEDMAP flag which
> > > > is already ambiguous for determining if DAX is enabled even if this
> > > > dynamic listing issue is fixed.  XFS has arranged for DAX to be a
> > > > per-inode capability and has an XFS-specific inode flag.  We can make
> > > > that a common inode flag, but it seems we should have a way to
> > > > interrogate the mapping itself in the case where the inode is unknown
> > > > or unavailable.  I'm thinking extensions to mincore to have flags for
> > > > DAX and possibly whether the page is part of a pte, pmd, or pud
> > > > mapping.  Just floating that idea before starting to look into the
> > > > implementation, comments or other ideas welcome...
> > > 
> > > I think this goes back to our previous discussion about support for the PMEM
> > > programming model.  Really I think what NVML needs isn't a way to tell if it
> > > is getting a DAX mapping, but whether it is getting a DAX mapping on a
> > > filesystem that fully supports the PMEM programming model.  This of course is
> > > defined to be a filesystem where it can do all of its flushes from userspace
> > > safely and never call fsync/msync, and that allocations that happen in page
> > > faults will be synchronized to media before the page fault completes.
> > > 
> > > IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
> > > everything or can I rely fully on flushes from userspace?" 
> > 
> > "need fsync/msync" is a dynamic state of an inode, not a static
> > property. i.e. users can do things that change an inode behind the
> > back of a mapping, even if they are not aware that this might
> > happen. As such, a filesystem can invalidate an existing mapping
> > at any time and userspace won't notice because it will simply fault
> > in a new mapping on the next access...
> > 
> > > For all existing implementations, I think the answer is "you need to use
> > > fsync/msync" because we don't yet have proper support for the PMEM programming
> > > model.
> > 
> > Yes, that is correct.
> > 
> > FWIW, I don't think it will ever be possible to support this ....
> > wonderful "PMEM programming model" from any current or future kernel
> > filesystem without a very specific set of restrictions on what can
> > be done to a file.  e.g.
> > 
> > 	1. the file has to be fully allocated and zeroed before
> > 	   use. Preallocation/zeroing via unwritten extents is not
> > 	   allowed. Sparse files are not allowed. Shared extents are
> > 	   not allowed.
> > 	2. set the "PMEM_IMMUTABLE" inode flag - filesystem must
> > 	   check the file is fully allocated before allowing it to
> > 	   be set, and caller must have CAP_LINUX_IMMUTABLE.
> > 	3. Inode metadata is now immutable, and file data can only
> > 	   be accessed and/or modified via mmap().
> > 	4. All non-mmap methods of inode data modification
> > 	   will now fail with EPERM.
> > 	5. all methods of inode metadata modification will now fail
> > 	   with EPERM, timestamp udpdates will be ignored.
> > 	6. PMEM_IMMUTABLE flag can only be removed if the file is
> > 	   not currently mapped and caller has CAP_LINUX_IMMUTABLE.
> > 
> > A flag like this /should/ make it possible to avoid fsync/msync() on
> > a file for existing filesystems, but it also means that such files
> > have significant management issues (hence the need for
> > CAP_LINUX_IMMUTABLE to cover it's use).
> 
> Hmmm... I started to ponder such a flag, but ran into some questions.
> If it's PMEM_IMMUTABLE, does this mean that none of 1-6 apply if the
> filesystem discovers it isn't on pmem?

Would only be meaningful if the FS_XFLAG_DAX/S_DAX flag is also set
on the inode and the backing store is dax capable. Hence the 'PMEM'
part of the name.

> I thought about just having a 'immutable metadata' flag where any
> timestamp, xattr, or block mapping update just returns EPERM.

And all the rest - no hard links, no perm/owner changes, no security
context changes(!), and so on. ANd it's even more complex with
filesystems that have COW metadata and pack multiple unrelated
metadata objects into single blocks - they can do all sorts of
interesting things on unrealted metadata updates... :P

You'd also have to turn off background internal filesystem mod
vectors, too, like EOF scanning, or defrag, balance, dedupe,
auto-repair, etc.  And, now that I think about it, snapshots are out
of the question too.

This gets more hairy the more I think about what our filesystems can
do these days....

> There
> wouldn't be any checks as in (1); if you left a hole in the file prior
> to setting the flag then you won't be filling it unless you clear the
> flag.

Which means writing into a hole would need to return an error, and a
write page fault into a hole would need a segv. Seems like a great
way to cause random application failures to me...

> OTOH if it merely made the metadata unchangeable then it's a
> stretch to get to non-mmap data accesses also being disallowed.

*nod*

> Maybe the immutable metadata and mmap-only properties would only be
> implied if both DAX and IMMUTABLE_META are set on a file?

I'd suggest that PMEM_IMMUTABLE could only be set on an inode that
already has the FS_XFLAG_DAX set on it (or it is being set at the
same time). And clearing the DAX flag would also remove the
PMEM_IMMUTABLE flag. Perhaps it would be better to call it
FS_XFLAG_DAX_IMMUTABLE rather than anything pmem related.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-15  3:49                     ` Nicholas Piggin
@ 2016-09-15 10:32                       ` Dave Chinner
  2016-09-15 11:42                         ` Nicholas Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2016-09-15 10:32 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Christoph Hellwig, Oliver O'Halloran, Yumei Huang,
	Michal Hocko, Xiao Guangrong, Andrew Morton, KVM list, Linux MM,
	Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti, linux-kernel,
	Dave Hansen, Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Thu, Sep 15, 2016 at 01:49:45PM +1000, Nicholas Piggin wrote:
> On Thu, 15 Sep 2016 12:31:33 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> > > On Wed, 14 Sep 2016 17:39:02 +1000
> > Sure, but one first has to describe the feature desired before all
> 
> The DAX people have been.

Hmmmm. the only "DAX people" I know of are kernel developers who
have been working on implementing DAX in the kernel - Willy, Ross,
Dan, Jan, Christoph, Kirill, myelf and a few others around the
fringes.

> They want to be able to get mappings
> that can be synced without doing fsync.

Oh, ok, the Intel Userspace PMEM library requirement. I though you
had something more that this - whatever extra problem the per-block
no fsync flag would solve?

> The *exact* extent of
> those capabilities and what the API exactly looks like is up for
> discussion.

Yup.

> Well you said it was impossible already and Christoph told them
> they were smoking crack :)

I have not said that. I have said bad things about bad
proposals and called the PMEM library model broken, but I most
definitely have not said that solving the problem is impossible.

> > That's not an answer to the questions I asked about about the "no
> > sync" flag you were proposing. You've redirected to the a different
> > solution, one that ....
> 
> No sync flag would do the same thing exactly in terms of consistency.
> It would just do the no-sync sequence by default rather than being
> asked for it. More of an API detail than implementation.

You still haven't described anything about what a per-block flag
design is supposed to look like.... :/

> > > Filesystem will
> > > invalidate all such mappings before it does buffered IOs or hole punch,
> > > and will sync metadata after allocating a new block but before returning
> > > from a fault.  
> > 
> > ... requires synchronous metadata updates from page fault context,
> > which we already know is not a good solution.  I'll quote one of
> > Christoph's previous replies to save me the trouble:
> > 
> > 	"You could write all metadata synchronously from the page
> > 	fault handler, but that's basically asking for all kinds of
> > 	deadlocks."
> > So, let's redirect back to the "no sync" flag you were talking about
> > - can you answer the questions I asked above? It would be especially
> > important to highlight how the proposed feature would avoid requiring
> > synchronous metadata updates in page fault contexts....
> 
> Right. So what deadlocks are you concerned about?

It basically puts the entire journal checkpoint path under a page
fault context. i.e. a whole new global locking context problem is
created as this path can now be run both inside and outside the
mmap_sem. Nothing ever good comes from running filesystem locking
code both inside and outside the mmap_sem.

FWIW, We've never executed synchronous transactions inside page
faults in XFS, and I think ext4 is in the same boat - it may be even
worse because of the way it does ordered data dispatch through the
journal. I don't really even want to think about the level of hurt
this might put btrfs or other COW/log structured filesystems under.
I'm sure Christoph can reel off a bunch more issues off the top of
his head....

> There could be a scale of capabilities here, for different filesystems
> that do things differently. 

Why do we need such complexity to be defined?

I'm tending towards just adding new fallocate() operation that sets
up a fully allocated and zeroed file of fixed length that has
immutable metadata once created. Single syscall, with well dfined
semantics, and it doesn't dictate the implementation any filesystem
must use. All it dictates is that the data region can be written
safely on dax-enabled storage without needing fsync() to be issued.

i.e. the implementation can be filesystem specific, and it is simple
to implement the basic functionality and constraints in both XFS and
ext4 right away, and as othe filesystems come along they can
implement it in the way that best suits them. e.g. btrfs would need
to add the "no COW" flag to the file as well.

If someone wants to implement a per-block no-fsync flag, and  do
sycnhronous metadata updates in the page fault path, then they are
welcome to do so. But we don't /need/ such complexity to implement
the functionality that pmem programming model requires.

> Some filesystems could require fsync for metadata, but allow fdatasync
> to be skipped. Users would need to have some knowledge of block size
> or do preallocation and sync.

Not sure what you mean here -  avoiding the need for using fsync()
by using fsync() seems a little circular to me.  :/

> That might put more burden on libraries/applications if there are
> concurrent operations, but that might be something they can deal with
> -- fdatasync already requires some knowledge of concurrent operations
> (or lack thereof).

Additional userspace complexity is something we should avoid.


> You and Christoph know a huge amount about vfs and filesystems.
> But sometimes you shut people down prematurely.

Appearances can be deceiving.  I don't shut discussions down unless
my time is being wasted, and that's pretty rare.

[You probably know most of what I'm about to write, but I'm not
actually writing it for you.... ]

> It can be very
> intimidating for someone who might not know *exactly* what they
> are asking for or have not considered some difficult locking case
> in a filesystem.

Yup, most kernel developers are aware that this is how the mailing
list discussions appear from the outside.

Unfortunately, too many people think they have expert knowlege when
they don't (the Dunning-Kruger cognitive bias), and so they simply
can't understand an expert response that points out problems 5 or 6
steps further along the logic chain and assumes the reader knows
that chain intimately. They don't so
they think they've been shut down.  It's closer to the truth that
they've suddenly been made aware of how little they know about the
topic under discussion and they don't know how to react.  At this
point, we see a classic fight-or-flight reflex response, and then
the person either runs away scared or create heat and light....

Occasionally, someone comes back with "hey, I don't quite understand
that, can you explain it more/differently", then I will try to
explain it as best I can and have time to do so. History has proven
that most kernel developers react in this way, including Christoph.
And when this happens, we often end up with a new contributor....

As Confuscius says: "Real knowledge is to know the extent of one's
ignorance."

[ I talk about this whole issue in more detail mid-way through this
LCA 2015 presentation: https://www.youtube.com/watch?v=VpuVDfSXs-g ]

This is why we tend to ask people for their problem descriptions,
requirements and constraints, rather than expecting them to explain
the how they think their problem needs to be solved. We do not
expect anyone other than the regular kernel developers to understand
deep, dark details needed to craft a workable solution. If they do,
great.  if they don't, then don't get upset or angry about it.
Instead, while one may not like or fully understand the answer that
is given, accept the answers that are given and respect that there
is usually a very good reason that answer was given.

> I'm sure it's not intentional, but that's how it
> can come across.

Yup, and there's really very little we can do about it. It's one of
the consequences of having experts hang around in public....

> That said, I don't want to derail their thread any further with
> this. So I apologise for my tone to you, Dave.

Accepted. Let's start over, eh?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-15 10:32                       ` Dave Chinner
@ 2016-09-15 11:42                         ` Nicholas Piggin
  2016-09-15 22:33                           ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2016-09-15 11:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Oliver O'Halloran, Yumei Huang,
	Michal Hocko, Xiao Guangrong, Andrew Morton, KVM list, Linux MM,
	Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti, linux-kernel,
	Dave Hansen, Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Thu, 15 Sep 2016 20:32:10 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Thu, Sep 15, 2016 at 01:49:45PM +1000, Nicholas Piggin wrote:
> > On Thu, 15 Sep 2016 12:31:33 +1000
> > Dave Chinner <david@fromorbit.com> wrote:
> >   
> > > On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:  
> > > > On Wed, 14 Sep 2016 17:39:02 +1000  
> > > Sure, but one first has to describe the feature desired before all  
> > 
> > The DAX people have been.  
> 
> Hmmmm. the only "DAX people" I know of are kernel developers who
> have been working on implementing DAX in the kernel - Willy, Ross,
> Dan, Jan, Christoph, Kirill, myelf and a few others around the
> fringes.
> 
> > They want to be able to get mappings
> > that can be synced without doing fsync.  
> 
> Oh, ok, the Intel Userspace PMEM library requirement. I though you
> had something more that this - whatever extra problem the per-block
> no fsync flag would solve?

Only the PMEM really. I don't want to add more complexity than
required.

> > The *exact* extent of
> > those capabilities and what the API exactly looks like is up for
> > discussion.  
> 
> Yup.
> 
> > Well you said it was impossible already and Christoph told them
> > they were smoking crack :)  
> 
> I have not said that. I have said bad things about bad
> proposals and called the PMEM library model broken, but I most
> definitely have not said that solving the problem is impossible.
>
> > > That's not an answer to the questions I asked about about the "no
> > > sync" flag you were proposing. You've redirected to the a different
> > > solution, one that ....  
> > 
> > No sync flag would do the same thing exactly in terms of consistency.
> > It would just do the no-sync sequence by default rather than being
> > asked for it. More of an API detail than implementation.  
> 
> You still haven't described anything about what a per-block flag
> design is supposed to look like.... :/

For the API, or implementation? I'm not quite sure what you mean
here. For implementation it's possible to carefully ensure metadata
is persistent when allocating blocks in page fault but before
mapping pages. Truncate or hole punch or such things can be made to
work by invalidating all such mappings and holding them off until
you can cope with them again. Not necessarily for a filesystem with
*all* capabilities of XFS -- I don't know -- but for a complete basic
one.

> > > > Filesystem will
> > > > invalidate all such mappings before it does buffered IOs or hole punch,
> > > > and will sync metadata after allocating a new block but before returning
> > > > from a fault.    
> > > 
> > > ... requires synchronous metadata updates from page fault context,
> > > which we already know is not a good solution.  I'll quote one of
> > > Christoph's previous replies to save me the trouble:
> > > 
> > > 	"You could write all metadata synchronously from the page
> > > 	fault handler, but that's basically asking for all kinds of
> > > 	deadlocks."
> > > So, let's redirect back to the "no sync" flag you were talking about
> > > - can you answer the questions I asked above? It would be especially
> > > important to highlight how the proposed feature would avoid requiring
> > > synchronous metadata updates in page fault contexts....  
> > 
> > Right. So what deadlocks are you concerned about?  
> 
> It basically puts the entire journal checkpoint path under a page
> fault context. i.e. a whole new global locking context problem is

Yes there are potentially some new lock orderings created if you
do that, depending on what locks the filesystem does.

> created as this path can now be run both inside and outside the
> mmap_sem. Nothing ever good comes from running filesystem locking
> code both inside and outside the mmap_sem.

You mean that some cases journal checkpoint runs with mmap_sem
held, and others without mmap_sem held? Not that mmap_sem is taken
inside journal checkpoint. Then I don't really see why that's a
problem. I mean performance could suffer a bit, but with fault
retry you can almost always do the syncing outside mmap_sem in
practice.

Yes, I'll preemptively agree with you -- We don't want to add any
such burden if it is not needed and well justified.

> FWIW, We've never executed synchronous transactions inside page
> faults in XFS, and I think ext4 is in the same boat - it may be even
> worse because of the way it does ordered data dispatch through the
> journal. I don't really even want to think about the level of hurt
> this might put btrfs or other COW/log structured filesystems under.
> I'm sure Christoph can reel off a bunch more issues off the top of
> his head....

I asked him, we'll see what he thinks. I don't beleive there is
anything fundamental about mm or fs core layers that cause deadlocks
though.

> 
> > There could be a scale of capabilities here, for different filesystems
> > that do things differently.   
> 
> Why do we need such complexity to be defined?
> 
> I'm tending towards just adding new fallocate() operation that sets
> up a fully allocated and zeroed file of fixed length that has
> immutable metadata once created. Single syscall, with well dfined
> semantics, and it doesn't dictate the implementation any filesystem
> must use. All it dictates is that the data region can be written
> safely on dax-enabled storage without needing fsync() to be issued.
> 
> i.e. the implementation can be filesystem specific, and it is simple
> to implement the basic functionality and constraints in both XFS and
> ext4 right away, and as othe filesystems come along they can
> implement it in the way that best suits them. e.g. btrfs would need
> to add the "no COW" flag to the file as well.
> 
> If someone wants to implement a per-block no-fsync flag, and  do
> sycnhronous metadata updates in the page fault path, then they are
> welcome to do so. But we don't /need/ such complexity to implement
> the functionality that pmem programming model requires.

Sure. That's about what I meant by scale of capabilities. And beyond
that I would be as happy as you if that was sufficient. It raises
the bar for justifying more complexity, which is always good.


> > Some filesystems could require fsync for metadata, but allow fdatasync
> > to be skipped. Users would need to have some knowledge of block size
> > or do preallocation and sync.  
> 
> Not sure what you mean here -  avoiding the need for using fsync()
> by using fsync() seems a little circular to me.  :/

I meant if blocks are already preallocated and metadata unchanging,
basically like your above proposal.

> 
> > That might put more burden on libraries/applications if there are
> > concurrent operations, but that might be something they can deal with
> > -- fdatasync already requires some knowledge of concurrent operations
> > (or lack thereof).  
> 
> Additional userspace complexity is something we should avoid.
> 
> 
> > You and Christoph know a huge amount about vfs and filesystems.
> > But sometimes you shut people down prematurely.  
> 
> Appearances can be deceiving.  I don't shut discussions down unless
> my time is being wasted, and that's pretty rare.
> 
> [You probably know most of what I'm about to write, but I'm not
> actually writing it for you.... ]
> 
> > It can be very
> > intimidating for someone who might not know *exactly* what they
> > are asking for or have not considered some difficult locking case
> > in a filesystem.  
> 
> Yup, most kernel developers are aware that this is how the mailing
> list discussions appear from the outside.

[...]

Point taken and I don't want to harp on about it. You guys can
still be intimidating to good kernel programmers who otherwise
don't know vfs and several filesystems inside out, that's all.


> > That said, I don't want to derail their thread any further with
> > this. So I apologise for my tone to you, Dave.  
> 
> Accepted. Let's start over, eh?

That would be good.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-15 11:42                         ` Nicholas Piggin
@ 2016-09-15 22:33                           ` Dave Chinner
  2016-09-16  5:54                             ` Nicholas Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2016-09-15 22:33 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Christoph Hellwig, Oliver O'Halloran, Yumei Huang,
	Michal Hocko, Xiao Guangrong, Andrew Morton, KVM list, Linux MM,
	Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti, linux-kernel,
	Dave Hansen, Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Thu, Sep 15, 2016 at 09:42:22PM +1000, Nicholas Piggin wrote:
> On Thu, 15 Sep 2016 20:32:10 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> > 
> > You still haven't described anything about what a per-block flag
> > design is supposed to look like.... :/
> 
> For the API, or implementation? I'm not quite sure what you mean
> here. For implementation it's possible to carefully ensure metadata
> is persistent when allocating blocks in page fault but before
> mapping pages. Truncate or hole punch or such things can be made to
> work by invalidating all such mappings and holding them off until
> you can cope with them again. Not necessarily for a filesystem with
> *all* capabilities of XFS -- I don't know -- but for a complete basic
> one.

SO, essentially, it comes down to synchrnous metadta updates again.
but synchronous updates would be conditional on whether an extent
metadata with the "nofsync" flag asserted was updated? Where's the
nofsync flag kept? in memory at a generic layer, or in the
filesystem, potentially in an on-disk structure? How would the
application set it for a given range?

> > > > > Filesystem will
> > > > > invalidate all such mappings before it does buffered IOs or hole punch,
> > > > > and will sync metadata after allocating a new block but before returning
> > > > > from a fault.    
> > > > 
> > > > ... requires synchronous metadata updates from page fault context,
> > > > which we already know is not a good solution.  I'll quote one of
> > > > Christoph's previous replies to save me the trouble:
> > > > 
> > > > 	"You could write all metadata synchronously from the page
> > > > 	fault handler, but that's basically asking for all kinds of
> > > > 	deadlocks."
> > > > So, let's redirect back to the "no sync" flag you were talking about
> > > > - can you answer the questions I asked above? It would be especially
> > > > important to highlight how the proposed feature would avoid requiring
> > > > synchronous metadata updates in page fault contexts....  
> > > 
> > > Right. So what deadlocks are you concerned about?  
> > 
> > It basically puts the entire journal checkpoint path under a page
> > fault context. i.e. a whole new global locking context problem is
> 
> Yes there are potentially some new lock orderings created if you
> do that, depending on what locks the filesystem does.

Well, that's the whole issue.

> > created as this path can now be run both inside and outside the
> > mmap_sem. Nothing ever good comes from running filesystem locking
> > code both inside and outside the mmap_sem.
> 
> You mean that some cases journal checkpoint runs with mmap_sem
> held, and others without mmap_sem held? Not that mmap_sem is taken
> inside journal checkpoint.

Maybe not, but now we open up the potential for locks held inside
or outside mmap sem to interact with the journal locks that are now
held inside and outside mmap_sem. See below....

> Then I don't really see why that's a
> problem. I mean performance could suffer a bit, but with fault
> retry you can almost always do the syncing outside mmap_sem in
> practice.
> 
> Yes, I'll preemptively agree with you -- We don't want to add any
> such burden if it is not needed and well justified.
> 
> > FWIW, We've never executed synchronous transactions inside page
> > faults in XFS, and I think ext4 is in the same boat - it may be even
> > worse because of the way it does ordered data dispatch through the
> > journal. I don't really even want to think about the level of hurt
> > this might put btrfs or other COW/log structured filesystems under.
> > I'm sure Christoph can reel off a bunch more issues off the top of
> > his head....
> 
> I asked him, we'll see what he thinks. I don't beleive there is
> anything fundamental about mm or fs core layers that cause deadlocks
> though.

Spent 5 minutes looking at ext4 for an example: filesystems are
allowed to take page locks during transaction commit. e.g ext4
journal commit when using the default ordered data mode:

jbd2_journal_commit_transaction
  journal_submit_data_buffers()
    journal_submit_inode_data_buffers
      generic_writepages()
        ext4_writepages()
	  mpage_prepare_extent_to_map()
	    lock_page()

i.e. if we fault on the user buffer during a write() operation and
that user buffer is a mmapped DAX file that needs to be allocated
and we have synchronous metadata updates during page faults, we
deadlock on the page lock held above the page fault context...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-15 22:33                           ` Dave Chinner
@ 2016-09-16  5:54                             ` Nicholas Piggin
  2016-12-19 21:11                               ` Ross Zwisler
  0 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2016-09-16  5:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Oliver O'Halloran, Yumei Huang,
	Michal Hocko, Xiao Guangrong, Andrew Morton, KVM list, Linux MM,
	Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti, linux-kernel,
	Dave Hansen, Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini

On Fri, 16 Sep 2016 08:33:50 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Thu, Sep 15, 2016 at 09:42:22PM +1000, Nicholas Piggin wrote:
> > On Thu, 15 Sep 2016 20:32:10 +1000
> > Dave Chinner <david@fromorbit.com> wrote:  
> > > 
> > > You still haven't described anything about what a per-block flag
> > > design is supposed to look like.... :/  
> > 
> > For the API, or implementation? I'm not quite sure what you mean
> > here. For implementation it's possible to carefully ensure metadata
> > is persistent when allocating blocks in page fault but before
> > mapping pages. Truncate or hole punch or such things can be made to
> > work by invalidating all such mappings and holding them off until
> > you can cope with them again. Not necessarily for a filesystem with
> > *all* capabilities of XFS -- I don't know -- but for a complete basic
> > one.  
> 
> SO, essentially, it comes down to synchrnous metadta updates again.

Yes. I guess fundamentally you can't get away from either that or
preloading at some level.

(Also I don't know that there's a sane way to handle [cm]time properly,
so some things like that -- this is just about block allocation /
avoiding fdatasync).

> but synchronous updates would be conditional on whether an extent
> metadata with the "nofsync" flag asserted was updated? Where's the
> nofsync flag kept? in memory at a generic layer, or in the
> filesystem, potentially in an on-disk structure? How would the
> application set it for a given range?

I guess that comes back to the API. Whether you want it to be persistent,
request based, etc. It could be derived type of storage blocks that are
mapped there, stored per-inode, in-memory, or on extents on disk. I'm not
advocating for a particular API and of course less complexity better.

> 
> > > > > > Filesystem will
> > > > > > invalidate all such mappings before it does buffered IOs or hole punch,
> > > > > > and will sync metadata after allocating a new block but before returning
> > > > > > from a fault.      
> > > > > 
> > > > > ... requires synchronous metadata updates from page fault context,
> > > > > which we already know is not a good solution.  I'll quote one of
> > > > > Christoph's previous replies to save me the trouble:
> > > > > 
> > > > > 	"You could write all metadata synchronously from the page
> > > > > 	fault handler, but that's basically asking for all kinds of
> > > > > 	deadlocks."
> > > > > So, let's redirect back to the "no sync" flag you were talking about
> > > > > - can you answer the questions I asked above? It would be especially
> > > > > important to highlight how the proposed feature would avoid requiring
> > > > > synchronous metadata updates in page fault contexts....    
> > > > 
> > > > Right. So what deadlocks are you concerned about?    
> > > 
> > > It basically puts the entire journal checkpoint path under a page
> > > fault context. i.e. a whole new global locking context problem is  
> > 
> > Yes there are potentially some new lock orderings created if you
> > do that, depending on what locks the filesystem does.  
> 
> Well, that's the whole issue.

For filesystem implementations, but perhaps not mm/vfs implemenatation
AFAIKS.

> 
> > > created as this path can now be run both inside and outside the
> > > mmap_sem. Nothing ever good comes from running filesystem locking
> > > code both inside and outside the mmap_sem.  
> > 
> > You mean that some cases journal checkpoint runs with mmap_sem
> > held, and others without mmap_sem held? Not that mmap_sem is taken
> > inside journal checkpoint.  
> 
> Maybe not, but now we open up the potential for locks held inside
> or outside mmap sem to interact with the journal locks that are now
> held inside and outside mmap_sem. See below....
> 
> > Then I don't really see why that's a
> > problem. I mean performance could suffer a bit, but with fault
> > retry you can almost always do the syncing outside mmap_sem in
> > practice.
> > 
> > Yes, I'll preemptively agree with you -- We don't want to add any
> > such burden if it is not needed and well justified.
> >   
> > > FWIW, We've never executed synchronous transactions inside page
> > > faults in XFS, and I think ext4 is in the same boat - it may be even
> > > worse because of the way it does ordered data dispatch through the
> > > journal. I don't really even want to think about the level of hurt
> > > this might put btrfs or other COW/log structured filesystems under.
> > > I'm sure Christoph can reel off a bunch more issues off the top of
> > > his head....  
> > 
> > I asked him, we'll see what he thinks. I don't beleive there is
> > anything fundamental about mm or fs core layers that cause deadlocks
> > though.  
> 
> Spent 5 minutes looking at ext4 for an example: filesystems are
> allowed to take page locks during transaction commit. e.g ext4
> journal commit when using the default ordered data mode:
> 
> jbd2_journal_commit_transaction
>   journal_submit_data_buffers()
>     journal_submit_inode_data_buffers
>       generic_writepages()
>         ext4_writepages()
> 	  mpage_prepare_extent_to_map()
> 	    lock_page()
> 
> i.e. if we fault on the user buffer during a write() operation and
> that user buffer is a mmapped DAX file that needs to be allocated
> and we have synchronous metadata updates during page faults, we
> deadlock on the page lock held above the page fault context...

Yeah, page lock is probably bigger issue for filesystems than
mmap_sem. But still is filesystem implementation detail. Again,
I'm not suggesting you could just switch all filesystems today
to do a metadata sync with mmap sem and page lock held. Only that
there aren't fundamental deadlocks enforced by the mm/vfs.

Filesystems are already taking metadata page locks in the read path
while holding data page lock, so there's long been some amount of
nesting of page lock.

It would be possible to change the page fault handler to allow a
sync without holding page lock too if it came to it. But I don't
want to go to far about implementation handwaving before it's even
established that this would be worthwhile.

Definitely the first step would be your simple preallocated per
inode approach until it is shown to be insufficient.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-09-16  5:54                             ` Nicholas Piggin
@ 2016-12-19 21:11                               ` Ross Zwisler
  2016-12-20  1:09                                 ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: Ross Zwisler @ 2016-12-19 21:11 UTC (permalink / raw)
  To: Nicholas Piggin, Dave Chinner, Jan Kara, Jeff Moyer, Darrick J. Wong
  Cc: Dave Chinner, Yumei Huang, Michal Hocko, Xiaof Guangrong,
	KVM list, Dave Hansen, Gleb Natapov, linux-nvdimm@lists.01.org,
	mtosatti, linux-kernel, Christoph Hellwig, Linux MM,
	Stefan Hajnoczi, linux-fsdevel, Paolo Bonzini, Andrew Morton

On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
<>
> Definitely the first step would be your simple preallocated per
> inode approach until it is shown to be insufficient.

Reviving this thread a few months later...

Dave, we're interested in taking a serious look at what it would take to get
PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
become, with some amount of work) a workable solution?

We're happy to do the grunt work for this feature, but we will probably need
guidance from someone with more XFS experience.  With you out on extended leave
the first half of 2017, who would be the best person to ask for this guidance?
Darrick?

Thanks,
- Ross

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-12-19 21:11                               ` Ross Zwisler
@ 2016-12-20  1:09                                 ` Darrick J. Wong
  2016-12-20  1:18                                   ` Dan Williams
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2016-12-20  1:09 UTC (permalink / raw)
  To: Ross Zwisler, Nicholas Piggin, Dave Chinner, Jan Kara,
	Jeff Moyer, Yumei Huang, Michal Hocko, Xiaof Guangrong, KVM list,
	Dave Hansen, Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti,
	linux-kernel, Christoph Hellwig, Linux MM, Stefan Hajnoczi,
	linux-fsdevel, Paolo Bonzini, Andrew Morton

On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
> <>
> > Definitely the first step would be your simple preallocated per
> > inode approach until it is shown to be insufficient.
> 
> Reviving this thread a few months later...
> 
> Dave, we're interested in taking a serious look at what it would take to get
> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
> become, with some amount of work) a workable solution?
> 
> We're happy to do the grunt work for this feature, but we will probably need
> guidance from someone with more XFS experience.  With you out on extended leave
> the first half of 2017, who would be the best person to ask for this guidance?
> Darrick?

Yes, probably. :)

I think where we left off with this (on the XFS side) is some sort of
fallocate mode that would allocate blocks, zero them, and then set the
DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
file and thereby gain the ability to control write persistents behavior
without having to worry about fs metadata updates.  As an added plus, I
think zeroing the pmem also clears media errors, or something like that.

<shrug> Is that a reasonable starting point?  My memory is a little foggy.

Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
read that.

--D

> 
> Thanks,
> - Ross

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-12-20  1:09                                 ` Darrick J. Wong
@ 2016-12-20  1:18                                   ` Dan Williams
  2016-12-21  0:40                                     ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2016-12-20  1:18 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ross Zwisler, Nicholas Piggin, Dave Chinner, Jan Kara,
	Jeff Moyer, Yumei Huang, Michal Hocko, Xiaof Guangrong, KVM list,
	Dave Hansen, Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti,
	linux-kernel, Christoph Hellwig, Linux MM, Stefan Hajnoczi,
	linux-fsdevel, Paolo Bonzini, Andrew Morton

On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
>> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
>> <>
>> > Definitely the first step would be your simple preallocated per
>> > inode approach until it is shown to be insufficient.
>>
>> Reviving this thread a few months later...
>>
>> Dave, we're interested in taking a serious look at what it would take to get
>> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
>> become, with some amount of work) a workable solution?
>>
>> We're happy to do the grunt work for this feature, but we will probably need
>> guidance from someone with more XFS experience.  With you out on extended leave
>> the first half of 2017, who would be the best person to ask for this guidance?
>> Darrick?
>
> Yes, probably. :)
>
> I think where we left off with this (on the XFS side) is some sort of
> fallocate mode that would allocate blocks, zero them, and then set the
> DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
> file and thereby gain the ability to control write persistents behavior
> without having to worry about fs metadata updates.  As an added plus, I
> think zeroing the pmem also clears media errors, or something like that.
>
> <shrug> Is that a reasonable starting point?  My memory is a little foggy.
>
> Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
> read that.

That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
via a character device interface. It's useful for cases where you want
an entire nvdimm namespace/volume in "no fs-metadata to worry about"
mode.  But, for sub-allocations of a namespace and support for
existing tooling, PMEM_IMMUTABLE is much more usable.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-12-20  1:18                                   ` Dan Williams
@ 2016-12-21  0:40                                     ` Darrick J. Wong
  2016-12-21 16:53                                       ` Dan Williams
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2016-12-21  0:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Nicholas Piggin, Dave Chinner, Jan Kara,
	Jeff Moyer, Yumei Huang, Michal Hocko, Xiaof Guangrong, KVM list,
	Dave Hansen, Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti,
	linux-kernel, Christoph Hellwig, Linux MM, Stefan Hajnoczi,
	linux-fsdevel, Paolo Bonzini, Andrew Morton

On Mon, Dec 19, 2016 at 05:18:40PM -0800, Dan Williams wrote:
> On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
> >> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
> >> <>
> >> > Definitely the first step would be your simple preallocated per
> >> > inode approach until it is shown to be insufficient.
> >>
> >> Reviving this thread a few months later...
> >>
> >> Dave, we're interested in taking a serious look at what it would take to get
> >> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
> >> become, with some amount of work) a workable solution?
> >>
> >> We're happy to do the grunt work for this feature, but we will probably need
> >> guidance from someone with more XFS experience.  With you out on extended leave
> >> the first half of 2017, who would be the best person to ask for this guidance?
> >> Darrick?
> >
> > Yes, probably. :)
> >
> > I think where we left off with this (on the XFS side) is some sort of
> > fallocate mode that would allocate blocks, zero them, and then set the
> > DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
> > file and thereby gain the ability to control write persistents behavior
> > without having to worry about fs metadata updates.  As an added plus, I
> > think zeroing the pmem also clears media errors, or something like that.
> >
> > <shrug> Is that a reasonable starting point?  My memory is a little foggy.
> >
> > Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
> > read that.
> 
> That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
> via a character device interface. It's useful for cases where you want
> an entire nvdimm namespace/volume in "no fs-metadata to worry about"
> mode.  But, for sub-allocations of a namespace and support for
> existing tooling, PMEM_IMMUTABLE is much more usable.

Well sure... but otoh I was thinking that it'd be pretty neat if we
could use the same code regardless of whether the target file was a
dax-device or an xfs file:

fd = open("<some path>", O_RDWR);
fstat(fd, &statbuf):
fallocate(fd, FALLOC_FL_PMEM_IMMUTABLE, 0, statbuf.st_size);
p = mmap(NULL, statbuf.st_size, PROT_READ | PROT_WRITE, fd, 0);

*(p + 42) = 0xDEADBEEF;
asm { clflush; } /* or whatever */

...so perhaps it would be a good idea to design the fallocate primitive
around "prepare this fd for mmap-only pmem semantics" and let it the
backend do zeroing and inode flag changes as necessary to make it
happen.  We'd need to do some bikeshedding about what the other falloc
flags mean when we're dealing with pmem files and devices, but I think
we should try to keep the userland presentation the same unless there's
a really good reason not to.

--D

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-12-21  0:40                                     ` Darrick J. Wong
@ 2016-12-21 16:53                                       ` Dan Williams
  2016-12-21 21:24                                         ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2016-12-21 16:53 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ross Zwisler, Nicholas Piggin, Dave Chinner, Jan Kara,
	Jeff Moyer, Yumei Huang, Michal Hocko, Xiaof Guangrong, KVM list,
	Dave Hansen, Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti,
	linux-kernel, Christoph Hellwig, Linux MM, Stefan Hajnoczi,
	linux-fsdevel, Paolo Bonzini, Andrew Morton

On Tue, Dec 20, 2016 at 4:40 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Mon, Dec 19, 2016 at 05:18:40PM -0800, Dan Williams wrote:
>> On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
>> <darrick.wong@oracle.com> wrote:
>> > On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
>> >> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
>> >> <>
>> >> > Definitely the first step would be your simple preallocated per
>> >> > inode approach until it is shown to be insufficient.
>> >>
>> >> Reviving this thread a few months later...
>> >>
>> >> Dave, we're interested in taking a serious look at what it would take to get
>> >> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
>> >> become, with some amount of work) a workable solution?
>> >>
>> >> We're happy to do the grunt work for this feature, but we will probably need
>> >> guidance from someone with more XFS experience.  With you out on extended leave
>> >> the first half of 2017, who would be the best person to ask for this guidance?
>> >> Darrick?
>> >
>> > Yes, probably. :)
>> >
>> > I think where we left off with this (on the XFS side) is some sort of
>> > fallocate mode that would allocate blocks, zero them, and then set the
>> > DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
>> > file and thereby gain the ability to control write persistents behavior
>> > without having to worry about fs metadata updates.  As an added plus, I
>> > think zeroing the pmem also clears media errors, or something like that.
>> >
>> > <shrug> Is that a reasonable starting point?  My memory is a little foggy.
>> >
>> > Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
>> > read that.
>>
>> That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
>> via a character device interface. It's useful for cases where you want
>> an entire nvdimm namespace/volume in "no fs-metadata to worry about"
>> mode.  But, for sub-allocations of a namespace and support for
>> existing tooling, PMEM_IMMUTABLE is much more usable.
>
> Well sure... but otoh I was thinking that it'd be pretty neat if we
> could use the same code regardless of whether the target file was a
> dax-device or an xfs file:
>
> fd = open("<some path>", O_RDWR);
> fstat(fd, &statbuf):
> fallocate(fd, FALLOC_FL_PMEM_IMMUTABLE, 0, statbuf.st_size);
> p = mmap(NULL, statbuf.st_size, PROT_READ | PROT_WRITE, fd, 0);
>
> *(p + 42) = 0xDEADBEEF;
> asm { clflush; } /* or whatever */
>
> ...so perhaps it would be a good idea to design the fallocate primitive
> around "prepare this fd for mmap-only pmem semantics" and let it the
> backend do zeroing and inode flag changes as necessary to make it
> happen.  We'd need to do some bikeshedding about what the other falloc
> flags mean when we're dealing with pmem files and devices, but I think
> we should try to keep the userland presentation the same unless there's
> a really good reason not to.

It would be interesting to use fallocate to size device-dax files...

However, the fallocate userland presentation I was looking to unify
was the semantic for how to clear errors. The current method for
creating a device-dax instance is 'echo "<some size>" > "<some sysfs
file>"'. Christoph put the kibosh on carrying the
fallocate-error-clearing semantic over, so we'll need to fallback to
using the existing ND_IOCTL_CLEAR_ERROR. All that's missing for error
clearing is a secondary mechanism to report the media error list for a
physical address range.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-12-21 16:53                                       ` Dan Williams
@ 2016-12-21 21:24                                         ` Dave Chinner
  2016-12-21 21:33                                           ` Dan Williams
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2016-12-21 21:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Ross Zwisler, Nicholas Piggin, Jan Kara,
	Jeff Moyer, Yumei Huang, Michal Hocko, Xiaof Guangrong, KVM list,
	Dave Hansen, Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti,
	linux-kernel, Christoph Hellwig, Linux MM, Stefan Hajnoczi,
	linux-fsdevel, Paolo Bonzini, Andrew Morton

On Wed, Dec 21, 2016 at 08:53:46AM -0800, Dan Williams wrote:
> On Tue, Dec 20, 2016 at 4:40 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Mon, Dec 19, 2016 at 05:18:40PM -0800, Dan Williams wrote:
> >> On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
> >> <darrick.wong@oracle.com> wrote:
> >> > On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
> >> >> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
> >> >> <>
> >> >> > Definitely the first step would be your simple preallocated per
> >> >> > inode approach until it is shown to be insufficient.
> >> >>
> >> >> Reviving this thread a few months later...
> >> >>
> >> >> Dave, we're interested in taking a serious look at what it would take to get
> >> >> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
> >> >> become, with some amount of work) a workable solution?
> >> >>
> >> >> We're happy to do the grunt work for this feature, but we will probably need
> >> >> guidance from someone with more XFS experience.  With you out on extended leave
> >> >> the first half of 2017, who would be the best person to ask for this guidance?
> >> >> Darrick?
> >> >
> >> > Yes, probably. :)
> >> >
> >> > I think where we left off with this (on the XFS side) is some sort of
> >> > fallocate mode that would allocate blocks, zero them, and then set the
> >> > DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
> >> > file and thereby gain the ability to control write persistents behavior
> >> > without having to worry about fs metadata updates.  As an added plus, I
> >> > think zeroing the pmem also clears media errors, or something like that.
> >> >
> >> > <shrug> Is that a reasonable starting point?  My memory is a little foggy.
> >> >
> >> > Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
> >> > read that.
> >>
> >> That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
> >> via a character device interface. It's useful for cases where you want
> >> an entire nvdimm namespace/volume in "no fs-metadata to worry about"
> >> mode.  But, for sub-allocations of a namespace and support for
> >> existing tooling, PMEM_IMMUTABLE is much more usable.
> >
> > Well sure... but otoh I was thinking that it'd be pretty neat if we
> > could use the same code regardless of whether the target file was a
> > dax-device or an xfs file:
> >
> > fd = open("<some path>", O_RDWR);
> > fstat(fd, &statbuf):
> > fallocate(fd, FALLOC_FL_PMEM_IMMUTABLE, 0, statbuf.st_size);
> > p = mmap(NULL, statbuf.st_size, PROT_READ | PROT_WRITE, fd, 0);
> >
> > *(p + 42) = 0xDEADBEEF;
> > asm { clflush; } /* or whatever */
> >
> > ...so perhaps it would be a good idea to design the fallocate primitive
> > around "prepare this fd for mmap-only pmem semantics" and let it the
> > backend do zeroing and inode flag changes as necessary to make it
> > happen.  We'd need to do some bikeshedding about what the other falloc
> > flags mean when we're dealing with pmem files and devices, but I think
> > we should try to keep the userland presentation the same unless there's
> > a really good reason not to.
> 
> It would be interesting to use fallocate to size device-dax files...

No. device-dax needs to die, not poison a bunch of existing file and
block device APIs and behaviours with special snowflakes.  Get
DAX-enabled filesystems to do what you need, and get rid of this
ugly, nasty hack.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
  2016-12-21 21:24                                         ` Dave Chinner
@ 2016-12-21 21:33                                           ` Dan Williams
  0 siblings, 0 replies; 38+ messages in thread
From: Dan Williams @ 2016-12-21 21:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Ross Zwisler, Nicholas Piggin, Jan Kara,
	Jeff Moyer, Yumei Huang, Michal Hocko, Xiaof Guangrong, KVM list,
	Dave Hansen, Gleb Natapov, linux-nvdimm@lists.01.org, mtosatti,
	linux-kernel, Christoph Hellwig, Linux MM, Stefan Hajnoczi,
	linux-fsdevel, Paolo Bonzini, Andrew Morton

On Wed, Dec 21, 2016 at 1:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Dec 21, 2016 at 08:53:46AM -0800, Dan Williams wrote:
>> On Tue, Dec 20, 2016 at 4:40 PM, Darrick J. Wong
>> <darrick.wong@oracle.com> wrote:
>> > On Mon, Dec 19, 2016 at 05:18:40PM -0800, Dan Williams wrote:
>> >> On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
>> >> <darrick.wong@oracle.com> wrote:
>> >> > On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
>> >> >> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
>> >> >> <>
>> >> >> > Definitely the first step would be your simple preallocated per
>> >> >> > inode approach until it is shown to be insufficient.
>> >> >>
>> >> >> Reviving this thread a few months later...
>> >> >>
>> >> >> Dave, we're interested in taking a serious look at what it would take to get
>> >> >> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
>> >> >> become, with some amount of work) a workable solution?
>> >> >>
>> >> >> We're happy to do the grunt work for this feature, but we will probably need
>> >> >> guidance from someone with more XFS experience.  With you out on extended leave
>> >> >> the first half of 2017, who would be the best person to ask for this guidance?
>> >> >> Darrick?
>> >> >
>> >> > Yes, probably. :)
>> >> >
>> >> > I think where we left off with this (on the XFS side) is some sort of
>> >> > fallocate mode that would allocate blocks, zero them, and then set the
>> >> > DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
>> >> > file and thereby gain the ability to control write persistents behavior
>> >> > without having to worry about fs metadata updates.  As an added plus, I
>> >> > think zeroing the pmem also clears media errors, or something like that.
>> >> >
>> >> > <shrug> Is that a reasonable starting point?  My memory is a little foggy.
>> >> >
>> >> > Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
>> >> > read that.
>> >>
>> >> That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
>> >> via a character device interface. It's useful for cases where you want
>> >> an entire nvdimm namespace/volume in "no fs-metadata to worry about"
>> >> mode.  But, for sub-allocations of a namespace and support for
>> >> existing tooling, PMEM_IMMUTABLE is much more usable.
>> >
>> > Well sure... but otoh I was thinking that it'd be pretty neat if we
>> > could use the same code regardless of whether the target file was a
>> > dax-device or an xfs file:
>> >
>> > fd = open("<some path>", O_RDWR);
>> > fstat(fd, &statbuf):
>> > fallocate(fd, FALLOC_FL_PMEM_IMMUTABLE, 0, statbuf.st_size);
>> > p = mmap(NULL, statbuf.st_size, PROT_READ | PROT_WRITE, fd, 0);
>> >
>> > *(p + 42) = 0xDEADBEEF;
>> > asm { clflush; } /* or whatever */
>> >
>> > ...so perhaps it would be a good idea to design the fallocate primitive
>> > around "prepare this fd for mmap-only pmem semantics" and let it the
>> > backend do zeroing and inode flag changes as necessary to make it
>> > happen.  We'd need to do some bikeshedding about what the other falloc
>> > flags mean when we're dealing with pmem files and devices, but I think
>> > we should try to keep the userland presentation the same unless there's
>> > a really good reason not to.
>>
>> It would be interesting to use fallocate to size device-dax files...
>
> No. device-dax needs to die, not poison a bunch of existing file and
> block device APIs and behaviours with special snowflakes.  Get
> DAX-enabled filesystems to do what you need, and get rid of this
> ugly, nasty hack.
>

Right, Christoph already killed fallocate for device-dax.

What we're looking for now is the next level of detail on how to get
started on PMEM_IMMUTABLE, as Ross asked a few messages back in this
thread, so we have a reasonable replacement for device-dax.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-12-21 21:33 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-08  4:32 DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps) Dan Williams
2016-09-08 22:56 ` Ross Zwisler
2016-09-08 23:04   ` Dan Williams
2016-09-09  8:55     ` Xiao Guangrong
2016-09-09 15:40       ` Dan Williams
2016-09-12  6:00         ` Xiao Guangrong
2016-09-12  3:44       ` Rudoff, Andy
2016-09-12  6:31         ` Xiao Guangrong
2016-09-12  1:40   ` Dave Chinner
2016-09-15  5:55     ` Darrick J. Wong
2016-09-15  6:25       ` Dave Chinner
2016-09-12  5:27   ` Christoph Hellwig
2016-09-12  7:25     ` Oliver O'Halloran
2016-09-12  7:51       ` Christoph Hellwig
2016-09-12  8:05         ` Nicholas Piggin
2016-09-12 15:01           ` Christoph Hellwig
2016-09-13  1:31             ` Nicholas Piggin
2016-09-13  4:06               ` Dan Williams
2016-09-13  5:40                 ` Nicholas Piggin
2016-09-12 21:34           ` Dave Chinner
2016-09-13  1:53             ` Nicholas Piggin
2016-09-13  7:17               ` Christoph Hellwig
2016-09-13  9:06                 ` Nicholas Piggin
2016-09-14  7:39               ` Dave Chinner
2016-09-14 10:19                 ` Nicholas Piggin
2016-09-15  2:31                   ` Dave Chinner
2016-09-15  3:49                     ` Nicholas Piggin
2016-09-15 10:32                       ` Dave Chinner
2016-09-15 11:42                         ` Nicholas Piggin
2016-09-15 22:33                           ` Dave Chinner
2016-09-16  5:54                             ` Nicholas Piggin
2016-12-19 21:11                               ` Ross Zwisler
2016-12-20  1:09                                 ` Darrick J. Wong
2016-12-20  1:18                                   ` Dan Williams
2016-12-21  0:40                                     ` Darrick J. Wong
2016-12-21 16:53                                       ` Dan Williams
2016-12-21 21:24                                         ` Dave Chinner
2016-12-21 21:33                                           ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).