All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] New MAP_PMEM_AWARE mmap flag
@ 2016-02-21 17:03 Boaz Harrosh
  2016-02-21 17:04 ` [RFC 1/2] mmap: Define a new " Boaz Harrosh
                   ` (3 more replies)
  0 siblings, 4 replies; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-21 17:03 UTC (permalink / raw)
  To: Dan Williams, Ross Zwisler, linux-nvdimm, Matthew Wilcox,
	Kirill A. Shutemov, Dave Chinner
  Cc: Oleg Nesterov, Mel Gorman, Johannes Weiner, linux-mm, Arnd Bergmann

Hi all

Recent DAX code fixed the cl_flushing ie durability of mmap access
of direct persistent-memory from applications. It uses the radix-tree
per inode to track the indexes of a file that where page-faulted for
write. Then at m/fsync time it would cl_flush these pages and clean
the radix-tree, for the next round.

Sigh, that is life, for legacy applications this is the price we must
pay. But for NV aware applications like nvml library, we pay extra extra
price, even if we do not actually call m/fsync eventually. For these
applications these extra resources and especially the extra radix locking
per page-fault, costs a lot, like x3 a lot.

What we propose here is a way for those applications to enjoy the
boost and still not sacrifice any correctness of legacy applications.
Any concurrent access from legacy apps vs nv-aware apps even to the same
file / same page, will work correctly.

We do that by defining a new MMAP flag that is set by the nv-aware
app. this flag is carried by the VMA. In the dax code we bypass any
radix handling of the page if this flag is set. Those pages accessed *without*
this flag will be added to the radix-tree, those with will not.
At m/fsync time if the radix tree is then empty nothing will happen.

These are very simple none intrusive patches with minimum risk. (I think)
They are based on v4.5-rc5. If you need a rebase on any other tree please
say.

Please consider this new flag for those of us people who specialize in
persistent-memory setups and want to extract any possible mileage out
of our systems.

Also attached for reference a 3rd patch to the nvml library to use
the new flag. Which brings me to the issue of persistent_memcpy / persistent_flush.
Currently this library is for x86_64 only, using the movnt instructions. The gcc
compiler should have a per ARCH facility for durable memory accesses. So applications
can be portable across systems.

Please advise?

list of patches:
[RFC 1/2] mmap: Define a new MAP_PMEM_AWARE mmap flag
[RFC 2/2] REVIEWME: dax: Support MAP_PMEM_AWARE for optimal

	Two Kernel patches

[RFC 1/1] util: add pmem-aware flag to mmap

	A patch for the nvml library

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [RFC 1/2] mmap: Define a new MAP_PMEM_AWARE mmap flag
  2016-02-21 17:03 [RFC 0/2] New MAP_PMEM_AWARE mmap flag Boaz Harrosh
@ 2016-02-21 17:04 ` Boaz Harrosh
  2016-02-21 17:06 ` [RFC 2/2] dax: Support " Boaz Harrosh
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-21 17:04 UTC (permalink / raw)
  To: Dan Williams, Ross Zwisler, linux-nvdimm, Matthew Wilcox,
	Kirill A. Shutemov, Dave Chinner
  Cc: Oleg Nesterov, Mel Gorman, Johannes Weiner, linux-mm, Arnd Bergmann


In dax.c we go to great length to keep track of write
faulted pages, so on m/fsync time we can cl_flush all these
"dirty" pages, so they are durable.

This is heavy on locking and resources and slows down
write-mmap performance considerably.

But some applications might already be aware of PMEM and
might use the fast movnt instructions to directly persist
to pmem storage bypassing CPU caches.

For these applications we define a new MAP_PMEM_AWARE mmap
flag.

In a later patch we use this flag in fs/dax.c so to optimize
for these applications.

NOTE: In current code we also want/need for the vma to
carry this flag so a new VM_PMEM_AWARE flag is also defined
and do_mmap() will translate between the constants.

NOTE2: vm_flags has already exhausted the 32 bits, but there
was a hole left at value 0x00800000
(After VM_HUGETLB and before VM_ARCH_1)
I hope this does not step on anyone's toes?

CC: Dan Williams <dan.j.williams@intel.com>
CC: Ross Zwisler <ross.zwisler@linux.intel.com>
CC: Matthew Wilcox <willy@linux.intel.com>
CC: linux-nvdimm <linux-nvdimm@ml01.01.org>
CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Mel Gorman <mgorman@suse.de>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: linux-mm@kvack.org (open list:MEMORY MANAGEMENT)

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/mm.h              | 1 +
 include/uapi/asm-generic/mman.h | 1 +
 mm/mmap.c                       | 2 ++
 3 files changed, 4 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 376f373..fe992c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -155,6 +155,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_ACCOUNT	0x00100000	/* Is a VM accounted object */
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
+#define VM_PMEM_AWARE	0x00800000	/* Caries MAP_PMEM_AWARE */
 #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
 #define VM_ARCH_2	0x02000000
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 7162cd4..0dc14d7 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_PMEM_AWARE	0x80000		/* dax.c: Do not cl_flush dirty pages */
 
 /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 76d1ec2..5ebc525 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1402,6 +1402,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		if (file && is_file_hugepages(file))
 			vm_flags |= VM_NORESERVE;
 	}
+	if (flags & MAP_PMEM_AWARE)
+		vm_flags |= VM_PMEM_AWARE;
 
 	addr = mmap_region(file, addr, len, vm_flags, pgoff);
 	if (!IS_ERR_VALUE(addr) &&
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC 2/2] dax: Support MAP_PMEM_AWARE mmap flag
  2016-02-21 17:03 [RFC 0/2] New MAP_PMEM_AWARE mmap flag Boaz Harrosh
  2016-02-21 17:04 ` [RFC 1/2] mmap: Define a new " Boaz Harrosh
@ 2016-02-21 17:06 ` Boaz Harrosh
  2016-02-21 19:51 ` [RFC 0/2] New " Dan Williams
  2016-03-11  6:44 ` Andy Lutomirski
  3 siblings, 0 replies; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-21 17:06 UTC (permalink / raw)
  To: Dan Williams, Ross Zwisler, linux-nvdimm, Matthew Wilcox,
	Kirill A. Shutemov, Dave Chinner
  Cc: Oleg Nesterov, Mel Gorman, Johannes Weiner, linux-mm, Arnd Bergmann


It is possible that applications like nvml is aware that
it is working with pmem, and is already doing movnt instructions
and cl_flushes to keep data persistent.

It is not enough that these applications do not call m/fsync,
in current code we already pay extra locking and resources in
the radix tree on every write page-fault even before we call
m/fsync.

Such application can do an mmap call with the new MAP_PMEM_AWARE
flag, and for these mmap pointers flushing will not be maintained.
This will not hurt any other legacy applications that do regular
mmap and memcpy for these applications even if working on the same
file, even legacy libraries in the same process space that do mmap
calls will have their pagefaults accounted for. Since this is per
vma.

CC: Dan Williams <dan.j.williams@intel.com>
CC: Ross Zwisler <ross.zwisler@linux.intel.com>
CC: Matthew Wilcox <willy@linux.intel.com>
CC: linux-nvdimm <linux-nvdimm@ml01.01.org>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/dax.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 64e3fc1..f8aec85 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -579,10 +579,12 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
+	if (!(vma->vm_flags & VM_PMEM_AWARE)) {
+		error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
 			vmf->flags & FAULT_FLAG_WRITE);
-	if (error)
-		goto out;
+		if (error)
+			goto out;
+	}
 
 	error = vm_insert_mixed_rw(vma, vaddr, dax.pfn,
 				     0 != (vmf->flags & FAULT_FLAG_WRITE));
@@ -984,7 +986,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		 * entry completely on the initial read and just wait until
 		 * the write to insert a dirty entry.
 		 */
-		if (write) {
+		if (write && !(vma->vm_flags & VM_PMEM_AWARE)) {
 			error = dax_radix_entry(mapping, pgoff, dax.sector,
 					true, true);
 			if (error) {
@@ -1065,7 +1067,9 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	 * saves us from having to make a call to get_block() here to look
 	 * up the sector.
 	 */
-	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
+	if (!(vma->vm_flags & VM_PMEM_AWARE))
+		dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false,
+				true);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-21 17:03 [RFC 0/2] New MAP_PMEM_AWARE mmap flag Boaz Harrosh
  2016-02-21 17:04 ` [RFC 1/2] mmap: Define a new " Boaz Harrosh
  2016-02-21 17:06 ` [RFC 2/2] dax: Support " Boaz Harrosh
@ 2016-02-21 19:51 ` Dan Williams
  2016-02-21 20:24   ` Boaz Harrosh
  2016-03-11  6:44 ` Andy Lutomirski
  3 siblings, 1 reply; 70+ messages in thread
From: Dan Williams @ 2016-02-21 19:51 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ross Zwisler, linux-nvdimm, Matthew Wilcox, Kirill A. Shutemov,
	Dave Chinner, Oleg Nesterov, Mel Gorman, Johannes Weiner,
	linux-mm, Arnd Bergmann

On Sun, Feb 21, 2016 at 9:03 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> Hi all
>
> Recent DAX code fixed the cl_flushing ie durability of mmap access
> of direct persistent-memory from applications. It uses the radix-tree
> per inode to track the indexes of a file that where page-faulted for
> write. Then at m/fsync time it would cl_flush these pages and clean
> the radix-tree, for the next round.
>
> Sigh, that is life, for legacy applications this is the price we must
> pay. But for NV aware applications like nvml library, we pay extra extra
> price, even if we do not actually call m/fsync eventually. For these
> applications these extra resources and especially the extra radix locking
> per page-fault, costs a lot, like x3 a lot.
>
> What we propose here is a way for those applications to enjoy the
> boost and still not sacrifice any correctness of legacy applications.
> Any concurrent access from legacy apps vs nv-aware apps even to the same
> file / same page, will work correctly.
>
> We do that by defining a new MMAP flag that is set by the nv-aware
> app. this flag is carried by the VMA. In the dax code we bypass any
> radix handling of the page if this flag is set. Those pages accessed *without*
> this flag will be added to the radix-tree, those with will not.
> At m/fsync time if the radix tree is then empty nothing will happen.
>
> These are very simple none intrusive patches with minimum risk. (I think)
> They are based on v4.5-rc5. If you need a rebase on any other tree please
> say.
>
> Please consider this new flag for those of us people who specialize in
> persistent-memory setups and want to extract any possible mileage out
> of our systems.
>
> Also attached for reference a 3rd patch to the nvml library to use
> the new flag. Which brings me to the issue of persistent_memcpy / persistent_flush.
> Currently this library is for x86_64 only, using the movnt instructions. The gcc
> compiler should have a per ARCH facility for durable memory accesses. So applications
> can be portable across systems.
>
> Please advise?

When this came up a couple weeks ago [1], the conclusion I came away
with is that if an application wants to avoid the overhead of DAX
semantics it needs to use an alternative to DAX access methods.  Maybe
a new pmem aware fs like Nova [2], or some other mechanism that
bypasses the semantics that existing applications on top of ext4 and
xfs expect.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004411.html
[2]: http://sched.co/68kS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-21 19:51 ` [RFC 0/2] New " Dan Williams
@ 2016-02-21 20:24   ` Boaz Harrosh
  2016-02-21 20:57     ` Dan Williams
  0 siblings, 1 reply; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-21 20:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-nvdimm, Matthew Wilcox, Kirill A. Shutemov,
	Dave Chinner, Oleg Nesterov, Mel Gorman, Johannes Weiner,
	linux-mm, Arnd Bergmann

On 02/21/2016 09:51 PM, Dan Williams wrote:
<>
>> Please advise?
> 
> When this came up a couple weeks ago [1], the conclusion I came away
> with is 

I think I saw that talk, no this was not suggested. What was suggested
was an FS / mount knob. That would break semantics, this here does not
break anything.

> that if an application wants to avoid the overhead of DAX
> semantics it needs to use an alternative to DAX access methods.  Maybe
> a new pmem aware fs like Nova [2], or some other mechanism that
> bypasses the semantics that existing applications on top of ext4 and
> xfs expect.
> 

But my suggestion does not break any "existing applications" and does
not break any semantics of ext4 or xfs. (That I can see)

As I said above it perfectly co exists with existing applications and
is the best of both worlds. The both applications can write to the
same page and will not break any of application's expectation. Old or
new.

Please point me to where I'm wrong in the code submitted?

Besides even an FS like Nova will need a flag per vma like this,
it will need to sort out the different type of application. So
here is how this is communicated, on the mmap call, how else?
And also works for xfs or ext4

Do you not see how this is entirely different then what was
proposed? or am I totally missing something? Again please show
me how this breaks anything's expectations.

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-21 20:24   ` Boaz Harrosh
@ 2016-02-21 20:57     ` Dan Williams
  2016-02-21 21:23       ` Boaz Harrosh
  0 siblings, 1 reply; 70+ messages in thread
From: Dan Williams @ 2016-02-21 20:57 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ross Zwisler, linux-nvdimm, Matthew Wilcox, Kirill A. Shutemov,
	Dave Chinner, Oleg Nesterov, Mel Gorman, Johannes Weiner,
	linux-mm, Arnd Bergmann

On Sun, Feb 21, 2016 at 12:24 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 02/21/2016 09:51 PM, Dan Williams wrote:
> <>
>>> Please advise?
>>
>> When this came up a couple weeks ago [1], the conclusion I came away
>> with is
>
> I think I saw that talk, no this was not suggested. What was suggested
> was an FS / mount knob. That would break semantics, this here does not
> break anything.

No, it was a MAP_DAX mmap flag, similar to this proposal.  The
difference being that MAP_DAX was all or nothing (DAX vs page cache)
to address MAP_SHARED semantics.

>
>> that if an application wants to avoid the overhead of DAX
>> semantics it needs to use an alternative to DAX access methods.  Maybe
>> a new pmem aware fs like Nova [2], or some other mechanism that
>> bypasses the semantics that existing applications on top of ext4 and
>> xfs expect.
>>
>
> But my suggestion does not break any "existing applications" and does
> not break any semantics of ext4 or xfs. (That I can see)
>
> As I said above it perfectly co exists with existing applications and
> is the best of both worlds. The both applications can write to the
> same page and will not break any of application's expectation. Old or
> new.
>
> Please point me to where I'm wrong in the code submitted?
>
> Besides even an FS like Nova will need a flag per vma like this,
> it will need to sort out the different type of application. So
> here is how this is communicated, on the mmap call, how else?
> And also works for xfs or ext4
>
> Do you not see how this is entirely different then what was
> proposed? or am I totally missing something? Again please show
> me how this breaks anything's expectations.
>

What happens for MAP_SHARED mappings with mixed pmem aware/unaware
applications?  Does MAP_PMEM_AWARE also imply awareness of other
applications that may be dirtying cachelines without taking
responsibility for making them persistent?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-21 20:57     ` Dan Williams
@ 2016-02-21 21:23       ` Boaz Harrosh
  2016-02-21 22:03         ` Dan Williams
  0 siblings, 1 reply; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-21 21:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-nvdimm, Matthew Wilcox, Kirill A. Shutemov,
	Dave Chinner, Oleg Nesterov, Mel Gorman, Johannes Weiner,
	linux-mm, Arnd Bergmann

On 02/21/2016 10:57 PM, Dan Williams wrote:
> On Sun, Feb 21, 2016 at 12:24 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 02/21/2016 09:51 PM, Dan Williams wrote:
>> <>
>>>> Please advise?
>>>
>>> When this came up a couple weeks ago [1], the conclusion I came away
>>> with is
>>
>> I think I saw that talk, no this was not suggested. What was suggested
>> was an FS / mount knob. That would break semantics, this here does not
>> break anything.
> 
> No, it was a MAP_DAX mmap flag, similar to this proposal.  The
> difference being that MAP_DAX was all or nothing (DAX vs page cache)
> to address MAP_SHARED semantics.
> 

Big difference no? I'm not talking about cached access at all.

>>
>>> that if an application wants to avoid the overhead of DAX
>>> semantics it needs to use an alternative to DAX access methods.  Maybe
>>> a new pmem aware fs like Nova [2], or some other mechanism that
>>> bypasses the semantics that existing applications on top of ext4 and
>>> xfs expect.
>>>
>>
>> But my suggestion does not break any "existing applications" and does
>> not break any semantics of ext4 or xfs. (That I can see)
>>
>> As I said above it perfectly co exists with existing applications and
>> is the best of both worlds. The both applications can write to the
>> same page and will not break any of application's expectation. Old or
>> new.
>>
>> Please point me to where I'm wrong in the code submitted?
>>
>> Besides even an FS like Nova will need a flag per vma like this,
>> it will need to sort out the different type of application. So
>> here is how this is communicated, on the mmap call, how else?
>> And also works for xfs or ext4
>>
>> Do you not see how this is entirely different then what was
>> proposed? or am I totally missing something? Again please show
>> me how this breaks anything's expectations.
>>
> 
> What happens for MAP_SHARED mappings with mixed pmem aware/unaware
> applications?  Does MAP_PMEM_AWARE also imply awareness of other
> applications that may be dirtying cachelines without taking
> responsibility for making them persistent?
> 

Sure. please have a look. What happens is that the legacy app
will add the page to the radix tree, come the fsync it will be
flushed. Even though a "new-type" app might fault on the same page
before or after, which did not add it to the radix tree.
So yes, all pages faulted by legacy apps will be flushed.

I have manually tested all this and it seems to work. Can you see
a theoretical scenario where it would not?

We are yet to setup our NvDIMM machines to be testing all this with
automatic power-off cycles and see who it holds, Hence the RFC status.

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-21 21:23       ` Boaz Harrosh
@ 2016-02-21 22:03         ` Dan Williams
  2016-02-21 22:31           ` Dave Chinner
  2016-02-22 11:05           ` Boaz Harrosh
  0 siblings, 2 replies; 70+ messages in thread
From: Dan Williams @ 2016-02-21 22:03 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ross Zwisler, linux-nvdimm, Matthew Wilcox, Kirill A. Shutemov,
	Dave Chinner, Oleg Nesterov, Mel Gorman, Johannes Weiner,
	linux-mm, Arnd Bergmann

On Sun, Feb 21, 2016 at 1:23 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 02/21/2016 10:57 PM, Dan Williams wrote:
>> On Sun, Feb 21, 2016 at 12:24 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>> On 02/21/2016 09:51 PM, Dan Williams wrote:
>>> <>
>>>>> Please advise?
>>>>
>>>> When this came up a couple weeks ago [1], the conclusion I came away
>>>> with is
>>>
>>> I think I saw that talk, no this was not suggested. What was suggested
>>> was an FS / mount knob. That would break semantics, this here does not
>>> break anything.
>>
>> No, it was a MAP_DAX mmap flag, similar to this proposal.  The
>> difference being that MAP_DAX was all or nothing (DAX vs page cache)
>> to address MAP_SHARED semantics.
>>
>
> Big difference no? I'm not talking about cached access at all.
>
>>>
>>>> that if an application wants to avoid the overhead of DAX
>>>> semantics it needs to use an alternative to DAX access methods.  Maybe
>>>> a new pmem aware fs like Nova [2], or some other mechanism that
>>>> bypasses the semantics that existing applications on top of ext4 and
>>>> xfs expect.
>>>>
>>>
>>> But my suggestion does not break any "existing applications" and does
>>> not break any semantics of ext4 or xfs. (That I can see)
>>>
>>> As I said above it perfectly co exists with existing applications and
>>> is the best of both worlds. The both applications can write to the
>>> same page and will not break any of application's expectation. Old or
>>> new.
>>>
>>> Please point me to where I'm wrong in the code submitted?
>>>
>>> Besides even an FS like Nova will need a flag per vma like this,
>>> it will need to sort out the different type of application. So
>>> here is how this is communicated, on the mmap call, how else?
>>> And also works for xfs or ext4
>>>
>>> Do you not see how this is entirely different then what was
>>> proposed? or am I totally missing something? Again please show
>>> me how this breaks anything's expectations.
>>>
>>
>> What happens for MAP_SHARED mappings with mixed pmem aware/unaware
>> applications?  Does MAP_PMEM_AWARE also imply awareness of other
>> applications that may be dirtying cachelines without taking
>> responsibility for making them persistent?
>>
>
> Sure. please have a look. What happens is that the legacy app
> will add the page to the radix tree, come the fsync it will be
> flushed. Even though a "new-type" app might fault on the same page
> before or after, which did not add it to the radix tree.
> So yes, all pages faulted by legacy apps will be flushed.
>
> I have manually tested all this and it seems to work. Can you see
> a theoretical scenario where it would not?

I'm worried about the scenario where the pmem aware app assumes that
none of the cachelines in its mapping are dirty when it goes to issue
pcommit.  We'll have two applications with different perceptions of
when writes are durable.  Maybe it's not a problem in practice, at
least current generation x86 cpus flush existing dirty cachelines when
performing non-temporal stores.  However, it bothers me that there are
cpus where a pmem-unaware app could prevent a pmem-aware app from
making writes durable.  It seems if one app has established a
MAP_PMEM_AWARE mapping it needs guarantees that all apps participating
in that shared mapping have the same awareness.

Another potential issue is that MAP_PMEM_AWARE is not enough on its
own.  If the filesystem or inode does not support DAX the application
needs to assume page cache semantics.  At a minimum MAP_PMEM_AWARE
requests would need to fail if DAX is not available.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-21 22:03         ` Dan Williams
@ 2016-02-21 22:31           ` Dave Chinner
  2016-02-22  9:57             ` Boaz Harrosh
  2016-02-22 15:34             ` Jeff Moyer
  2016-02-22 11:05           ` Boaz Harrosh
  1 sibling, 2 replies; 70+ messages in thread
From: Dave Chinner @ 2016-02-21 22:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Boaz Harrosh, Ross Zwisler, linux-nvdimm, Matthew Wilcox,
	Kirill A. Shutemov, Oleg Nesterov, Mel Gorman, Johannes Weiner,
	linux-mm, Arnd Bergmann

On Sun, Feb 21, 2016 at 02:03:43PM -0800, Dan Williams wrote:
> On Sun, Feb 21, 2016 at 1:23 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> > On 02/21/2016 10:57 PM, Dan Williams wrote:
> >> On Sun, Feb 21, 2016 at 12:24 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> >>> On 02/21/2016 09:51 PM, Dan Williams wrote:
> > Sure. please have a look. What happens is that the legacy app
> > will add the page to the radix tree, come the fsync it will be
> > flushed. Even though a "new-type" app might fault on the same page
> > before or after, which did not add it to the radix tree.
> > So yes, all pages faulted by legacy apps will be flushed.
> >
> > I have manually tested all this and it seems to work. Can you see
> > a theoretical scenario where it would not?
> 
> I'm worried about the scenario where the pmem aware app assumes that
> none of the cachelines in its mapping are dirty when it goes to issue
> pcommit.  We'll have two applications with different perceptions of
> when writes are durable.  Maybe it's not a problem in practice, at
> least current generation x86 cpus flush existing dirty cachelines when
> performing non-temporal stores.  However, it bothers me that there are
> cpus where a pmem-unaware app could prevent a pmem-aware app from
> making writes durable.  It seems if one app has established a
> MAP_PMEM_AWARE mapping it needs guarantees that all apps participating
> in that shared mapping have the same awareness.

Which, in practice, cannot work. Think cp, rsync, or any other
program a user can run that can read the file the MAP_PMEM_AWARE
application is using.

> Another potential issue is that MAP_PMEM_AWARE is not enough on its
> own.  If the filesystem or inode does not support DAX the application
> needs to assume page cache semantics.  At a minimum MAP_PMEM_AWARE
> requests would need to fail if DAX is not available.

They will always still need to call msync()/fsync() to guarantee
data integrity, because the filesystem metadata that indexes the
data still needs to be committed before data integrity can be
guaranteed. i.e. MAP_PMEM_AWARE by itself it not sufficient for data
integrity, and so the app will have to be written like any other app
that uses page cache based mmap().

Indeed, the application cannot even assume that a fully allocated
file does not require msync/fsync because the filesystem may be
doing things like dedupe, defrag, copy on write, etc behind the back
of the application and so file metadata changes may still be in
volatile RAM even though the application has flushed it's data.
Applications have no idea what the underlying filesystem and storage
is doing and so they cannot assume that complete data integrity is
provided by userspace driven CPU cache flush instructions on their
file data.

This "pmem aware applications only need to commit their data"
thinking is what got us into this mess in the first place. It's
wrong, and we need to stop trying to make pmem work this way because
it's a fundamentally broken concept.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-21 22:31           ` Dave Chinner
@ 2016-02-22  9:57             ` Boaz Harrosh
  2016-02-22 15:34             ` Jeff Moyer
  1 sibling, 0 replies; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-22  9:57 UTC (permalink / raw)
  To: Dave Chinner, Dan Williams
  Cc: Ross Zwisler, linux-nvdimm, Matthew Wilcox, Kirill A. Shutemov,
	Oleg Nesterov, Mel Gorman, Johannes Weiner, linux-mm,
	Arnd Bergmann

On 02/22/2016 12:31 AM, Dave Chinner wrote:
> On Sun, Feb 21, 2016 at 02:03:43PM -0800, Dan Williams wrote:
>> On Sun, Feb 21, 2016 at 1:23 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>> On 02/21/2016 10:57 PM, Dan Williams wrote:
>>>> On Sun, Feb 21, 2016 at 12:24 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>>>> On 02/21/2016 09:51 PM, Dan Williams wrote:
>>> Sure. please have a look. What happens is that the legacy app
>>> will add the page to the radix tree, come the fsync it will be
>>> flushed. Even though a "new-type" app might fault on the same page
>>> before or after, which did not add it to the radix tree.
>>> So yes, all pages faulted by legacy apps will be flushed.
>>>
>>> I have manually tested all this and it seems to work. Can you see
>>> a theoretical scenario where it would not?
>>
>> I'm worried about the scenario where the pmem aware app assumes that
>> none of the cachelines in its mapping are dirty when it goes to issue
>> pcommit.  We'll have two applications with different perceptions of
>> when writes are durable.  Maybe it's not a problem in practice, at
>> least current generation x86 cpus flush existing dirty cachelines when
>> performing non-temporal stores.  However, it bothers me that there are
>> cpus where a pmem-unaware app could prevent a pmem-aware app from
>> making writes durable.  It seems if one app has established a
>> MAP_PMEM_AWARE mapping it needs guarantees that all apps participating
>> in that shared mapping have the same awareness.
> 
> Which, in practice, cannot work. Think cp, rsync, or any other
> program a user can run that can read the file the MAP_PMEM_AWARE
> application is using.
> 

Yes what of it? nothing will happen, it all just works.

Perhaps you did not understand, we are talking about DAX mapped
file. Not a combination of dax vs page-cached system.

One thread stores a value X in memory movnt style, one thread pocks
the same X value from memory, CPUs do this all the time. What of it?

>> Another potential issue is that MAP_PMEM_AWARE is not enough on its
>> own.  If the filesystem or inode does not support DAX the application
>> needs to assume page cache semantics.  At a minimum MAP_PMEM_AWARE
>> requests would need to fail if DAX is not available.

DAN this is a good Idea. I will add it. In a system perspective this
is not needed. In fact today what will happen if you load nvml on a
none -dax mounted fs? nothing will work at all even though at the
beginning the all data seems to be there. right?
But I think with this here it is a chance for us to let nvml unload
gracefully before any destructive changes are made.

> 
> They will always still need to call msync()/fsync() to guarantee
> data integrity, because the filesystem metadata that indexes the
> data still needs to be committed before data integrity can be
> guaranteed. i.e. MAP_PMEM_AWARE by itself it not sufficient for data
> integrity, and so the app will have to be written like any other app
> that uses page cache based mmap().
> 

Sure yes. I agree completely. msync()/fsync() will need to be called.

I apologize, you have missed the motivation of this patch because I
did not explain very good. Our motivation is speed.

One can have durable data by:
1. Doing movnt  - Done and faster then memcpy even
2. radix-tree-add; memcpy; cl_flush;
   Surly this one is much slower lock heavy, and resource consuming.
   Our micro benchmarks show 3-8 times slowness. (memory speeds remember)

So sure a MAP_PMEM_AWARE *must* call m/fsync() for data integrity but
will not pay the "slow" price at all, it will all be very fast because
the o(n) radix-tree management+traversal+cl_flush will not be there, only
the meta-data bits will sync.

> Indeed, the application cannot even assume that a fully allocated
> file does not require msync/fsync because the filesystem may be
> doing things like dedupe, defrag, copy on write, etc behind the back
> of the application and so file metadata changes may still be in
> volatile RAM even though the application has flushed it's data.
> Applications have no idea what the underlying filesystem and storage
> is doing and so they cannot assume that complete data integrity is
> provided by userspace driven CPU cache flush instructions on their
> file data.
> 

Exactly, m/fsync() is needed, only will be much *faster*

> This "pmem aware applications only need to commit their data"
> thinking is what got us into this mess in the first place. It's
> wrong, and we need to stop trying to make pmem work this way because
> it's a fundamentally broken concept.
> 

Hey sir Dave, Please hold your horses. What mess are you talking about?
there is no mess. All We are trying to do is enable model [1] above vs
current model [2], which costs a lot.

Every bit of data integrity and FS freedom to manage data behind the
scenes, is kept intact.
	YES apps need to fsync!

Thank you, I will add this warning in the next submission. To explain
better.

> Cheers,
> Dave.
> 

Cheers
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-21 22:03         ` Dan Williams
  2016-02-21 22:31           ` Dave Chinner
@ 2016-02-22 11:05           ` Boaz Harrosh
  1 sibling, 0 replies; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-22 11:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-nvdimm, Matthew Wilcox, Kirill A. Shutemov,
	Dave Chinner, Oleg Nesterov, Mel Gorman, Johannes Weiner,
	linux-mm, Arnd Bergmann

On 02/22/2016 12:03 AM, Dan Williams wrote:
> On Sun, Feb 21, 2016 at 1:23 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
<>
>> I have manually tested all this and it seems to work. Can you see
>> a theoretical scenario where it would not?
> 
> I'm worried about the scenario where the pmem aware app assumes that
> none of the cachelines in its mapping are dirty when it goes to issue
> pcommit.  We'll have two applications with different perceptions of
> when writes are durable.  

Warning rant: Rrrr the theoretical pcommit. We have built mountains
on a none existing CPU. Show me a pcomit already.

But yes pcommit changes nothing.

> Maybe it's not a problem in practice, at
> least current generation x86 cpus flush existing dirty cachelines when
> performing non-temporal stores.  However, it bothers me that there are
> cpus where a pmem-unaware app could prevent a pmem-aware app from
> making writes durable.  It seems if one app has established a
> MAP_PMEM_AWARE mapping it needs guarantees that all apps participating
> in that shared mapping have the same awareness.
> 

But we are not breaking any current POSIX guaranties. You are thinking
memory, but this is POSIX filesystem semantics. This is all up to the
application.

Consider a regular page-cached FS, and your above two applications,
(Which BTW do not exist exactly because). Both are doing a write not
to a cacheline to a page even:

App 1			app2
- write block X		...
- sync			write block X

- 		POWER OFF

There is no guaranty that app 1 version is what will be read
after mount. Any random amount of app2 changes can be seen.
In fact even while the pages are in DMA they can change.

All that is guarantied is that the page will be marked dirty
because app 2 dirty it even though app 1 submitted it to be
cleaned.
And is what we have. If app 2 is pmem-unaware the page is added
to the radix tree, come sync time it will cl_flush.

In Any which case after the write storms end, and a final
sync is preformed we should have an image of the very last
writes. This is POSIX. And this is kept here.

So no no need for "shared mapping have the same awareness"

[BTW: coming from the NFS world all this is one big lough
 because there we don't even have a read concurrent write
 guaranty let alone a write vs write guaranty.]

> Another potential issue is that MAP_PMEM_AWARE is not enough on its
> own.  If the filesystem or inode does not support DAX the application
> needs to assume page cache semantics.  At a minimum MAP_PMEM_AWARE
> requests would need to fail if DAX is not available.
> 

Yes good idea, will do.

Shalom
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-21 22:31           ` Dave Chinner
  2016-02-22  9:57             ` Boaz Harrosh
@ 2016-02-22 15:34             ` Jeff Moyer
  2016-02-22 17:44               ` Christoph Hellwig
                                 ` (2 more replies)
  1 sibling, 3 replies; 70+ messages in thread
From: Jeff Moyer @ 2016-02-22 15:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Arnd Bergmann, linux-nvdimm, Oleg Nesterov,
	linux-mm, Mel Gorman, Johannes Weiner, Kirill A. Shutemov

Hi, Dave,

Dave Chinner <david@fromorbit.com> writes:

>> Another potential issue is that MAP_PMEM_AWARE is not enough on its
>> own.  If the filesystem or inode does not support DAX the application
>> needs to assume page cache semantics.  At a minimum MAP_PMEM_AWARE
>> requests would need to fail if DAX is not available.
>
> They will always still need to call msync()/fsync() to guarantee
> data integrity, because the filesystem metadata that indexes the
> data still needs to be committed before data integrity can be
> guaranteed. i.e. MAP_PMEM_AWARE by itself it not sufficient for data
> integrity, and so the app will have to be written like any other app
> that uses page cache based mmap().
>
> Indeed, the application cannot even assume that a fully allocated
> file does not require msync/fsync because the filesystem may be
> doing things like dedupe, defrag, copy on write, etc behind the back
> of the application and so file metadata changes may still be in
> volatile RAM even though the application has flushed it's data.

Once you hand out a persistent memory mapping, you sure as heck can't
switch blocks around behind the back of the application.

But even if we're not dealing with persistent memory, you seem to imply
that applications needs to fsync just in case the file system did
something behind its back.  In other words, an application opening a
fully allocated file and using fdatasync will also need to call fsync,
just in case.  Is that really what you're suggesting?

> Applications have no idea what the underlying filesystem and storage
> is doing and so they cannot assume that complete data integrity is
> provided by userspace driven CPU cache flush instructions on their
> file data.

This is surprising to me, and goes completely against the proposed
programming model.  In fact, this is a very basic tenet of the operation
of the nvml libraries on pmem.io.

That aside, let me see if I understand you correctly.

An application creates a file and writes to every single block in the
thing, sync's it, closes it.  It then opens it back up, calls mmap with
this new MAP_DAX flag or on a file system mounted with -o dax, and
proceeds to access the file using loads and stores.  It persists its
data by using non-temporal stores, flushing and fencing cpu
instructions.

If I understand you correctly, you're saying that that application is
not written correctly, because it needs to call fsync to persist
metadata (that it presumably did not modify).  Is that right?

-Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-22 15:34             ` Jeff Moyer
@ 2016-02-22 17:44               ` Christoph Hellwig
  2016-02-22 17:58                 ` Jeff Moyer
  2016-02-22 20:05                 ` Rudoff, Andy
  2016-02-22 21:50               ` Dave Chinner
  2016-02-23 13:51               ` Boaz Harrosh
  2 siblings, 2 replies; 70+ messages in thread
From: Christoph Hellwig @ 2016-02-22 17:44 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dave Chinner, Dan Williams, Arnd Bergmann, linux-nvdimm,
	Oleg Nesterov, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov

On Mon, Feb 22, 2016 at 10:34:45AM -0500, Jeff Moyer wrote:
> > of the application and so file metadata changes may still be in
> > volatile RAM even though the application has flushed it's data.
> 
> Once you hand out a persistent memory mapping, you sure as heck can't
> switch blocks around behind the back of the application.

You might not even have allocated the blocks at the time of the mmap,
although for pmem remapping it after a page fault has actually allocated
the block would be rather painful.

> But even if we're not dealing with persistent memory, you seem to imply
> that applications needs to fsync just in case the file system did
> something behind its back.  In other words, an application opening a
> fully allocated file and using fdatasync will also need to call fsync,
> just in case.  Is that really what you're suggesting?

You above statement looks rather confused.  The only difference between
fdatasync and sync is that the former does not write out metadata not
required to find the file data (usually that's just timestamps).  So if
you already use fdatasync or msync properly you don't need to fsync
again.  But you need to use one of the above methods to ensure your
data is persistent on the medium.

> > Applications have no idea what the underlying filesystem and storage
> > is doing and so they cannot assume that complete data integrity is
> > provided by userspace driven CPU cache flush instructions on their
> > file data.
> 
> This is surprising to me, and goes completely against the proposed
> programming model.  In fact, this is a very basic tenet of the operation
> of the nvml libraries on pmem.io.

It's simply impossible to provide.  But then again pmem.io seems to be
much more about hype than reality anyway.

> An application creates a file and writes to every single block in the
> thing, sync's it, closes it.  It then opens it back up, calls mmap with
> this new MAP_DAX flag or on a file system mounted with -o dax, and
> proceeds to access the file using loads and stores.  It persists its
> data by using non-temporal stores, flushing and fencing cpu
> instructions.
> 
> If I understand you correctly, you're saying that that application is
> not written correctly, because it needs to call fsync to persist
> metadata (that it presumably did not modify).  Is that right?

Exactly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-22 17:44               ` Christoph Hellwig
@ 2016-02-22 17:58                 ` Jeff Moyer
  2016-02-22 18:03                   ` Christoph Hellwig
  2016-02-22 20:05                 ` Rudoff, Andy
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2016-02-22 17:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Dan Williams, Arnd Bergmann, linux-nvdimm,
	Oleg Nesterov, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov

Hi, Christoph,

Christoph Hellwig <hch@infradead.org> writes:

> On Mon, Feb 22, 2016 at 10:34:45AM -0500, Jeff Moyer wrote:
>> > of the application and so file metadata changes may still be in
>> > volatile RAM even though the application has flushed it's data.
>> 
>> Once you hand out a persistent memory mapping, you sure as heck can't
>> switch blocks around behind the back of the application.
>
> You might not even have allocated the blocks at the time of the mmap,
> although for pmem remapping it after a page fault has actually allocated
> the block would be rather painful.

Right, I meant after fault, but it seems like you're suggesting that
that is even possible.  I wouldn't mind discussing this part more, but I
think it detracts from the main question I have, which is at the end.
Maybe we can take it up over beers some time.

>> But even if we're not dealing with persistent memory, you seem to imply
>> that applications needs to fsync just in case the file system did
>> something behind its back.  In other words, an application opening a
>> fully allocated file and using fdatasync will also need to call fsync,
>> just in case.  Is that really what you're suggesting?
>
> You above statement looks rather confused.  The only difference between
> fdatasync and sync is that the former does not write out metadata not
> required to find the file data (usually that's just timestamps).  So if
> you already use fdatasync or msync properly you don't need to fsync
> again.  But you need to use one of the above methods to ensure your
> data is persistent on the medium.

Duh, yeah.  I forgot about the "metadata necessary to find the file
data" part (which is, admittedly, a big part).

>> An application creates a file and writes to every single block in the
>> thing, sync's it, closes it.  It then opens it back up, calls mmap with
>> this new MAP_DAX flag or on a file system mounted with -o dax, and
>> proceeds to access the file using loads and stores.  It persists its
>> data by using non-temporal stores, flushing and fencing cpu
>> instructions.
>> 
>> If I understand you correctly, you're saying that that application is
>> not written correctly, because it needs to call fsync to persist
>> metadata (that it presumably did not modify).  Is that right?
>
> Exactly.

Sorry for being dense, but why, exactly?  If the file system is making
changes without the application's involvement, then the file system
should be responsible for ensuring its own consistency, irrespective of
whether the application issues an fsync.  Clearly I'm missing some key
point here.

-Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-22 17:58                 ` Jeff Moyer
@ 2016-02-22 18:03                   ` Christoph Hellwig
  2016-02-22 18:52                     ` Jeff Moyer
  0 siblings, 1 reply; 70+ messages in thread
From: Christoph Hellwig @ 2016-02-22 18:03 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Dave Chinner, Dan Williams, Arnd Bergmann,
	linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Mon, Feb 22, 2016 at 12:58:18PM -0500, Jeff Moyer wrote:
> Sorry for being dense, but why, exactly?  If the file system is making
> changes without the application's involvement, then the file system
> should be responsible for ensuring its own consistency, irrespective of
> whether the application issues an fsync.  Clearly I'm missing some key
> point here.

The simplest example is a copy on write file system (or simply a copy on
write file, which can exist with ocfs2 and will with xfs very soon),
where each write will allocate a new block, which will require metadata
updates.

We've built the whole I/O model around the concept that by default our
I/O will required fsync/msync.  For read/write-style I/O you can opt out
using O_DSYNC.  There currently is no way to opt out for memory mapped
I/O, mostly because it's

  a) useless without something like DAX, and
  b) much harder to implement

So a MAP_SYNC option might not be entirely off the table, but I think
it would be a lot of hard work and I'm not even sure it's possible
to handle it in the general case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-22 18:03                   ` Christoph Hellwig
@ 2016-02-22 18:52                     ` Jeff Moyer
  2016-02-23  9:45                       ` Christoph Hellwig
  0 siblings, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2016-02-22 18:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Dan Williams, Arnd Bergmann, linux-nvdimm,
	Oleg Nesterov, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov

Christoph Hellwig <hch@infradead.org> writes:

> On Mon, Feb 22, 2016 at 12:58:18PM -0500, Jeff Moyer wrote:
>> Sorry for being dense, but why, exactly?  If the file system is making
>> changes without the application's involvement, then the file system
>> should be responsible for ensuring its own consistency, irrespective of
>> whether the application issues an fsync.  Clearly I'm missing some key
>> point here.
>
> The simplest example is a copy on write file system (or simply a copy on
> write file, which can exist with ocfs2 and will with xfs very soon),
> where each write will allocate a new block, which will require metadata
> updates.
>
> We've built the whole I/O model around the concept that by default our
> I/O will required fsync/msync.  For read/write-style I/O you can opt out
> using O_DSYNC.  There currently is no way to opt out for memory mapped
> I/O, mostly because it's
>
>   a) useless without something like DAX, and
>   b) much harder to implement
>
> So a MAP_SYNC option might not be entirely off the table, but I think
> it would be a lot of hard work and I'm not even sure it's possible
> to handle it in the general case.

I see.  So, at write fault time, you're saying that new blocks may be
allocated, and that in order to make that persistent, we need a sync
operation.  Presumably this MAP_SYNC option could sync out the necessary
metadata updates to the log before returning from the write fault
handler.  The arguments against making this work are that it isn't
generally useful, and that we don't want more dax special cases in the
code.  Did I get that right?

Thanks,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-22 17:44               ` Christoph Hellwig
  2016-02-22 17:58                 ` Jeff Moyer
@ 2016-02-22 20:05                 ` Rudoff, Andy
  2016-02-23  9:52                   ` Christoph Hellwig
  1 sibling, 1 reply; 70+ messages in thread
From: Rudoff, Andy @ 2016-02-22 20:05 UTC (permalink / raw)
  To: Christoph Hellwig, Jeff Moyer
  Cc: Arnd Bergmann, linux-nvdimm, Dave Chinner, Oleg Nesterov,
	linux-mm, Mel Gorman, Johannes Weiner, Kirill A. Shutemov

>> This is surprising to me, and goes completely against the proposed 
>> programming model.  In fact, this is a very basic tenet of the 
>> operation of the nvml libraries on pmem.io.
>
>It's simply impossible to provide.  But then again pmem.io seems to be much more about hype than reality anyway.

Well that comment woke me up :-)

I think several things are getting mixed together in this discussion:

First, one primary reason DAX exists is so that applications can access persistence directly.  Once mappings are set up, latency-sensitive apps get load/store access and can flush stores themselves using instructions rather than kernel calls.

Second, programming to load/store persistence is tricky, but the usual API for programming to memory-mapped files will "just work" and we built on that to avoid needlessly creating new permission & naming models.  If you want to use msync() or fsync(), it will work, but may not perform as well as using the instructions.  The instructions give you very fine-grain flushing control, but the downside is that the app must track what it changes at that fine granularity.  Both models work, but there's a trade-off.

So what can be done to make persistent memory easier to use?  I think this is where the debate really is.  Using memory-mapped files and the instructions directly is difficult.  The libraries available on pmem.io are meant to make it easier (providing transactions, memory allocation, etc) but it is still difficult.  But what about just taking applications that use mmap() and giving them DAX without their knowledge?  Is that a way to leverage pmem more easily, without forcing an application to change?  I think this is analogous to forcing O_DIRECT on applications without their knowledge.  There may be cases where it works, but there will always be better leverage of the technology if the application is architected to use it.

There are applications already modified to use DAX for pmem and to flush stores themselves (using NVDIMMs for testing, but planning for the higher-capacity pmem to become available).  Some are using the libraries on pmem.io, some are not.  Those are pmem-aware applications and I haven't seen any incorrect expectations on what happens with copy-on-write or page faults that fill in holes in a file.  Maybe there's a case to be made for applications getting DAX transparently, but I think that's not the only usage and the model we've been pushing where an application is pmem aware seems to be getting traction.

-andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-22 15:34             ` Jeff Moyer
  2016-02-22 17:44               ` Christoph Hellwig
@ 2016-02-22 21:50               ` Dave Chinner
  2016-02-23 13:51               ` Boaz Harrosh
  2 siblings, 0 replies; 70+ messages in thread
From: Dave Chinner @ 2016-02-22 21:50 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Arnd Bergmann, linux-nvdimm, Oleg Nesterov,
	linux-mm, Mel Gorman, Johannes Weiner, Kirill A. Shutemov

On Mon, Feb 22, 2016 at 10:34:45AM -0500, Jeff Moyer wrote:
> Hi, Dave,
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
> >> Another potential issue is that MAP_PMEM_AWARE is not enough on its
> >> own.  If the filesystem or inode does not support DAX the application
> >> needs to assume page cache semantics.  At a minimum MAP_PMEM_AWARE
> >> requests would need to fail if DAX is not available.
> >
> > They will always still need to call msync()/fsync() to guarantee
> > data integrity, because the filesystem metadata that indexes the
> > data still needs to be committed before data integrity can be
> > guaranteed. i.e. MAP_PMEM_AWARE by itself it not sufficient for data
> > integrity, and so the app will have to be written like any other app
> > that uses page cache based mmap().
> >
> > Indeed, the application cannot even assume that a fully allocated
> > file does not require msync/fsync because the filesystem may be
> > doing things like dedupe, defrag, copy on write, etc behind the back
> > of the application and so file metadata changes may still be in
> > volatile RAM even though the application has flushed it's data.
> 
> Once you hand out a persistent memory mapping, you sure as heck can't
> switch blocks around behind the back of the application.

Yes we can. All we need to do is lock out page faults, invalidate
the mappings, and change the underlying blocks.  The app using mmap
will refault on it's next access, and get the new block mapped into
it's address space.

I'll point to hole punching as an example of how we do these
invalidate/modify operations right now, and we expect them to work
and not result in data corruption. We even have tests (e.g. fsx in
xfstests has all these operations enabled) to make sure it works.

> That aside, let me see if I understand you correctly.
> 
> An application creates a file and writes to every single block in the
> thing, sync's it, closes it.  It then opens it back up, calls mmap with
> this new MAP_DAX flag or on a file system mounted with -o dax, and
> proceeds to access the file using loads and stores.  It persists its
> data by using non-temporal stores, flushing and fencing cpu
> instructions.

The moment the app does a write to the file data, we can no longer
assume the filesystem metadata references to the file data are
durable.

> If I understand you correctly, you're saying that that application is
> not written correctly, because it needs to call fsync to persist
> metadata (that it presumably did not modify).  Is that right?

Yes, though fdatasync() would be sufficient because the app only
modified data.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-22 18:52                     ` Jeff Moyer
@ 2016-02-23  9:45                       ` Christoph Hellwig
  0 siblings, 0 replies; 70+ messages in thread
From: Christoph Hellwig @ 2016-02-23  9:45 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Dave Chinner, Dan Williams, Arnd Bergmann,
	linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Mon, Feb 22, 2016 at 01:52:28PM -0500, Jeff Moyer wrote:
> I see.  So, at write fault time, you're saying that new blocks may be
> allocated, and that in order to make that persistent, we need a sync
> operation.

Yes.

> Presumably this MAP_SYNC option could sync out the necessary
> metadata updates to the log before returning from the write fault
> handler.  The arguments against making this work are that it isn't
> generally useful, and that we don't want more dax special cases in the
> code.  Did I get that right?

The argument is that it's non-trivial, and we haven't even sorted out
basic semantics for directly mapped storaged.  Let's finish up getting
this right, and then look into optimizing it further in the next step.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-22 20:05                 ` Rudoff, Andy
@ 2016-02-23  9:52                   ` Christoph Hellwig
  2016-02-23 10:07                     ` Rudoff, Andy
  2016-02-23 14:10                     ` Boaz Harrosh
  0 siblings, 2 replies; 70+ messages in thread
From: Christoph Hellwig @ 2016-02-23  9:52 UTC (permalink / raw)
  To: Rudoff, Andy
  Cc: Christoph Hellwig, Jeff Moyer, Arnd Bergmann, linux-nvdimm,
	Dave Chinner, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

[Hi Andy - care to properly line break after ~75 character, that makes
 ready the message a lot easier, thanks!]

On Mon, Feb 22, 2016 at 08:05:44PM +0000, Rudoff, Andy wrote:
> I think several things are getting mixed together in this discussion:
> 
> First, one primary reason DAX exists is so that applications can access
> persistence directly.

Agreed.

> Once mappings are set up, latency-sensitive apps get load/store access
> and can flush stores themselves using instructions rather than kernel calls.

Disagreed.  That's not how the architecture has worked at any point
since the humble ext2/XIP days.  It might be a worthwhile goal in the
long run, but it's never been part of the architecture as discussed on
the Linux lists, and it's not trivially implementable.

> Second, programming to load/store persistence is tricky, but the usual API
> for programming to memory-mapped files will "just work" and we built on
> that to avoid needlessly creating new permission & naming models.

Agreed.

> If you want to use msync() or fsync(), it will work, but may not perform as
> well as using the instructions.

And this is BS.  Using msync or fsync might not perform as well as not
actually using them, but without them you do not get persistence.  If
you use your pmem as a throw away cache that's fine, but for most people
that is not the case.

> The instructions give you very fine-grain flushing control, but the
> downside is that the app must track what it changes at that fine
> granularity.  Both models work, but there's a trade-off.

No, the cache flush model simply does not work without a lot of hard
work to enable it first.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23  9:52                   ` Christoph Hellwig
@ 2016-02-23 10:07                     ` Rudoff, Andy
  2016-02-23 12:06                       ` Dave Chinner
  2016-02-23 14:10                     ` Boaz Harrosh
  1 sibling, 1 reply; 70+ messages in thread
From: Rudoff, Andy @ 2016-02-23 10:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Moyer, Arnd Bergmann, linux-nvdimm, Dave Chinner,
	Oleg Nesterov, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov


> [Hi Andy - care to properly line break after ~75 character, that makes
> ready the message a lot easier, thanks!]

My bad. 

>> The instructions give you very fine-grain flushing control, but the
>> downside is that the app must track what it changes at that fine
>> granularity.  Both models work, but there's a trade-off.
> 
> No, the cache flush model simply does not work without a lot of hard
> work to enable it first.

It's working well enough to pass tests that simulate crashes and
various workload tests for the apps involved. And I agree there
has been a lot of hard work behind it. I guess I'm not sure why you're
saying it is impossible or not working.

Let's take an example: an app uses fallocate() to create a DAX file,
mmap() to map it, msync() to flush changes. The app follows POSIX
meaning it doesn't expect file metadata to be flushed magically, etc.
The app is tested carefully and it works correctly.  Now the msync()
call used to flush stores is replaced by flushing instructions.
What's broken?

Thanks,

-andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 10:07                     ` Rudoff, Andy
@ 2016-02-23 12:06                       ` Dave Chinner
  2016-02-23 17:10                         ` Ross Zwisler
  0 siblings, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2016-02-23 12:06 UTC (permalink / raw)
  To: Rudoff, Andy
  Cc: Christoph Hellwig, Jeff Moyer, Arnd Bergmann, linux-nvdimm,
	Oleg Nesterov, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov

On Tue, Feb 23, 2016 at 10:07:07AM +0000, Rudoff, Andy wrote:
> 
> > [Hi Andy - care to properly line break after ~75 character, that makes
> > ready the message a lot easier, thanks!]
> 
> My bad. 
> 
> >> The instructions give you very fine-grain flushing control, but the
> >> downside is that the app must track what it changes at that fine
> >> granularity.  Both models work, but there's a trade-off.
> > 
> > No, the cache flush model simply does not work without a lot of hard
> > work to enable it first.
> 
> It's working well enough to pass tests that simulate crashes and
> various workload tests for the apps involved. And I agree there
> has been a lot of hard work behind it. I guess I'm not sure why you're
> saying it is impossible or not working.
> 
> Let's take an example: an app uses fallocate() to create a DAX file,
> mmap() to map it, msync() to flush changes. The app follows POSIX
> meaning it doesn't expect file metadata to be flushed magically, etc.
> The app is tested carefully and it works correctly.  Now the msync()
> call used to flush stores is replaced by flushing instructions.
> What's broken?

You haven't told the filesytem to flush any dirty metadata required
to access the user data to persistent storage.  If the zeroing and
unwritten extent conversion that is run by the filesytem during
write faults into preallocated blocks isn't persistent, then after a
crash the file will read back as unwritten extents, returning zeros
rather than the data that was written.

msync() calls fsync() on file back pages, which makes file metadata
changes persistent.  Indeed, if you read the fdatasync man page, you
might have noticed that it makes explicit reference that it requires
the filesystem to flush the metadata needed to access the data that
is being synced. IOWs, the filesystem knows about this dirty
metadata that needs to be flushed to ensure data integrity,
userspace doesn't.

Not to mention that the filesystem will convert and zero much more
than just a single cacheline (whole pages at minimum, could be 2MB
extents for large pages, etc) so the filesystem may require CPU
cache flushes over a much wider range of cachelines that the
application realises are dirty and require flushing for data
integrity purposes. The filesytem knows about these dirty cache
lines, userspace doesn't.

IOWs, your userspace library may have made sure the data it modifies
is in the physical location via your userspace CPU cache flushes,
but there can be a lot of stuff it doesn't know about internal to
the filesytem that also needs to be flushed to ensure data integrity
is maintained.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-22 15:34             ` Jeff Moyer
  2016-02-22 17:44               ` Christoph Hellwig
  2016-02-22 21:50               ` Dave Chinner
@ 2016-02-23 13:51               ` Boaz Harrosh
  2016-02-23 14:22                 ` Jeff Moyer
  2 siblings, 1 reply; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-23 13:51 UTC (permalink / raw)
  To: Jeff Moyer, Dave Chinner
  Cc: Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On 02/22/2016 05:34 PM, Jeff Moyer wrote:
> Hi, Dave,
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
>>> Another potential issue is that MAP_PMEM_AWARE is not enough on its
>>> own.  If the filesystem or inode does not support DAX the application
>>> needs to assume page cache semantics.  At a minimum MAP_PMEM_AWARE
>>> requests would need to fail if DAX is not available.
>>
>> They will always still need to call msync()/fsync() to guarantee
>> data integrity, because the filesystem metadata that indexes the
>> data still needs to be committed before data integrity can be
>> guaranteed. i.e. MAP_PMEM_AWARE by itself it not sufficient for data
>> integrity, and so the app will have to be written like any other app
>> that uses page cache based mmap().
>>
>> Indeed, the application cannot even assume that a fully allocated
>> file does not require msync/fsync because the filesystem may be
>> doing things like dedupe, defrag, copy on write, etc behind the back
>> of the application and so file metadata changes may still be in
>> volatile RAM even though the application has flushed it's data.
> 
> Once you hand out a persistent memory mapping, you sure as heck can't
> switch blocks around behind the back of the application.
> 
> But even if we're not dealing with persistent memory, you seem to imply
> that applications needs to fsync just in case the file system did
> something behind its back.  In other words, an application opening a
> fully allocated file and using fdatasync will also need to call fsync,
> just in case.  Is that really what you're suggesting?
> 
>> Applications have no idea what the underlying filesystem and storage
>> is doing and so they cannot assume that complete data integrity is
>> provided by userspace driven CPU cache flush instructions on their
>> file data.
> 
> This is surprising to me, and goes completely against the proposed
> programming model.  In fact, this is a very basic tenet of the operation
> of the nvml libraries on pmem.io.
> 
> That aside, let me see if I understand you correctly.
> 
> An application creates a file and writes to every single block in the
> thing, sync's it, closes it.  It then opens it back up, calls mmap with
> this new MAP_DAX flag or on a file system mounted with -o dax, and
> proceeds to access the file using loads and stores.  It persists its
> data by using non-temporal stores, flushing and fencing cpu
> instructions.
> 
> If I understand you correctly, you're saying that that application is
> not written correctly, because it needs to call fsync to persist
> metadata (that it presumably did not modify).  Is that right?
> 

Hi Jeff

I do not understand why you chose to drop my email address from your
reply? What do I need to feel when this happens?

And to your questions above. As I answered to Dave.
This is the novelty of my approach and the big difference between
what you guys thought with MAP_DAX and my patches as submitted.
 1. Application will/need to call m/fsync to let the FS the freedom it needs
 2. The m/fsync as well as the page faults will be very light wait and fast,
    all that is required from the pmem aware app is to do movnt stores and cl_flushes.

So enjoying both worlds. And actually more:
With your approach of fallocat(ing) the all space in advance you might as well
just partition the storage and use the DAX(ed) block device. But with my
approach you need not pre-allocate and enjoy the over provisioned model and
the space allocation management of a modern FS. And even with all that still
enjoy very fast direct mapped stores by not requiring the current slow m/fsync()

I hope you guys stand behind me in my effort to accelerate userspace pmem apps
and still not break any built in assumptions.

> -Jeff

Cheers
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23  9:52                   ` Christoph Hellwig
  2016-02-23 10:07                     ` Rudoff, Andy
@ 2016-02-23 14:10                     ` Boaz Harrosh
  2016-02-23 16:56                       ` Dan Williams
  2016-02-23 17:25                       ` Ross Zwisler
  1 sibling, 2 replies; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-23 14:10 UTC (permalink / raw)
  To: Christoph Hellwig, Rudoff, Andy, Dave Chinner, Jeff Moyer
  Cc: Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On 02/23/2016 11:52 AM, Christoph Hellwig wrote:
<>
> 
> And this is BS.  Using msync or fsync might not perform as well as not
> actually using them, but without them you do not get persistence.  If
> you use your pmem as a throw away cache that's fine, but for most people
> that is not the case.
> 

Hi Christoph

So is exactly my suggestion. My approach is *not* the we do not call
m/fsync to let the FS clean up.

In my model we still do that, only we eliminate the m/fsync slowness
and the all page faults overhead by being instructed by the application
that we do not need to track the data modified cachelines. Since the
application is telling us that it will do so.

In my model the job is split:
 App will take care of data persistence by instructing a MAP_PMEM_AWARE,
 and doing its own cl_flushing / movnt.
 Which is the heavy cost

 The FS will keep track of the Meta-Data persistence as it already does, via the
 call to m/fsync. Which is marginal performance compared to the above heavy
 IO.

Note that the FS is still free to move blocks around, as Dave said:
lockout pagefaultes, unmap from user space, let app fault again on a new
block. this will still work as before, already in COW we flush the old
block so there will be no persistence lost.

So this all thread started with my patches, and my patches do not say
"no m/fsync" they say, make this 3-8 times faster than today if the app
is participating in the heavy lifting.

Please tell me what you find wrong with my approach?

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 13:51               ` Boaz Harrosh
@ 2016-02-23 14:22                 ` Jeff Moyer
  0 siblings, 0 replies; 70+ messages in thread
From: Jeff Moyer @ 2016-02-23 14:22 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Arnd Bergmann, linux-nvdimm, Oleg Nesterov,
	linux-mm, Mel Gorman, Johannes Weiner, Kirill A. Shutemov

Boaz Harrosh <boaz@plexistor.com> writes:

>> An application creates a file and writes to every single block in the
>> thing, sync's it, closes it.  It then opens it back up, calls mmap with
>> this new MAP_DAX flag or on a file system mounted with -o dax, and
>> proceeds to access the file using loads and stores.  It persists its
>> data by using non-temporal stores, flushing and fencing cpu
>> instructions.
>> 
>> If I understand you correctly, you're saying that that application is
>> not written correctly, because it needs to call fsync to persist
>> metadata (that it presumably did not modify).  Is that right?
>> 
>
> Hi Jeff
>
> I do not understand why you chose to drop my email address from your
> reply? What do I need to feel when this happens?

Hi Boaz,

Sorry you were dropped, that was not my intention; I blame my mailer, as
I did hit reply-all.  No hard feelings?

> And to your questions above. As I answered to Dave.
> This is the novelty of my approach and the big difference between
> what you guys thought with MAP_DAX and my patches as submitted.
>  1. Application will/need to call m/fsync to let the FS the freedom it needs
>  2. The m/fsync as well as the page faults will be very light wait and fast,
>     all that is required from the pmem aware app is to do movnt stores and cl_flushes.

I like the approach for these existing file systems.

> So enjoying both worlds. And actually more:
> With your approach of fallocat(ing) the all space in advance you might as well
> just partition the storage and use the DAX(ed) block device. But with my
> approach you need not pre-allocate and enjoy the over provisioned model and
> the space allocation management of a modern FS. And even with all that still
> enjoy very fast direct mapped stores by not requiring the current slow m/fsync()

Well, that remains to be seen.  Certainly for O_DIRECT appends or hole
filling, there is extra overhead involved when compared to writes to
already-existing blocks.  Apply that to DAX and the overhead will be
much more prominent.  I'm not saying that this is definitely the case,
but I think it's something we'll have to measure going forward.

> I hope you guys stand behind me in my effort to accelerate userspace pmem apps
> and still not break any built in assumptions.

I do like the idea of reducing the msync/fsync overhead, though I admit
I haven't yet looked at the patches in any detail.  My mail in this
thread was primarily an attempt to wrap my head around why the fs needs
the fsync/msync at all.  I've got that cleared up now.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 14:10                     ` Boaz Harrosh
@ 2016-02-23 16:56                       ` Dan Williams
  2016-02-23 17:05                         ` Ross Zwisler
  2016-02-23 21:55                         ` Boaz Harrosh
  2016-02-23 17:25                       ` Ross Zwisler
  1 sibling, 2 replies; 70+ messages in thread
From: Dan Williams @ 2016-02-23 16:56 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, Rudoff, Andy, Dave Chinner, Jeff Moyer,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Tue, Feb 23, 2016 at 6:10 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 02/23/2016 11:52 AM, Christoph Hellwig wrote:
[..]
> Please tell me what you find wrong with my approach?

Setting aside fs interactions you didn't respond to my note about
architectures where the pmem-aware app needs to flush caches due to
other non-pmem aware apps sharing the mapping.  Non-temporal stores
guaranteeing persistence on their own is an architecture specific
feature.  I don't see how we can have a generic support for mixed
MAP_PMEM_AWARE / unaware shared mappings when the architecture
dependency exists [1].

I think Christoph has already pointed out the roadmap.  Get the
existing crop of DAX bugs squashed and then *maybe* look at something
like a MAP_SYNC to opt-out of userspace needing to call *sync.

[1]: 10.4.6.2 Caching of Temporal vs. Non-Temporal Data
"Some older CPU implementations (e.g., Pentium M) allowed addresses
being written with a non-temporal store instruction to be updated
in-place if the memory type was not WC and line was already in the
cache."

I wouldn't be surprised if other architectures had similar constraints.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 16:56                       ` Dan Williams
@ 2016-02-23 17:05                         ` Ross Zwisler
  2016-02-23 17:26                           ` Dan Williams
  2016-02-23 21:55                         ` Boaz Harrosh
  1 sibling, 1 reply; 70+ messages in thread
From: Ross Zwisler @ 2016-02-23 17:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Boaz Harrosh, Arnd Bergmann, linux-nvdimm, Dave Chinner,
	Oleg Nesterov, Christoph Hellwig, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Tue, Feb 23, 2016 at 08:56:57AM -0800, Dan Williams wrote:
> On Tue, Feb 23, 2016 at 6:10 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> > On 02/23/2016 11:52 AM, Christoph Hellwig wrote:
> [..]
> > Please tell me what you find wrong with my approach?
> 
> Setting aside fs interactions you didn't respond to my note about
> architectures where the pmem-aware app needs to flush caches due to
> other non-pmem aware apps sharing the mapping.  Non-temporal stores
> guaranteeing persistence on their own is an architecture specific
> feature.  I don't see how we can have a generic support for mixed
> MAP_PMEM_AWARE / unaware shared mappings when the architecture
> dependency exists [1].
> 
> I think Christoph has already pointed out the roadmap.  Get the
> existing crop of DAX bugs squashed and then *maybe* look at something
> like a MAP_SYNC to opt-out of userspace needing to call *sync.
> 
> [1]: 10.4.6.2 Caching of Temporal vs. Non-Temporal Data
> "Some older CPU implementations (e.g., Pentium M) allowed addresses
> being written with a non-temporal store instruction to be updated
> in-place if the memory type was not WC and line was already in the
> cache."
> 
> I wouldn't be surprised if other architectures had similar constraints.

I don't understand how this is an argument against Boaz's approach.  If
non-temporal stores are essentially broken, they are broken for both the
kernel use case and for the userspace use case, and (if we want to support
these platforms, which I'm not sure we do) we would need to fall back to
writes + explicit flushes for both kernel space and userspace.

As long as each of userspace and kernel space are doing the right thing on
whatever platform we are on to get persistence for the writes that they do, I
think that everything works out essentially the same way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 12:06                       ` Dave Chinner
@ 2016-02-23 17:10                         ` Ross Zwisler
  2016-02-23 21:47                           ` Dave Chinner
  0 siblings, 1 reply; 70+ messages in thread
From: Ross Zwisler @ 2016-02-23 17:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rudoff, Andy, Arnd Bergmann, linux-nvdimm, Oleg Nesterov,
	Christoph Hellwig, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov

On Tue, Feb 23, 2016 at 11:06:44PM +1100, Dave Chinner wrote:
> On Tue, Feb 23, 2016 at 10:07:07AM +0000, Rudoff, Andy wrote:
> > 
> > > [Hi Andy - care to properly line break after ~75 character, that makes
> > > ready the message a lot easier, thanks!]
> > 
> > My bad. 
> > 
> > >> The instructions give you very fine-grain flushing control, but the
> > >> downside is that the app must track what it changes at that fine
> > >> granularity.  Both models work, but there's a trade-off.
> > > 
> > > No, the cache flush model simply does not work without a lot of hard
> > > work to enable it first.
> > 
> > It's working well enough to pass tests that simulate crashes and
> > various workload tests for the apps involved. And I agree there
> > has been a lot of hard work behind it. I guess I'm not sure why you're
> > saying it is impossible or not working.
> > 
> > Let's take an example: an app uses fallocate() to create a DAX file,
> > mmap() to map it, msync() to flush changes. The app follows POSIX
> > meaning it doesn't expect file metadata to be flushed magically, etc.
> > The app is tested carefully and it works correctly.  Now the msync()
> > call used to flush stores is replaced by flushing instructions.
> > What's broken?
> 
> You haven't told the filesytem to flush any dirty metadata required
> to access the user data to persistent storage.  If the zeroing and
> unwritten extent conversion that is run by the filesytem during
> write faults into preallocated blocks isn't persistent, then after a
> crash the file will read back as unwritten extents, returning zeros
> rather than the data that was written.
> 
> msync() calls fsync() on file back pages, which makes file metadata
> changes persistent.  Indeed, if you read the fdatasync man page, you
> might have noticed that it makes explicit reference that it requires
> the filesystem to flush the metadata needed to access the data that
> is being synced. IOWs, the filesystem knows about this dirty
> metadata that needs to be flushed to ensure data integrity,
> userspace doesn't.
> 
> Not to mention that the filesystem will convert and zero much more
> than just a single cacheline (whole pages at minimum, could be 2MB
> extents for large pages, etc) so the filesystem may require CPU
> cache flushes over a much wider range of cachelines that the
> application realises are dirty and require flushing for data
> integrity purposes. The filesytem knows about these dirty cache
> lines, userspace doesn't.

With the current code at least dax_zero_page_range() doesn't rely on
fsync/msync from userspace to make the zeroes that it writes persistent.  It
does all the necessary flushing and wmb_pmem() calls itself.  I agree that
this does not address your concern about metadata being in sync, though.

> IOWs, your userspace library may have made sure the data it modifies
> is in the physical location via your userspace CPU cache flushes,
> but there can be a lot of stuff it doesn't know about internal to
> the filesytem that also needs to be flushed to ensure data integrity
> is maintained.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 14:10                     ` Boaz Harrosh
  2016-02-23 16:56                       ` Dan Williams
@ 2016-02-23 17:25                       ` Ross Zwisler
  2016-02-23 22:47                         ` Boaz Harrosh
  1 sibling, 1 reply; 70+ messages in thread
From: Ross Zwisler @ 2016-02-23 17:25 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, Rudoff, Andy, Dave Chinner, Jeff Moyer,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Tue, Feb 23, 2016 at 04:10:50PM +0200, Boaz Harrosh wrote:
> On 02/23/2016 11:52 AM, Christoph Hellwig wrote:
> <>
> > 
> > And this is BS.  Using msync or fsync might not perform as well as not
> > actually using them, but without them you do not get persistence.  If
> > you use your pmem as a throw away cache that's fine, but for most people
> > that is not the case.
> > 
> 
> Hi Christoph
> 
> So is exactly my suggestion. My approach is *not* the we do not call
> m/fsync to let the FS clean up.
> 
> In my model we still do that, only we eliminate the m/fsync slowness
> and the all page faults overhead by being instructed by the application
> that we do not need to track the data modified cachelines. Since the
> application is telling us that it will do so.
> 
> In my model the job is split:
>  App will take care of data persistence by instructing a MAP_PMEM_AWARE,
>  and doing its own cl_flushing / movnt.
>  Which is the heavy cost
> 
>  The FS will keep track of the Meta-Data persistence as it already does, via the
>  call to m/fsync. Which is marginal performance compared to the above heavy
>  IO.
> 
> Note that the FS is still free to move blocks around, as Dave said:
> lockout pagefaultes, unmap from user space, let app fault again on a new
> block. this will still work as before, already in COW we flush the old
> block so there will be no persistence lost.
> 
> So this all thread started with my patches, and my patches do not say
> "no m/fsync" they say, make this 3-8 times faster than today if the app
> is participating in the heavy lifting.
> 
> Please tell me what you find wrong with my approach?

It seems like we are trying to solve a couple of different problems:

1) Make page faults faster by skipping any radix tree insertions, tag updates,
etc.

2) Make fsync/msync faster by not flushing data that the application says it
is already making durable from userspace.

I agree that your approach seems to improve both of these problems, but I
would argue that it is an incomplete solution for problem #2 because a
fsync/msync from the PMEM aware application would still flush any radix tree
entries from *other* threads that were writing to the same file.

It seems like a more direct solution for #2 above would be to have a
metadata-only equivalent of fsync/fdatasync, say "fmetasync", which says "I'll
make the writes I do to my mmaps durable from userspace, but I need you to
sync all filesystem metadata for me, please".

This would allow a complete separation of data synchronization in userspace
from metadata synchronization in kernel space by the filesystem code.

By itself a fmetasync() type solution of course would do nothing for issue #1
- if that was a compelling issue you'd need something like the mmap tag you're
proposing to skip work on page faults.

All that being said, though, I agree with others in the thread that we should
still be focused on correctness, as we have a lot of correctness issues
remaining.  When we eventually get to the place where we are trying to do
performance optimizations, those optimizations should be measurement driven.

- Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 17:05                         ` Ross Zwisler
@ 2016-02-23 17:26                           ` Dan Williams
  0 siblings, 0 replies; 70+ messages in thread
From: Dan Williams @ 2016-02-23 17:26 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Boaz Harrosh, Arnd Bergmann, linux-nvdimm, Dave Chinner,
	Oleg Nesterov, Christoph Hellwig, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Tue, Feb 23, 2016 at 9:05 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Feb 23, 2016 at 08:56:57AM -0800, Dan Williams wrote:
>> On Tue, Feb 23, 2016 at 6:10 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> > On 02/23/2016 11:52 AM, Christoph Hellwig wrote:
>> [..]
>> > Please tell me what you find wrong with my approach?
>>
>> Setting aside fs interactions you didn't respond to my note about
>> architectures where the pmem-aware app needs to flush caches due to
>> other non-pmem aware apps sharing the mapping.  Non-temporal stores
>> guaranteeing persistence on their own is an architecture specific
>> feature.  I don't see how we can have a generic support for mixed
>> MAP_PMEM_AWARE / unaware shared mappings when the architecture
>> dependency exists [1].
>>
>> I think Christoph has already pointed out the roadmap.  Get the
>> existing crop of DAX bugs squashed and then *maybe* look at something
>> like a MAP_SYNC to opt-out of userspace needing to call *sync.
>>
>> [1]: 10.4.6.2 Caching of Temporal vs. Non-Temporal Data
>> "Some older CPU implementations (e.g., Pentium M) allowed addresses
>> being written with a non-temporal store instruction to be updated
>> in-place if the memory type was not WC and line was already in the
>> cache."
>>
>> I wouldn't be surprised if other architectures had similar constraints.
>
> I don't understand how this is an argument against Boaz's approach.  If
> non-temporal stores are essentially broken, they are broken for both the
> kernel use case and for the userspace use case, and (if we want to support
> these platforms, which I'm not sure we do) we would need to fall back to
> writes + explicit flushes for both kernel space and userspace.

MAP_PMEM_AWARE only declares self-awareness does not guarantee that
everyone else sharing the mapping is equally aware.  A pmem-aware app
on such an architecture would be free to flush once and use
non-temporal stores going forward, but if the mapping is shared it
needs to flush all the time.  Like I said before it needs to be
all-aware apps in a shared mapping or none, but it's moot because I
think something like MAP_SYNC is semantically much clearer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 17:10                         ` Ross Zwisler
@ 2016-02-23 21:47                           ` Dave Chinner
  2016-02-23 22:15                             ` Boaz Harrosh
  0 siblings, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2016-02-23 21:47 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Rudoff, Andy, Arnd Bergmann, linux-nvdimm, Oleg Nesterov,
	Christoph Hellwig, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov

On Tue, Feb 23, 2016 at 10:10:59AM -0700, Ross Zwisler wrote:
> On Tue, Feb 23, 2016 at 11:06:44PM +1100, Dave Chinner wrote:
> > On Tue, Feb 23, 2016 at 10:07:07AM +0000, Rudoff, Andy wrote:
> > Not to mention that the filesystem will convert and zero much
> > more than just a single cacheline (whole pages at minimum, could
> > be 2MB extents for large pages, etc) so the filesystem may
> > require CPU cache flushes over a much wider range of cachelines
> > that the application realises are dirty and require flushing for
> > data integrity purposes. The filesytem knows about these dirty
> > cache lines, userspace doesn't.
> 
> With the current code at least dax_zero_page_range() doesn't rely

dax_clear_sectors(), actually.

> on fsync/msync from userspace to make the zeroes that it writes
> persistent.  It does all the necessary flushing and wmb_pmem()
> calls itself. 

Yes, that's the current implementation. We don't actually depend on
those semantics, though, and assuming we do is a demonstration of
the problems we're having right now. We could get rid of all the
synchronous cache flushes and just mark the range dirty in the
mapping radix tree and ensure that the cache flushes occur before
the conversion transaction is made durable. And to make my point
even clearer, that "flush data then transactions" ordering is
exactly how fsync is implemented.

i.e. what we've implemented right now is a basic, slow,
easy-to-make-work-correctly brute force solution. That doesn't mean
we always need to implement it this way, or that we are bound by the
way dax_clear_sectors() currently flushes cachelines before it
returns. It's just a simple implementation that provides the
ordering the *filesystem requires* to provide the correct data
integrity semantics to userspace.

pmem cache flushing is a durability mechanism, it's not a data
integrity solution. We have to flush CPU caches to provide
durability, but that alone is not sufficient to guarantee that
application data is complete and accessible after a crash.

> I agree that this does not address your concern
> about metadata being in sync, though.

Right, and msync/fsync is the only way to guarantee that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 16:56                       ` Dan Williams
  2016-02-23 17:05                         ` Ross Zwisler
@ 2016-02-23 21:55                         ` Boaz Harrosh
  2016-02-23 22:33                           ` Dan Williams
  1 sibling, 1 reply; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-23 21:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Rudoff, Andy, Dave Chinner, Jeff Moyer,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On 02/23/2016 06:56 PM, Dan Williams wrote:
> On Tue, Feb 23, 2016 at 6:10 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 02/23/2016 11:52 AM, Christoph Hellwig wrote:
> [..]
>> Please tell me what you find wrong with my approach?
> 
> Setting aside fs interactions you didn't respond to my note about
> architectures where the pmem-aware app needs to flush caches due to
> other non-pmem aware apps sharing the mapping.  Non-temporal stores
> guaranteeing persistence on their own is an architecture specific
> feature.  I don't see how we can have a generic support for mixed
> MAP_PMEM_AWARE / unaware shared mappings when the architecture
> dependency exists [1].
> 

I thought I did. your Pentium M example below is just fine.
Or I'm missing something really big here. so you will need
to step me through real slow.

Lets say we have a very silly system
[Which BTW will never exist because again the presidence of NFS and
applications are written to do work also over NFS]

So say in this system two applications one writes to all the even
addressed longs and the second writes to all the odd addressed in
a given page. then app 1 syncs and so does app 2. Only after both syncs
the system is stable at a known checkpoint, because before to union of the
two syncs we do not know what will persist to harddisk, right?

Now say we are dax and app 1 is MAP_PMEM_AWAR and app 2 is old.

app 1] faults in page X; Does its "evens" stores Pentium M movnt style directly
      to memory, all odd addresses new values are still in cache.

app 2] faults in page X; the page is in the radix tree because it is "old-style"
       does its cached "odds" stores; calls a sync that does cl_flush.

Lets look at a single cacheline. 
- If app 2 sync came before app1 movnt then in memory we have a zebra of zeros and app2 values.
  but once app 1 came along and did its movnt all expected values are there persistent.

- If app 1 stores came before app 2 sync, then we have a zebra of app1 + zeros.
  But once sync came we have persistent both values.

In any which case we are guarantied persistence when both apps finished their
run. If we interrupt the run at any point before, we will have zebra cachlines
even if we are talking about a regular harddisk with regular volatile page cache.

So I fail to see what is broken, please explain. What broken senario you are
seeing? that before dax/none-dax would work?

(For me BTW the two applications that intimately share a single cacheline are one
 multi process application and for me they need to understand what they are doing.
 if the admin upgrages the one he should also upgrade the other. Look in the real
 world, who are heavy users of MAP_SHARED, can you imagine gcc linker sharing the same
 file with another concurrent application? the only one that I know that remotely does
 that is git. And git makes sure to take file locks when it writes such shared records.
 Git works over NFS as well)

But seriously please explain the problem. I do not see one.

> I think Christoph has already pointed out the roadmap.  Get the
> existing crop of DAX bugs squashed 

Sure that's always true, I'm a stability freak through and through ask
the guys who work with me. I like to sleep at night ;-)

> and then *maybe* look at something
> like a MAP_SYNC to opt-out of userspace needing to call *sync.
> 

MAP_SYNC Is another novelty, which as Dave showed will not be implemented
by such a legacy filesystem as xfs. any time soon. sync is needed not only
for memory stores. For me this is a supper set of what I proposed. because
again any file writes persistence is built of two parts durable data, and
durable meta-data. My flag says, app takes care of data, then the other part
can be done another way. For performance sake which is what I care about
the heavy lifting is done at the data path. the meta data is marginal.
If you want for completeness sake then fine have another flag.

The new app written will need to do its new pmem_memcpy magic any way.
then we are saying "do we need to call fsync() or not?"

I hate it that you postpone that to never because it would be nice for
philosophical sake to not have the app call sync at all. and all these
years suffer the performance penalty. Instead of putting in a 10 liners
patch today that has no risks, and yes forces new apps to keep the ugly
fsync() call, but have the targeted performance today instead of *maybe* never.

my path is a nice intermediate  progression to yours. Yours blocks my needs
indefinitely?

> [1]: 10.4.6.2 Caching of Temporal vs. Non-Temporal Data
> "Some older CPU implementations (e.g., Pentium M) allowed addresses
> being written with a non-temporal store instruction to be updated
> in-place if the memory type was not WC and line was already in the
> cache."
> 
> I wouldn't be surprised if other architectures had similar constraints.
> 

Perhaps you are looking at this from the wrong perspective. Pentium M
can do this because the two cores shared the same cache. But we are talking
about POSIX files semantics. Not CPU memory semantics. Some of our problems
go away.

Or am I missing something out and I'm completely clueless. Please explain
slowly.

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 21:47                           ` Dave Chinner
@ 2016-02-23 22:15                             ` Boaz Harrosh
  2016-02-23 23:28                               ` Dave Chinner
  0 siblings, 1 reply; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-23 22:15 UTC (permalink / raw)
  To: Dave Chinner, Ross Zwisler
  Cc: Arnd Bergmann, linux-nvdimm, Oleg Nesterov, Christoph Hellwig,
	linux-mm, Mel Gorman, Johannes Weiner, Kirill A. Shutemov

On 02/23/2016 11:47 PM, Dave Chinner wrote:
<>
> 
> i.e. what we've implemented right now is a basic, slow,
> easy-to-make-work-correctly brute force solution. That doesn't mean
> we always need to implement it this way, or that we are bound by the
> way dax_clear_sectors() currently flushes cachelines before it
> returns. It's just a simple implementation that provides the
> ordering the *filesystem requires* to provide the correct data
> integrity semantics to userspace.
> 

Or it can be written properly with movnt instructions and be even
faster the a simple memset, and no need for any cl_flushing let alone
any radix-tree locking.

That said your suggestion above is 25%-100% slower than current code
because the cl_flushes will be needed eventually, and the atomics of a
lock takes 25% the time of a full page copy. You are forgetting we are
talking about memory and not harddisk. the rules are different.
(Cumming from NFS it took me a long time to adjust)

I'll send a patch to fix this
Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 21:55                         ` Boaz Harrosh
@ 2016-02-23 22:33                           ` Dan Williams
  2016-02-23 23:07                             ` Boaz Harrosh
  2016-02-23 23:28                             ` Jeff Moyer
  0 siblings, 2 replies; 70+ messages in thread
From: Dan Williams @ 2016-02-23 22:33 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, Rudoff, Andy, Dave Chinner, Jeff Moyer,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Tue, Feb 23, 2016 at 1:55 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
[..]
> But seriously please explain the problem. I do not see one.
>
>> I think Christoph has already pointed out the roadmap.  Get the
>> existing crop of DAX bugs squashed
>
> Sure that's always true, I'm a stability freak through and through ask
> the guys who work with me. I like to sleep at night ;-)
>
>> and then *maybe* look at something
>> like a MAP_SYNC to opt-out of userspace needing to call *sync.
>>
>
> MAP_SYNC Is another novelty, which as Dave showed will not be implemented
> by such a legacy filesystem as xfs. any time soon. sync is needed not only
> for memory stores. For me this is a supper set of what I proposed. because
> again any file writes persistence is built of two parts durable data, and
> durable meta-data. My flag says, app takes care of data, then the other part
> can be done another way. For performance sake which is what I care about
> the heavy lifting is done at the data path. the meta data is marginal.
> If you want for completeness sake then fine have another flag.
>
> The new app written will need to do its new pmem_memcpy magic any way.
> then we are saying "do we need to call fsync() or not?"
>
> I hate it that you postpone that to never because it would be nice for
> philosophical sake to not have the app call sync at all. and all these
> years suffer the performance penalty. Instead of putting in a 10 liners
> patch today that has no risks, and yes forces new apps to keep the ugly
> fsync() call, but have the targeted performance today instead of *maybe* never.
>
> my path is a nice intermediate  progression to yours. Yours blocks my needs
> indefinitely?
>
>> [1]: 10.4.6.2 Caching of Temporal vs. Non-Temporal Data
>> "Some older CPU implementations (e.g., Pentium M) allowed addresses
>> being written with a non-temporal store instruction to be updated
>> in-place if the memory type was not WC and line was already in the
>> cache."
>>
>> I wouldn't be surprised if other architectures had similar constraints.
>>
>
> Perhaps you are looking at this from the wrong perspective. Pentium M
> can do this because the two cores shared the same cache. But we are talking
> about POSIX files semantics. Not CPU memory semantics. Some of our problems
> go away.
>
> Or am I missing something out and I'm completely clueless. Please explain
> slowly.
>

So I need to step back from the Pentium M example.  It's already a red
herring because, as Ross points out, prefetch concerns would require
that strawman application to be doing cache flushes itself.

Set that aside and sorry for that diversion.

In general MAP_SYNC, makes more sense semantic sense in that the
filesystem knows that the application is not going to be calling *sync
and it avoids triggering flushes for cachelines we don't care about.

Although if we had MAP_SYNC today we'd still be in the situation that
an app that fails to do its own cache flushes / bypass correctly gets
to keep the broken pieces.

The crux of the problem, in my opinion, is that we're asking for an "I
know what I'm doing" flag, and I expect that's an impossible statement
for a filesystem to trust generically.  If you can get MAP_PMEM_AWARE
in, great, but I'm more and more of the opinion that the "I know what
I'm doing" interface should be something separate from today's trusted
filesystems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 17:25                       ` Ross Zwisler
@ 2016-02-23 22:47                         ` Boaz Harrosh
  0 siblings, 0 replies; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-23 22:47 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Christoph Hellwig, Rudoff, Andy, Dave Chinner, Jeff Moyer,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On 02/23/2016 07:25 PM, Ross Zwisler wrote:
<>
> 
> It seems like we are trying to solve a couple of different problems:
> 
> 1) Make page faults faster by skipping any radix tree insertions, tag updates,
> etc.
> 
> 2) Make fsync/msync faster by not flushing data that the application says it
> is already making durable from userspace.
> 

I fail to see how this is separate issues the reason you are keeping track
of pages in [1] is exactly because you want to know which are they in [2].
Only [2] matters and [1] is what you thought is a necessary cost.
If you remember I wanted to solve [1] differently by iterating over the
extent lists of the mmap range and cl_flushing all the "written" pages in the
range not only the written ones. In our testes we found that for  most real world
applications and benchmarks this works better then your approach. because
page-faults are fast.
There is however work loads that are much worse. In anyway your way was easier
because it had a generic solution instead of an FS specific implementation.

> I agree that your approach seems to improve both of these problems, but I
> would argue that it is an incomplete solution for problem #2 because a
> fsync/msync from the PMEM aware application would still flush any radix tree
> entries from *other* threads that were writing to the same file.
> 

No!! you meant applications. Because threads are from the same application if
a programmer is dumb enough to upgrade one mmap call site to new and keep
all other sites legacy without the flag and pmem_mecpy, then he can suffer, I do
not care for dumb programmers.

For the two applications one new one legacy writing to the same file each written
by a different team of programmers. For one they do not exist. But for two
this is an administrator issue. Yes if he allows such a setup he knows that the
performance will not be has if both apps upgraded but it will still be better then
two legacy apps. because at least all the pages from the new app will not slow-sync.

> It seems like a more direct solution for #2 above would be to have a
> metadata-only equivalent of fsync/fdatasync, say "fmetasync", which says "I'll
> make the writes I do to my mmaps durable from userspace, but I need you to
> sync all filesystem metadata for me, please".
> 
> This would allow a complete separation of data synchronization in userspace
> from metadata synchronization in kernel space by the filesystem code.
> 
> By itself a fmetasync() type solution of course would do nothing for issue #1
> - if that was a compelling issue you'd need something like the mmap tag you're
> proposing to skip work on page faults.
> 

Again a novelty solution to a theoretical only problem. With only very marginal
performance gains. And no users that I can see. And lots of work including FS
specific work.

> All that being said, though, I agree with others in the thread that we should
> still be focused on correctness, as we have a lot of correctness issues
> remaining.  When we eventually get to the place where we are trying to do
> performance optimizations, those optimizations should be measurement driven.
> 

What I'm hopping to do is establish a good practice for pmem aware apps
that everyone can agree on and will give us ground to optimize for.
That pmem apps can start to be written and experimented with.

The patch I sent is so simple and none intrusive that it can be easily
be carried in the noise and I cannot see how it breaks anything. And yes
I am measurement driven and is why I even bother.

And hence the RFC let us establish a programming model first.

> - Ross
> 

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 22:33                           ` Dan Williams
@ 2016-02-23 23:07                             ` Boaz Harrosh
  2016-02-23 23:23                               ` Dan Williams
  2016-02-23 23:28                             ` Jeff Moyer
  1 sibling, 1 reply; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-23 23:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Rudoff, Andy, Dave Chinner, Jeff Moyer,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On 02/24/2016 12:33 AM, Dan Williams wrote:
<>
> 
> In general MAP_SYNC, makes more sense semantic sense in that the
> filesystem knows that the application is not going to be calling *sync
> and it avoids triggering flushes for cachelines we don't care about.
> 

I'm not sure I understand what you meant by
	"avoids triggering flushes for cachelines we don't care about". 

But again MAP_SYNC is nice but too nice, and will never just be. And
why does it need to be either/or why not a progression turds.
[In fact our system already has MAP_SYNC.]

And you are contradicting yourself because with MAP_SYNC an application
still needs to do its magical pmem_memcpy()

> Although if we had MAP_SYNC today we'd still be in the situation that
> an app that fails to do its own cache flushes / bypass correctly gets
> to keep the broken pieces.
> 

Yes that is true today and was always true and will always be true, your
point being?

> The crux of the problem, in my opinion, is that we're asking for an "I
> know what I'm doing" flag, and I expect that's an impossible statement
> for a filesystem to trust generically.  If you can get MAP_PMEM_AWARE
> in, great, but I'm more and more of the opinion that the "I know what
> I'm doing" interface should be something separate from today's trusted
> filesystems.
> 

I disagree. I'm not saying any "trust me I know what I'm doing" flag.
the FS reveals nothing and trusts nothing.
All I'm saying is that the libc library I'm using as the new pmem_memecpy()
and I'm using that instead of the old memecpy(). So the FS does not need to
wipe my face after I eat. Failing to do so just means a bug in the application
that failed to actually move the proper data to the place it needs to move to.
The FS did its contract by providing the exact blocks back as was written to
pmem this time by the app using pmem_memecpy(). So an FS did not violate any
trust, and nothing the app did can cause any break to the the shared filesystem
except a bad thing to itself.

This is true anyway and was not invented by this patch.

Cheers
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 23:07                             ` Boaz Harrosh
@ 2016-02-23 23:23                               ` Dan Williams
  2016-02-23 23:40                                 ` Boaz Harrosh
  0 siblings, 1 reply; 70+ messages in thread
From: Dan Williams @ 2016-02-23 23:23 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, Rudoff, Andy, Dave Chinner, Jeff Moyer,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Tue, Feb 23, 2016 at 3:07 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 02/24/2016 12:33 AM, Dan Williams wrote:

>> The crux of the problem, in my opinion, is that we're asking for an "I
>> know what I'm doing" flag, and I expect that's an impossible statement
>> for a filesystem to trust generically.  If you can get MAP_PMEM_AWARE
>> in, great, but I'm more and more of the opinion that the "I know what
>> I'm doing" interface should be something separate from today's trusted
>> filesystems.
>>
>
> I disagree. I'm not saying any "trust me I know what I'm doing" flag.
> the FS reveals nothing and trusts nothing.
> All I'm saying is that the libc library I'm using as the new pmem_memecpy()
> and I'm using that instead of the old memecpy(). So the FS does not need to
> wipe my face after I eat. Failing to do so just means a bug in the application

"just means a bug in the application"

Who gets the bug report when an app gets its cache syncing wrong and
data corruption ensues, and why isn't the fix for that bug that the
filesystem simply stops trusting MAP_PMEM_AWARE and synching
cachelines on behalf of the app when it calls sync as it must for
metadata consistency.  Problem solved globally for all broken usages
of MAP_PMEM_AWARE and the flag loses all meaning as a result.

This is the takeaway I've internalized from Dave's pushback of these
new mmap flags.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 22:15                             ` Boaz Harrosh
@ 2016-02-23 23:28                               ` Dave Chinner
  2016-02-24  0:08                                 ` Boaz Harrosh
  0 siblings, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2016-02-23 23:28 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ross Zwisler, Arnd Bergmann, linux-nvdimm, Oleg Nesterov,
	Christoph Hellwig, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov

On Wed, Feb 24, 2016 at 12:15:34AM +0200, Boaz Harrosh wrote:
> On 02/23/2016 11:47 PM, Dave Chinner wrote:
> <>
> > 
> > i.e. what we've implemented right now is a basic, slow,
> > easy-to-make-work-correctly brute force solution. That doesn't mean
> > we always need to implement it this way, or that we are bound by the
> > way dax_clear_sectors() currently flushes cachelines before it
> > returns. It's just a simple implementation that provides the
> > ordering the *filesystem requires* to provide the correct data
> > integrity semantics to userspace.
> > 
> 
> Or it can be written properly with movnt instructions and be even
> faster the a simple memset, and no need for any cl_flushing let alone
> any radix-tree locking.

Precisely my point - semantics of persistent memory durability are
going to change from kernel to kernel, architecture to architecture,
and hardware to hardware.

Assuming applications are going to handle all these wacky
differences to provide their users with robust data integrity is a
recipe for disaster. If applications writers can't even use fsync
properly, I can guarantee you they are going to completely fuck up
data integrity when targeting pmem.

> That said your suggestion above is 25%-100% slower than current code
> because the cl_flushes will be needed eventually, and the atomics of a
> lock takes 25% the time of a full page copy.

So what? We can optimise for performance later, once we've provided
correct and resilient infrastructure. We've been fighting against
premature optimisation for performance from teh start with DAX -
we've repeatedly had to undo stuff that was fast but broken, and
were not doing that any more. Correctness comes first, then we can
address the performance issues via iterative improvement, like we do
with everything else.

> You are forgetting we are
> talking about memory and not harddisk. the rules are different.

That's bullshit, Boaz. I'm sick and tired of people saying "but pmem
is different" as justification for not providing correct, reliable
data integrity behaviour. Filesytems on PMEM have to follow all the
same rules as any other type of persistent storage we put
filesystems on.

Yes, the speed of the storage may expose the fact that am
unoptimised correct implementation is a lot more expensive than
ignoring correctness, but that does not mean we can ignore
correctness. Nor does it mean that a correct implementation will be
slow - it just means we haven't optimised for speed yet because
getting it correct is a hard problem and our primary focus.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 22:33                           ` Dan Williams
  2016-02-23 23:07                             ` Boaz Harrosh
@ 2016-02-23 23:28                             ` Jeff Moyer
  2016-02-23 23:34                               ` Dan Williams
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2016-02-23 23:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Boaz Harrosh, Christoph Hellwig, Rudoff, Andy, Dave Chinner,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

Dan Williams <dan.j.williams@intel.com> writes:

> In general MAP_SYNC, makes more sense semantic sense in that the
> filesystem knows that the application is not going to be calling *sync

and so it makes sure its metadata is consistent after a write fault.

What you wrote is true for both MAP_SYNC and MAP_PMEM_AWARE.  :)

I assume you meant that MAP_SYNC is semantically cleaner from the file
system developer's point of view, yes?  Boaz, it might be helpful for
you to write down how an application might be structured to make use of
MAP_PMEM_AWARE.  Up to this point, I've been assuming you'd call it
whenever an application would call pcommit (or whatever the incantation
is on current CPUs).

> Although if we had MAP_SYNC today we'd still be in the situation that
> an app that fails to do its own cache flushes / bypass correctly gets
> to keep the broken pieces.

Dan, we already have this problem with existing storage and existing
interfaces.  Nothing changes with dax.

> The crux of the problem, in my opinion, is that we're asking for an "I
> know what I'm doing" flag, and I expect that's an impossible statement
> for a filesystem to trust generically.

The file system already trusts that.  If an application doesn't use
fsync properly, guess what, it will break.  This line of reasoning
doesn't make any sense to me.

> If you can get MAP_PMEM_AWARE in, great, but I'm more and more of the
> opinion that the "I know what I'm doing" interface should be something
> separate from today's trusted filesystems.

Just so I understand you, MAP_PMEM_AWARE isn't the "I know what I'm
doing" interface, right?

It sounds like we're a long way off from anything like MAP_SYNC going
in.  What I think would be useful at this stage is to come up with a
programming model we can all agree on.  ;-)  Crucially, I want to avoid
the O_DIRECT quagmire of different file systems behaving differently,
and having no way to actually query what behavior you're going to get.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 23:28                             ` Jeff Moyer
@ 2016-02-23 23:34                               ` Dan Williams
  2016-02-23 23:43                                 ` Jeff Moyer
  0 siblings, 1 reply; 70+ messages in thread
From: Dan Williams @ 2016-02-23 23:34 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Boaz Harrosh, Christoph Hellwig, Rudoff, Andy, Dave Chinner,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Tue, Feb 23, 2016 at 3:28 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> The crux of the problem, in my opinion, is that we're asking for an "I
>> know what I'm doing" flag, and I expect that's an impossible statement
>> for a filesystem to trust generically.
>
> The file system already trusts that.  If an application doesn't use
> fsync properly, guess what, it will break.  This line of reasoning
> doesn't make any sense to me.

No, I'm worried about the case where an app specifies MAP_PMEM_AWARE
uses fsync correctly, and fails to flush cpu cache.

>> If you can get MAP_PMEM_AWARE in, great, but I'm more and more of the
>> opinion that the "I know what I'm doing" interface should be something
>> separate from today's trusted filesystems.
>
> Just so I understand you, MAP_PMEM_AWARE isn't the "I know what I'm
> doing" interface, right?

It is the "I know what I'm doing" interface, MAP_PMEM_AWARE asserts "I
know when to flush the cpu relative to an fsync()".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 23:23                               ` Dan Williams
@ 2016-02-23 23:40                                 ` Boaz Harrosh
  2016-02-24  0:08                                   ` Dave Chinner
  0 siblings, 1 reply; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-23 23:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Rudoff, Andy, Dave Chinner, Jeff Moyer,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On 02/24/2016 01:23 AM, Dan Williams wrote:
> On Tue, Feb 23, 2016 at 3:07 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 02/24/2016 12:33 AM, Dan Williams wrote:
> 
>>> The crux of the problem, in my opinion, is that we're asking for an "I
>>> know what I'm doing" flag, and I expect that's an impossible statement
>>> for a filesystem to trust generically.  If you can get MAP_PMEM_AWARE
>>> in, great, but I'm more and more of the opinion that the "I know what
>>> I'm doing" interface should be something separate from today's trusted
>>> filesystems.
>>>
>>
>> I disagree. I'm not saying any "trust me I know what I'm doing" flag.
>> the FS reveals nothing and trusts nothing.
>> All I'm saying is that the libc library I'm using as the new pmem_memecpy()
>> and I'm using that instead of the old memecpy(). So the FS does not need to
>> wipe my face after I eat. Failing to do so just means a bug in the application
> 
> "just means a bug in the application"
> 
> Who gets the bug report when an app gets its cache syncing wrong and
> data corruption ensues, and why isn't the fix for that bug that the
> filesystem simply stops trusting MAP_PMEM_AWARE and synching
> cachelines on behalf of the app when it calls sync as it must for
> metadata consistency.  Problem solved globally for all broken usages
> of MAP_PMEM_AWARE and the flag loses all meaning as a result.
> 

Because this will not fix the application's bugs. Because if the application
is broken then you do not know that this will fix it. It is broken it failed
to uphold the contract it had with the Kernel.

It is like saying lets call fsync on file close because broken apps keep
forgetting to call fsync(). And file close is called even if the app crashes.
Will Dave do that?

No if an app has a bug like this falling to call the proper pmem_xxx routine
in the proper work flow, it might has just forgotten to call fsync, or maybe
still modifying memory after fsync was called. And your babysitting the app
will not help.

> This is the takeaway I've internalized from Dave's pushback of these
> new mmap flags.
> 

We are already used to tell the firefox guys, you did not call fsync and
you lost data on a crash.

We will have a new mantra, "You did not use pmem_memcpy() but used MAP_PMEM_AWARE"
We have contracts like that between Kernel and apps all the time. I fail to see why
this one crossed the line for you?

Again the question is: Can an app do something so stupid that it can break other
apps?

Cheers
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 23:34                               ` Dan Williams
@ 2016-02-23 23:43                                 ` Jeff Moyer
  2016-02-23 23:56                                   ` Dan Williams
  0 siblings, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2016-02-23 23:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Boaz Harrosh, Christoph Hellwig, Rudoff, Andy, Dave Chinner,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

Dan Williams <dan.j.williams@intel.com> writes:

> On Tue, Feb 23, 2016 at 3:28 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> The crux of the problem, in my opinion, is that we're asking for an "I
>>> know what I'm doing" flag, and I expect that's an impossible statement
>>> for a filesystem to trust generically.
>>
>> The file system already trusts that.  If an application doesn't use
>> fsync properly, guess what, it will break.  This line of reasoning
>> doesn't make any sense to me.
>
> No, I'm worried about the case where an app specifies MAP_PMEM_AWARE
> uses fsync correctly, and fails to flush cpu cache.

I don't think the kernel needs to put training wheels on applications.

>>> If you can get MAP_PMEM_AWARE in, great, but I'm more and more of the
>>> opinion that the "I know what I'm doing" interface should be something
>>> separate from today's trusted filesystems.
>>
>> Just so I understand you, MAP_PMEM_AWARE isn't the "I know what I'm
>> doing" interface, right?
>
> It is the "I know what I'm doing" interface, MAP_PMEM_AWARE asserts "I
> know when to flush the cpu relative to an fsync()".

I see.  So I think your argument is that new file systems (such as Nova)
can have whacky new semantics, but existing file systems should provide
the more conservative semantics that they have provided since the dawn
of time (even if we add a new mmap flag to control the behavior).

I don't agree with that.  :)

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 23:43                                 ` Jeff Moyer
@ 2016-02-23 23:56                                   ` Dan Williams
  2016-02-24  4:09                                     ` Ross Zwisler
  2016-02-24 15:02                                     ` Jeff Moyer
  0 siblings, 2 replies; 70+ messages in thread
From: Dan Williams @ 2016-02-23 23:56 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Boaz Harrosh, Christoph Hellwig, Rudoff, Andy, Dave Chinner,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Tue, Feb 23, 2016 at 3:43 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> On Tue, Feb 23, 2016 at 3:28 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>> The crux of the problem, in my opinion, is that we're asking for an "I
>>>> know what I'm doing" flag, and I expect that's an impossible statement
>>>> for a filesystem to trust generically.
>>>
>>> The file system already trusts that.  If an application doesn't use
>>> fsync properly, guess what, it will break.  This line of reasoning
>>> doesn't make any sense to me.
>>
>> No, I'm worried about the case where an app specifies MAP_PMEM_AWARE
>> uses fsync correctly, and fails to flush cpu cache.
>
> I don't think the kernel needs to put training wheels on applications.
>
>>>> If you can get MAP_PMEM_AWARE in, great, but I'm more and more of the
>>>> opinion that the "I know what I'm doing" interface should be something
>>>> separate from today's trusted filesystems.
>>>
>>> Just so I understand you, MAP_PMEM_AWARE isn't the "I know what I'm
>>> doing" interface, right?
>>
>> It is the "I know what I'm doing" interface, MAP_PMEM_AWARE asserts "I
>> know when to flush the cpu relative to an fsync()".
>
> I see.  So I think your argument is that new file systems (such as Nova)
> can have whacky new semantics, but existing file systems should provide
> the more conservative semantics that they have provided since the dawn
> of time (even if we add a new mmap flag to control the behavior).
>
> I don't agree with that.  :)
>

Fair enough.  Recall, I was pushing MAP_DAX not to long ago.  It just
seems like a Sisyphean effort to push an mmap flag up the XFS hill and
maybe that effort is better spent somewhere else.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 23:40                                 ` Boaz Harrosh
@ 2016-02-24  0:08                                   ` Dave Chinner
  0 siblings, 0 replies; 70+ messages in thread
From: Dave Chinner @ 2016-02-24  0:08 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dan Williams, Christoph Hellwig, Rudoff, Andy, Jeff Moyer,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Wed, Feb 24, 2016 at 01:40:57AM +0200, Boaz Harrosh wrote:
> On 02/24/2016 01:23 AM, Dan Williams wrote:
> > On Tue, Feb 23, 2016 at 3:07 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> >> On 02/24/2016 12:33 AM, Dan Williams wrote:
> > 
> >>> The crux of the problem, in my opinion, is that we're asking for an "I
> >>> know what I'm doing" flag, and I expect that's an impossible statement
> >>> for a filesystem to trust generically.  If you can get MAP_PMEM_AWARE
> >>> in, great, but I'm more and more of the opinion that the "I know what
> >>> I'm doing" interface should be something separate from today's trusted
> >>> filesystems.
> >>>
> >>
> >> I disagree. I'm not saying any "trust me I know what I'm doing" flag.
> >> the FS reveals nothing and trusts nothing.
> >> All I'm saying is that the libc library I'm using as the new pmem_memecpy()
> >> and I'm using that instead of the old memecpy(). So the FS does not need to
> >> wipe my face after I eat. Failing to do so just means a bug in the application
> > 
> > "just means a bug in the application"
> > 
> > Who gets the bug report when an app gets its cache syncing wrong and
> > data corruption ensues, and why isn't the fix for that bug that the
> > filesystem simply stops trusting MAP_PMEM_AWARE and synching
> > cachelines on behalf of the app when it calls sync as it must for
> > metadata consistency.  Problem solved globally for all broken usages
> > of MAP_PMEM_AWARE and the flag loses all meaning as a result.
> > 
> 
> Because this will not fix the application's bugs. Because if the application
> is broken then you do not know that this will fix it. It is broken it failed
> to uphold the contract it had with the Kernel.

That's not the point Dan was making. Data corruption bugs are going
to get reported to the filesystem developers, not the application
developers, because usres think that data corruption is always the
fault of the filesystem. How is the filesystem developer going to
know that a) the app is using DAX, b) the app has set some special
"I know what I'm doing flag", and c) the app doesn't actually know
what it is doing.

We are simply going to assume c) - from long experience I don't
trust any application developer to understand how data integrity
works. Almost any app developer that says they understand how
filesystems provide data integrity are almost always competely
wrong.

Hell, this thread has made me understand that most pmem developers
don't understand how filesystems provide data integrity guarantees.
Why should we trust applicaiton developers to do better?

> It is like saying lets call fsync on file close because broken apps keep
> forgetting to call fsync(). And file close is called even if the app crashes.
> Will Dave do that?

/me points to XFS_ITRUNCATE and xfs_release().

Yes, we already flush data on close in situations where data loss is
common due to stupid application developers refusing to use fsync
because "it's too slow".

ext4 has similar flush on close behaviours for the same reasons.

> No if an app has a bug like this falling to call the proper pmem_xxx routine
> in the proper work flow, it might has just forgotten to call fsync, or maybe
> still modifying memory after fsync was called. And your babysitting the app
> will not help.

History tells us otherwise. users always blame the filesystem first,
and then app developers will refuse to fix their applications
because it would either make their app slow or they think it's a
filesystem problem to solve because they tested on some other
filesystem and it didn't display that behaviour. The result is we
end up working around such problems in the filesystem so that users
don't end up losing data due to shit applications.

The same will happen here - filesystems will end up ignoring this
special "I know what I'm doing" flag because the vast majority of
app developers don't know enough to even realise that they don't
know what they are doing.

I *really* don't care about speed and performance here. I care about
reliability, resilience and data integrity. Speed comes from the
storage hardware being fast, not from filesystems ignoring
reliability, resilience and data integrity.

> > This is the takeaway I've internalized from Dave's pushback of these
> > new mmap flags.
> > 
> 
> We are already used to tell the firefox guys, you did not call fsync and
> you lost data on a crash.
> 
> We will have a new mantra, "You did not use pmem_memcpy() but used MAP_PMEM_AWARE"
> We have contracts like that between Kernel and apps all the time. I fail to see why
> this one crossed the line for you?

So, you prefer to repeat past mistakes rather than learning from
them. I prefer that we don't make the same mistakes again and so
have to live with them for the next 20 years.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 23:28                               ` Dave Chinner
@ 2016-02-24  0:08                                 ` Boaz Harrosh
  0 siblings, 0 replies; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-24  0:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, Arnd Bergmann, linux-nvdimm, Oleg Nesterov,
	Christoph Hellwig, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov

On 02/24/2016 01:28 AM, Dave Chinner wrote:
> On Wed, Feb 24, 2016 at 12:15:34AM +0200, Boaz Harrosh wrote:
>> On 02/23/2016 11:47 PM, Dave Chinner wrote:
>> <>
>>>
>>> i.e. what we've implemented right now is a basic, slow,
>>> easy-to-make-work-correctly brute force solution. That doesn't mean
>>> we always need to implement it this way, or that we are bound by the
>>> way dax_clear_sectors() currently flushes cachelines before it
>>> returns. It's just a simple implementation that provides the
>>> ordering the *filesystem requires* to provide the correct data
>>> integrity semantics to userspace.
>>>
>>
>> Or it can be written properly with movnt instructions and be even
>> faster the a simple memset, and no need for any cl_flushing let alone
>> any radix-tree locking.
> 
> Precisely my point - semantics of persistent memory durability are
> going to change from kernel to kernel, architecture to architecture,
> and hardware to hardware.
> 
> Assuming applications are going to handle all these wacky
> differences to provide their users with robust data integrity is a
> recipe for disaster. If applications writers can't even use fsync
> properly, I can guarantee you they are going to completely fuck up
> data integrity when targeting pmem.
> 
>> That said your suggestion above is 25%-100% slower than current code
>> because the cl_flushes will be needed eventually, and the atomics of a
>> lock takes 25% the time of a full page copy.
> 
> So what? We can optimise for performance later, once we've provided
> correct and resilient infrastructure. We've been fighting against
> premature optimisation for performance from teh start with DAX -
> we've repeatedly had to undo stuff that was fast but broken, and
> were not doing that any more. Correctness comes first, then we can
> address the performance issues via iterative improvement, like we do
> with everything else.
> 
>> You are forgetting we are
>> talking about memory and not harddisk. the rules are different.
> 
> That's bullshit, Boaz. I'm sick and tired of people saying "but pmem
> is different" as justification for not providing correct, reliable
> data integrity behaviour. Filesytems on PMEM have to follow all the
> same rules as any other type of persistent storage we put
> filesystems on.
> 
> Yes, the speed of the storage may expose the fact that am
> unoptimised correct implementation is a lot more expensive than
> ignoring correctness, but that does not mean we can ignore
> correctness. Nor does it mean that a correct implementation will be
> slow - it just means we haven't optimised for speed yet because
> getting it correct is a hard problem and our primary focus.
> 
> Cheers,
> 

Cheers indeed. Only you failed to say where I have sacrificed correctness.
You are barging into an open door. People who knows me know I'm a sucker
for stability and correctness.
	YES!!! Correctness first! must call fsync!! No BUGS!!
You have no arguments with me on that.

You yourself said that the current dax_clear_sectors() *is correct* but
is doing cl_flushes and it could just be dirtying the radix-tree plus
regular memory sets. I pointed out that this is slower because the
performance  rules of memory are different from the performance rules of
block storage.

I never said anything about data correctness or transactions or data
placements. Did I?

And I agree with you. All the wacky details of pmem needs to hide under
a gcc ARCH specific library in a generic portable API way. And not trusted
to the app.

The API is real simple:

pmem_memcpy()
pmem_memset()
pmem_flush()

All the wacky craft is hidden under these basic three old C concepts.
For now they are written and implemented and tested under nvml soon
enough they can move to a more generic place.

> Dave.
> 

Yes I usually do like to bulshit a lot in my personal life is lots of fun,
but not on Computers work, because Computers are boring I'd rather go dancing
instead. And bulshit is a waste of time. I do know what I'm doing and like you
I hate short cuts and complications and wacky code. I like correctness and stability.

Cheers indeed ;-)
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 23:56                                   ` Dan Williams
@ 2016-02-24  4:09                                     ` Ross Zwisler
  2016-02-24 19:30                                       ` Ross Zwisler
  2016-02-25  7:44                                       ` Boaz Harrosh
  2016-02-24 15:02                                     ` Jeff Moyer
  1 sibling, 2 replies; 70+ messages in thread
From: Ross Zwisler @ 2016-02-24  4:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Moyer, Arnd Bergmann, linux-nvdimm, Dave Chinner,
	Oleg Nesterov, Christoph Hellwig, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Tue, Feb 23, 2016 at 03:56:17PM -0800, Dan Williams wrote:
> On Tue, Feb 23, 2016 at 3:43 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > Dan Williams <dan.j.williams@intel.com> writes:
> >
> >> On Tue, Feb 23, 2016 at 3:28 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >>>> The crux of the problem, in my opinion, is that we're asking for an "I
> >>>> know what I'm doing" flag, and I expect that's an impossible statement
> >>>> for a filesystem to trust generically.
> >>>
> >>> The file system already trusts that.  If an application doesn't use
> >>> fsync properly, guess what, it will break.  This line of reasoning
> >>> doesn't make any sense to me.
> >>
> >> No, I'm worried about the case where an app specifies MAP_PMEM_AWARE
> >> uses fsync correctly, and fails to flush cpu cache.
> >
> > I don't think the kernel needs to put training wheels on applications.
> >
> >>>> If you can get MAP_PMEM_AWARE in, great, but I'm more and more of the
> >>>> opinion that the "I know what I'm doing" interface should be something
> >>>> separate from today's trusted filesystems.
> >>>
> >>> Just so I understand you, MAP_PMEM_AWARE isn't the "I know what I'm
> >>> doing" interface, right?
> >>
> >> It is the "I know what I'm doing" interface, MAP_PMEM_AWARE asserts "I
> >> know when to flush the cpu relative to an fsync()".
> >
> > I see.  So I think your argument is that new file systems (such as Nova)
> > can have whacky new semantics, but existing file systems should provide
> > the more conservative semantics that they have provided since the dawn
> > of time (even if we add a new mmap flag to control the behavior).
> >
> > I don't agree with that.  :)
> >
> 
> Fair enough.  Recall, I was pushing MAP_DAX not to long ago.  It just
> seems like a Sisyphean effort to push an mmap flag up the XFS hill and
> maybe that effort is better spent somewhere else.

Well, for what it's worth MAP_SYNC feels like the "right" solution to me.  I
understand that we are a ways from having it implemented, but it seems like
the correct way to have applications work with persistent memory in a perfect
world, and worth the effort.

MAP_PMEM_AWARE is interesting, but even in a perfect world it seems like a
partial solution - applications still need to call *sync to get the FS
metadata to be durable, and they have no reliable way of knowing which of
their actions will cause the metadata to be out of sync.

Dave, is your objection to the MAP_SYNC idea a practical one about complexity
and time to get it implemented, or do you think it's is the wrong solution?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-23 23:56                                   ` Dan Williams
  2016-02-24  4:09                                     ` Ross Zwisler
@ 2016-02-24 15:02                                     ` Jeff Moyer
  2016-02-24 22:56                                       ` Dave Chinner
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2016-02-24 15:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Boaz Harrosh, Christoph Hellwig, Rudoff, Andy, Dave Chinner,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

Dan Williams <dan.j.williams@intel.com> writes:

>> I see.  So I think your argument is that new file systems (such as Nova)
>> can have whacky new semantics, but existing file systems should provide
>> the more conservative semantics that they have provided since the dawn
>> of time (even if we add a new mmap flag to control the behavior).
>>
>> I don't agree with that.  :)
>>
>
> Fair enough.  Recall, I was pushing MAP_DAX not to long ago.  It just
> seems like a Sisyphean effort to push an mmap flag up the XFS hill and
> maybe that effort is better spent somewhere else.

Given Dave's last response to Boaz, I see what you mean, and I also
understand Dave's reasoning better, now.  FWIW, I never disagreed with
spending effort elsewhere for now.  I did think that the mmap flag was
on the horizon, though.  From Dave's comments, I think the prospects of
that are slim to none.  That's fine, at least we have a definite
direction.  Time to update all of the slide decks.  =)

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-24  4:09                                     ` Ross Zwisler
@ 2016-02-24 19:30                                       ` Ross Zwisler
  2016-02-25  9:46                                         ` Jan Kara
  2016-02-25  7:44                                       ` Boaz Harrosh
  1 sibling, 1 reply; 70+ messages in thread
From: Ross Zwisler @ 2016-02-24 19:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Jeff Moyer, Arnd Bergmann, linux-nvdimm,
	Dave Chinner, Oleg Nesterov, Christoph Hellwig, linux-mm,
	Mel Gorman, Johannes Weiner, Kirill A. Shutemov, boaz

On Tue, Feb 23, 2016 at 09:09:47PM -0700, Ross Zwisler wrote:
> On Tue, Feb 23, 2016 at 03:56:17PM -0800, Dan Williams wrote:
> > On Tue, Feb 23, 2016 at 3:43 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > > Dan Williams <dan.j.williams@intel.com> writes:
> > >
> > >> On Tue, Feb 23, 2016 at 3:28 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > >>>> The crux of the problem, in my opinion, is that we're asking for an "I
> > >>>> know what I'm doing" flag, and I expect that's an impossible statement
> > >>>> for a filesystem to trust generically.
> > >>>
> > >>> The file system already trusts that.  If an application doesn't use
> > >>> fsync properly, guess what, it will break.  This line of reasoning
> > >>> doesn't make any sense to me.
> > >>
> > >> No, I'm worried about the case where an app specifies MAP_PMEM_AWARE
> > >> uses fsync correctly, and fails to flush cpu cache.
> > >
> > > I don't think the kernel needs to put training wheels on applications.
> > >
> > >>>> If you can get MAP_PMEM_AWARE in, great, but I'm more and more of the
> > >>>> opinion that the "I know what I'm doing" interface should be something
> > >>>> separate from today's trusted filesystems.
> > >>>
> > >>> Just so I understand you, MAP_PMEM_AWARE isn't the "I know what I'm
> > >>> doing" interface, right?
> > >>
> > >> It is the "I know what I'm doing" interface, MAP_PMEM_AWARE asserts "I
> > >> know when to flush the cpu relative to an fsync()".
> > >
> > > I see.  So I think your argument is that new file systems (such as Nova)
> > > can have whacky new semantics, but existing file systems should provide
> > > the more conservative semantics that they have provided since the dawn
> > > of time (even if we add a new mmap flag to control the behavior).
> > >
> > > I don't agree with that.  :)
> > >
> > 
> > Fair enough.  Recall, I was pushing MAP_DAX not to long ago.  It just
> > seems like a Sisyphean effort to push an mmap flag up the XFS hill and
> > maybe that effort is better spent somewhere else.
> 
> Well, for what it's worth MAP_SYNC feels like the "right" solution to me.  I
> understand that we are a ways from having it implemented, but it seems like
> the correct way to have applications work with persistent memory in a perfect
> world, and worth the effort.
> 
> MAP_PMEM_AWARE is interesting, but even in a perfect world it seems like a
> partial solution - applications still need to call *sync to get the FS
> metadata to be durable, and they have no reliable way of knowing which of
> their actions will cause the metadata to be out of sync.
> 
> Dave, is your objection to the MAP_SYNC idea a practical one about complexity
> and time to get it implemented, or do you think it's is the wrong solution?

Jan, I just noticed that this chain didn't CC you nor linux-fsdevel, so you
may have missed it.  All the gory details are here:

http://thread.gmane.org/gmane.linux.kernel.mm/146691

Let me provide a little background for my question.  (Everyone else on the
thread feel free to jump in if you feel like my summary is incorrect or
incomplete.)

There is a new persistent memory programming model outlined on pmem.io and
implemented by the NVM Library (nvml).

http://pmem.io/
http://pmem.io/nvml/
https://github.com/pmem/nvml/

This new programming model is based on the idea that an application should be
able to create a DAX MMAP, and then from then on satisfy the data durability
requirements of the application purely in userspace.  This is done by using
non-temporal stores or cached writes followed by flushes, the same way that we
do things in the kernel.

Dave was concerned that this breaks down for XFS because even if the
application were to sync all its writes to media, the filesystem could be
making associated metadata changes that the application wouldn't and couldn't
know about:

http://article.gmane.org/gmane.linux.kernel.mm/146699

To sync these metadata changes to media, the application would still need to
call *sync.

One proposal from Christoph was that we could add a MMAP_SYNC flag that
essentially says "make all metadata operations synchronous":

http://article.gmane.org/gmane.linux.kernel.mm/146753

The worry is that this would be complex to implement, and that we maybe don't
want yet another DAX special case in the FS code.

Another way that we could implement this would be to key off of the DAX mount
option / inode setting for all mmaps that use DAX.  This would preclude the
need for changes to the mmap() API.

My question: How far away are we from having such a metadata durability
guarantee in ext4?  Do we have cases where the metadata changes associated
with a page fault, etc. could be out of sync with the data writes that are
being made durable by the application in userspace?

I see ext4 creating journal entries around page faults in places like
ext4_dax_fault() - this should durably record any metadata changes associated
with that page fault before the fault completes, correct?

Are there other cases you can think of with ext4 where we would need to call
*sync for DAX just to be sure we are safely synchronizing metadata?

Thanks,
- Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-24 15:02                                     ` Jeff Moyer
@ 2016-02-24 22:56                                       ` Dave Chinner
  2016-02-25 16:24                                         ` Jeff Moyer
  0 siblings, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2016-02-24 22:56 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Boaz Harrosh, Christoph Hellwig, Rudoff, Andy,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Wed, Feb 24, 2016 at 10:02:26AM -0500, Jeff Moyer wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
> 
> >> I see.  So I think your argument is that new file systems (such as Nova)
> >> can have whacky new semantics, but existing file systems should provide
> >> the more conservative semantics that they have provided since the dawn
> >> of time (even if we add a new mmap flag to control the behavior).
> >>
> >> I don't agree with that.  :)
> >>
> >
> > Fair enough.  Recall, I was pushing MAP_DAX not to long ago.  It just
> > seems like a Sisyphean effort to push an mmap flag up the XFS hill and
> > maybe that effort is better spent somewhere else.
> 
> Given Dave's last response to Boaz, I see what you mean, and I also
> understand Dave's reasoning better, now.  FWIW, I never disagreed with
> spending effort elsewhere for now.  I did think that the mmap flag was
> on the horizon, though.  From Dave's comments, I think the prospects of
> that are slim to none.  That's fine, at least we have a definite
> direction.  Time to update all of the slide decks.  =)

Well, let me clarify what I said a bit here, because I feel like I'm
being unfairly blamed for putting data integrity as the highest
priority for DAX+pmem instead of falling in line and chanting
"Performance! Performance! Performance!" with everyone else.

Let me state this clearly: I'm not opposed to making optimisations
that change the way applications and the kernel interact. I like the
idea of MAP_SYNC, but I see this sort of API/behaviour change as a
last resort when all else fails, not a "first and only" optimisation
option.

The big issue we have right now is that we haven't made the DAX/pmem
infrastructure work correctly and reliably for general use.  Hence
adding new APIs to workaround cases where we haven't yet provided
correct behaviour, let alone optimised for performance is, quite
frankly, a clear case premature optimisation.

We need a solid foundation on which to build a fast, safe pmem
storage stack. Rushing to add checkbox performance requirements or
features to demonstrate "progress" leads us down the path of btrfs -
a code base that we are forever struggling with because the
foundation didn't solve known hard problems at an early stage of
developement (e.g. ENOSPC, single tree lock, using generic RAID and
device layers, etc). This results in a code base full of entrenched
deficiencies that are almost impossible to fix and I, personally, do
not want to end up with DAX being in a similar place.

Getting fsync to work with DAX is one of these "known hard problems"
that we really need to solve before we try to optimise for
performance. Once we have solid, workable infrastructure, we'll be
in a much better place to evaluate the merits of optimisations that
reduce or eliminate dirty tracking overhead that is required for
providing data integrity.

>From this perspective, I'd much prefer that we look to generic
mapping infrastructure optimisations before we look to one-off API
additions for systems running PMEM. Yes, it's harder to do, but the
end result of such an approach is that everyone benefits, not just
some proprietary application that almost nobody uses.

Indeed, it may be that we need to revist previous work like using an
rcu-aware btree for the mapping tree instead of a radix tree, as was
prototyped way back in ~2007 by Peter Zjilstra. If we can make
infrastructure changes that mostly remove the overhead of tracking
everything in the kernel, then we don't need to add special
userspace API changes to minimise the kernel tracking overhead.

Only if we can't bring the overhead of kernel-side dirty tracking
down to a reasonable overhead should we be considering a new API
that puts the responsibility on userspace for syncing data, and even
then we'll need to be very, very careful about it.

However, such discussions are a complete distraction to the problems
we need to solve right now. i.e. we need to focus on making DAX+pmem
work safely and reliably. Once we've done that, then we can focus on
performance optimisations and, perhaps, new interfaces to userspace.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-24  4:09                                     ` Ross Zwisler
  2016-02-24 19:30                                       ` Ross Zwisler
@ 2016-02-25  7:44                                       ` Boaz Harrosh
  1 sibling, 0 replies; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-25  7:44 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams
  Cc: Arnd Bergmann, linux-nvdimm, Dave Chinner, Oleg Nesterov,
	Christoph Hellwig, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov

On 02/24/2016 06:09 AM, Ross Zwisler wrote:
> On Tue, Feb 23, 2016 at 03:56:17PM -0800, Dan Williams wrote:
<>
> 
> MAP_PMEM_AWARE is interesting, but even in a perfect world it seems like a
> partial solution - applications still need to call *sync to get the FS
> metadata to be durable, and they have no reliable way of knowing which of
> their actions will cause the metadata to be out of sync.
> 

So there is the very simple answer:
	Just like today.

Today you need to call m/fsync after you have finished all modifications
and you want a persistent point. This of course will work. .I.E write
the application same as if the mount is not dax. But do set the flag
and switch to pmem_memcpy all over. BTW pmem_memcpy() will give you
10% gain on memory performance with fully-cached FS a swell.

I do not mind that. Just that with MAP_PMEM_AWARE the call to sync will
be fast and the page-faults much much faster. I'm a pragmatic person I'm saying
to application writers.
	Change nothing, have the same source code for both DAX and none DAX
	mode. Just switch to pmem_memcpy() / pmem_flush() everywhere and set
	the mmap flag, and you have 3x boost on your mmap performance.

> Dave, is your objection to the MAP_SYNC idea a practical one about complexity
> and time to get it implemented, or do you think it's is the wrong solution?

So you see with MAP_SYNC you are asking developers to write two versions of their
app, the later which does not call m/fsync.

[BTW MAP_SYNC is a *very bad* name because with it you are requiring the applications
 to switch to pmem_memcpy() and persistent stores everywhere. It might be very
 confusing and people might assume that the Kernel can magically guess every time
 an mmap pointer was modified, even after the page-fault.
 It should be called something like MAP_PMEM_SYNC
]

> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-24 19:30                                       ` Ross Zwisler
@ 2016-02-25  9:46                                         ` Jan Kara
  0 siblings, 0 replies; 70+ messages in thread
From: Jan Kara @ 2016-02-25  9:46 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Dan Williams, Jeff Moyer, Arnd Bergmann, linux-nvdimm,
	Dave Chinner, Oleg Nesterov, Christoph Hellwig, linux-mm,
	Mel Gorman, Johannes Weiner, Kirill A. Shutemov, boaz

On Wed 24-02-16 12:30:59, Ross Zwisler wrote:
> On Tue, Feb 23, 2016 at 09:09:47PM -0700, Ross Zwisler wrote:
> > On Tue, Feb 23, 2016 at 03:56:17PM -0800, Dan Williams wrote:
> > > On Tue, Feb 23, 2016 at 3:43 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > > > Dan Williams <dan.j.williams@intel.com> writes:
> > > >
> > > >> On Tue, Feb 23, 2016 at 3:28 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > > >>>> The crux of the problem, in my opinion, is that we're asking for an "I
> > > >>>> know what I'm doing" flag, and I expect that's an impossible statement
> > > >>>> for a filesystem to trust generically.
> > > >>>
> > > >>> The file system already trusts that.  If an application doesn't use
> > > >>> fsync properly, guess what, it will break.  This line of reasoning
> > > >>> doesn't make any sense to me.
> > > >>
> > > >> No, I'm worried about the case where an app specifies MAP_PMEM_AWARE
> > > >> uses fsync correctly, and fails to flush cpu cache.
> > > >
> > > > I don't think the kernel needs to put training wheels on applications.
> > > >
> > > >>>> If you can get MAP_PMEM_AWARE in, great, but I'm more and more of the
> > > >>>> opinion that the "I know what I'm doing" interface should be something
> > > >>>> separate from today's trusted filesystems.
> > > >>>
> > > >>> Just so I understand you, MAP_PMEM_AWARE isn't the "I know what I'm
> > > >>> doing" interface, right?
> > > >>
> > > >> It is the "I know what I'm doing" interface, MAP_PMEM_AWARE asserts "I
> > > >> know when to flush the cpu relative to an fsync()".
> > > >
> > > > I see.  So I think your argument is that new file systems (such as Nova)
> > > > can have whacky new semantics, but existing file systems should provide
> > > > the more conservative semantics that they have provided since the dawn
> > > > of time (even if we add a new mmap flag to control the behavior).
> > > >
> > > > I don't agree with that.  :)
> > > >
> > > 
> > > Fair enough.  Recall, I was pushing MAP_DAX not to long ago.  It just
> > > seems like a Sisyphean effort to push an mmap flag up the XFS hill and
> > > maybe that effort is better spent somewhere else.
> > 
> > Well, for what it's worth MAP_SYNC feels like the "right" solution to me.  I
> > understand that we are a ways from having it implemented, but it seems like
> > the correct way to have applications work with persistent memory in a perfect
> > world, and worth the effort.
> > 
> > MAP_PMEM_AWARE is interesting, but even in a perfect world it seems like a
> > partial solution - applications still need to call *sync to get the FS
> > metadata to be durable, and they have no reliable way of knowing which of
> > their actions will cause the metadata to be out of sync.
> > 
> > Dave, is your objection to the MAP_SYNC idea a practical one about complexity
> > and time to get it implemented, or do you think it's is the wrong solution?
> 
> Jan, I just noticed that this chain didn't CC you nor linux-fsdevel, so you
> may have missed it.  All the gory details are here:
> 
> http://thread.gmane.org/gmane.linux.kernel.mm/146691
> 
> Let me provide a little background for my question.  (Everyone else on the
> thread feel free to jump in if you feel like my summary is incorrect or
> incomplete.)
> 
> There is a new persistent memory programming model outlined on pmem.io and
> implemented by the NVM Library (nvml).
> 
> http://pmem.io/
> http://pmem.io/nvml/
> https://github.com/pmem/nvml/
> 
> This new programming model is based on the idea that an application should be
> able to create a DAX MMAP, and then from then on satisfy the data durability
> requirements of the application purely in userspace.  This is done by using
> non-temporal stores or cached writes followed by flushes, the same way that we
> do things in the kernel.
> 
> Dave was concerned that this breaks down for XFS because even if the
> application were to sync all its writes to media, the filesystem could be
> making associated metadata changes that the application wouldn't and couldn't
> know about:
> 
> http://article.gmane.org/gmane.linux.kernel.mm/146699
> 
> To sync these metadata changes to media, the application would still need to
> call *sync.
> 
> One proposal from Christoph was that we could add a MMAP_SYNC flag that
> essentially says "make all metadata operations synchronous":
> 
> http://article.gmane.org/gmane.linux.kernel.mm/146753
> 
> The worry is that this would be complex to implement, and that we maybe don't
> want yet another DAX special case in the FS code.
> 
> Another way that we could implement this would be to key off of the DAX mount
> option / inode setting for all mmaps that use DAX.  This would preclude the
> need for changes to the mmap() API.
> 
> My question: How far away are we from having such a metadata durability
> guarantee in ext4?  Do we have cases where the metadata changes associated
> with a page fault, etc. could be out of sync with the data writes that are
> being made durable by the application in userspace?
> 
> I see ext4 creating journal entries around page faults in places like
> ext4_dax_fault() - this should durably record any metadata changes associated
> with that page fault before the fault completes, correct?

That is not true. Journalling makes sure metadata changes are recorded in
the journal but you have to commit the transaction to make the change
really durable. That happens either in response to sync / fsync or
asynchronously every couple of seconds. So with ext4 you have exactly the
same issues with durability as with XFS.

> Are there other cases you can think of with ext4 where we would need to call
> *sync for DAX just to be sure we are safely synchronizing metadata?

So I think implementing something like MAP_SYNC semantics for ext4 is
reasonably doable. Basically we would have to make sure that we commit a
transaction already during a page fault which is not that hard to do.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-24 22:56                                       ` Dave Chinner
@ 2016-02-25 16:24                                         ` Jeff Moyer
  2016-02-25 19:11                                           ` Jeff Moyer
  2016-02-25 21:20                                           ` Dave Chinner
  0 siblings, 2 replies; 70+ messages in thread
From: Jeff Moyer @ 2016-02-25 16:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Boaz Harrosh, Christoph Hellwig, Rudoff, Andy,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

Hi, Dave,

Dave Chinner <david@fromorbit.com> writes:

> Well, let me clarify what I said a bit here, because I feel like I'm
> being unfairly blamed for putting data integrity as the highest
> priority for DAX+pmem instead of falling in line and chanting
> "Performance! Performance! Performance!" with everyone else.

It's totally fair.  ;-)

> Let me state this clearly: I'm not opposed to making optimisations
> that change the way applications and the kernel interact. I like the
> idea of MAP_SYNC, but I see this sort of API/behaviour change as a
> last resort when all else fails, not a "first and only" optimisation
> option.

So, calling it "first and only" seems a bit unfair on your part.  I
don't think anyone asking for a MAP_SYNC option doesn't also want other
applications to work well.  That aside, this is where your opinion
differs from mine: I don't see MAP_SYNC as a last resort option.  And
let me be clear, this /is/ an opinion.  I have no hard facts to back it
up, precisely because we don't have any application we can use for a
comparison.  But, it seems plausible to me that no matter how well you
optimize your msync implementation, it will still be more expensive than
an application that doesn't call msync at all.  This obviously depends
on how the application is using the programming model, among other
things.  I agree that we would need real data to back this up.  However,
I don't see any reason to preclude such an implementation, or to leave
it as a last resort.  I think it should be part of our planning process
if it's reasonably feasible.

> The big issue we have right now is that we haven't made the DAX/pmem
> infrastructure work correctly and reliably for general use.  Hence
> adding new APIs to workaround cases where we haven't yet provided
> correct behaviour, let alone optimised for performance is, quite
> frankly, a clear case premature optimisation.

Again, I see the two things as separate issues.  You need both.
Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
issue of making existing applications work safely.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-25 16:24                                         ` Jeff Moyer
@ 2016-02-25 19:11                                           ` Jeff Moyer
  2016-02-25 20:15                                             ` Dave Chinner
  2016-02-25 21:20                                           ` Dave Chinner
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2016-02-25 19:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Arnd Bergmann, linux-nvdimm, Oleg Nesterov, Christoph Hellwig,
	linux-mm, Mel Gorman, Johannes Weiner, Kirill A. Shutemov

Jeff Moyer <jmoyer@redhat.com> writes:

>> The big issue we have right now is that we haven't made the DAX/pmem
>> infrastructure work correctly and reliably for general use.  Hence
>> adding new APIs to workaround cases where we haven't yet provided
>> correct behaviour, let alone optimised for performance is, quite
>> frankly, a clear case premature optimisation.
>
> Again, I see the two things as separate issues.  You need both.
> Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
> issue of making existing applications work safely.

I want to add one more thing to this discussion, just for the sake of
clarity.  When I talk about existing applications and pmem, I mean
applications that already know how to detect and recover from torn
sectors.  Any application that assumes hardware does not tear sectors
should be run on a file system layered on top of the btt.

I think this underlying assumption may have been overlooked in this
discussion, and could very well be a source of confusion.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-25 19:11                                           ` Jeff Moyer
@ 2016-02-25 20:15                                             ` Dave Chinner
  2016-02-25 20:57                                               ` Jeff Moyer
  2016-02-25 21:08                                               ` Phil Terry
  0 siblings, 2 replies; 70+ messages in thread
From: Dave Chinner @ 2016-02-25 20:15 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Arnd Bergmann, linux-nvdimm, Oleg Nesterov, Christoph Hellwig,
	linux-mm, Mel Gorman, Johannes Weiner, Kirill A. Shutemov

On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:
> Jeff Moyer <jmoyer@redhat.com> writes:
> 
> >> The big issue we have right now is that we haven't made the DAX/pmem
> >> infrastructure work correctly and reliably for general use.  Hence
> >> adding new APIs to workaround cases where we haven't yet provided
> >> correct behaviour, let alone optimised for performance is, quite
> >> frankly, a clear case premature optimisation.
> >
> > Again, I see the two things as separate issues.  You need both.
> > Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
> > issue of making existing applications work safely.
> 
> I want to add one more thing to this discussion, just for the sake of
> clarity.  When I talk about existing applications and pmem, I mean
> applications that already know how to detect and recover from torn
> sectors.  Any application that assumes hardware does not tear sectors
> should be run on a file system layered on top of the btt.

Which turns off DAX, and hence makes this a moot discussion because
mmap is then buffered through the page cache and hence applications
*must use msync/fsync* to provide data integrity. Which also makes
them safe to use with DAX if we have a working fsync.

Keep in mind that existing storage technologies tear fileystem data
writes, too, because user data writes are filesystem block sized and
not atomic at the device level (i.e.  typical is 512 byte sector, 4k
filesystem block size, so there are 7 points in a single write where
a tear can occur on a crash).

IOWs existing storage already has the capability of tearing user
data on crash and has been doing so for a least they last 30 years.
Hence I really don't see any fundamental difference here with
pmem+DAX - the only difference is that the tear granuarlity is
smaller (CPU cacheline rather than sector).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-25 20:15                                             ` Dave Chinner
@ 2016-02-25 20:57                                               ` Jeff Moyer
  2016-02-25 22:27                                                 ` Dave Chinner
  2016-02-25 21:08                                               ` Phil Terry
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2016-02-25 20:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Arnd Bergmann, linux-nvdimm, Oleg Nesterov, Christoph Hellwig,
	linux-mm, Mel Gorman, Johannes Weiner, Kirill A. Shutemov

Good morning, Dave,

Dave Chinner <david@fromorbit.com> writes:

> On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:
>> Jeff Moyer <jmoyer@redhat.com> writes:
>> 
>> >> The big issue we have right now is that we haven't made the DAX/pmem
>> >> infrastructure work correctly and reliably for general use.  Hence
>> >> adding new APIs to workaround cases where we haven't yet provided
>> >> correct behaviour, let alone optimised for performance is, quite
>> >> frankly, a clear case premature optimisation.
>> >
>> > Again, I see the two things as separate issues.  You need both.
>> > Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
>> > issue of making existing applications work safely.
>> 
>> I want to add one more thing to this discussion, just for the sake of
>> clarity.  When I talk about existing applications and pmem, I mean
>> applications that already know how to detect and recover from torn
>> sectors.  Any application that assumes hardware does not tear sectors
>> should be run on a file system layered on top of the btt.
>
> Which turns off DAX, and hence makes this a moot discussion because

You're missing the point.  You can't take applications that don't know
how to deal with torn sectors and put them on a block device that does
not provide power fail write atomicity of a single sector.  That said,
there are two classes of applications that /can/ make use of file
systems layered on top of /dev/pmem devices:

1) applications that know how to deal with torn sectors
2) these new-fangled applications written for persistent memory

Thus, it's not a moot point.  There are existing applications that can
make use of the msync/fsync code we've been discussing.  And then there
are these other applications that want to take care of the persistence
all on their own.

> Keep in mind that existing storage technologies tear fileystem data
> writes, too, because user data writes are filesystem block sized and
> not atomic at the device level (i.e.  typical is 512 byte sector, 4k
> filesystem block size, so there are 7 points in a single write where
> a tear can occur on a crash).

You are conflating torn pages (pages being a generic term for anything
greater than a sector) and torn sectors.  That point aside, you can do
O_DIRECT I/O on a sector granularity, even on a file system that has a
block size larger than the device logical block size.  Thus,
applications can control the blast radius of a write.

> IOWs existing storage already has the capability of tearing user
> data on crash and has been doing so for a least they last 30 years.

And yet applications assume that this doesn't happen.  Have a look at
this:
  https://www.sqlite.org/psow.html

> Hence I really don't see any fundamental difference here with
> pmem+DAX - the only difference is that the tear granuarlity is
> smaller (CPU cacheline rather than sector).

Like it or not, applications have been assuming that they get power fail
write atomicity of a single sector, and they have (mostly) been right.
With persistent memory, I am certain there will be torn writes.  We've
already seen it in testing.  This is why I don't see file systems on a
pmem device as general purpose.

Irrespective of what storage systems do today, I think it's good
practice to not leave landmines for applications that will use
persistent memory.  Let's be very clear on what is expected to work and
what isn't.  I hope I've made my stance clear.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-25 20:15                                             ` Dave Chinner
  2016-02-25 20:57                                               ` Jeff Moyer
@ 2016-02-25 21:08                                               ` Phil Terry
  2016-02-25 21:39                                                 ` Dave Chinner
  1 sibling, 1 reply; 70+ messages in thread
From: Phil Terry @ 2016-02-25 21:08 UTC (permalink / raw)
  To: Dave Chinner, Jeff Moyer
  Cc: Arnd Bergmann, linux-nvdimm, Oleg Nesterov, Christoph Hellwig,
	linux-mm, Mel Gorman, Johannes Weiner, Kirill A. Shutemov

On 02/25/2016 12:15 PM, Dave Chinner wrote:
> On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:
>> Jeff Moyer <jmoyer@redhat.com> writes:
>>
>>>> The big issue we have right now is that we haven't made the DAX/pmem
>>>> infrastructure work correctly and reliably for general use.  Hence
>>>> adding new APIs to workaround cases where we haven't yet provided
>>>> correct behaviour, let alone optimised for performance is, quite
>>>> frankly, a clear case premature optimisation.
>>> Again, I see the two things as separate issues.  You need both.
>>> Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
>>> issue of making existing applications work safely.
>> I want to add one more thing to this discussion, just for the sake of
>> clarity.  When I talk about existing applications and pmem, I mean
>> applications that already know how to detect and recover from torn
>> sectors.  Any application that assumes hardware does not tear sectors
>> should be run on a file system layered on top of the btt.
> Which turns off DAX, and hence makes this a moot discussion because
> mmap is then buffered through the page cache and hence applications
> *must use msync/fsync* to provide data integrity. Which also makes
> them safe to use with DAX if we have a working fsync.
>
> Keep in mind that existing storage technologies tear fileystem data
> writes, too, because user data writes are filesystem block sized and
> not atomic at the device level (i.e.  typical is 512 byte sector, 4k
> filesystem block size, so there are 7 points in a single write where
> a tear can occur on a crash).
Is that really true? Storage to date is on the PCIE/SATA etc IO chain. 
The locks and application crash scenarios when traversing down this 
chain are such that the device will not have its DMA programmed until 
the whole 4K etc page is flushed to memory, pinned for DMA, etc. Then 
the DMA to the device is kicked off. If power crashes during the DMA, 
either we have devices which are supercapped or battery backed to flush 
their write caches and or have firmware which will abort the damaged 
results of the torn DMA on the devices internal meta-data recovery when 
power is restored. (The hardware/firmware on an HDD has been way more 
complex than the simple mind model might lead one to expect for years). 
All of this wrapped inside filesystem transaction semantics.

This is a crucial difference for "storage class memory" on the DRAM bus. 
The NVDIMMs cannot be DMA masters and instead passively receive 
cache-line writes. A "buffered DIMM" as alluded to in the pmem.io Device 
Writers Guide might have intelligence on the DIIMM to detect, map and 
recover tearing via the Block Window Aperture driver interface but on a 
PMEM interface cannot do so. Hence btt on the host with full 
transparency to manage the memory on the NVDIMM is required for the PMEM 
driver. Given this it doesn't make sense to try and put it on the device 
for the BW driver either.

In both cases, btt is not indirecting the buffer (as for a DMA master IO 
type device) but is simply using the same pmem api primitives to manage 
its own meta data about the filesystem writes to detect and recover from 
tears after the event. In what sense is DAX disabled for this?

So I think (please correct me if I'm wrong) but actually the 
hardware/firmware guys have been fixing the torn sector problem for the 
last 30 years and the "storage on the memory channel" has reintroduced 
the problem. So to use as SSD analogy, you fix this problem with the 
FTL, and as we've seen with recent software defined flash and 
openchannel approaches, you can either have the FTL on the device or on 
the host. Absence of the bus master DMA on a DIMM (even with the BW 
aperture software) makes a device based solution problematic so the host 
solution a la btt is required for both PMEM and BW.

>
> IOWs existing storage already has the capability of tearing user
> data on crash and has been doing so for a least they last 30 years.
> Hence I really don't see any fundamental difference here with
> pmem+DAX - the only difference is that the tear granuarlity is
> smaller (CPU cacheline rather than sector).
>
> Cheers,
>
> Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-25 16:24                                         ` Jeff Moyer
  2016-02-25 19:11                                           ` Jeff Moyer
@ 2016-02-25 21:20                                           ` Dave Chinner
  2016-02-29 20:32                                             ` Jeff Moyer
  1 sibling, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2016-02-25 21:20 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Boaz Harrosh, Christoph Hellwig, Rudoff, Andy,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Thu, Feb 25, 2016 at 11:24:57AM -0500, Jeff Moyer wrote:
> Hi, Dave,
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
> > Well, let me clarify what I said a bit here, because I feel like I'm
> > being unfairly blamed for putting data integrity as the highest
> > priority for DAX+pmem instead of falling in line and chanting
> > "Performance! Performance! Performance!" with everyone else.
> 
> It's totally fair.  ;-)
> 
> > Let me state this clearly: I'm not opposed to making optimisations
> > that change the way applications and the kernel interact. I like the
> > idea of MAP_SYNC, but I see this sort of API/behaviour change as a
> > last resort when all else fails, not a "first and only" optimisation
> > option.
> 
> So, calling it "first and only" seems a bit unfair on your part.

Maybe so, but it's a valid observation - it's being pushed as a way
of avoidning needing to make the kernel code work correctly and
fast. i.e. the argument is "new, unoptimised code is too slow, so we
want a knob to avoid it completely".

Boaz keeps saying that we can make the kernel code faster, but he's
still pushing to enable bypassing that code rather than sending
patches to make the kernel pmem infrastructure faster.  Such
bypasses lead to the situation that the kernel code isn't used by
the applications that could benefit from optimisation and
improvement of the kernel code because they don't use it anymore.
This is what I meant as "first and only" kernel optimisation.

> I
> don't think anyone asking for a MAP_SYNC option doesn't also want other
> applications to work well.  That aside, this is where your opinion
> differs from mine: I don't see MAP_SYNC as a last resort option.  And
> let me be clear, this /is/ an opinion.  I have no hard facts to back it
> up, precisely because we don't have any application we can use for a
> comparison.

Right, we have no numbers, and we don't yet have an optimised kernel
side implementation to compare against. Until we have the ability to
compare apples with apples, we should be pushing back against API
changes that are based on oranges being tastier than apples.

> But, it seems plausible to me that no matter how well you
> optimize your msync implementation, it will still be more expensive than
> an application that doesn't call msync at all.  This obviously depends
> on how the application is using the programming model, among other
> things.  I agree that we would need real data to back this up.  However,
> I don't see any reason to preclude such an implementation, or to leave
> it as a last resort.  I think it should be part of our planning process
> if it's reasonably feasible.

Essentially I see this situation/request as conceptually the same as
O_DIRECT for read/write - O_DIRECT bypasses the kernel dirty range
tracking and, as such, has nasty cache coherency issues when you mix
it with buffered IO. Nor does it play well with mmap, it has
different semantics for every filesystem and the kernel code has
been optimised to the point of fragility.

And, of course, O_DIRECT requires applications to do exactly the
right things to extract performance gains and maintain data
integrity. If they get it right, they will be faster than using the
page cache, but we know that applications often get it very wrong.
And even when they get it right, data corruption can still occur
because some thrid party accessed file in a different manner (e.g. a
backup) and triggered one of the known, fundamentally unfixable
coherency problems.

However, despite the fact we are stuck with O_DIRECT and it's
deranged monkeys (which I am one of), we should not be ignoring the
problems that bypassing the kernel infrastructure has caused us and
continues to cause us. As such, we really need to think hard about
whether we should be repeating the development of such a bypass
feature. If we do, we stand a very good chance of ending up in the
same place - a bunch of code that does not play well with others,
and a nightmare to test because it's expected to work and not
corrupt data...

We should try very hard not to repeat the biggest mistake O_DIRECT
made: we need to define and document exactly what behaviour we
guarantee, how it works and exaclty what responsisbilities the
kernel and userspace have in *great detail* /before/ we add the
mechanism to the kernel.

Think it through carefully - API changes and semantics are forever.
We don't want to add something that in a couple of years we are
wishing we never added....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-25 21:08                                               ` Phil Terry
@ 2016-02-25 21:39                                                 ` Dave Chinner
  0 siblings, 0 replies; 70+ messages in thread
From: Dave Chinner @ 2016-02-25 21:39 UTC (permalink / raw)
  To: Phil Terry
  Cc: Jeff Moyer, Arnd Bergmann, linux-nvdimm, Oleg Nesterov,
	Christoph Hellwig, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov

On Thu, Feb 25, 2016 at 01:08:28PM -0800, Phil Terry wrote:
> On 02/25/2016 12:15 PM, Dave Chinner wrote:
> >On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:
> >>Jeff Moyer <jmoyer@redhat.com> writes:
> >>
> >>>>The big issue we have right now is that we haven't made the DAX/pmem
> >>>>infrastructure work correctly and reliably for general use.  Hence
> >>>>adding new APIs to workaround cases where we haven't yet provided
> >>>>correct behaviour, let alone optimised for performance is, quite
> >>>>frankly, a clear case premature optimisation.
> >>>Again, I see the two things as separate issues.  You need both.
> >>>Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
> >>>issue of making existing applications work safely.
> >>I want to add one more thing to this discussion, just for the sake of
> >>clarity.  When I talk about existing applications and pmem, I mean
> >>applications that already know how to detect and recover from torn
> >>sectors.  Any application that assumes hardware does not tear sectors
> >>should be run on a file system layered on top of the btt.
> >Which turns off DAX, and hence makes this a moot discussion because
> >mmap is then buffered through the page cache and hence applications
> >*must use msync/fsync* to provide data integrity. Which also makes
> >them safe to use with DAX if we have a working fsync.
> >
> >Keep in mind that existing storage technologies tear fileystem data
> >writes, too, because user data writes are filesystem block sized and
> >not atomic at the device level (i.e.  typical is 512 byte sector, 4k
> >filesystem block size, so there are 7 points in a single write where
> >a tear can occur on a crash).
> Is that really true? Storage to date is on the PCIE/SATA etc IO
> chain. The locks and application crash scenarios when traversing
> down this chain are such that the device will not have its DMA
> programmed until the whole 4K etc page is flushed to memory, pinned

Has nothing to do with DMA semantics. Storage devices we have to
deal with have volatile write caches, and we can't assume anything
about what they write when power fails except that single sector
writes are atomic.

> In both cases, btt is not indirecting the buffer (as for a DMA
> master IO type device) but is simply using the same pmem api
> primitives to manage its own meta data about the filesystem writes
> to detect and recover from tears after the event. In what sense is
> DAX disabled for this?

BTT is, IIRC, using writeahead logging to stage every IO into pmem
so that after a crash the entire write can be recovered and replayed
to overwrite any torn sectors. This requires buffering at page cache
level, as direct writes to the pmem will not get logged. Hence DAX
cannot be used on BTT devices. Indeed:

static const struct block_device_operations btt_fops = {
        .owner =                THIS_MODULE,
        .rw_page =              btt_rw_page,
        .getgeo =               btt_getgeo,
        .revalidate_disk =      nvdimm_revalidate_disk,
};

There's no .direct_access method implemented for btt devices, so
it's clear that filesystems on BTT devices cannot enable DAX.

> So I think (please correct me if I'm wrong) but actually the
> hardware/firmware guys have been fixing the torn sector problem for

I was not talking about torn /sectors/. I was talking about a user
data write being made up of *multiple sectors*, and so there is no
atomicity guarantee for a user data write on existing storage when
the filesystem block size (user data IO size) is larger than the
device sector size. 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-25 20:57                                               ` Jeff Moyer
@ 2016-02-25 22:27                                                 ` Dave Chinner
  2016-02-26  4:02                                                   ` Dan Williams
  2016-02-29 20:25                                                   ` Jeff Moyer
  0 siblings, 2 replies; 70+ messages in thread
From: Dave Chinner @ 2016-02-25 22:27 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Arnd Bergmann, linux-nvdimm, Oleg Nesterov, Christoph Hellwig,
	linux-mm, Mel Gorman, Johannes Weiner, Kirill A. Shutemov

On Thu, Feb 25, 2016 at 03:57:14PM -0500, Jeff Moyer wrote:
> Good morning, Dave,
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:
> >> Jeff Moyer <jmoyer@redhat.com> writes:
> >> 
> >> >> The big issue we have right now is that we haven't made the DAX/pmem
> >> >> infrastructure work correctly and reliably for general use.  Hence
> >> >> adding new APIs to workaround cases where we haven't yet provided
> >> >> correct behaviour, let alone optimised for performance is, quite
> >> >> frankly, a clear case premature optimisation.
> >> >
> >> > Again, I see the two things as separate issues.  You need both.
> >> > Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
> >> > issue of making existing applications work safely.
> >> 
> >> I want to add one more thing to this discussion, just for the sake of
> >> clarity.  When I talk about existing applications and pmem, I mean
> >> applications that already know how to detect and recover from torn
> >> sectors.  Any application that assumes hardware does not tear sectors
> >> should be run on a file system layered on top of the btt.
> >
> > Which turns off DAX, and hence makes this a moot discussion because
> 
> You're missing the point.  You can't take applications that don't know
> how to deal with torn sectors and put them on a block device that does
> not provide power fail write atomicity of a single sector.

Very few applications actually care about atomic sector writes.
Databases are probably the only class of application that really do
care about both single sector and multi-sector atomic write
behaviour, and many of them can be configured to assume single
sector writes can be torn.

Torn user data writes have always been possible, and so pmem does
not introduce any new semantics that applications have to handle.

> > Keep in mind that existing storage technologies tear fileystem data
> > writes, too, because user data writes are filesystem block sized and
> > not atomic at the device level (i.e.  typical is 512 byte sector, 4k
> > filesystem block size, so there are 7 points in a single write where
> > a tear can occur on a crash).
> 
> You are conflating torn pages (pages being a generic term for anything
> greater than a sector) and torn sectors.

No, I'm not. I'm pointing out that applications that really care
about data integrity already have the capability to recovery from
torn sectors in the event of a crash. pmem+DAX does not introduce
any new way of corrupting user data for these applications.

> > IOWs existing storage already has the capability of tearing user
> > data on crash and has been doing so for a least they last 30 years.
> 
> And yet applications assume that this doesn't happen.  Have a look at
> this:
>   https://www.sqlite.org/psow.html

Quote:

"All versions of SQLite up to and including version 3.7.9 assume
that the filesystem does not provide powersafe overwrite. [...]

Hence it seems reasonable to assume powersafe overwrite for modern
disks. [...] Caution is advised though. As Roger Binns noted on the
SQLite developers mailing list: "'poorly written' should be the main
assumption about drive firmware."

IOWs, SQLite used to always assume that single sector overwrites can
be torn, and now that it is optional it recommends that users should
assume this is the way their storage behaves in order to be safe. In
this config, it uses the write ahead log even for single sector
writes, and hence can recover from torn sector writes without having
to detect that the write was torn.

Quote:

"SQLite never assumes that database page writes are atomic,
 regardless of the PSOW setting.(1) And hence SQLite is always able
 to automatically recover from torn pages induced by a crash."

This is Because multi-sector writes are always staged through the
write ahead log and hence are cleanly recoverable after a crash
without having to detect whether a torn write occurred or not.

IOWs, you've just pointed to an application that demonstrates
pmem-safe behaviour - just configure the database files with
"file:somefile.db?psow=0" and it will assume that individual sector
writes can be torn, and it will always recover.

Hence I'm not sure exactly what point you are trying to make with
this example.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-25 22:27                                                 ` Dave Chinner
@ 2016-02-26  4:02                                                   ` Dan Williams
  2016-02-26 10:04                                                     ` Thanumalayan Sankaranarayana Pillai
  2016-02-29 20:25                                                   ` Jeff Moyer
  1 sibling, 1 reply; 70+ messages in thread
From: Dan Williams @ 2016-02-26  4:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Moyer, Arnd Bergmann, linux-nvdimm, Oleg Nesterov,
	Christoph Hellwig, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov, madthanu

[ adding Thanu ]

On Thu, Feb 25, 2016 at 2:27 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Feb 25, 2016 at 03:57:14PM -0500, Jeff Moyer wrote:
>> Good morning, Dave,
>>
>> Dave Chinner <david@fromorbit.com> writes:
>>
>> > On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:
>> >> Jeff Moyer <jmoyer@redhat.com> writes:
>> >>
>> >> >> The big issue we have right now is that we haven't made the DAX/pmem
>> >> >> infrastructure work correctly and reliably for general use.  Hence
>> >> >> adding new APIs to workaround cases where we haven't yet provided
>> >> >> correct behaviour, let alone optimised for performance is, quite
>> >> >> frankly, a clear case premature optimisation.
>> >> >
>> >> > Again, I see the two things as separate issues.  You need both.
>> >> > Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
>> >> > issue of making existing applications work safely.
>> >>
>> >> I want to add one more thing to this discussion, just for the sake of
>> >> clarity.  When I talk about existing applications and pmem, I mean
>> >> applications that already know how to detect and recover from torn
>> >> sectors.  Any application that assumes hardware does not tear sectors
>> >> should be run on a file system layered on top of the btt.
>> >
>> > Which turns off DAX, and hence makes this a moot discussion because
>>
>> You're missing the point.  You can't take applications that don't know
>> how to deal with torn sectors and put them on a block device that does
>> not provide power fail write atomicity of a single sector.
>
> Very few applications actually care about atomic sector writes.
> Databases are probably the only class of application that really do
> care about both single sector and multi-sector atomic write
> behaviour, and many of them can be configured to assume single
> sector writes can be torn.
>
> Torn user data writes have always been possible, and so pmem does
> not introduce any new semantics that applications have to handle.
>
>> > Keep in mind that existing storage technologies tear fileystem data
>> > writes, too, because user data writes are filesystem block sized and
>> > not atomic at the device level (i.e.  typical is 512 byte sector, 4k
>> > filesystem block size, so there are 7 points in a single write where
>> > a tear can occur on a crash).
>>
>> You are conflating torn pages (pages being a generic term for anything
>> greater than a sector) and torn sectors.
>
> No, I'm not. I'm pointing out that applications that really care
> about data integrity already have the capability to recovery from
> torn sectors in the event of a crash. pmem+DAX does not introduce
> any new way of corrupting user data for these applications.
>
>> > IOWs existing storage already has the capability of tearing user
>> > data on crash and has been doing so for a least they last 30 years.
>>
>> And yet applications assume that this doesn't happen.  Have a look at
>> this:
>>   https://www.sqlite.org/psow.html
>
> Quote:
>
> "All versions of SQLite up to and including version 3.7.9 assume
> that the filesystem does not provide powersafe overwrite. [...]
>
> Hence it seems reasonable to assume powersafe overwrite for modern
> disks. [...] Caution is advised though. As Roger Binns noted on the
> SQLite developers mailing list: "'poorly written' should be the main
> assumption about drive firmware."
>
> IOWs, SQLite used to always assume that single sector overwrites can
> be torn, and now that it is optional it recommends that users should
> assume this is the way their storage behaves in order to be safe. In
> this config, it uses the write ahead log even for single sector
> writes, and hence can recover from torn sector writes without having
> to detect that the write was torn.
>
> Quote:
>
> "SQLite never assumes that database page writes are atomic,
>  regardless of the PSOW setting.(1) And hence SQLite is always able
>  to automatically recover from torn pages induced by a crash."
>
> This is Because multi-sector writes are always staged through the
> write ahead log and hence are cleanly recoverable after a crash
> without having to detect whether a torn write occurred or not.
>
> IOWs, you've just pointed to an application that demonstrates
> pmem-safe behaviour - just configure the database files with
> "file:somefile.db?psow=0" and it will assume that individual sector
> writes can be torn, and it will always recover.
>
> Hence I'm not sure exactly what point you are trying to make with
> this example.

I met Thanu today at USENIX Fast'16 today and his research [1] has
found other applications that assume sector atomicity.  Also, here's a
thread he pointed to about the sector atomicity dependencies of LMDB
[2].

BTT is needed because existing software assumes sectors are not torn
and may not yet have settings like "psow=0" to workaround that
assumption.  Jeff's right, we would be mistaken not to recommend BTT
by default.  In that respect applications running on top of raw pmem,
sans BTT, are already making a "I know what I am doing" decision in
this respect.

[1]: http://research.cs.wisc.edu/wind/Publications/alice-osdi14.pdf
[2]: http://www.openldap.org/lists/openldap-devel/201410/msg00004.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-26  4:02                                                   ` Dan Williams
@ 2016-02-26 10:04                                                     ` Thanumalayan Sankaranarayana Pillai
  2016-02-28 10:17                                                         ` Boaz Harrosh
  0 siblings, 1 reply; 70+ messages in thread
From: Thanumalayan Sankaranarayana Pillai @ 2016-02-26 10:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jeff Moyer, Arnd Bergmann, linux-nvdimm,
	Oleg Nesterov, Christoph Hellwig, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

On Thu, Feb 25, 2016 at 10:02 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> [ adding Thanu ]
>
>> Very few applications actually care about atomic sector writes.
>> Databases are probably the only class of application that really do
>> care about both single sector and multi-sector atomic write
>> behaviour, and many of them can be configured to assume single
>> sector writes can be torn.
>>
>> Torn user data writes have always been possible, and so pmem does
>> not introduce any new semantics that applications have to handle.
>>

I know about BTT and DAX only at a conceptual level and hence do not understand
this mailing thread fully. But I can provide examples of important applications
expecting atomicity at a 512B or a smaller granularity. Here is a list:

(1) LMDB [1] that Dan mentioned, which expects "linear writes" (i.e., don't
need atomicity, but need the first byte to be written before the second byte)

(2) PostgreSQL expects atomicity [2]

(3) SQLite depends on linear writes [3] (we were unable to find these
dependencies during our testing, however). Also, PSOW in SQLite is not relevant
to this discussion as I understand it; PSOW deals with corruption of data
*around* the actual written bytes.

(4) We found that ZooKeeper depends on atomicity during our testing, but we did
not contact the ZooKeeper developers about this. Some details in our paper [4].

It is tempting to assume that applications do not use the concept of disk
sectors and deal with only file-system blocks (which are not atomic in
practice), and take measures to deal with the non-atomic file-system blocks.
But, in reality, applications seem to assume that 512B (more or less) sectors
are atomic or linear, and build their consistency mechanisms around that.

[1] http://www.openldap.org/list~s/openldap-devel/201410/msg00004.html
[2] http://www.postgresql.org/docs/9.5/static/wal-internals.html , "To deal
with the case where pg_control is corrupt" ...
[3] https://www.sqlite.org/atomiccommit.html , "SQLite does always assume that
a sector write is linear" ...
[4] http://research.cs.wisc.edu/wind/Publications/alice-osdi14.pdf

Regards,
Thanu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-26 10:04                                                     ` Thanumalayan Sankaranarayana Pillai
@ 2016-02-28 10:17                                                         ` Boaz Harrosh
  0 siblings, 0 replies; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-28 10:17 UTC (permalink / raw)
  To: Thanumalayan Sankaranarayana Pillai, Dan Williams
  Cc: Arnd Bergmann, linux-nvdimm, Dave Chinner, Oleg Nesterov,
	Christoph Hellwig, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov, NFS list

On 02/26/2016 12:04 PM, Thanumalayan Sankaranarayana Pillai wrote:
> On Thu, Feb 25, 2016 at 10:02 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> [ adding Thanu ]
>>
>>> Very few applications actually care about atomic sector writes.
>>> Databases are probably the only class of application that really do
>>> care about both single sector and multi-sector atomic write
>>> behaviour, and many of them can be configured to assume single
>>> sector writes can be torn.
>>>
>>> Torn user data writes have always been possible, and so pmem does
>>> not introduce any new semantics that applications have to handle.
>>>
> 
> I know about BTT and DAX only at a conceptual level and hence do not understand
> this mailing thread fully. But I can provide examples of important applications
> expecting atomicity at a 512B or a smaller granularity. Here is a list:
> 
> (1) LMDB [1] that Dan mentioned, which expects "linear writes" (i.e., don't
> need atomicity, but need the first byte to be written before the second byte)
> 
> (2) PostgreSQL expects atomicity [2]
> 
> (3) SQLite depends on linear writes [3] (we were unable to find these
> dependencies during our testing, however). Also, PSOW in SQLite is not relevant
> to this discussion as I understand it; PSOW deals with corruption of data
> *around* the actual written bytes.
> 
> (4) We found that ZooKeeper depends on atomicity during our testing, but we did
> not contact the ZooKeeper developers about this. Some details in our paper [4].
> 
> It is tempting to assume that applications do not use the concept of disk
> sectors and deal with only file-system blocks (which are not atomic in
> practice), and take measures to deal with the non-atomic file-system blocks.
> But, in reality, applications seem to assume that 512B (more or less) sectors
> are atomic or linear, and build their consistency mechanisms around that.
> 

This all discussion is a shock to me. where were these guys hiding, under a rock?

In the NFS world you can get not torn sectors but torn words. You may have
reorder of writes, you may have data holes the all deal. Until you get back
a successful sync nothing is guarantied. It is not only a client
crash but also a network breach, and so on. So you never know what can happen.

So are you saying all these applications do not run on NFS?

Thanks
Boaz

> [1] http://www.openldap.org/list~s/openldap-devel/201410/msg00004.html
> [2] http://www.postgresql.org/docs/9.5/static/wal-internals.html , "To deal
> with the case where pg_control is corrupt" ...
> [3] https://www.sqlite.org/atomiccommit.html , "SQLite does always assume that
> a sector write is linear" ...
> [4] http://research.cs.wisc.edu/wind/Publications/alice-osdi14.pdf
> 
> Regards,
> Thanu
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
@ 2016-02-28 10:17                                                         ` Boaz Harrosh
  0 siblings, 0 replies; 70+ messages in thread
From: Boaz Harrosh @ 2016-02-28 10:17 UTC (permalink / raw)
  To: Thanumalayan Sankaranarayana Pillai, Dan Williams
  Cc: Arnd Bergmann, linux-nvdimm, Dave Chinner, Oleg Nesterov,
	Christoph Hellwig, linux-mm, Mel Gorman, Johannes Weiner,
	Kirill A. Shutemov, NFS list

On 02/26/2016 12:04 PM, Thanumalayan Sankaranarayana Pillai wrote:
> On Thu, Feb 25, 2016 at 10:02 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> [ adding Thanu ]
>>
>>> Very few applications actually care about atomic sector writes.
>>> Databases are probably the only class of application that really do
>>> care about both single sector and multi-sector atomic write
>>> behaviour, and many of them can be configured to assume single
>>> sector writes can be torn.
>>>
>>> Torn user data writes have always been possible, and so pmem does
>>> not introduce any new semantics that applications have to handle.
>>>
> 
> I know about BTT and DAX only at a conceptual level and hence do not understand
> this mailing thread fully. But I can provide examples of important applications
> expecting atomicity at a 512B or a smaller granularity. Here is a list:
> 
> (1) LMDB [1] that Dan mentioned, which expects "linear writes" (i.e., don't
> need atomicity, but need the first byte to be written before the second byte)
> 
> (2) PostgreSQL expects atomicity [2]
> 
> (3) SQLite depends on linear writes [3] (we were unable to find these
> dependencies during our testing, however). Also, PSOW in SQLite is not relevant
> to this discussion as I understand it; PSOW deals with corruption of data
> *around* the actual written bytes.
> 
> (4) We found that ZooKeeper depends on atomicity during our testing, but we did
> not contact the ZooKeeper developers about this. Some details in our paper [4].
> 
> It is tempting to assume that applications do not use the concept of disk
> sectors and deal with only file-system blocks (which are not atomic in
> practice), and take measures to deal with the non-atomic file-system blocks.
> But, in reality, applications seem to assume that 512B (more or less) sectors
> are atomic or linear, and build their consistency mechanisms around that.
> 

This all discussion is a shock to me. where were these guys hiding, under a rock?

In the NFS world you can get not torn sectors but torn words. You may have
reorder of writes, you may have data holes the all deal. Until you get back
a successful sync nothing is guarantied. It is not only a client
crash but also a network breach, and so on. So you never know what can happen.

So are you saying all these applications do not run on NFS?

Thanks
Boaz

> [1] http://www.openldap.org/list~s/openldap-devel/201410/msg00004.html
> [2] http://www.postgresql.org/docs/9.5/static/wal-internals.html , "To deal
> with the case where pg_control is corrupt" ...
> [3] https://www.sqlite.org/atomiccommit.html , "SQLite does always assume that
> a sector write is linear" ...
> [4] http://research.cs.wisc.edu/wind/Publications/alice-osdi14.pdf
> 
> Regards,
> Thanu
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-25 22:27                                                 ` Dave Chinner
  2016-02-26  4:02                                                   ` Dan Williams
@ 2016-02-29 20:25                                                   ` Jeff Moyer
  1 sibling, 0 replies; 70+ messages in thread
From: Jeff Moyer @ 2016-02-29 20:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Arnd Bergmann, linux-nvdimm, Oleg Nesterov, Christoph Hellwig,
	linux-mm, Mel Gorman, Johannes Weiner, Kirill A. Shutemov

Hi, Dave,

Dave Chinner <david@fromorbit.com> writes:

>> You're missing the point.  You can't take applications that don't know
>> how to deal with torn sectors and put them on a block device that does
>> not provide power fail write atomicity of a single sector.
>
> Very few applications actually care about atomic sector writes.

I agree that most applications do not care about power-fail write
atomicity of a single sector.  However, of those applications that do
care about it, how many will/can run safely when atomic sector writes
are not provided?  Thanu gave some examples of applications that require
atomic sector writes today, and I'm sure there are more.  It sounds like
you are comfortable with running those applications on a file system
layered on top of a raw pmem device.  (Again, I'm coming from the angle
that block storage already provides this guarantee, at least mostly.)

> IOWs, you've just pointed to an application that demonstrates
> pmem-safe behaviour - just configure the database files with
> "file:somefile.db?psow=0" and it will assume that individual sector
> writes can be torn, and it will always recover.
>
> Hence I'm not sure exactly what point you are trying to make with
> this example.

Sorry, what I meant to point out was that the sqlite developers changed
from assuming sectors could be torn to assuming they were not.  So, *by
default*, the database assumes that sectors will not be torn.

Dave, on one hand you're arguing fervently for data integrity (over
pre-mature optimisation).  But on the other hand you're willing to
ignore data integrity completely for a set of existing applications.
This is not internally consistent.  :)  Please explain.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-25 21:20                                           ` Dave Chinner
@ 2016-02-29 20:32                                             ` Jeff Moyer
  0 siblings, 0 replies; 70+ messages in thread
From: Jeff Moyer @ 2016-02-29 20:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Boaz Harrosh, Christoph Hellwig, Rudoff, Andy,
	Arnd Bergmann, linux-nvdimm, Oleg Nesterov, linux-mm, Mel Gorman,
	Johannes Weiner, Kirill A. Shutemov

Dave Chinner <david@fromorbit.com> writes:

> On Thu, Feb 25, 2016 at 11:24:57AM -0500, Jeff Moyer wrote:
>> But, it seems plausible to me that no matter how well you
>> optimize your msync implementation, it will still be more expensive than
>> an application that doesn't call msync at all.  This obviously depends
>> on how the application is using the programming model, among other
>> things.  I agree that we would need real data to back this up.  However,
>> I don't see any reason to preclude such an implementation, or to leave
>> it as a last resort.  I think it should be part of our planning process
>> if it's reasonably feasible.
>
> Essentially I see this situation/request as conceptually the same as
> O_DIRECT for read/write - O_DIRECT bypasses the kernel dirty range
> tracking and, as such, has nasty cache coherency issues when you mix
> it with buffered IO. Nor does it play well with mmap, it has
> different semantics for every filesystem and the kernel code has
> been optimised to the point of fragility.
>
> And, of course, O_DIRECT requires applications to do exactly the
> right things to extract performance gains and maintain data
> integrity. If they get it right, they will be faster than using the
> page cache, but we know that applications often get it very wrong.
> And even when they get it right, data corruption can still occur
> because some thrid party accessed file in a different manner (e.g. a
> backup) and triggered one of the known, fundamentally unfixable
> coherency problems.
>
> However, despite the fact we are stuck with O_DIRECT and it's
> deranged monkeys (which I am one of), we should not be ignoring the
> problems that bypassing the kernel infrastructure has caused us and
> continues to cause us. As such, we really need to think hard about
> whether we should be repeating the development of such a bypass
> feature. If we do, we stand a very good chance of ending up in the
> same place - a bunch of code that does not play well with others,
> and a nightmare to test because it's expected to work and not
> corrupt data...
>
> We should try very hard not to repeat the biggest mistake O_DIRECT
> made: we need to define and document exactly what behaviour we
> guarantee, how it works and exaclty what responsisbilities the
> kernel and userspace have in *great detail* /before/ we add the
> mechanism to the kernel.
>
> Think it through carefully - API changes and semantics are forever.
> We don't want to add something that in a couple of years we are
> wishing we never added....

I agree with everything you wrote, there.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-28 10:17                                                         ` Boaz Harrosh
  (?)
@ 2016-03-03 17:38                                                         ` Howard Chu
  -1 siblings, 0 replies; 70+ messages in thread
From: Howard Chu @ 2016-03-03 17:38 UTC (permalink / raw)
  To: linux-mm

Boaz Harrosh <boaz <at> plexistor.com> writes:

> 
> On 02/26/2016 12:04 PM, Thanumalayan Sankaranarayana Pillai wrote:
> > On Thu, Feb 25, 2016 at 10:02 PM, Dan Williams <dan.j.williams <at>
intel.com> wrote:
> >> [ adding Thanu ]
> >>
> >>> Very few applications actually care about atomic sector writes.
> >>> Databases are probably the only class of application that really do
> >>> care about both single sector and multi-sector atomic write
> >>> behaviour, and many of them can be configured to assume single
> >>> sector writes can be torn.
> >>>
> >>> Torn user data writes have always been possible, and so pmem does
> >>> not introduce any new semantics that applications have to handle.
> >>>
> > 
> > I know about BTT and DAX only at a conceptual level and hence do not
understand
> > this mailing thread fully. But I can provide examples of important
applications
> > expecting atomicity at a 512B or a smaller granularity. Here is a list:
> > 
> > (1) LMDB [1] that Dan mentioned, which expects "linear writes" (i.e., don't
> > need atomicity, but need the first byte to be written before the second
byte)
> > 
> > (2) PostgreSQL expects atomicity [2]
> > 
> > (3) SQLite depends on linear writes [3] (we were unable to find these
> > dependencies during our testing, however). Also, PSOW in SQLite is not
relevant
> > to this discussion as I understand it; PSOW deals with corruption of data
> > *around* the actual written bytes.
> > 
> > (4) We found that ZooKeeper depends on atomicity during our testing, but
we did
> > not contact the ZooKeeper developers about this. Some details in our
paper [4].
> > 
> > It is tempting to assume that applications do not use the concept of disk
> > sectors and deal with only file-system blocks (which are not atomic in
> > practice), and take measures to deal with the non-atomic file-system blocks.
> > But, in reality, applications seem to assume that 512B (more or less)
sectors
> > are atomic or linear, and build their consistency mechanisms around that.
> > 
> 
> This all discussion is a shock to me. where were these guys hiding, under
a rock?
> 
> In the NFS world you can get not torn sectors but torn words. You may have
> reorder of writes, you may have data holes the all deal. Until you get back
> a successful sync nothing is guarantied. It is not only a client
> crash but also a network breach, and so on. So you never know what can happen.
> 
> So are you saying all these applications do not run on NFS?

Speaking for LMDB: LMDB is entirely dependent on mmap, and the coherence of
a unified buffer cache. None of this is supported on NFS, so NFS has never
been a concern for us. We explicitly document that LMDB cannot be used over NFS.

Speaking more generally, you're talking nonsense. NFS by default transmits
*pages* over UDP - datagrams are all-or-nothing, you can't get torn words.
Likewise, NFS over TCP means individual pages are transmitted with
individual bytes in order within a page.

> Thanks
> Boaz
> 
> > [1] http://www.openldap.org/list~s/openldap-devel/201410/msg00004.html
> > [2] http://www.postgresql.org/docs/9.5/static/wal-internals.html , "To deal
> > with the case where pg_control is corrupt" ...
> > [3] https://www.sqlite.org/atomiccommit.html , "SQLite does always
assume that
> > a sector write is linear" ...
> > [4] http://research.cs.wisc.edu/wind/Publications/alice-osdi14.pdf
> > 
> > Regards,
> > Thanu
> > _______________________________________________
> > Linux-nvdimm mailing list
> > Linux-nvdimm <at> lists.01.org
> > https://lists.01.org/mailman/listinfo/linux-nvdimm
> > 
--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-02-21 17:03 [RFC 0/2] New MAP_PMEM_AWARE mmap flag Boaz Harrosh
                   ` (2 preceding siblings ...)
  2016-02-21 19:51 ` [RFC 0/2] New " Dan Williams
@ 2016-03-11  6:44 ` Andy Lutomirski
  2016-03-11 19:07   ` Dan Williams
  3 siblings, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2016-03-11  6:44 UTC (permalink / raw)
  To: Boaz Harrosh, Dan Williams, Ross Zwisler, linux-nvdimm,
	Matthew Wilcox, Kirill A. Shutemov, Dave Chinner
  Cc: Oleg Nesterov, Mel Gorman, Johannes Weiner, linux-mm, Arnd Bergmann

On 02/21/2016 09:03 AM, Boaz Harrosh wrote:
> Hi all
>
> Recent DAX code fixed the cl_flushing ie durability of mmap access
> of direct persistent-memory from applications. It uses the radix-tree
> per inode to track the indexes of a file that where page-faulted for
> write. Then at m/fsync time it would cl_flush these pages and clean
> the radix-tree, for the next round.
>
> Sigh, that is life, for legacy applications this is the price we must
> pay. But for NV aware applications like nvml library, we pay extra extra
> price, even if we do not actually call m/fsync eventually. For these
> applications these extra resources and especially the extra radix locking
> per page-fault, costs a lot, like x3 a lot.
>
> What we propose here is a way for those applications to enjoy the
> boost and still not sacrifice any correctness of legacy applications.
> Any concurrent access from legacy apps vs nv-aware apps even to the same
> file / same page, will work correctly.
>
> We do that by defining a new MMAP flag that is set by the nv-aware
> app. this flag is carried by the VMA. In the dax code we bypass any
> radix handling of the page if this flag is set. Those pages accessed *without*
> this flag will be added to the radix-tree, those with will not.
> At m/fsync time if the radix tree is then empty nothing will happen.
>

I'm a little late to the party, but let me offer a variant that might be 
considerably safer:

Add a flag MAP_DAX_WRITETHROUGH (name could be debated -- 
MAP_DAX_FASTFLUSH might be more architecture-neutral, but I'm only 
familiar with the x86 semantics).

MAP_DAX_WRITETHROUGH does whatever is needed to ensure that writing 
through the mapping and then calling fsync is both safe and fast.  On 
x86, it would (surprise, surprise!) map the pages writethrough and skip 
adding them to the radix tree.  fsync makes sure to do sfence before 
pcommit.

This is totally safe.  You *can't* abuse this to cause fsync to leave 
non-persistent dirty cached data anywhere.

It makes sufficiently DAX-aware applications very fast.  Reads are 
unaffected, and non-temporal writes should be the same speed as they are 
under any other circumstances.

It makes applications that set it blindly very slow.  Applications that 
use standard writes (i.e. plain stores that are neither fast string 
operations nor explicit non-temporal writes) will suffer.  But they'll 
still work correctly.

Applications that want a WB mapping with manually-managed persistence 
can still do it, but fsync will be slow.  Adding an fmetadatasync() for 
their benefit might be a decent idea, but it would just be icing on the 
cake.

Unlike with MAP_DAX_AWARE, there's no issue with malicious users who map 
the thing with the wrong flag, write, call fsync, and snicker because 
now the other applications might read data and be surprised that the 
data they just read isn't persistent even if they subsequently call fsync.

There would be details to be hashed out in case a page is mapped 
normally and with MAP_DAX_WRITETHROUGH in separate mappings.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-03-11  6:44 ` Andy Lutomirski
@ 2016-03-11 19:07   ` Dan Williams
  2016-03-11 19:10     ` Andy Lutomirski
  0 siblings, 1 reply; 70+ messages in thread
From: Dan Williams @ 2016-03-11 19:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Boaz Harrosh, Ross Zwisler, linux-nvdimm, Matthew Wilcox,
	Kirill A. Shutemov, Dave Chinner, Oleg Nesterov, Mel Gorman,
	Johannes Weiner, linux-mm, Arnd Bergmann

On Thu, Mar 10, 2016 at 10:44 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On 02/21/2016 09:03 AM, Boaz Harrosh wrote:
>>
>> Hi all
>>
>> Recent DAX code fixed the cl_flushing ie durability of mmap access
>> of direct persistent-memory from applications. It uses the radix-tree
>> per inode to track the indexes of a file that where page-faulted for
>> write. Then at m/fsync time it would cl_flush these pages and clean
>> the radix-tree, for the next round.
>>
>> Sigh, that is life, for legacy applications this is the price we must
>> pay. But for NV aware applications like nvml library, we pay extra extra
>> price, even if we do not actually call m/fsync eventually. For these
>> applications these extra resources and especially the extra radix locking
>> per page-fault, costs a lot, like x3 a lot.
>>
>> What we propose here is a way for those applications to enjoy the
>> boost and still not sacrifice any correctness of legacy applications.
>> Any concurrent access from legacy apps vs nv-aware apps even to the same
>> file / same page, will work correctly.
>>
>> We do that by defining a new MMAP flag that is set by the nv-aware
>> app. this flag is carried by the VMA. In the dax code we bypass any
>> radix handling of the page if this flag is set. Those pages accessed
>> *without*
>> this flag will be added to the radix-tree, those with will not.
>> At m/fsync time if the radix tree is then empty nothing will happen.
>>
>
> I'm a little late to the party, but let me offer a variant that might be
> considerably safer:
>
> Add a flag MAP_DAX_WRITETHROUGH (name could be debated -- MAP_DAX_FASTFLUSH
> might be more architecture-neutral, but I'm only familiar with the x86
> semantics).
>
> MAP_DAX_WRITETHROUGH does whatever is needed to ensure that writing through
> the mapping and then calling fsync is both safe and fast.  On x86, it would
> (surprise, surprise!) map the pages writethrough and skip adding them to the
> radix tree.  fsync makes sure to do sfence before pcommit.
>
> This is totally safe.  You *can't* abuse this to cause fsync to leave
> non-persistent dirty cached data anywhere.
>
> It makes sufficiently DAX-aware applications very fast.  Reads are
> unaffected, and non-temporal writes should be the same speed as they are
> under any other circumstances.
>
> It makes applications that set it blindly very slow.  Applications that use
> standard writes (i.e. plain stores that are neither fast string operations
> nor explicit non-temporal writes) will suffer.  But they'll still work
> correctly.
>
> Applications that want a WB mapping with manually-managed persistence can
> still do it, but fsync will be slow.  Adding an fmetadatasync() for their
> benefit might be a decent idea, but it would just be icing on the cake.
>
> Unlike with MAP_DAX_AWARE, there's no issue with malicious users who map the
> thing with the wrong flag, write, call fsync, and snicker because now the
> other applications might read data and be surprised that the data they just
> read isn't persistent even if they subsequently call fsync.
>
> There would be details to be hashed out in case a page is mapped normally
> and with MAP_DAX_WRITETHROUGH in separate mappings.
>

Interesting...

The mixed mapping problem is made slightly more difficult by the fact
that we add persistent memory to the direct-map when allocating struct
page, but probably not insurmountable.  Also, this still has the
syscall overhead that a MAP_SYNC semantic eliminates, but we need to
collect numbers to see if that matters.

However, chatting with Andy R. about the NVML use case, the library
alternates between streaming non-temporal writes and byte-accesses +
clwb().  The byte accesses get slower with a write-through mapping.
So, performance data is needed all around to see where these options
land.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-03-11 19:07   ` Dan Williams
@ 2016-03-11 19:10     ` Andy Lutomirski
  2016-03-11 23:02       ` Rudoff, Andy
  0 siblings, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2016-03-11 19:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Boaz Harrosh, Ross Zwisler, linux-nvdimm,
	Matthew Wilcox, Kirill A. Shutemov, Dave Chinner, Oleg Nesterov,
	Mel Gorman, Johannes Weiner, linux-mm, Arnd Bergmann

On Fri, Mar 11, 2016 at 11:07 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Mar 10, 2016 at 10:44 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On 02/21/2016 09:03 AM, Boaz Harrosh wrote:
>>>
>>> Hi all
>>>
>>> Recent DAX code fixed the cl_flushing ie durability of mmap access
>>> of direct persistent-memory from applications. It uses the radix-tree
>>> per inode to track the indexes of a file that where page-faulted for
>>> write. Then at m/fsync time it would cl_flush these pages and clean
>>> the radix-tree, for the next round.
>>>
>>> Sigh, that is life, for legacy applications this is the price we must
>>> pay. But for NV aware applications like nvml library, we pay extra extra
>>> price, even if we do not actually call m/fsync eventually. For these
>>> applications these extra resources and especially the extra radix locking
>>> per page-fault, costs a lot, like x3 a lot.
>>>
>>> What we propose here is a way for those applications to enjoy the
>>> boost and still not sacrifice any correctness of legacy applications.
>>> Any concurrent access from legacy apps vs nv-aware apps even to the same
>>> file / same page, will work correctly.
>>>
>>> We do that by defining a new MMAP flag that is set by the nv-aware
>>> app. this flag is carried by the VMA. In the dax code we bypass any
>>> radix handling of the page if this flag is set. Those pages accessed
>>> *without*
>>> this flag will be added to the radix-tree, those with will not.
>>> At m/fsync time if the radix tree is then empty nothing will happen.
>>>
>>
>> I'm a little late to the party, but let me offer a variant that might be
>> considerably safer:
>>
>> Add a flag MAP_DAX_WRITETHROUGH (name could be debated -- MAP_DAX_FASTFLUSH
>> might be more architecture-neutral, but I'm only familiar with the x86
>> semantics).
>>
>> MAP_DAX_WRITETHROUGH does whatever is needed to ensure that writing through
>> the mapping and then calling fsync is both safe and fast.  On x86, it would
>> (surprise, surprise!) map the pages writethrough and skip adding them to the
>> radix tree.  fsync makes sure to do sfence before pcommit.
>>
>> This is totally safe.  You *can't* abuse this to cause fsync to leave
>> non-persistent dirty cached data anywhere.
>>
>> It makes sufficiently DAX-aware applications very fast.  Reads are
>> unaffected, and non-temporal writes should be the same speed as they are
>> under any other circumstances.
>>
>> It makes applications that set it blindly very slow.  Applications that use
>> standard writes (i.e. plain stores that are neither fast string operations
>> nor explicit non-temporal writes) will suffer.  But they'll still work
>> correctly.
>>
>> Applications that want a WB mapping with manually-managed persistence can
>> still do it, but fsync will be slow.  Adding an fmetadatasync() for their
>> benefit might be a decent idea, but it would just be icing on the cake.
>>
>> Unlike with MAP_DAX_AWARE, there's no issue with malicious users who map the
>> thing with the wrong flag, write, call fsync, and snicker because now the
>> other applications might read data and be surprised that the data they just
>> read isn't persistent even if they subsequently call fsync.
>>
>> There would be details to be hashed out in case a page is mapped normally
>> and with MAP_DAX_WRITETHROUGH in separate mappings.
>>
>
> Interesting...
>
> The mixed mapping problem is made slightly more difficult by the fact
> that we add persistent memory to the direct-map when allocating struct
> page, but probably not insurmountable.  Also, this still has the
> syscall overhead that a MAP_SYNC semantic eliminates, but we need to
> collect numbers to see if that matters.
>
> However, chatting with Andy R. about the NVML use case, the library
> alternates between streaming non-temporal writes and byte-accesses +
> clwb().  The byte accesses get slower with a write-through mapping.
> So, performance data is needed all around to see where these options
> land.

When you say  "byte-access + clwb()", do you mean literally write a
byte, clwb, write a byte, clwb... or do you mean lots of byte accesses
and then one clwb?  If the former, I suspect it could be changed to
non-temporal store + sfence and be faster.

My understanding is that non-temporal store + sfence doesn't populate
the cache, though, which is unfortunate for some use cases.

The real solution would be for Intel to add an efficient operation to
force writeback on a large region of physical pages.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
  2016-03-11 19:10     ` Andy Lutomirski
@ 2016-03-11 23:02       ` Rudoff, Andy
  0 siblings, 0 replies; 70+ messages in thread
From: Rudoff, Andy @ 2016-03-11 23:02 UTC (permalink / raw)
  To: Andy Lutomirski, Williams, Dan J
  Cc: Arnd Bergmann, linux-nvdimm, Dave Chinner, Oleg Nesterov,
	linux-mm, Mel Gorman, Andy Lutomirski, Johannes Weiner,
	Kirill A. Shutemov

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2205 bytes --]


>>
>> The mixed mapping problem is made slightly more difficult by the fact
>> that we add persistent memory to the direct-map when allocating struct
>> page, but probably not insurmountable.  Also, this still has the
>> syscall overhead that a MAP_SYNC semantic eliminates, but we need to
>> collect numbers to see if that matters.
>>
>> However, chatting with Andy R. about the NVML use case, the library
>> alternates between streaming non-temporal writes and byte-accesses +
>> clwb().  The byte accesses get slower with a write-through mapping.
>> So, performance data is needed all around to see where these options
>> land.
>
>When you say  "byte-access + clwb()", do you mean literally write a
>byte, clwb, write a byte, clwb... or do you mean lots of byte accesses
>and then one clwb?  If the former, I suspect it could be changed to
>non-temporal store + sfence and be faster.

Typically a mixture.  That is, there are times where we store a pointer
and follow it immediately with CLWB, and there are times where we do
lots of work and then decide to commit what we've done by running over
a range doing CLWB.  In our libraries, NT stores are easy to use because
we control the code.  But one of the benefits of pmem is that applications
can access data structures in-place, without calling through APIs for
every pointer de-reference, so it gets sort of impractical to require
NT stores.  Imagine, for example, as part of an update to pmem you want
to strcpy() or sprintf() or some other function you didn't write.  Following
that with a call to a commit API that flushes things is easier on the
app developer than requiring them to have NT store versions of all those
routines.

>My understanding is that non-temporal store + sfence doesn't populate
>the cache, though, which is unfortunate for some use cases.

That matches my understanding.

>The real solution would be for Intel to add an efficient operation to
>force writeback on a large region of physical pages.

This is under investigation, but unfortunately not available just yet...

-andy
N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2016-03-11 23:02 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-21 17:03 [RFC 0/2] New MAP_PMEM_AWARE mmap flag Boaz Harrosh
2016-02-21 17:04 ` [RFC 1/2] mmap: Define a new " Boaz Harrosh
2016-02-21 17:06 ` [RFC 2/2] dax: Support " Boaz Harrosh
2016-02-21 19:51 ` [RFC 0/2] New " Dan Williams
2016-02-21 20:24   ` Boaz Harrosh
2016-02-21 20:57     ` Dan Williams
2016-02-21 21:23       ` Boaz Harrosh
2016-02-21 22:03         ` Dan Williams
2016-02-21 22:31           ` Dave Chinner
2016-02-22  9:57             ` Boaz Harrosh
2016-02-22 15:34             ` Jeff Moyer
2016-02-22 17:44               ` Christoph Hellwig
2016-02-22 17:58                 ` Jeff Moyer
2016-02-22 18:03                   ` Christoph Hellwig
2016-02-22 18:52                     ` Jeff Moyer
2016-02-23  9:45                       ` Christoph Hellwig
2016-02-22 20:05                 ` Rudoff, Andy
2016-02-23  9:52                   ` Christoph Hellwig
2016-02-23 10:07                     ` Rudoff, Andy
2016-02-23 12:06                       ` Dave Chinner
2016-02-23 17:10                         ` Ross Zwisler
2016-02-23 21:47                           ` Dave Chinner
2016-02-23 22:15                             ` Boaz Harrosh
2016-02-23 23:28                               ` Dave Chinner
2016-02-24  0:08                                 ` Boaz Harrosh
2016-02-23 14:10                     ` Boaz Harrosh
2016-02-23 16:56                       ` Dan Williams
2016-02-23 17:05                         ` Ross Zwisler
2016-02-23 17:26                           ` Dan Williams
2016-02-23 21:55                         ` Boaz Harrosh
2016-02-23 22:33                           ` Dan Williams
2016-02-23 23:07                             ` Boaz Harrosh
2016-02-23 23:23                               ` Dan Williams
2016-02-23 23:40                                 ` Boaz Harrosh
2016-02-24  0:08                                   ` Dave Chinner
2016-02-23 23:28                             ` Jeff Moyer
2016-02-23 23:34                               ` Dan Williams
2016-02-23 23:43                                 ` Jeff Moyer
2016-02-23 23:56                                   ` Dan Williams
2016-02-24  4:09                                     ` Ross Zwisler
2016-02-24 19:30                                       ` Ross Zwisler
2016-02-25  9:46                                         ` Jan Kara
2016-02-25  7:44                                       ` Boaz Harrosh
2016-02-24 15:02                                     ` Jeff Moyer
2016-02-24 22:56                                       ` Dave Chinner
2016-02-25 16:24                                         ` Jeff Moyer
2016-02-25 19:11                                           ` Jeff Moyer
2016-02-25 20:15                                             ` Dave Chinner
2016-02-25 20:57                                               ` Jeff Moyer
2016-02-25 22:27                                                 ` Dave Chinner
2016-02-26  4:02                                                   ` Dan Williams
2016-02-26 10:04                                                     ` Thanumalayan Sankaranarayana Pillai
2016-02-28 10:17                                                       ` Boaz Harrosh
2016-02-28 10:17                                                         ` Boaz Harrosh
2016-03-03 17:38                                                         ` Howard Chu
2016-02-29 20:25                                                   ` Jeff Moyer
2016-02-25 21:08                                               ` Phil Terry
2016-02-25 21:39                                                 ` Dave Chinner
2016-02-25 21:20                                           ` Dave Chinner
2016-02-29 20:32                                             ` Jeff Moyer
2016-02-23 17:25                       ` Ross Zwisler
2016-02-23 22:47                         ` Boaz Harrosh
2016-02-22 21:50               ` Dave Chinner
2016-02-23 13:51               ` Boaz Harrosh
2016-02-23 14:22                 ` Jeff Moyer
2016-02-22 11:05           ` Boaz Harrosh
2016-03-11  6:44 ` Andy Lutomirski
2016-03-11 19:07   ` Dan Williams
2016-03-11 19:10     ` Andy Lutomirski
2016-03-11 23:02       ` Rudoff, Andy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.