All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3 v3] dax: Fix mmap-write not updating c/mtime
@ 2015-03-23 12:47 ` Boaz Harrosh
  0 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-23 12:47 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

Hi

[v3]
* I'm re-posting the two DAX patches that fix the mmap-write after read
  problem with DAX. (No changes since [v2])
* I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze
  and making mapping read-only. 
  Jan Please review and see if this is what you meant.

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

[v1]
The main problem is that current mm/memory.c will no call us with page_mkwrite
if we do not have an actual page mapping, which is what DAX uses.
The solution presented here introduces a new pfn_mkwrite to solve this problem.
Please see patch-2 for details.

I've been running with this patch for 4 month both HW and VMs with no apparent
danger, but see patch-1 I played it safe.

I am also posting an xfstest 080 that demonstrate this problem, I believe
that also some git operations (can't remember which) suffer from this problem.
Actually Eryu Guan found that this test fails on some other FS as well.

List of patches:
  [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze
  [PATCH 3/3] RFC: dax: dax_prepare_freeze

  [PATCH v4] xfstest: generic/080 test that mmap-write updates c/mtime

Please I need that some mm person review the first patch?

Andrew hi
I believe this needs to eventually go through your tree. Please pick it
up when you feel it is ready. I believe the first 2 are ready and fix real
bugs.

Matthew hi
I would love to have your ACK on these patches?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 0/3 v3] dax: Fix mmap-write not updating c/mtime
@ 2015-03-23 12:47 ` Boaz Harrosh
  0 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-23 12:47 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

Hi

[v3]
* I'm re-posting the two DAX patches that fix the mmap-write after read
  problem with DAX. (No changes since [v2])
* I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze
  and making mapping read-only. 
  Jan Please review and see if this is what you meant.

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

[v1]
The main problem is that current mm/memory.c will no call us with page_mkwrite
if we do not have an actual page mapping, which is what DAX uses.
The solution presented here introduces a new pfn_mkwrite to solve this problem.
Please see patch-2 for details.

I've been running with this patch for 4 month both HW and VMs with no apparent
danger, but see patch-1 I played it safe.

I am also posting an xfstest 080 that demonstrate this problem, I believe
that also some git operations (can't remember which) suffer from this problem.
Actually Eryu Guan found that this test fails on some other FS as well.

List of patches:
  [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze
  [PATCH 3/3] RFC: dax: dax_prepare_freeze

  [PATCH v4] xfstest: generic/080 test that mmap-write updates c/mtime

Please I need that some mm person review the first patch?

Andrew hi
I believe this needs to eventually go through your tree. Please pick it
up when you feel it is ready. I believe the first 2 are ready and fix real
bugs.

Matthew hi
I would love to have your ACK on these patches?

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-03-23 12:47 ` Boaz Harrosh
  (?)
@ 2015-03-23 12:49 ` Boaz Harrosh
  2015-03-23 22:49     ` Andrew Morton
  -1 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-23 12:49 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

From: Yigal Korman <yigal@plexistor.com>

This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
to get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it
to page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario
above we fail to set ctime/mtime though we modified the file.
An xfstest is attached to this patchset that shows the failure
and the fix. (A DAX patch will follow)

This functionality is extra important for us, because upon
dirtying of a pmem page we also want to RDMA the page to a
remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
  1 - The name ;-)
  2 - But mainly because it would take a very long and tedious
      audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
      users. To make sure they do not now CRASH. For example current
      DAX code (which this is for) would crash.
      If we would want to reuse page_mkwrite, We will need to first
      patch all users, so to not-crash-on-no-page. Then enable this
      patch. But even if I did that I would not sleep so well at night.
      Adding a new vector is the safest thing to do, and is not that
      expensive. an extra pointer at a static function vector per driver.
      Also the new vector is better for performance, because else we
      Will call all current Kernel vectors, so to:
	check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Hugh Dickins <hughd@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-mm@kvack.org

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 27 ++++++++++++++++++++++++++-
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a9392..1cd820c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -250,6 +250,8 @@ struct vm_operations_struct {
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
+	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
diff --git a/mm/memory.c b/mm/memory.c
index 8068893..ebef8f6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1982,6 +1982,22 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	return ret;
 }
 
+static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
+{
+	struct vm_fault vmf;
+
+	if (!vma->vm_ops || !vma->vm_ops->pfn_mkwrite)
+		return 0;
+
+	vmf.page = 0;
+	vmf.pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) +
+			vma->vm_pgoff;
+	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+
+	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -2025,8 +2041,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * accounting on raw pfn maps.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-				     (VM_WRITE|VM_SHARED))
+				     (VM_WRITE|VM_SHARED)) {
+			pte_unmap_unlock(page_table, ptl);
+			ret = do_pfn_mkwrite(vma, address);
+			if (ret & VM_FAULT_ERROR)
+				return ret;
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte))
+				goto unlock;
 			goto reuse;
+		}
 		goto gotten;
 	}
 
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection
  2015-03-23 12:47 ` Boaz Harrosh
  (?)
  (?)
@ 2015-03-23 12:52 ` Boaz Harrosh
  -1 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-23 12:52 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

From: Yigal Korman <yigal@plexistor.com>

[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.

A new xfstest is submitted for testing this (generic/080)

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/dax.c           | 17 +++++++++++++++++
 fs/ext2/file.c     |  1 +
 fs/ext4/file.c     |  1 +
 include/linux/fs.h |  1 +
 4 files changed, 20 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ed1619e..d0bd1f4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
+ * dax_pfn_mkwrite - handle first write to DAX page
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ *
+ */
+int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	sb_start_pagefault(sb);
+	file_update_time(vma->vm_file);
+	sb_end_pagefault(sb);
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
+
+/**
  * dax_zero_page_range - zero a range within a page of a DAX file
  * @inode: The file being truncated
  * @from: The file offset that is being truncated to
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e317017..866a3ce 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
 	.page_mkwrite	= ext2_dax_mkwrite,
+	.pfn_mkwrite	= dax_pfn_mkwrite,
 };
 
 static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..b43a7a6 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct ext4_dax_vm_ops = {
 	.fault		= ext4_dax_fault,
 	.page_mkwrite	= ext4_dax_mkwrite,
+	.pfn_mkwrite	= dax_pfn_mkwrite,
 };
 #else
 #define ext4_dax_vm_ops	ext4_file_vm_ops
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..24af817 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2597,6 +2597,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
 #ifdef CONFIG_BLOCK
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-23 12:47 ` Boaz Harrosh
                   ` (2 preceding siblings ...)
  (?)
@ 2015-03-23 12:54 ` Boaz Harrosh
  2015-03-23 22:40     ` Dave Chinner
  2015-03-24 12:37     ` Boaz Harrosh
  -1 siblings, 2 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-23 12:54 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

From: Boaz Harrosh <boaz@plexistor.com>

When freezing an FS, we must write protect all IS_DAX()
inodes that have an mmap mapping on an inode. Otherwise
application will be able to modify previously faulted-in
file pages.

I'm actually doing a full unmap_mapping_range because
there is no readily available "mapping_write_protect" like
functionality. I do not think it is worth it to define one
just for here and just for some extra read-faults after an
fs_freeze.

How hot-path is fs_freeze at all?

CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/dax.c           | 30 ++++++++++++++++++++++++++++++
 fs/super.c         |  3 +++
 include/linux/fs.h |  1 +
 3 files changed, 34 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index d0bd1f4..f3fc28b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -549,3 +549,33 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 	return dax_zero_page_range(inode, from, length, get_block);
 }
 EXPORT_SYMBOL_GPL(dax_truncate_page);
+
+/* This is meant to be called as part of freeze_super. otherwise we might
+ * Need some extra locking before calling here.
+ */
+void dax_prepare_freeze(struct super_block *sb)
+{
+	struct inode *inode;
+
+	/* TODO: each DAX fs has some private mount option to enable DAX. If
+	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
+	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
+	 * if (!(sb->s_flags & MS_DAX_ENABLE))
+	 *	return 0;
+	 */
+
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		/* TODO: For freezing we can actually do with write-protecting
+		 * the page. But I cannot find a ready made function that does
+		 * that for a giving mapping (with all the proper locking).
+		 * How performance sensitive is the all sb_freeze API?
+		 * For now we can just unmap the all mapping, and pay extra
+		 * on read faults.
+		 */
+		/* NOTE: Do not unmap private COW mapped pages it will not
+		 * modify the FS.
+		 */
+		if (IS_DAX(inode))
+			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
+	}
+}
diff --git a/fs/super.c b/fs/super.c
index 2b7dc90..9ef490c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
 	/* All writers are done so after syncing there won't be dirty data */
 	sync_filesystem(sb);
 
+	/* Need to take care of DAX mmaped inodes */
+	dax_prepare_freeze(sb);
+
 	/* Now wait for internal filesystem counter */
 	sb->s_writers.frozen = SB_FREEZE_FS;
 	smp_wmb();
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 24af817..3b943d4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2599,6 +2599,7 @@ int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
+void dax_prepare_freeze(struct super_block *sb);
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4] xfstest: generic/080 test that mmap-write updates c/mtime
  2015-03-23 12:47 ` Boaz Harrosh
                   ` (3 preceding siblings ...)
  (?)
@ 2015-03-23 12:56 ` Boaz Harrosh
  -1 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-23 12:56 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

From: Dave Chinner <dchinner@redhat.com>

when using mmap() for file i/o, writing to the file should update
it's c/mtime. Specifically if we first mmap-read from a page, then
memap-write to the same page.

This test was failing for the initial submission of DAX because
pfn based mapping do not have an page_mkwrite called for them.
The new Kernel patches that introduce pfn_mkwrite fixes this test.

Written by Dave Chinner but edited and tested by:
	Omer Zilberberg

Dave hands-up man, it looks like you edited this directly
in the email, but there was not even a single typo.

Tested-by: Omer Zilberberg <omzg@plexistor.com>
Tested-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Omer Zilberberg <omzg@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Eryu Guan <eguan@redhat.com>
---
 tests/generic/080     | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/080.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 81 insertions(+)
 create mode 100755 tests/generic/080
 create mode 100644 tests/generic/080.out

diff --git a/tests/generic/080 b/tests/generic/080
new file mode 100755
index 0000000..43c93d7
--- /dev/null
+++ b/tests/generic/080
@@ -0,0 +1,78 @@
+#! /bin/bash
+# FS QA Test No. 080
+#
+# Verify that mtime is updated when writing to mmap-ed pages
+#
+#-----------------------------------------------------------------------
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=0
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+	rm -f $testfile
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs generic
+_supported_os IRIX Linux
+_require_test
+
+echo "Silence is golden."
+rm -f $seqres.full
+
+# pattern the file.
+testfile=$TEST_DIR/mmap_mtime_testfile
+$XFS_IO_PROG -f -c "pwrite 0 4k" -c fsync $testfile >> $seqres.full
+
+# sample timestamps.
+mtime1=`stat -c %Y $testfile`
+ctime1=`stat -c %Z $testfile`
+echo "before mwrite: $mtime1 $ctime1" >> $seqres.full
+
+# map read followed by map write to trigger timestamp change
+sleep 2
+$XFS_IO_PROG -c "mmap 0 4k" -c "mread 0 4k" -c "mwrite 0 4k" $testfile \
+	>> $seqres.full
+
+# sample and verify that timestamps have changed.
+mtime2=`stat -c %Y $testfile`
+ctime2=`stat -c %Z $testfile`
+echo "after mwrite : $mtime2 $ctime2" >> $seqres.full
+
+if [ "$mtime1" == "$mtime2" ]; then
+	echo "mtime not updated"
+	let status=$status+1
+fi
+if [ "$ctime1" == "$ctime2" ]; then
+	echo "ctime not updated"
+	let status=$status+1
+fi
+
+exit
diff --git a/tests/generic/080.out b/tests/generic/080.out
new file mode 100644
index 0000000..cccac52
--- /dev/null
+++ b/tests/generic/080.out
@@ -0,0 +1,2 @@
+QA output created by 080
+Silence is golden.
diff --git a/tests/generic/group b/tests/generic/group
index d56d3ce..8154401 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -79,6 +79,7 @@
 077 acl attr auto enospc
 078 auto quick metadata
 079 acl attr ioctl metadata auto quick
+080 auto quick
 083 rw auto enospc stress
 088 perms auto quick
 089 metadata auto
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-23 12:54 ` [PATCH 3/3] RFC: dax: dax_prepare_freeze Boaz Harrosh
@ 2015-03-23 22:40     ` Dave Chinner
  2015-03-24 12:37     ` Boaz Harrosh
  1 sibling, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-23 22:40 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> When freezing an FS, we must write protect all IS_DAX()
> inodes that have an mmap mapping on an inode. Otherwise
> application will be able to modify previously faulted-in
> file pages.

All you need to do is lock out page faults once the page is clean;
that's what the sb_start_pagefault() calls are for in the page fault
path - they catch write faults and block them until the filesystem
is unfrozen. Hence I'm not sure why this would be necessary if you
are catching write faults in .pfn_mkwrite....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-23 22:40     ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-23 22:40 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> When freezing an FS, we must write protect all IS_DAX()
> inodes that have an mmap mapping on an inode. Otherwise
> application will be able to modify previously faulted-in
> file pages.

All you need to do is lock out page faults once the page is clean;
that's what the sb_start_pagefault() calls are for in the page fault
path - they catch write faults and block them until the filesystem
is unfrozen. Hence I'm not sure why this would be necessary if you
are catching write faults in .pfn_mkwrite....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-03-23 12:49 ` [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
@ 2015-03-23 22:49     ` Andrew Morton
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2015-03-23 22:49 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Mon, 23 Mar 2015 14:49:32 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:

> From: Yigal Korman <yigal@plexistor.com>
> 
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.

Looks OK to me.

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1982,6 +1982,22 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>  	return ret;
>  }
>  
> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct vm_fault vmf;
> +
> +	if (!vma->vm_ops || !vma->vm_ops->pfn_mkwrite)
> +		return 0;
> +
> +	vmf.page = 0;
> +	vmf.pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) +
> +			vma->vm_pgoff;
> +	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
> +	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> +
> +	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
> +}

It might be a little neater to use

	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
		struct vm_fault vmf = {
			...
		};
		...
	}

>  /*
>   * This routine handles present pages, when users try to write
>   * to a shared page. It is done by copying the page to a new address
> @@ -2025,8 +2041,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * accounting on raw pfn maps.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> -				     (VM_WRITE|VM_SHARED))
> +				     (VM_WRITE|VM_SHARED)) {
> +			pte_unmap_unlock(page_table, ptl);
> +			ret = do_pfn_mkwrite(vma, address);
> +			if (ret & VM_FAULT_ERROR)
> +				return ret;
> +			page_table = pte_offset_map_lock(mm, pmd, address,
> +							 &ptl);
> +			if (!pte_same(*page_table, orig_pte))
> +				goto unlock;
>  			goto reuse;
> +		}
>  		goto gotten;

There are significant pending changes in this area.  See linux-next,
or http://ozlabs.org/~akpm/mmots/broken-out/mm-refactor-*

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
@ 2015-03-23 22:49     ` Andrew Morton
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2015-03-23 22:49 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Mon, 23 Mar 2015 14:49:32 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:

> From: Yigal Korman <yigal@plexistor.com>
> 
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.

Looks OK to me.

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1982,6 +1982,22 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>  	return ret;
>  }
>  
> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct vm_fault vmf;
> +
> +	if (!vma->vm_ops || !vma->vm_ops->pfn_mkwrite)
> +		return 0;
> +
> +	vmf.page = 0;
> +	vmf.pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) +
> +			vma->vm_pgoff;
> +	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
> +	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> +
> +	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
> +}

It might be a little neater to use

	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
		struct vm_fault vmf = {
			...
		};
		...
	}

>  /*
>   * This routine handles present pages, when users try to write
>   * to a shared page. It is done by copying the page to a new address
> @@ -2025,8 +2041,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * accounting on raw pfn maps.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> -				     (VM_WRITE|VM_SHARED))
> +				     (VM_WRITE|VM_SHARED)) {
> +			pte_unmap_unlock(page_table, ptl);
> +			ret = do_pfn_mkwrite(vma, address);
> +			if (ret & VM_FAULT_ERROR)
> +				return ret;
> +			page_table = pte_offset_map_lock(mm, pmd, address,
> +							 &ptl);
> +			if (!pte_same(*page_table, orig_pte))
> +				goto unlock;
>  			goto reuse;
> +		}
>  		goto gotten;

There are significant pending changes in this area.  See linux-next,
or http://ozlabs.org/~akpm/mmots/broken-out/mm-refactor-*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-23 22:40     ` Dave Chinner
@ 2015-03-24  6:14       ` Boaz Harrosh
  -1 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-24  6:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On 03/24/2015 12:40 AM, Dave Chinner wrote:
> On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote:
>> From: Boaz Harrosh <boaz@plexistor.com>
>>
>> When freezing an FS, we must write protect all IS_DAX()
>> inodes that have an mmap mapping on an inode. Otherwise
>> application will be able to modify previously faulted-in
>> file pages.
> 
> All you need to do is lock out page faults once the page is clean;
> that's what the sb_start_pagefault() calls are for in the page fault
> path - they catch write faults and block them until the filesystem
> is unfrozen. Hence I'm not sure why this would be necessary if you
> are catching write faults in .pfn_mkwrite....
> 

Jan pointed it out and he was right I have a test for this. What
happens is that since we had a mapping from before the freeze we will
not have a page-fault. And the buffers will be modified.

As Jan explained in the cache path we do a writeback which turns
all pages to read-only. But with dax we do not have writeback
so the pages stay read-write mapped. Something needs to loop
through the pages and write-protect them. I chose to unmap
them because it is the much-much smaller code, and I do not care
to optimize the freeze.

[Yes, sigh, I will convert the test to an xfstest. May I just add
 it to some existing fs_freeze test. Only novelty is that we need
 to write-access an mmap block before the freeze-start, then continue
 access after the freeze and see modifications
]

> Cheers,
> Dave.
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-24  6:14       ` Boaz Harrosh
  0 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-24  6:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On 03/24/2015 12:40 AM, Dave Chinner wrote:
> On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote:
>> From: Boaz Harrosh <boaz@plexistor.com>
>>
>> When freezing an FS, we must write protect all IS_DAX()
>> inodes that have an mmap mapping on an inode. Otherwise
>> application will be able to modify previously faulted-in
>> file pages.
> 
> All you need to do is lock out page faults once the page is clean;
> that's what the sb_start_pagefault() calls are for in the page fault
> path - they catch write faults and block them until the filesystem
> is unfrozen. Hence I'm not sure why this would be necessary if you
> are catching write faults in .pfn_mkwrite....
> 

Jan pointed it out and he was right I have a test for this. What
happens is that since we had a mapping from before the freeze we will
not have a page-fault. And the buffers will be modified.

As Jan explained in the cache path we do a writeback which turns
all pages to read-only. But with dax we do not have writeback
so the pages stay read-write mapped. Something needs to loop
through the pages and write-protect them. I chose to unmap
them because it is the much-much smaller code, and I do not care
to optimize the freeze.

[Yes, sigh, I will convert the test to an xfstest. May I just add
 it to some existing fs_freeze test. Only novelty is that we need
 to write-access an mmap block before the freeze-start, then continue
 access after the freeze and see modifications
]

> Cheers,
> Dave.
> 

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-23 12:54 ` [PATCH 3/3] RFC: dax: dax_prepare_freeze Boaz Harrosh
@ 2015-03-24 12:37     ` Boaz Harrosh
  2015-03-24 12:37     ` Boaz Harrosh
  1 sibling, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-24 12:37 UTC (permalink / raw)
  To: Boaz Harrosh, Dave Chinner, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm,
	linux-nvdimm, linux-fsdevel, Eryu Guan

On 03/23/2015 02:54 PM, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> When freezing an FS, we must write protect all IS_DAX()
> inodes that have an mmap mapping on an inode. Otherwise
> application will be able to modify previously faulted-in
> file pages.
> 
> I'm actually doing a full unmap_mapping_range because
> there is no readily available "mapping_write_protect" like
> functionality. I do not think it is worth it to define one
> just for here and just for some extra read-faults after an
> fs_freeze.
> 
> How hot-path is fs_freeze at all?
> 

OK So reinspecting this was a complete raw RFC. I need to do
more work on this thing

comments below ...

> CC: Jan Kara <jack@suse.cz>
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  fs/dax.c           | 30 ++++++++++++++++++++++++++++++
>  fs/super.c         |  3 +++
>  include/linux/fs.h |  1 +
>  3 files changed, 34 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index d0bd1f4..f3fc28b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -549,3 +549,33 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  	return dax_zero_page_range(inode, from, length, get_block);
>  }
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
> +
> +/* This is meant to be called as part of freeze_super. otherwise we might
> + * Need some extra locking before calling here.
> + */
> +void dax_prepare_freeze(struct super_block *sb)
> +{
> +	struct inode *inode;
> +
> +	/* TODO: each DAX fs has some private mount option to enable DAX. If
> +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
> +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
> +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
> +	 *	return 0;
> +	 */
> +
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		/* TODO: For freezing we can actually do with write-protecting
> +		 * the page. But I cannot find a ready made function that does
> +		 * that for a giving mapping (with all the proper locking).
> +		 * How performance sensitive is the all sb_freeze API?
> +		 * For now we can just unmap the all mapping, and pay extra
> +		 * on read faults.
> +		 */
> +		/* NOTE: Do not unmap private COW mapped pages it will not
> +		 * modify the FS.
> +		 */
> +		if (IS_DAX(inode))
> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);

So what happens here is that we loop on all sb->s_inodes every freeze
and in the not DAX case just do nothing.

It could be nice to have a flag at the sb level to tel us if we need
to expect IS_DAX() inodes at all, for example when we are mounted on
an harddisk it should not be set.

All of ext2/4 and now Dave's xfs have their own
	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX

Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
here in Generic code? The option parsing will be done by each FS but
the flag be global?

> +	}
> +}
> diff --git a/fs/super.c b/fs/super.c
> index 2b7dc90..9ef490c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
>  	/* All writers are done so after syncing there won't be dirty data */
>  	sync_filesystem(sb);
>  
> +	/* Need to take care of DAX mmaped inodes */
> +	dax_prepare_freeze(sb);
> +

So if CONFIG_FS_DAX is not set this will not compile I need to
define an empty one if not set

Cheers
Boaz


>  	/* Now wait for internal filesystem counter */
>  	sb->s_writers.frozen = SB_FREEZE_FS;
>  	smp_wmb();
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 24af817..3b943d4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2599,6 +2599,7 @@ int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
>  int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
>  #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
> +void dax_prepare_freeze(struct super_block *sb);
>  
>  #ifdef CONFIG_BLOCK
>  typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-24 12:37     ` Boaz Harrosh
  0 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-24 12:37 UTC (permalink / raw)
  To: Boaz Harrosh, Dave Chinner, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm,
	linux-nvdimm, linux-fsdevel, Eryu Guan

On 03/23/2015 02:54 PM, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> When freezing an FS, we must write protect all IS_DAX()
> inodes that have an mmap mapping on an inode. Otherwise
> application will be able to modify previously faulted-in
> file pages.
> 
> I'm actually doing a full unmap_mapping_range because
> there is no readily available "mapping_write_protect" like
> functionality. I do not think it is worth it to define one
> just for here and just for some extra read-faults after an
> fs_freeze.
> 
> How hot-path is fs_freeze at all?
> 

OK So reinspecting this was a complete raw RFC. I need to do
more work on this thing

comments below ...

> CC: Jan Kara <jack@suse.cz>
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  fs/dax.c           | 30 ++++++++++++++++++++++++++++++
>  fs/super.c         |  3 +++
>  include/linux/fs.h |  1 +
>  3 files changed, 34 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index d0bd1f4..f3fc28b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -549,3 +549,33 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  	return dax_zero_page_range(inode, from, length, get_block);
>  }
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
> +
> +/* This is meant to be called as part of freeze_super. otherwise we might
> + * Need some extra locking before calling here.
> + */
> +void dax_prepare_freeze(struct super_block *sb)
> +{
> +	struct inode *inode;
> +
> +	/* TODO: each DAX fs has some private mount option to enable DAX. If
> +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
> +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
> +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
> +	 *	return 0;
> +	 */
> +
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		/* TODO: For freezing we can actually do with write-protecting
> +		 * the page. But I cannot find a ready made function that does
> +		 * that for a giving mapping (with all the proper locking).
> +		 * How performance sensitive is the all sb_freeze API?
> +		 * For now we can just unmap the all mapping, and pay extra
> +		 * on read faults.
> +		 */
> +		/* NOTE: Do not unmap private COW mapped pages it will not
> +		 * modify the FS.
> +		 */
> +		if (IS_DAX(inode))
> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);

So what happens here is that we loop on all sb->s_inodes every freeze
and in the not DAX case just do nothing.

It could be nice to have a flag at the sb level to tel us if we need
to expect IS_DAX() inodes at all, for example when we are mounted on
an harddisk it should not be set.

All of ext2/4 and now Dave's xfs have their own
	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX

Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
here in Generic code? The option parsing will be done by each FS but
the flag be global?

> +	}
> +}
> diff --git a/fs/super.c b/fs/super.c
> index 2b7dc90..9ef490c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
>  	/* All writers are done so after syncing there won't be dirty data */
>  	sync_filesystem(sb);
>  
> +	/* Need to take care of DAX mmaped inodes */
> +	dax_prepare_freeze(sb);
> +

So if CONFIG_FS_DAX is not set this will not compile I need to
define an empty one if not set

Cheers
Boaz


>  	/* Now wait for internal filesystem counter */
>  	sb->s_writers.frozen = SB_FREEZE_FS;
>  	smp_wmb();
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 24af817..3b943d4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2599,6 +2599,7 @@ int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
>  int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
>  #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
> +void dax_prepare_freeze(struct super_block *sb);
>  
>  #ifdef CONFIG_BLOCK
>  typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-24  6:14       ` Boaz Harrosh
@ 2015-03-25  2:22         ` Dave Chinner
  -1 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25  2:22 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
> On 03/24/2015 12:40 AM, Dave Chinner wrote:
> > On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote:
> >> From: Boaz Harrosh <boaz@plexistor.com>
> >>
> >> When freezing an FS, we must write protect all IS_DAX()
> >> inodes that have an mmap mapping on an inode. Otherwise
> >> application will be able to modify previously faulted-in
> >> file pages.
> > 
> > All you need to do is lock out page faults once the page is clean;
> > that's what the sb_start_pagefault() calls are for in the page fault
> > path - they catch write faults and block them until the filesystem
> > is unfrozen. Hence I'm not sure why this would be necessary if you
> > are catching write faults in .pfn_mkwrite....
> > 
> 
> Jan pointed it out and he was right I have a test for this. What
> happens is that since we had a mapping from before the freeze we will
> not have a page-fault. And the buffers will be modified.
> 
> As Jan explained in the cache path we do a writeback which turns
> all pages to read-only. But with dax we do not have writeback
> so the pages stay read-write mapped. Something needs to loop
> through the pages and write-protect them. I chose to unmap
> them because it is the much-much smaller code, and I do not care
> to optimize the freeze.

Then we have wider problem with DAX, then: sync doesn't work
properly. i.e. if we still has write mapped pages, then we haven't
flushed dirty cache lines on write-mapped files to the persistent
domain by the time sync completes.

So, this shouldn't be some special case that only the freeze code
takes into account - we need to make sure that sync (and therefore
freeze) flushes all dirty cache lines and marks all mappings
clean....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-25  2:22         ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25  2:22 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
> On 03/24/2015 12:40 AM, Dave Chinner wrote:
> > On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote:
> >> From: Boaz Harrosh <boaz@plexistor.com>
> >>
> >> When freezing an FS, we must write protect all IS_DAX()
> >> inodes that have an mmap mapping on an inode. Otherwise
> >> application will be able to modify previously faulted-in
> >> file pages.
> > 
> > All you need to do is lock out page faults once the page is clean;
> > that's what the sb_start_pagefault() calls are for in the page fault
> > path - they catch write faults and block them until the filesystem
> > is unfrozen. Hence I'm not sure why this would be necessary if you
> > are catching write faults in .pfn_mkwrite....
> > 
> 
> Jan pointed it out and he was right I have a test for this. What
> happens is that since we had a mapping from before the freeze we will
> not have a page-fault. And the buffers will be modified.
> 
> As Jan explained in the cache path we do a writeback which turns
> all pages to read-only. But with dax we do not have writeback
> so the pages stay read-write mapped. Something needs to loop
> through the pages and write-protect them. I chose to unmap
> them because it is the much-much smaller code, and I do not care
> to optimize the freeze.

Then we have wider problem with DAX, then: sync doesn't work
properly. i.e. if we still has write mapped pages, then we haven't
flushed dirty cache lines on write-mapped files to the persistent
domain by the time sync completes.

So, this shouldn't be some special case that only the freeze code
takes into account - we need to make sure that sync (and therefore
freeze) flushes all dirty cache lines and marks all mappings
clean....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-24 12:37     ` Boaz Harrosh
@ 2015-03-25  2:26       ` Dave Chinner
  -1 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25  2:26 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Tue, Mar 24, 2015 at 02:37:45PM +0200, Boaz Harrosh wrote:
> On 03/23/2015 02:54 PM, Boaz Harrosh wrote:
> > From: Boaz Harrosh <boaz@plexistor.com>
> > 
> > When freezing an FS, we must write protect all IS_DAX()
> > inodes that have an mmap mapping on an inode. Otherwise
> > application will be able to modify previously faulted-in
> > file pages.
> > 
> > I'm actually doing a full unmap_mapping_range because
> > there is no readily available "mapping_write_protect" like
> > functionality. I do not think it is worth it to define one
> > just for here and just for some extra read-faults after an
> > fs_freeze.
> > 
> > How hot-path is fs_freeze at all?
> > 
> 
> OK So reinspecting this was a complete raw RFC. I need to do
> more work on this thing
> 
> comments below ...
> 
> > CC: Jan Kara <jack@suse.cz>
> > CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> > ---
> >  fs/dax.c           | 30 ++++++++++++++++++++++++++++++
> >  fs/super.c         |  3 +++
> >  include/linux/fs.h |  1 +
> >  3 files changed, 34 insertions(+)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index d0bd1f4..f3fc28b 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -549,3 +549,33 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
> >  	return dax_zero_page_range(inode, from, length, get_block);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_truncate_page);
> > +
> > +/* This is meant to be called as part of freeze_super. otherwise we might
> > + * Need some extra locking before calling here.
> > + */
> > +void dax_prepare_freeze(struct super_block *sb)
> > +{
> > +	struct inode *inode;
> > +
> > +	/* TODO: each DAX fs has some private mount option to enable DAX. If
> > +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
> > +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
> > +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
> > +	 *	return 0;
> > +	 */
> > +
> > +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {

missing locking.

> > +		/* TODO: For freezing we can actually do with write-protecting
> > +		 * the page. But I cannot find a ready made function that does
> > +		 * that for a giving mapping (with all the proper locking).
> > +		 * How performance sensitive is the all sb_freeze API?
> > +		 * For now we can just unmap the all mapping, and pay extra
> > +		 * on read faults.
> > +		 */
> > +		/* NOTE: Do not unmap private COW mapped pages it will not
> > +		 * modify the FS.
> > +		 */
> > +		if (IS_DAX(inode))
> > +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
> 
> So what happens here is that we loop on all sb->s_inodes every freeze
> and in the not DAX case just do nothing.

Which is real bad and known to be a performance issue. See Josef's
recent sync scalability patchset posting that only tracks and walks
dirty inodes...

> It could be nice to have a flag at the sb level to tel us if we need
> to expect IS_DAX() inodes at all, for example when we are mounted on
> an harddisk it should not be set.
> 
> All of ext2/4 and now Dave's xfs have their own
> 	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX
> 
> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
> here in Generic code? The option parsing will be done by each FS but
> the flag be global?

No, because as I mentioned in another thread we're going to end up
with filesystems that don't have "mount wide" DAX behaviour, and we
have to check every dirty inode anyway. And....

> > diff --git a/fs/super.c b/fs/super.c
> > index 2b7dc90..9ef490c 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
> >  	/* All writers are done so after syncing there won't be dirty data */
> >  	sync_filesystem(sb);
> >  
> > +	/* Need to take care of DAX mmaped inodes */
> > +	dax_prepare_freeze(sb);
> > +
> 
> So if CONFIG_FS_DAX is not set this will not compile I need to
> define an empty one if not set

... it's the wrong approach - sync_filesystem(sb) shoul dbe handling
this problem, so that sync and fsync work correctly, and then you
don't care about whether DAX is supported or not...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-25  2:26       ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25  2:26 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Tue, Mar 24, 2015 at 02:37:45PM +0200, Boaz Harrosh wrote:
> On 03/23/2015 02:54 PM, Boaz Harrosh wrote:
> > From: Boaz Harrosh <boaz@plexistor.com>
> > 
> > When freezing an FS, we must write protect all IS_DAX()
> > inodes that have an mmap mapping on an inode. Otherwise
> > application will be able to modify previously faulted-in
> > file pages.
> > 
> > I'm actually doing a full unmap_mapping_range because
> > there is no readily available "mapping_write_protect" like
> > functionality. I do not think it is worth it to define one
> > just for here and just for some extra read-faults after an
> > fs_freeze.
> > 
> > How hot-path is fs_freeze at all?
> > 
> 
> OK So reinspecting this was a complete raw RFC. I need to do
> more work on this thing
> 
> comments below ...
> 
> > CC: Jan Kara <jack@suse.cz>
> > CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> > ---
> >  fs/dax.c           | 30 ++++++++++++++++++++++++++++++
> >  fs/super.c         |  3 +++
> >  include/linux/fs.h |  1 +
> >  3 files changed, 34 insertions(+)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index d0bd1f4..f3fc28b 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -549,3 +549,33 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
> >  	return dax_zero_page_range(inode, from, length, get_block);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_truncate_page);
> > +
> > +/* This is meant to be called as part of freeze_super. otherwise we might
> > + * Need some extra locking before calling here.
> > + */
> > +void dax_prepare_freeze(struct super_block *sb)
> > +{
> > +	struct inode *inode;
> > +
> > +	/* TODO: each DAX fs has some private mount option to enable DAX. If
> > +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
> > +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
> > +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
> > +	 *	return 0;
> > +	 */
> > +
> > +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {

missing locking.

> > +		/* TODO: For freezing we can actually do with write-protecting
> > +		 * the page. But I cannot find a ready made function that does
> > +		 * that for a giving mapping (with all the proper locking).
> > +		 * How performance sensitive is the all sb_freeze API?
> > +		 * For now we can just unmap the all mapping, and pay extra
> > +		 * on read faults.
> > +		 */
> > +		/* NOTE: Do not unmap private COW mapped pages it will not
> > +		 * modify the FS.
> > +		 */
> > +		if (IS_DAX(inode))
> > +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
> 
> So what happens here is that we loop on all sb->s_inodes every freeze
> and in the not DAX case just do nothing.

Which is real bad and known to be a performance issue. See Josef's
recent sync scalability patchset posting that only tracks and walks
dirty inodes...

> It could be nice to have a flag at the sb level to tel us if we need
> to expect IS_DAX() inodes at all, for example when we are mounted on
> an harddisk it should not be set.
> 
> All of ext2/4 and now Dave's xfs have their own
> 	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX
> 
> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
> here in Generic code? The option parsing will be done by each FS but
> the flag be global?

No, because as I mentioned in another thread we're going to end up
with filesystems that don't have "mount wide" DAX behaviour, and we
have to check every dirty inode anyway. And....

> > diff --git a/fs/super.c b/fs/super.c
> > index 2b7dc90..9ef490c 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
> >  	/* All writers are done so after syncing there won't be dirty data */
> >  	sync_filesystem(sb);
> >  
> > +	/* Need to take care of DAX mmaped inodes */
> > +	dax_prepare_freeze(sb);
> > +
> 
> So if CONFIG_FS_DAX is not set this will not compile I need to
> define an empty one if not set

... it's the wrong approach - sync_filesystem(sb) shoul dbe handling
this problem, so that sync and fsync work correctly, and then you
don't care about whether DAX is supported or not...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-25  2:22         ` Dave Chinner
  (?)
@ 2015-03-25  8:10         ` Boaz Harrosh
  2015-03-25  9:29             ` Dave Chinner
  -1 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-25  8:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On 03/25/2015 04:22 AM, Dave Chinner wrote:
> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
<>
> 
> Then we have wider problem with DAX, then: sync doesn't work
> properly. i.e. if we still has write mapped pages, then we haven't
> flushed dirty cache lines on write-mapped files to the persistent
> domain by the time sync completes.
> 
> So, this shouldn't be some special case that only the freeze code
> takes into account - we need to make sure that sync (and therefore
> freeze) flushes all dirty cache lines and marks all mappings
> clean....
> 

This is not how I understood it and how I read the code.

The sync does happen, .fsync of the FS is called on each
file just as if the user called it. If this is broken it just
needs to be fixed there at the .fsync vector. POSIX mandate
persistence at .fsync so at the vfs layer we rely on that.

So everything at this stage should be synced to real media.

What does not happen is writeback. since dax does not have
any writeback. And because of that nothing turned the
user mappings to read only. This is what I do here but
instead of write-protecting I just unmap because it is
easier for me to code it.

> Cheers,
> Dave.

Cheers
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-25  2:26       ` Dave Chinner
@ 2015-03-25  8:31         ` Boaz Harrosh
  -1 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-25  8:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On 03/25/2015 04:26 AM, Dave Chinner wrote:
<>
>>> +	/* TODO: each DAX fs has some private mount option to enable DAX. If
>>> +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
>>> +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
>>> +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
>>> +	 *	return 0;
>>> +	 */
>>> +
>>> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> 
> missing locking.
> 

I will please need help here. This is very deep inside the freeze process
we area already holding bunch of locks. We know that nothing can be modified
at this stage. We are completely read-only.

Only thing I can see that can happen is inode eviction do to oom. So
do I need an iget. But how do I know the iget is allowed here?

OK I do not have a clue what locks do I need, this deep in the freeze?

>>> +		/* TODO: For freezing we can actually do with write-protecting
>>> +		 * the page. But I cannot find a ready made function that does
>>> +		 * that for a giving mapping (with all the proper locking).
>>> +		 * How performance sensitive is the all sb_freeze API?
>>> +		 * For now we can just unmap the all mapping, and pay extra
>>> +		 * on read faults.
>>> +		 */
>>> +		/* NOTE: Do not unmap private COW mapped pages it will not
>>> +		 * modify the FS.
>>> +		 */
>>> +		if (IS_DAX(inode))
>>> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
>>
>> So what happens here is that we loop on all sb->s_inodes every freeze
>> and in the not DAX case just do nothing.
> 
> Which is real bad and known to be a performance issue. See Josef's
> recent sync scalability patchset posting that only tracks and walks
> dirty inodes...
> 

Sure but how hot is freeze? Josef's fixed the very hot sync path,
but freeze happens once in a blue moon. Do we care?

>> It could be nice to have a flag at the sb level to tel us if we need
>> to expect IS_DAX() inodes at all, for example when we are mounted on
>> an harddisk it should not be set.
>>
>> All of ext2/4 and now Dave's xfs have their own
>> 	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX
>>
>> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
>> here in Generic code? The option parsing will be done by each FS but
>> the flag be global?
> 
> No, because as I mentioned in another thread we're going to end up
> with filesystems that don't have "mount wide" DAX behaviour, and we
> have to check every dirty inode anyway. And....
> 

Sure! but let us contract with the FS, that please set the MS_MOUNT_DAX
if there is any chance at all that IS_DAX() comes out true, so we loop
here. 

OK You know what, I will change this check to be:
	if (sb->s_bdev->bd_disk->fops->direct_access)

BTW: We must loop this way on every sb inode because we do not have
dirty inodes. There is no "dirty"ing going on in dax, not of inodes
and not of pages.

>>> diff --git a/fs/super.c b/fs/super.c
>>> index 2b7dc90..9ef490c 100644
>>> --- a/fs/super.c
>>> +++ b/fs/super.c
>>> @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
>>>  	/* All writers are done so after syncing there won't be dirty data */
>>>  	sync_filesystem(sb);
>>>  
>>> +	/* Need to take care of DAX mmaped inodes */
>>> +	dax_prepare_freeze(sb);
>>> +
>>
>> So if CONFIG_FS_DAX is not set this will not compile I need to
>> define an empty one if not set
> 
> ... it's the wrong approach - sync_filesystem(sb) shoul dbe handling
> this problem, so that sync and fsync work correctly, and then you
> don't care about whether DAX is supported or not...
> 

sync and fsync should and will work correctly, but this does not
solve our problem. because what turns pages to read-only is the
writeback. And we do not have this in dax. Therefore we need to
do this here as a special case.

> Cheers,
> Dave.
> 

I have a new patchset with all this, I will send it once it is fully
tested, I have problems testing both freeze and splice there are not
any good tests that I could find that do what I want, so still working.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-25  8:31         ` Boaz Harrosh
  0 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-25  8:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On 03/25/2015 04:26 AM, Dave Chinner wrote:
<>
>>> +	/* TODO: each DAX fs has some private mount option to enable DAX. If
>>> +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
>>> +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
>>> +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
>>> +	 *	return 0;
>>> +	 */
>>> +
>>> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> 
> missing locking.
> 

I will please need help here. This is very deep inside the freeze process
we area already holding bunch of locks. We know that nothing can be modified
at this stage. We are completely read-only.

Only thing I can see that can happen is inode eviction do to oom. So
do I need an iget. But how do I know the iget is allowed here?

OK I do not have a clue what locks do I need, this deep in the freeze?

>>> +		/* TODO: For freezing we can actually do with write-protecting
>>> +		 * the page. But I cannot find a ready made function that does
>>> +		 * that for a giving mapping (with all the proper locking).
>>> +		 * How performance sensitive is the all sb_freeze API?
>>> +		 * For now we can just unmap the all mapping, and pay extra
>>> +		 * on read faults.
>>> +		 */
>>> +		/* NOTE: Do not unmap private COW mapped pages it will not
>>> +		 * modify the FS.
>>> +		 */
>>> +		if (IS_DAX(inode))
>>> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
>>
>> So what happens here is that we loop on all sb->s_inodes every freeze
>> and in the not DAX case just do nothing.
> 
> Which is real bad and known to be a performance issue. See Josef's
> recent sync scalability patchset posting that only tracks and walks
> dirty inodes...
> 

Sure but how hot is freeze? Josef's fixed the very hot sync path,
but freeze happens once in a blue moon. Do we care?

>> It could be nice to have a flag at the sb level to tel us if we need
>> to expect IS_DAX() inodes at all, for example when we are mounted on
>> an harddisk it should not be set.
>>
>> All of ext2/4 and now Dave's xfs have their own
>> 	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX
>>
>> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
>> here in Generic code? The option parsing will be done by each FS but
>> the flag be global?
> 
> No, because as I mentioned in another thread we're going to end up
> with filesystems that don't have "mount wide" DAX behaviour, and we
> have to check every dirty inode anyway. And....
> 

Sure! but let us contract with the FS, that please set the MS_MOUNT_DAX
if there is any chance at all that IS_DAX() comes out true, so we loop
here. 

OK You know what, I will change this check to be:
	if (sb->s_bdev->bd_disk->fops->direct_access)

BTW: We must loop this way on every sb inode because we do not have
dirty inodes. There is no "dirty"ing going on in dax, not of inodes
and not of pages.

>>> diff --git a/fs/super.c b/fs/super.c
>>> index 2b7dc90..9ef490c 100644
>>> --- a/fs/super.c
>>> +++ b/fs/super.c
>>> @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
>>>  	/* All writers are done so after syncing there won't be dirty data */
>>>  	sync_filesystem(sb);
>>>  
>>> +	/* Need to take care of DAX mmaped inodes */
>>> +	dax_prepare_freeze(sb);
>>> +
>>
>> So if CONFIG_FS_DAX is not set this will not compile I need to
>> define an empty one if not set
> 
> ... it's the wrong approach - sync_filesystem(sb) shoul dbe handling
> this problem, so that sync and fsync work correctly, and then you
> don't care about whether DAX is supported or not...
> 

sync and fsync should and will work correctly, but this does not
solve our problem. because what turns pages to read-only is the
writeback. And we do not have this in dax. Therefore we need to
do this here as a special case.

> Cheers,
> Dave.
> 

I have a new patchset with all this, I will send it once it is fully
tested, I have problems testing both freeze and splice there are not
any good tests that I could find that do what I want, so still working.

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-25  8:10         ` Boaz Harrosh
@ 2015-03-25  9:29             ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25  9:29 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
> On 03/25/2015 04:22 AM, Dave Chinner wrote:
> > On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
> <>
> > 
> > Then we have wider problem with DAX, then: sync doesn't work
> > properly. i.e. if we still has write mapped pages, then we haven't
> > flushed dirty cache lines on write-mapped files to the persistent
> > domain by the time sync completes.
> > 
> > So, this shouldn't be some special case that only the freeze code
> > takes into account - we need to make sure that sync (and therefore
> > freeze) flushes all dirty cache lines and marks all mappings
> > clean....
> > 
> 
> This is not how I understood it and how I read the code.
> 
> The sync does happen, .fsync of the FS is called on each
> file just as if the user called it. If this is broken it just
> needs to be fixed there at the .fsync vector. POSIX mandate
> persistence at .fsync so at the vfs layer we rely on that.

right now, the filesystems will see that there are no dirty pages
on the inode, and then just sync the inode metadata. They will do
nothing else as filesystems are not aware of CPU cachelines at all.

> So everything at this stage should be synced to real media.

Actually no. This is what intel are introducing new CPU instructions
for - so fsync can flush the cpu caches and commit them to th
persistence domain correctly.

> What does not happen is writeback. since dax does not have
> any writeback.

Which is precisely the problem we need to address - we don't need
writeback to a block device, but we do need the dirty CPU cachelines
flushed and the mappings cleaned.

> And because of that nothing turned the
> user mappings to read only. This is what I do here but
> instead of write-protecting I just unmap because it is
> easier for me to code it.

That doesn't mean it is the correct solution.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-25  9:29             ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25  9:29 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
> On 03/25/2015 04:22 AM, Dave Chinner wrote:
> > On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
> <>
> > 
> > Then we have wider problem with DAX, then: sync doesn't work
> > properly. i.e. if we still has write mapped pages, then we haven't
> > flushed dirty cache lines on write-mapped files to the persistent
> > domain by the time sync completes.
> > 
> > So, this shouldn't be some special case that only the freeze code
> > takes into account - we need to make sure that sync (and therefore
> > freeze) flushes all dirty cache lines and marks all mappings
> > clean....
> > 
> 
> This is not how I understood it and how I read the code.
> 
> The sync does happen, .fsync of the FS is called on each
> file just as if the user called it. If this is broken it just
> needs to be fixed there at the .fsync vector. POSIX mandate
> persistence at .fsync so at the vfs layer we rely on that.

right now, the filesystems will see that there are no dirty pages
on the inode, and then just sync the inode metadata. They will do
nothing else as filesystems are not aware of CPU cachelines at all.

> So everything at this stage should be synced to real media.

Actually no. This is what intel are introducing new CPU instructions
for - so fsync can flush the cpu caches and commit them to th
persistence domain correctly.

> What does not happen is writeback. since dax does not have
> any writeback.

Which is precisely the problem we need to address - we don't need
writeback to a block device, but we do need the dirty CPU cachelines
flushed and the mappings cleaned.

> And because of that nothing turned the
> user mappings to read only. This is what I do here but
> instead of write-protecting I just unmap because it is
> easier for me to code it.

That doesn't mean it is the correct solution.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-25  8:31         ` Boaz Harrosh
@ 2015-03-25  9:41           ` Dave Chinner
  -1 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25  9:41 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote:
> On 03/25/2015 04:26 AM, Dave Chinner wrote:
> <>
> >>> +	/* TODO: each DAX fs has some private mount option to enable DAX. If
> >>> +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
> >>> +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
> >>> +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
> >>> +	 *	return 0;
> >>> +	 */
> >>> +
> >>> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > 
> > missing locking.
> > 
> 
> I will please need help here. This is very deep inside the freeze process
> we area already holding bunch of locks. We know that nothing can be modified
> at this stage. We are completely read-only.

Which means we could stillbe reading new inodes in off disk and
hence the sb->s_inodes list can be changing. Memory reclaim can be
running via the shrinker, freeing clean inodes, hence the
sb->s_inodes list can be changing.

>From fs/inode.c:

/*
 * Inode locking rules:
.....
 * inode_sb_list_lock protects:
 *   sb->s_inodes, inode->i_sb_list

This...

> >>> +		/* TODO: For freezing we can actually do with write-protecting
> >>> +		 * the page. But I cannot find a ready made function that does
> >>> +		 * that for a giving mapping (with all the proper locking).
> >>> +		 * How performance sensitive is the all sb_freeze API?
> >>> +		 * For now we can just unmap the all mapping, and pay extra
> >>> +		 * on read faults.
> >>> +		 */
> >>> +		/* NOTE: Do not unmap private COW mapped pages it will not
> >>> +		 * modify the FS.
> >>> +		 */
> >>> +		if (IS_DAX(inode))
> >>> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
> >>
> >> So what happens here is that we loop on all sb->s_inodes every freeze
> >> and in the not DAX case just do nothing.
> > 
> > Which is real bad and known to be a performance issue. See Josef's
> > recent sync scalability patchset posting that only tracks and walks
> > dirty inodes...
> 
> Sure but how hot is freeze? Josef's fixed the very hot sync path,
> but freeze happens once in a blue moon. Do we care?

Yes, because if you have 50 million cached inodes on a filesystem,
it's going to take a long time to traverse them all, and right now
the inode_sb_list_lock is a *global lock*.

> >> It could be nice to have a flag at the sb level to tel us if we need
> >> to expect IS_DAX() inodes at all, for example when we are mounted on
> >> an harddisk it should not be set.
> >>
> >> All of ext2/4 and now Dave's xfs have their own
> >> 	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX
> >>
> >> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
> >> here in Generic code? The option parsing will be done by each FS but
> >> the flag be global?
> > 
> > No, because as I mentioned in another thread we're going to end up
> > with filesystems that don't have "mount wide" DAX behaviour, and we
> > have to check every dirty inode anyway. And....
> > 
> 
> Sure! but let us contract with the FS, that please set the MS_MOUNT_DAX
> if there is any chance at all that IS_DAX() comes out true, so we loop
> here. 

The mount option is irrelevant here - we should only be looping over
dirty inodes.  We don't care if they are DAX or not - we have to
iterate them and ensure they are properly clean. We already have
infrastructure to do this - we should use it and fix the problem
once and for all rather than hacking special case code into random
places.

> BTW: We must loop this way on every sb inode because we do not have
> dirty inodes. There is no "dirty"ing going on in dax, not of inodes
> and not of pages.

Precisely the problem we need to address. We do have dirty inodes,
we just never set the fact they have dirty "pages" on them and hence
never do data writeback on them.

> > ... it's the wrong approach - sync_filesystem(sb) shoul dbe handling
> > this problem, so that sync and fsync work correctly, and then you
> > don't care about whether DAX is supported or not...
> > 
> 
> sync and fsync should and will work correctly, but this does not
> solve our problem. because what turns pages to read-only is the
> writeback. And we do not have this in dax. Therefore we need to
> do this here as a special case.

We can still use exactly the same dirty tracking as we use for data
writeback. The difference is that we don't need to go through all
teh page writeback; we can just flush the CPU caches and mark all
the mappings clean, then clear the I_DIRTY_PAGES flag and move on to
inode writeback....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-25  9:41           ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25  9:41 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote:
> On 03/25/2015 04:26 AM, Dave Chinner wrote:
> <>
> >>> +	/* TODO: each DAX fs has some private mount option to enable DAX. If
> >>> +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
> >>> +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
> >>> +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
> >>> +	 *	return 0;
> >>> +	 */
> >>> +
> >>> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > 
> > missing locking.
> > 
> 
> I will please need help here. This is very deep inside the freeze process
> we area already holding bunch of locks. We know that nothing can be modified
> at this stage. We are completely read-only.

Which means we could stillbe reading new inodes in off disk and
hence the sb->s_inodes list can be changing. Memory reclaim can be
running via the shrinker, freeing clean inodes, hence the
sb->s_inodes list can be changing.

>From fs/inode.c:

/*
 * Inode locking rules:
.....
 * inode_sb_list_lock protects:
 *   sb->s_inodes, inode->i_sb_list

This...

> >>> +		/* TODO: For freezing we can actually do with write-protecting
> >>> +		 * the page. But I cannot find a ready made function that does
> >>> +		 * that for a giving mapping (with all the proper locking).
> >>> +		 * How performance sensitive is the all sb_freeze API?
> >>> +		 * For now we can just unmap the all mapping, and pay extra
> >>> +		 * on read faults.
> >>> +		 */
> >>> +		/* NOTE: Do not unmap private COW mapped pages it will not
> >>> +		 * modify the FS.
> >>> +		 */
> >>> +		if (IS_DAX(inode))
> >>> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
> >>
> >> So what happens here is that we loop on all sb->s_inodes every freeze
> >> and in the not DAX case just do nothing.
> > 
> > Which is real bad and known to be a performance issue. See Josef's
> > recent sync scalability patchset posting that only tracks and walks
> > dirty inodes...
> 
> Sure but how hot is freeze? Josef's fixed the very hot sync path,
> but freeze happens once in a blue moon. Do we care?

Yes, because if you have 50 million cached inodes on a filesystem,
it's going to take a long time to traverse them all, and right now
the inode_sb_list_lock is a *global lock*.

> >> It could be nice to have a flag at the sb level to tel us if we need
> >> to expect IS_DAX() inodes at all, for example when we are mounted on
> >> an harddisk it should not be set.
> >>
> >> All of ext2/4 and now Dave's xfs have their own
> >> 	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX
> >>
> >> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
> >> here in Generic code? The option parsing will be done by each FS but
> >> the flag be global?
> > 
> > No, because as I mentioned in another thread we're going to end up
> > with filesystems that don't have "mount wide" DAX behaviour, and we
> > have to check every dirty inode anyway. And....
> > 
> 
> Sure! but let us contract with the FS, that please set the MS_MOUNT_DAX
> if there is any chance at all that IS_DAX() comes out true, so we loop
> here. 

The mount option is irrelevant here - we should only be looping over
dirty inodes.  We don't care if they are DAX or not - we have to
iterate them and ensure they are properly clean. We already have
infrastructure to do this - we should use it and fix the problem
once and for all rather than hacking special case code into random
places.

> BTW: We must loop this way on every sb inode because we do not have
> dirty inodes. There is no "dirty"ing going on in dax, not of inodes
> and not of pages.

Precisely the problem we need to address. We do have dirty inodes,
we just never set the fact they have dirty "pages" on them and hence
never do data writeback on them.

> > ... it's the wrong approach - sync_filesystem(sb) shoul dbe handling
> > this problem, so that sync and fsync work correctly, and then you
> > don't care about whether DAX is supported or not...
> > 
> 
> sync and fsync should and will work correctly, but this does not
> solve our problem. because what turns pages to read-only is the
> writeback. And we do not have this in dax. Therefore we need to
> do this here as a special case.

We can still use exactly the same dirty tracking as we use for data
writeback. The difference is that we don't need to go through all
teh page writeback; we can just flush the CPU caches and mark all
the mappings clean, then clear the I_DIRTY_PAGES flag and move on to
inode writeback....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-25  9:29             ` Dave Chinner
@ 2015-03-25 10:19               ` Boaz Harrosh
  -1 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-25 10:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On 03/25/2015 11:29 AM, Dave Chinner wrote:
> On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
>> On 03/25/2015 04:22 AM, Dave Chinner wrote:
>>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
>> <>
<>
>> The sync does happen, .fsync of the FS is called on each
>> file just as if the user called it. If this is broken it just
>> needs to be fixed there at the .fsync vector. POSIX mandate
>> persistence at .fsync so at the vfs layer we rely on that.
> 
> right now, the filesystems will see that there are no dirty pages
> on the inode, and then just sync the inode metadata. They will do
> nothing else as filesystems are not aware of CPU cachelines at all.
> 

Sigh yes. There is this bug. And I am sitting on a wide fix for this.

The strategy is. All Kernel writes are done with a new copy_user_nt.
NT stands for none-temporal. This shows 20% improvements since cachelines
need not be fetched when written too.

The arches that do not have NT instructions, will use a generic
copy_user_nt that does a copy_user and then flush cashes.
Same flush cashes we do before DMA IO. (effectively every 4k)
[Its more complicated with the edges and all, by I have solved
 all this. Will post in a week or two]

So what is left is the mmaped inodes. The logic here is that
at .fsync vector dax inodes will do a cl_flush only if mapping_mapped()
is true. Also .msync is the same as .fsync

And one last thing we also call .fsync at vm_operations_struct->close
because it is allowed for an app to do mmap, munmap, .fsync so we just
call dax .fsync at munmap always.

So by now we should be covered for fsync guaranty.

>> So everything at this stage should be synced to real media.
> 
> Actually no. This is what intel are introducing new CPU instructions
> for - so fsync can flush the cpu caches and commit them to th
> persistence domain correctly.
> 

The new intel instructions are for an optimization, and they will
fit in the picture for the CPUs that have it. But there are already
NT instructions for existing CPUs. (Just not as fast and precise)

Every ARCH will do its best under a small API
	copy_user_nt	- data is at media
	memset_nt	- data is at media
	cl_flush	- partial written cachelines flushed to media 
	sfence		- New data seen by all CPUs

>> What does not happen is writeback. since dax does not have
>> any writeback.
> 
> Which is precisely the problem we need to address - we don't need
> writeback to a block device, but we do need the dirty CPU cachelines
> flushed and the mappings cleaned.
> 

I see what you mean. Since nothing dirtied the inode then above
.fsync will not be called and we have not pushed mmap data to
media.

Again here we only need to do this for mmaped inodes, because
Kernel written data is (will be) written NT style.

>> And because of that nothing turned the
>> user mappings to read only. This is what I do here but
>> instead of write-protecting I just unmap because it is
>> easier for me to code it.
> 
> That doesn't mean it is the correct solution.

Please note that even if we properly .fsync cachlines the page-faults
are orthogonal to this. There is no point in making mmapped dax pages
read-only after every .fsync and pay a page-fault. We should leave them
mapped has is. The only place that we need page protection is at freeze
time.

But I see that we might have a problem with .fsync not being called.
I see that you sent a second mail. I'll try to answer there.

> Cheers,
> Dave.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-25 10:19               ` Boaz Harrosh
  0 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-25 10:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On 03/25/2015 11:29 AM, Dave Chinner wrote:
> On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
>> On 03/25/2015 04:22 AM, Dave Chinner wrote:
>>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
>> <>
<>
>> The sync does happen, .fsync of the FS is called on each
>> file just as if the user called it. If this is broken it just
>> needs to be fixed there at the .fsync vector. POSIX mandate
>> persistence at .fsync so at the vfs layer we rely on that.
> 
> right now, the filesystems will see that there are no dirty pages
> on the inode, and then just sync the inode metadata. They will do
> nothing else as filesystems are not aware of CPU cachelines at all.
> 

Sigh yes. There is this bug. And I am sitting on a wide fix for this.

The strategy is. All Kernel writes are done with a new copy_user_nt.
NT stands for none-temporal. This shows 20% improvements since cachelines
need not be fetched when written too.

The arches that do not have NT instructions, will use a generic
copy_user_nt that does a copy_user and then flush cashes.
Same flush cashes we do before DMA IO. (effectively every 4k)
[Its more complicated with the edges and all, by I have solved
 all this. Will post in a week or two]

So what is left is the mmaped inodes. The logic here is that
at .fsync vector dax inodes will do a cl_flush only if mapping_mapped()
is true. Also .msync is the same as .fsync

And one last thing we also call .fsync at vm_operations_struct->close
because it is allowed for an app to do mmap, munmap, .fsync so we just
call dax .fsync at munmap always.

So by now we should be covered for fsync guaranty.

>> So everything at this stage should be synced to real media.
> 
> Actually no. This is what intel are introducing new CPU instructions
> for - so fsync can flush the cpu caches and commit them to th
> persistence domain correctly.
> 

The new intel instructions are for an optimization, and they will
fit in the picture for the CPUs that have it. But there are already
NT instructions for existing CPUs. (Just not as fast and precise)

Every ARCH will do its best under a small API
	copy_user_nt	- data is at media
	memset_nt	- data is at media
	cl_flush	- partial written cachelines flushed to media 
	sfence		- New data seen by all CPUs

>> What does not happen is writeback. since dax does not have
>> any writeback.
> 
> Which is precisely the problem we need to address - we don't need
> writeback to a block device, but we do need the dirty CPU cachelines
> flushed and the mappings cleaned.
> 

I see what you mean. Since nothing dirtied the inode then above
.fsync will not be called and we have not pushed mmap data to
media.

Again here we only need to do this for mmaped inodes, because
Kernel written data is (will be) written NT style.

>> And because of that nothing turned the
>> user mappings to read only. This is what I do here but
>> instead of write-protecting I just unmap because it is
>> easier for me to code it.
> 
> That doesn't mean it is the correct solution.

Please note that even if we properly .fsync cachlines the page-faults
are orthogonal to this. There is no point in making mmapped dax pages
read-only after every .fsync and pay a page-fault. We should leave them
mapped has is. The only place that we need page protection is at freeze
time.

But I see that we might have a problem with .fsync not being called.
I see that you sent a second mail. I'll try to answer there.

> Cheers,
> Dave.

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-25  9:41           ` Dave Chinner
@ 2015-03-25 10:40             ` Boaz Harrosh
  -1 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-25 10:40 UTC (permalink / raw)
  To: Dave Chinner, Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On 03/25/2015 11:41 AM, Dave Chinner wrote:
> On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote:
>> On 03/25/2015 04:26 AM, Dave Chinner wrote:
<>
>> sync and fsync should and will work correctly, but this does not
>> solve our problem. because what turns pages to read-only is the
>> writeback. And we do not have this in dax. Therefore we need to
>> do this here as a special case.
> 
> We can still use exactly the same dirty tracking as we use for data
> writeback. The difference is that we don't need to go through all
> teh page writeback; we can just flush the CPU caches and mark all
> the mappings clean, then clear the I_DIRTY_PAGES flag and move on to
> inode writeback....
> 

I see what you mean. the sb wide sync will not step into mmaped inodes
and fsync them.

If we go my way and write NT (None Temporal) style in Kernel.
NT instructions exist since xeon and all the Intel iX core CPUs have
them. In tests we conducted doing xeon NT-writes vs
regular-writes-and-cl_flush at .fsync showed minimum of 20% improvement.
That is on very large IOs. On 4k IOs it was even better.

It looks like you have a much better picture in your mind how to
fit this properly at the inode-dirty picture. Can you attempt a rough draft?

If we are going the NT way. Then we can only I_DIRTY_ track the mmaped
inodes. For me this is really scary because I do not want to trigger
any writeback threads. If you could please draw me an outline (or write
something up ;-)) it would be great.

> Cheers,
> Dave.

Thanks
Boaz



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-25 10:40             ` Boaz Harrosh
  0 siblings, 0 replies; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-25 10:40 UTC (permalink / raw)
  To: Dave Chinner, Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On 03/25/2015 11:41 AM, Dave Chinner wrote:
> On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote:
>> On 03/25/2015 04:26 AM, Dave Chinner wrote:
<>
>> sync and fsync should and will work correctly, but this does not
>> solve our problem. because what turns pages to read-only is the
>> writeback. And we do not have this in dax. Therefore we need to
>> do this here as a special case.
> 
> We can still use exactly the same dirty tracking as we use for data
> writeback. The difference is that we don't need to go through all
> teh page writeback; we can just flush the CPU caches and mark all
> the mappings clean, then clear the I_DIRTY_PAGES flag and move on to
> inode writeback....
> 

I see what you mean. the sb wide sync will not step into mmaped inodes
and fsync them.

If we go my way and write NT (None Temporal) style in Kernel.
NT instructions exist since xeon and all the Intel iX core CPUs have
them. In tests we conducted doing xeon NT-writes vs
regular-writes-and-cl_flush at .fsync showed minimum of 20% improvement.
That is on very large IOs. On 4k IOs it was even better.

It looks like you have a much better picture in your mind how to
fit this properly at the inode-dirty picture. Can you attempt a rough draft?

If we are going the NT way. Then we can only I_DIRTY_ track the mmaped
inodes. For me this is really scary because I do not want to trigger
any writeback threads. If you could please draw me an outline (or write
something up ;-)) it would be great.

> Cheers,
> Dave.

Thanks
Boaz


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-25 10:19               ` Boaz Harrosh
@ 2015-03-25 20:00                 ` Dave Chinner
  -1 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25 20:00 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Wed, Mar 25, 2015 at 12:19:50PM +0200, Boaz Harrosh wrote:
> On 03/25/2015 11:29 AM, Dave Chinner wrote:
> > On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
> >> On 03/25/2015 04:22 AM, Dave Chinner wrote:
> >>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
> >> <>
> <>
> >> The sync does happen, .fsync of the FS is called on each
> >> file just as if the user called it. If this is broken it just
> >> needs to be fixed there at the .fsync vector. POSIX mandate
> >> persistence at .fsync so at the vfs layer we rely on that.
> > 
> > right now, the filesystems will see that there are no dirty pages
> > on the inode, and then just sync the inode metadata. They will do
> > nothing else as filesystems are not aware of CPU cachelines at all.
> > 
> 
> Sigh yes. There is this bug. And I am sitting on a wide fix for this.
> 
> The strategy is. All Kernel writes are done with a new copy_user_nt.
> NT stands for none-temporal. This shows 20% improvements since cachelines
> need not be fetched when written too.

That's unenforcable for mmap writes from userspace. And those are
the writes that will trigger the dirty write mapping problem.

> >> And because of that nothing turned the
> >> user mappings to read only. This is what I do here but
> >> instead of write-protecting I just unmap because it is
> >> easier for me to code it.
> > 
> > That doesn't mean it is the correct solution.
> 
> Please note that even if we properly .fsync cachlines the page-faults
> are orthogonal to this. There is no point in making mmapped dax pages
> read-only after every .fsync and pay a page-fault. We should leave them
> mapped has is. The only place that we need page protection is at freeze
> time.

Actually, current behaviour of filesystems is that fsync cleans all
the pages in the range, and means all the mappings are marked
read-only and so we get new calls into .page_mkwrite when write
faults occur. We need that .page_mkwrite call to be able to a)
update the mtime of the inode, and b) mark the inode "data dirty" so
that fsync knows it needs to do something....

Hence I'd much prefer we start with identical behaviour to normal
files, then we can optimise from a sane start point when write page
faults show up as a performance problem. i.e. Correctness first,
performance second.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-25 20:00                 ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25 20:00 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Wed, Mar 25, 2015 at 12:19:50PM +0200, Boaz Harrosh wrote:
> On 03/25/2015 11:29 AM, Dave Chinner wrote:
> > On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
> >> On 03/25/2015 04:22 AM, Dave Chinner wrote:
> >>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
> >> <>
> <>
> >> The sync does happen, .fsync of the FS is called on each
> >> file just as if the user called it. If this is broken it just
> >> needs to be fixed there at the .fsync vector. POSIX mandate
> >> persistence at .fsync so at the vfs layer we rely on that.
> > 
> > right now, the filesystems will see that there are no dirty pages
> > on the inode, and then just sync the inode metadata. They will do
> > nothing else as filesystems are not aware of CPU cachelines at all.
> > 
> 
> Sigh yes. There is this bug. And I am sitting on a wide fix for this.
> 
> The strategy is. All Kernel writes are done with a new copy_user_nt.
> NT stands for none-temporal. This shows 20% improvements since cachelines
> need not be fetched when written too.

That's unenforcable for mmap writes from userspace. And those are
the writes that will trigger the dirty write mapping problem.

> >> And because of that nothing turned the
> >> user mappings to read only. This is what I do here but
> >> instead of write-protecting I just unmap because it is
> >> easier for me to code it.
> > 
> > That doesn't mean it is the correct solution.
> 
> Please note that even if we properly .fsync cachlines the page-faults
> are orthogonal to this. There is no point in making mmapped dax pages
> read-only after every .fsync and pay a page-fault. We should leave them
> mapped has is. The only place that we need page protection is at freeze
> time.

Actually, current behaviour of filesystems is that fsync cleans all
the pages in the range, and means all the mappings are marked
read-only and so we get new calls into .page_mkwrite when write
faults occur. We need that .page_mkwrite call to be able to a)
update the mtime of the inode, and b) mark the inode "data dirty" so
that fsync knows it needs to do something....

Hence I'd much prefer we start with identical behaviour to normal
files, then we can optimise from a sane start point when write page
faults show up as a performance problem. i.e. Correctness first,
performance second.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-25 10:40             ` Boaz Harrosh
@ 2015-03-25 20:05               ` Dave Chinner
  -1 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25 20:05 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Wed, Mar 25, 2015 at 12:40:44PM +0200, Boaz Harrosh wrote:
> On 03/25/2015 11:41 AM, Dave Chinner wrote:
> > On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote:
> >> On 03/25/2015 04:26 AM, Dave Chinner wrote:
> <>
> >> sync and fsync should and will work correctly, but this does not
> >> solve our problem. because what turns pages to read-only is the
> >> writeback. And we do not have this in dax. Therefore we need to
> >> do this here as a special case.
> > 
> > We can still use exactly the same dirty tracking as we use for data
> > writeback. The difference is that we don't need to go through all
> > teh page writeback; we can just flush the CPU caches and mark all
> > the mappings clean, then clear the I_DIRTY_PAGES flag and move on to
> > inode writeback....
> > 
> 
> I see what you mean. the sb wide sync will not step into mmaped inodes
> and fsync them.
> 
> If we go my way and write NT (None Temporal) style in Kernel.
> NT instructions exist since xeon and all the Intel iX core CPUs have
> them. In tests we conducted doing xeon NT-writes vs
> regular-writes-and-cl_flush at .fsync showed minimum of 20% improvement.
> That is on very large IOs. On 4k IOs it was even better.

As I said before, relying on specific instructions is a non-starter
for mmap writes, and that's the problem we need to solve here.

> It looks like you have a much better picture in your mind how to
> fit this properly at the inode-dirty picture. Can you attempt a rough draft?
> 
> If we are going the NT way. Then we can only I_DIRTY_ track the mmaped
> inodes. For me this is really scary because I do not want to trigger
> any writeback threads. If you could please draw me an outline (or write
> something up ;-)) it would be great.

Writeback threads are not used for fsync - they are used for sync
and background writeback. They are already active on DAX filesystems
that track dirty inodes on the VFS superblock, as this is the way
inodes are written back on some filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-25 20:05               ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-25 20:05 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Wed, Mar 25, 2015 at 12:40:44PM +0200, Boaz Harrosh wrote:
> On 03/25/2015 11:41 AM, Dave Chinner wrote:
> > On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote:
> >> On 03/25/2015 04:26 AM, Dave Chinner wrote:
> <>
> >> sync and fsync should and will work correctly, but this does not
> >> solve our problem. because what turns pages to read-only is the
> >> writeback. And we do not have this in dax. Therefore we need to
> >> do this here as a special case.
> > 
> > We can still use exactly the same dirty tracking as we use for data
> > writeback. The difference is that we don't need to go through all
> > teh page writeback; we can just flush the CPU caches and mark all
> > the mappings clean, then clear the I_DIRTY_PAGES flag and move on to
> > inode writeback....
> > 
> 
> I see what you mean. the sb wide sync will not step into mmaped inodes
> and fsync them.
> 
> If we go my way and write NT (None Temporal) style in Kernel.
> NT instructions exist since xeon and all the Intel iX core CPUs have
> them. In tests we conducted doing xeon NT-writes vs
> regular-writes-and-cl_flush at .fsync showed minimum of 20% improvement.
> That is on very large IOs. On 4k IOs it was even better.

As I said before, relying on specific instructions is a non-starter
for mmap writes, and that's the problem we need to solve here.

> It looks like you have a much better picture in your mind how to
> fit this properly at the inode-dirty picture. Can you attempt a rough draft?
> 
> If we are going the NT way. Then we can only I_DIRTY_ track the mmaped
> inodes. For me this is really scary because I do not want to trigger
> any writeback threads. If you could please draw me an outline (or write
> something up ;-)) it would be great.

Writeback threads are not used for fsync - they are used for sync
and background writeback. They are already active on DAX filesystems
that track dirty inodes on the VFS superblock, as this is the way
inodes are written back on some filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-25 20:00                 ` Dave Chinner
  (?)
@ 2015-03-26  8:02                 ` Boaz Harrosh
  2015-03-26 20:58                     ` Dave Chinner
  -1 siblings, 1 reply; 36+ messages in thread
From: Boaz Harrosh @ 2015-03-26  8:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On 03/25/2015 10:00 PM, Dave Chinner wrote:
> On Wed, Mar 25, 2015 at 12:19:50PM +0200, Boaz Harrosh wrote:
>> On 03/25/2015 11:29 AM, Dave Chinner wrote:
>>> On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
>>>> On 03/25/2015 04:22 AM, Dave Chinner wrote:
>>>>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
>>>> <>
>> <>
>>>> The sync does happen, .fsync of the FS is called on each
>>>> file just as if the user called it. If this is broken it just
>>>> needs to be fixed there at the .fsync vector. POSIX mandate
>>>> persistence at .fsync so at the vfs layer we rely on that.
>>>
>>> right now, the filesystems will see that there are no dirty pages
>>> on the inode, and then just sync the inode metadata. They will do
>>> nothing else as filesystems are not aware of CPU cachelines at all.
>>>
>>
>> Sigh yes. There is this bug. And I am sitting on a wide fix for this.
>>
>> The strategy is. All Kernel writes are done with a new copy_user_nt.
>> NT stands for none-temporal. This shows 20% improvements since cachelines
>> need not be fetched when written too.
> 
> That's unenforcable for mmap writes from userspace. And those are
> the writes that will trigger the dirty write mapping problem.
> 

So for them I was thinking of just doing the .fsync on every
unmap (ie vm_operations_struct->close)

So now we know that only inodes that have an active vm mapping
are in need of sync.

>>>> And because of that nothing turned the
>>>> user mappings to read only. This is what I do here but
>>>> instead of write-protecting I just unmap because it is
>>>> easier for me to code it.
>>>
>>> That doesn't mean it is the correct solution.
>>
>> Please note that even if we properly .fsync cachlines the page-faults
>> are orthogonal to this. There is no point in making mmapped dax pages
>> read-only after every .fsync and pay a page-fault. We should leave them
>> mapped has is. The only place that we need page protection is at freeze
>> time.
> 
> Actually, current behaviour of filesystems is that fsync cleans all
> the pages in the range, and means all the mappings are marked
> read-only and so we get new calls into .page_mkwrite when write
> faults occur. We need that .page_mkwrite call to be able to a)
> update the mtime of the inode, and b) mark the inode "data dirty" so
> that fsync knows it needs to do something....
> 
> Hence I'd much prefer we start with identical behaviour to normal
> files, then we can optimise from a sane start point when write page
> faults show up as a performance problem. i.e. Correctness first,
> performance second.
> 

OK, (you see when you speak slow I understand fast ;-)). I agree then
I'll see if I can schedule some time for this. My boss will be very
angry with me about this. But I will need help please, and some hands
holding. Unless someone else volunteers to work on this ?

> Cheers,
> Dave.
> 

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
  2015-03-26  8:02                 ` Boaz Harrosh
@ 2015-03-26 20:58                     ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-26 20:58 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Thu, Mar 26, 2015 at 10:02:09AM +0200, Boaz Harrosh wrote:
> On 03/25/2015 10:00 PM, Dave Chinner wrote:
> > On Wed, Mar 25, 2015 at 12:19:50PM +0200, Boaz Harrosh wrote:
> >> On 03/25/2015 11:29 AM, Dave Chinner wrote:
> >>> On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
> >>>> On 03/25/2015 04:22 AM, Dave Chinner wrote:
> >>>>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
> >>>> <>
> >> <>
> >>>> The sync does happen, .fsync of the FS is called on each
> >>>> file just as if the user called it. If this is broken it just
> >>>> needs to be fixed there at the .fsync vector. POSIX mandate
> >>>> persistence at .fsync so at the vfs layer we rely on that.
> >>>
> >>> right now, the filesystems will see that there are no dirty pages
> >>> on the inode, and then just sync the inode metadata. They will do
> >>> nothing else as filesystems are not aware of CPU cachelines at all.
> >>>
> >>
> >> Sigh yes. There is this bug. And I am sitting on a wide fix for this.
> >>
> >> The strategy is. All Kernel writes are done with a new copy_user_nt.
> >> NT stands for none-temporal. This shows 20% improvements since cachelines
> >> need not be fetched when written too.
> > 
> > That's unenforcable for mmap writes from userspace. And those are
> > the writes that will trigger the dirty write mapping problem.
> > 
> 
> So for them I was thinking of just doing the .fsync on every
> unmap (ie vm_operations_struct->close)

That is not necessary, I think - it can be handled by the background
writeback thread just fine.

> So now we know that only inodes that have an active vm mapping
> are in need of sync.

Easy enough.

> >> Please note that even if we properly .fsync cachlines the page-faults
> >> are orthogonal to this. There is no point in making mmapped dax pages
> >> read-only after every .fsync and pay a page-fault. We should leave them
> >> mapped has is. The only place that we need page protection is at freeze
> >> time.
> > 
> > Actually, current behaviour of filesystems is that fsync cleans all
> > the pages in the range, and means all the mappings are marked
> > read-only and so we get new calls into .page_mkwrite when write
> > faults occur. We need that .page_mkwrite call to be able to a)
> > update the mtime of the inode, and b) mark the inode "data dirty" so
> > that fsync knows it needs to do something....
> > 
> > Hence I'd much prefer we start with identical behaviour to normal
> > files, then we can optimise from a sane start point when write page
> > faults show up as a performance problem. i.e. Correctness first,
> > performance second.
> 
> OK, (you see when you speak slow I understand fast ;-)). I agree then
> I'll see if I can schedule some time for this. My boss will be very
> angry with me about this. But I will need help please, and some hands
> holding. Unless someone else volunteers to work on this ?

It's not hard - you should be able to make somethign work from the
untested, uncompiled skeleton below....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

dax: hack in dirty mapping tracking to fsync/sync/writeback

Not-signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c            | 27 ++++++++++++++++++++++++++-
 fs/xfs/xfs_file.c   |  2 ++
 mm/page-writeback.c |  5 +++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0121f7d..61cbd76 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -27,6 +27,29 @@
 #include <linux/uio.h>
 #include <linux/vmstat.h>
 
+/*
+ * flush the mapping to the persistent domain within the byte range of (start,
+ * end). This is required by data integrity operations to ensure file data is on
+ * persistent storage prior to completion of the operation. It also requires us
+ * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
+ * the file is written to again so wehave an indication that we need to flush
+ * the mapping if a data integrity operation takes place.
+ *
+ * We don't need commits to storage here - the filesystems will issue flushes
+ * appropriately at the conclusion of the data integrity operation via REQ_FUA
+ * writes or blkdev_issue_flush() commands.  This requires the DAX block device
+ * to implement persistent storage domain fencing/commits on receiving a
+ * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
+ * layers.
+ */
+int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end)
+{
+	/* XXX: make ptes in range clean */
+
+	/* XXX: flush CPU caches  */
+	return 0;
+}
+
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
@@ -472,8 +495,10 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		file_update_time(vma->vm_file);
 	}
 	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
-	if (vmf->flags & FAULT_FLAG_WRITE)
+	if (vmf->flags & FAULT_FLAG_WRITE) {
+		__mark_inode_dirty(file_inode(vma->vm_file, I_DIRTY_PAGES);
 		sb_end_pagefault(sb);
+	}
 
 	return result;
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8017175..43e6c8e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1453,6 +1453,8 @@ xfs_filemap_page_mkwrite(
 	}
 
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	if (IS_DAX(inode))
+		__mark_inode_dirty(file_inode(vma->vm_file, I_DIRTY_PAGES);
 	sb_end_pagefault(inode->i_sb);
 
 	return ret;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 45e187b..aa2fa76 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2029,6 +2029,11 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 
 	if (wbc->nr_to_write <= 0)
 		return 0;
+
+	if (wbc->sync == WB_SYNC_ALL && IS_DAX(mapping->host))
+		return dax_flush_mapping(mapping, wbc->range_start,
+						  wbc->range_end);
+
 	if (mapping->a_ops->writepages)
 		ret = mapping->a_ops->writepages(mapping, wbc);
 	else

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
@ 2015-03-26 20:58                     ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-03-26 20:58 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Thu, Mar 26, 2015 at 10:02:09AM +0200, Boaz Harrosh wrote:
> On 03/25/2015 10:00 PM, Dave Chinner wrote:
> > On Wed, Mar 25, 2015 at 12:19:50PM +0200, Boaz Harrosh wrote:
> >> On 03/25/2015 11:29 AM, Dave Chinner wrote:
> >>> On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
> >>>> On 03/25/2015 04:22 AM, Dave Chinner wrote:
> >>>>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
> >>>> <>
> >> <>
> >>>> The sync does happen, .fsync of the FS is called on each
> >>>> file just as if the user called it. If this is broken it just
> >>>> needs to be fixed there at the .fsync vector. POSIX mandate
> >>>> persistence at .fsync so at the vfs layer we rely on that.
> >>>
> >>> right now, the filesystems will see that there are no dirty pages
> >>> on the inode, and then just sync the inode metadata. They will do
> >>> nothing else as filesystems are not aware of CPU cachelines at all.
> >>>
> >>
> >> Sigh yes. There is this bug. And I am sitting on a wide fix for this.
> >>
> >> The strategy is. All Kernel writes are done with a new copy_user_nt.
> >> NT stands for none-temporal. This shows 20% improvements since cachelines
> >> need not be fetched when written too.
> > 
> > That's unenforcable for mmap writes from userspace. And those are
> > the writes that will trigger the dirty write mapping problem.
> > 
> 
> So for them I was thinking of just doing the .fsync on every
> unmap (ie vm_operations_struct->close)

That is not necessary, I think - it can be handled by the background
writeback thread just fine.

> So now we know that only inodes that have an active vm mapping
> are in need of sync.

Easy enough.

> >> Please note that even if we properly .fsync cachlines the page-faults
> >> are orthogonal to this. There is no point in making mmapped dax pages
> >> read-only after every .fsync and pay a page-fault. We should leave them
> >> mapped has is. The only place that we need page protection is at freeze
> >> time.
> > 
> > Actually, current behaviour of filesystems is that fsync cleans all
> > the pages in the range, and means all the mappings are marked
> > read-only and so we get new calls into .page_mkwrite when write
> > faults occur. We need that .page_mkwrite call to be able to a)
> > update the mtime of the inode, and b) mark the inode "data dirty" so
> > that fsync knows it needs to do something....
> > 
> > Hence I'd much prefer we start with identical behaviour to normal
> > files, then we can optimise from a sane start point when write page
> > faults show up as a performance problem. i.e. Correctness first,
> > performance second.
> 
> OK, (you see when you speak slow I understand fast ;-)). I agree then
> I'll see if I can schedule some time for this. My boss will be very
> angry with me about this. But I will need help please, and some hands
> holding. Unless someone else volunteers to work on this ?

It's not hard - you should be able to make somethign work from the
untested, uncompiled skeleton below....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

dax: hack in dirty mapping tracking to fsync/sync/writeback

Not-signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c            | 27 ++++++++++++++++++++++++++-
 fs/xfs/xfs_file.c   |  2 ++
 mm/page-writeback.c |  5 +++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0121f7d..61cbd76 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -27,6 +27,29 @@
 #include <linux/uio.h>
 #include <linux/vmstat.h>
 
+/*
+ * flush the mapping to the persistent domain within the byte range of (start,
+ * end). This is required by data integrity operations to ensure file data is on
+ * persistent storage prior to completion of the operation. It also requires us
+ * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
+ * the file is written to again so wehave an indication that we need to flush
+ * the mapping if a data integrity operation takes place.
+ *
+ * We don't need commits to storage here - the filesystems will issue flushes
+ * appropriately at the conclusion of the data integrity operation via REQ_FUA
+ * writes or blkdev_issue_flush() commands.  This requires the DAX block device
+ * to implement persistent storage domain fencing/commits on receiving a
+ * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
+ * layers.
+ */
+int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end)
+{
+	/* XXX: make ptes in range clean */
+
+	/* XXX: flush CPU caches  */
+	return 0;
+}
+
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
@@ -472,8 +495,10 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		file_update_time(vma->vm_file);
 	}
 	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
-	if (vmf->flags & FAULT_FLAG_WRITE)
+	if (vmf->flags & FAULT_FLAG_WRITE) {
+		__mark_inode_dirty(file_inode(vma->vm_file, I_DIRTY_PAGES);
 		sb_end_pagefault(sb);
+	}
 
 	return result;
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8017175..43e6c8e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1453,6 +1453,8 @@ xfs_filemap_page_mkwrite(
 	}
 
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	if (IS_DAX(inode))
+		__mark_inode_dirty(file_inode(vma->vm_file, I_DIRTY_PAGES);
 	sb_end_pagefault(inode->i_sb);
 
 	return ret;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 45e187b..aa2fa76 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2029,6 +2029,11 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 
 	if (wbc->nr_to_write <= 0)
 		return 0;
+
+	if (wbc->sync == WB_SYNC_ALL && IS_DAX(mapping->host))
+		return dax_flush_mapping(mapping, wbc->range_start,
+						  wbc->range_end);
+
 	if (mapping->a_ops->writepages)
 		ret = mapping->a_ops->writepages(mapping, wbc);
 	else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2015-03-26 20:58 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-23 12:47 [PATCH 0/3 v3] dax: Fix mmap-write not updating c/mtime Boaz Harrosh
2015-03-23 12:47 ` Boaz Harrosh
2015-03-23 12:49 ` [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
2015-03-23 22:49   ` Andrew Morton
2015-03-23 22:49     ` Andrew Morton
2015-03-23 12:52 ` [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection Boaz Harrosh
2015-03-23 12:54 ` [PATCH 3/3] RFC: dax: dax_prepare_freeze Boaz Harrosh
2015-03-23 22:40   ` Dave Chinner
2015-03-23 22:40     ` Dave Chinner
2015-03-24  6:14     ` Boaz Harrosh
2015-03-24  6:14       ` Boaz Harrosh
2015-03-25  2:22       ` Dave Chinner
2015-03-25  2:22         ` Dave Chinner
2015-03-25  8:10         ` Boaz Harrosh
2015-03-25  9:29           ` Dave Chinner
2015-03-25  9:29             ` Dave Chinner
2015-03-25 10:19             ` Boaz Harrosh
2015-03-25 10:19               ` Boaz Harrosh
2015-03-25 20:00               ` Dave Chinner
2015-03-25 20:00                 ` Dave Chinner
2015-03-26  8:02                 ` Boaz Harrosh
2015-03-26 20:58                   ` Dave Chinner
2015-03-26 20:58                     ` Dave Chinner
2015-03-24 12:37   ` Boaz Harrosh
2015-03-24 12:37     ` Boaz Harrosh
2015-03-25  2:26     ` Dave Chinner
2015-03-25  2:26       ` Dave Chinner
2015-03-25  8:31       ` Boaz Harrosh
2015-03-25  8:31         ` Boaz Harrosh
2015-03-25  9:41         ` Dave Chinner
2015-03-25  9:41           ` Dave Chinner
2015-03-25 10:40           ` Boaz Harrosh
2015-03-25 10:40             ` Boaz Harrosh
2015-03-25 20:05             ` Dave Chinner
2015-03-25 20:05               ` Dave Chinner
2015-03-23 12:56 ` [PATCH v4] xfstest: generic/080 test that mmap-write updates c/mtime Boaz Harrosh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.