All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3 v4] dax: some dax fixes and cleanups
@ 2015-03-25 13:34 ` Boaz Harrosh
  0 siblings, 0 replies; 16+ messages in thread
From: Boaz Harrosh @ 2015-03-25 13:34 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

Hi

[v4] dax: some dax fixes and cleanups
* First patch fixed according to Andrew's comments. Thanks Andrew.
  1st and 2nd patch can go into current Kernel as they fix something
  that was merged this release.
* Added a new patch to fix up splice in the dax case, and cleanup.
  This one can wait for 4.1 (Also the first two not that anyone uses dax
  in production.)
* DAX freeze is not fixed yet. As we have more problems then I originally
  hoped for, as pointed out by Dave.
  (Just as a referance I'm sending a NO-GOOD additional patch to show what
   is not good enough to do. Was the RFC of [v3])
* Not re-posting the xfstest Dave please pick this up (It already found bugs
  in none dax FSs)

[v3] dax: Fix mmap-write not updating c/mtime
* I'm re-posting the two DAX patches that fix the mmap-write after read
  problem with DAX. (No changes since [v2])
* I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze
  and making mapping read-only. 
  Jan Please review and see if this is what you meant.

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

[v1]
The main problem is that current mm/memory.c will no call us with page_mkwrite
if we do not have an actual page mapping, which is what DAX uses.
The solution presented here introduces a new pfn_mkwrite to solve this problem.
Please see patch-2 for details.

I've been running with this patch for 4 month both HW and VMs with no apparent
danger, but see patch-1 I played it safe.

I am also posting an xfstest 080 that demonstrate this problem, I believe
that also some git operations (can't remember which) suffer from this problem.
Actually Eryu Guan found that this test fails on some other FS as well.

List of patches:
 [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
 [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze
 [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations

 [PATCH] NOTGOOD: dax: dax_prepare_freeze

Andrew hi
I believe this needs to eventually go through your tree. Please pick it
up when you feel it is ready. I believe all 3 are ready and fix real
bugs.

Matthew hi
I would love to have your ACK on these patches?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 0/3 v4] dax: some dax fixes and cleanups
@ 2015-03-25 13:34 ` Boaz Harrosh
  0 siblings, 0 replies; 16+ messages in thread
From: Boaz Harrosh @ 2015-03-25 13:34 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

Hi

[v4] dax: some dax fixes and cleanups
* First patch fixed according to Andrew's comments. Thanks Andrew.
  1st and 2nd patch can go into current Kernel as they fix something
  that was merged this release.
* Added a new patch to fix up splice in the dax case, and cleanup.
  This one can wait for 4.1 (Also the first two not that anyone uses dax
  in production.)
* DAX freeze is not fixed yet. As we have more problems then I originally
  hoped for, as pointed out by Dave.
  (Just as a referance I'm sending a NO-GOOD additional patch to show what
   is not good enough to do. Was the RFC of [v3])
* Not re-posting the xfstest Dave please pick this up (It already found bugs
  in none dax FSs)

[v3] dax: Fix mmap-write not updating c/mtime
* I'm re-posting the two DAX patches that fix the mmap-write after read
  problem with DAX. (No changes since [v2])
* I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze
  and making mapping read-only. 
  Jan Please review and see if this is what you meant.

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

[v1]
The main problem is that current mm/memory.c will no call us with page_mkwrite
if we do not have an actual page mapping, which is what DAX uses.
The solution presented here introduces a new pfn_mkwrite to solve this problem.
Please see patch-2 for details.

I've been running with this patch for 4 month both HW and VMs with no apparent
danger, but see patch-1 I played it safe.

I am also posting an xfstest 080 that demonstrate this problem, I believe
that also some git operations (can't remember which) suffer from this problem.
Actually Eryu Guan found that this test fails on some other FS as well.

List of patches:
 [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
 [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze
 [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations

 [PATCH] NOTGOOD: dax: dax_prepare_freeze

Andrew hi
I believe this needs to eventually go through your tree. Please pick it
up when you feel it is ready. I believe all 3 are ready and fix real
bugs.

Matthew hi
I would love to have your ACK on these patches?

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-03-25 13:34 ` Boaz Harrosh
  (?)
@ 2015-03-25 13:38 ` Boaz Harrosh
  2015-03-25 14:34   ` Kirill A. Shutemov
  2015-03-25 15:08   ` Dave Hansen
  -1 siblings, 2 replies; 16+ messages in thread
From: Boaz Harrosh @ 2015-03-25 13:38 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

From: Yigal Korman <yigal@plexistor.com>

This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
to get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it
to page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario
above we fail to set ctime/mtime though we modified the file.
An xfstest is attached to this patchset that shows the failure
and the fix. (A DAX patch will follow)

This functionality is extra important for us, because upon
dirtying of a pmem page we also want to RDMA the page to a
remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
  1 - The name ;-)
  2 - But mainly because it would take a very long and tedious
      audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
      users. To make sure they do not now CRASH. For example current
      DAX code (which this is for) would crash.
      If we would want to reuse page_mkwrite, We will need to first
      patch all users, so to not-crash-on-no-page. Then enable this
      patch. But even if I did that I would not sleep so well at night.
      Adding a new vector is the safest thing to do, and is not that
      expensive. an extra pointer at a static function vector per driver.
      Also the new vector is better for performance, because else we
      Will call all current Kernel vectors, so to:
	check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Hugh Dickins <hughd@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-mm@kvack.org

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 28 +++++++++++++++++++++++++++-
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a9392..1cd820c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -250,6 +250,8 @@ struct vm_operations_struct {
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
+	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
diff --git a/mm/memory.c b/mm/memory.c
index 8068893..8d640d1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1982,6 +1982,23 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	return ret;
 }
 
+static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
+{
+	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
+		struct vm_fault vmf = {
+			.page = 0,
+			.pgoff = (((address & PAGE_MASK) - vma->vm_start)
+						>> PAGE_SHIFT) + vma->vm_pgoff,
+			.virtual_address = (void __user *)(address & PAGE_MASK),
+			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
+		};
+
+		return vma->vm_ops->pfn_mkwrite(vma, &vmf);
+	}
+
+	return 0;
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -2025,8 +2042,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * accounting on raw pfn maps.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-				     (VM_WRITE|VM_SHARED))
+				     (VM_WRITE|VM_SHARED)) {
+			pte_unmap_unlock(page_table, ptl);
+			ret = do_pfn_mkwrite(vma, address);
+			if (ret & VM_FAULT_ERROR)
+				return ret;
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte))
+				goto unlock;
 			goto reuse;
+		}
 		goto gotten;
 	}
 
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] dax: pfn_mkwrite update c/mtime + freeze protection
  2015-03-25 13:34 ` Boaz Harrosh
@ 2015-03-25 13:41   ` Boaz Harrosh
  -1 siblings, 0 replies; 16+ messages in thread
From: Boaz Harrosh @ 2015-03-25 13:41 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

From: Yigal Korman <yigal@plexistor.com>

[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.

A new xfstest is submitted for testing this (generic/080)

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

Dave you need to add the
	.pfn_mkwrite	= dax_pfn_mkwrite
In the xfs patches

CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/dax.c           | 17 +++++++++++++++++
 fs/ext2/file.c     |  1 +
 fs/ext4/file.c     |  1 +
 include/linux/fs.h |  1 +
 4 files changed, 20 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ed1619e..d0bd1f4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
+ * dax_pfn_mkwrite - handle first write to DAX page
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ *
+ */
+int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	sb_start_pagefault(sb);
+	file_update_time(vma->vm_file);
+	sb_end_pagefault(sb);
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
+
+/**
  * dax_zero_page_range - zero a range within a page of a DAX file
  * @inode: The file being truncated
  * @from: The file offset that is being truncated to
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e317017..866a3ce 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
 	.page_mkwrite	= ext2_dax_mkwrite,
+	.pfn_mkwrite	= dax_pfn_mkwrite,
 };
 
 static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..b43a7a6 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct ext4_dax_vm_ops = {
 	.fault		= ext4_dax_fault,
 	.page_mkwrite	= ext4_dax_mkwrite,
+	.pfn_mkwrite	= dax_pfn_mkwrite,
 };
 #else
 #define ext4_dax_vm_ops	ext4_file_vm_ops
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..24af817 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2597,6 +2597,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
 #ifdef CONFIG_BLOCK
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] dax: pfn_mkwrite update c/mtime + freeze protection
@ 2015-03-25 13:41   ` Boaz Harrosh
  0 siblings, 0 replies; 16+ messages in thread
From: Boaz Harrosh @ 2015-03-25 13:41 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

From: Yigal Korman <yigal@plexistor.com>

[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.

A new xfstest is submitted for testing this (generic/080)

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

Dave you need to add the
	.pfn_mkwrite	= dax_pfn_mkwrite
In the xfs patches

CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/dax.c           | 17 +++++++++++++++++
 fs/ext2/file.c     |  1 +
 fs/ext4/file.c     |  1 +
 include/linux/fs.h |  1 +
 4 files changed, 20 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ed1619e..d0bd1f4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
+ * dax_pfn_mkwrite - handle first write to DAX page
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ *
+ */
+int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	sb_start_pagefault(sb);
+	file_update_time(vma->vm_file);
+	sb_end_pagefault(sb);
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
+
+/**
  * dax_zero_page_range - zero a range within a page of a DAX file
  * @inode: The file being truncated
  * @from: The file offset that is being truncated to
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e317017..866a3ce 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
 	.page_mkwrite	= ext2_dax_mkwrite,
+	.pfn_mkwrite	= dax_pfn_mkwrite,
 };
 
 static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..b43a7a6 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct ext4_dax_vm_ops = {
 	.fault		= ext4_dax_fault,
 	.page_mkwrite	= ext4_dax_mkwrite,
+	.pfn_mkwrite	= dax_pfn_mkwrite,
 };
 #else
 #define ext4_dax_vm_ops	ext4_file_vm_ops
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..24af817 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2597,6 +2597,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
 #ifdef CONFIG_BLOCK
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations
  2015-03-25 13:34 ` Boaz Harrosh
@ 2015-03-25 13:44   ` Boaz Harrosh
  -1 siblings, 0 replies; 16+ messages in thread
From: Boaz Harrosh @ 2015-03-25 13:44 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan


The original dax patchset split the ext2/4_file_operations
because of the two NULL splice_read/splice_write in the dax
case.

At vfs if splice_read/splice_write are NULL would then call
default_splice_read/write.

What we do here is make generic_file_splice_read aware of
IS_DAX() so the original ext2/4_file_operations can be used
as is.

For write it appears that iter_file_splice_write is just fine.
It uses the regular f_op->write(file,..) or new_sync_write(file, ...).

CC: Dave Chinner <dchinner@redhat.com>
CC: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/ext2/ext2.h  |  1 -
 fs/ext2/file.c  | 18 ------------------
 fs/ext2/inode.c |  5 +----
 fs/ext2/namei.c | 10 ++--------
 fs/ext4/ext4.h  |  1 -
 fs/ext4/file.c  | 20 --------------------
 fs/ext4/inode.c |  5 +----
 fs/ext4/namei.c | 10 ++--------
 fs/splice.c     |  3 +++
 9 files changed, 9 insertions(+), 64 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 678f9ab..8d15feb 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
 		      int datasync);
 extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_dax_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 866a3ce..19cac93 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -109,24 +109,6 @@ const struct file_operations ext2_file_operations = {
 	.splice_write	= iter_file_splice_write,
 };
 
-#ifdef CONFIG_FS_DAX
-const struct file_operations ext2_dax_file_operations = {
-	.llseek		= generic_file_llseek,
-	.read		= new_sync_read,
-	.write		= new_sync_write,
-	.read_iter	= generic_file_read_iter,
-	.write_iter	= generic_file_write_iter,
-	.unlocked_ioctl = ext2_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= ext2_compat_ioctl,
-#endif
-	.mmap		= ext2_file_mmap,
-	.open		= dquot_file_open,
-	.release	= ext2_release_file,
-	.fsync		= ext2_fsync,
-};
-#endif
-
 const struct inode_operations ext2_file_inode_operations = {
 #ifdef CONFIG_EXT2_FS_XATTR
 	.setxattr	= generic_setxattr,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 6434bc0..6fdbe80 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1388,10 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX)) {
-			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_dax_file_operations;
-		} else if (test_opt(inode->i_sb, NOBH)) {
+		if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
 			inode->i_fop = &ext2_file_operations;
 		} else {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 148f6e3..ce42293 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,10 +104,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, DAX)) {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_dax_file_operations;
-	} else if (test_opt(inode->i_sb, NOBH)) {
+	if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
 	} else {
@@ -125,10 +122,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, DAX)) {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_dax_file_operations;
-	} else if (test_opt(inode->i_sb, NOBH)) {
+	if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
 	} else {
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f63c3d5..8a3981e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2593,7 +2593,6 @@ extern const struct file_operations ext4_dir_operations;
 /* file.c */
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
-extern const struct file_operations ext4_dax_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 
 /* inline.c */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b43a7a6..557140a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -625,26 +625,6 @@ const struct file_operations ext4_file_operations = {
 	.fallocate	= ext4_fallocate,
 };
 
-#ifdef CONFIG_FS_DAX
-const struct file_operations ext4_dax_file_operations = {
-	.llseek		= ext4_llseek,
-	.read		= new_sync_read,
-	.write		= new_sync_write,
-	.read_iter	= generic_file_read_iter,
-	.write_iter	= ext4_file_write_iter,
-	.unlocked_ioctl = ext4_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= ext4_compat_ioctl,
-#endif
-	.mmap		= ext4_file_mmap,
-	.open		= ext4_file_open,
-	.release	= ext4_release_file,
-	.fsync		= ext4_sync_file,
-	/* Splice not yet supported with DAX */
-	.fallocate	= ext4_fallocate,
-};
-#endif
-
 const struct inode_operations ext4_file_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..ec60deb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4091,10 +4091,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 28fe71a..2291923 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2235,10 +2235,7 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
 		if (!err && IS_DIRSYNC(dir))
@@ -2302,10 +2299,7 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		d_tmpfile(dentry, inode);
 		err = ext4_orphan_add(handle, inode);
diff --git a/fs/splice.c b/fs/splice.c
index 7968da9..7d2fbb7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -524,6 +524,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	loff_t isize, left;
 	int ret;
 
+	if (IS_DAX(in->f_mapping->host))
+		return default_file_splice_read(in, ppos, pipe, len, flags);
+
 	isize = i_size_read(in->f_mapping->host);
 	if (unlikely(*ppos >= isize))
 		return 0;
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations
@ 2015-03-25 13:44   ` Boaz Harrosh
  0 siblings, 0 replies; 16+ messages in thread
From: Boaz Harrosh @ 2015-03-25 13:44 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan


The original dax patchset split the ext2/4_file_operations
because of the two NULL splice_read/splice_write in the dax
case.

At vfs if splice_read/splice_write are NULL would then call
default_splice_read/write.

What we do here is make generic_file_splice_read aware of
IS_DAX() so the original ext2/4_file_operations can be used
as is.

For write it appears that iter_file_splice_write is just fine.
It uses the regular f_op->write(file,..) or new_sync_write(file, ...).

CC: Dave Chinner <dchinner@redhat.com>
CC: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/ext2/ext2.h  |  1 -
 fs/ext2/file.c  | 18 ------------------
 fs/ext2/inode.c |  5 +----
 fs/ext2/namei.c | 10 ++--------
 fs/ext4/ext4.h  |  1 -
 fs/ext4/file.c  | 20 --------------------
 fs/ext4/inode.c |  5 +----
 fs/ext4/namei.c | 10 ++--------
 fs/splice.c     |  3 +++
 9 files changed, 9 insertions(+), 64 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 678f9ab..8d15feb 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
 		      int datasync);
 extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_dax_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 866a3ce..19cac93 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -109,24 +109,6 @@ const struct file_operations ext2_file_operations = {
 	.splice_write	= iter_file_splice_write,
 };
 
-#ifdef CONFIG_FS_DAX
-const struct file_operations ext2_dax_file_operations = {
-	.llseek		= generic_file_llseek,
-	.read		= new_sync_read,
-	.write		= new_sync_write,
-	.read_iter	= generic_file_read_iter,
-	.write_iter	= generic_file_write_iter,
-	.unlocked_ioctl = ext2_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= ext2_compat_ioctl,
-#endif
-	.mmap		= ext2_file_mmap,
-	.open		= dquot_file_open,
-	.release	= ext2_release_file,
-	.fsync		= ext2_fsync,
-};
-#endif
-
 const struct inode_operations ext2_file_inode_operations = {
 #ifdef CONFIG_EXT2_FS_XATTR
 	.setxattr	= generic_setxattr,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 6434bc0..6fdbe80 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1388,10 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX)) {
-			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_dax_file_operations;
-		} else if (test_opt(inode->i_sb, NOBH)) {
+		if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
 			inode->i_fop = &ext2_file_operations;
 		} else {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 148f6e3..ce42293 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,10 +104,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, DAX)) {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_dax_file_operations;
-	} else if (test_opt(inode->i_sb, NOBH)) {
+	if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
 	} else {
@@ -125,10 +122,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, DAX)) {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_dax_file_operations;
-	} else if (test_opt(inode->i_sb, NOBH)) {
+	if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
 	} else {
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f63c3d5..8a3981e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2593,7 +2593,6 @@ extern const struct file_operations ext4_dir_operations;
 /* file.c */
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
-extern const struct file_operations ext4_dax_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 
 /* inline.c */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b43a7a6..557140a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -625,26 +625,6 @@ const struct file_operations ext4_file_operations = {
 	.fallocate	= ext4_fallocate,
 };
 
-#ifdef CONFIG_FS_DAX
-const struct file_operations ext4_dax_file_operations = {
-	.llseek		= ext4_llseek,
-	.read		= new_sync_read,
-	.write		= new_sync_write,
-	.read_iter	= generic_file_read_iter,
-	.write_iter	= ext4_file_write_iter,
-	.unlocked_ioctl = ext4_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= ext4_compat_ioctl,
-#endif
-	.mmap		= ext4_file_mmap,
-	.open		= ext4_file_open,
-	.release	= ext4_release_file,
-	.fsync		= ext4_sync_file,
-	/* Splice not yet supported with DAX */
-	.fallocate	= ext4_fallocate,
-};
-#endif
-
 const struct inode_operations ext4_file_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..ec60deb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4091,10 +4091,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 28fe71a..2291923 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2235,10 +2235,7 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
 		if (!err && IS_DIRSYNC(dir))
@@ -2302,10 +2299,7 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		d_tmpfile(dentry, inode);
 		err = ext4_orphan_add(handle, inode);
diff --git a/fs/splice.c b/fs/splice.c
index 7968da9..7d2fbb7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -524,6 +524,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	loff_t isize, left;
 	int ret;
 
+	if (IS_DAX(in->f_mapping->host))
+		return default_file_splice_read(in, ppos, pipe, len, flags);
+
 	isize = i_size_read(in->f_mapping->host);
 	if (unlikely(*ppos >= isize))
 		return 0;
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [FIXME] NOT-GOOD: dax: dax_prepare_freeze
  2015-03-25 13:34 ` Boaz Harrosh
@ 2015-03-25 13:47   ` Boaz Harrosh
  -1 siblings, 0 replies; 16+ messages in thread
From: Boaz Harrosh @ 2015-03-25 13:47 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan


This is just for reference!

When freezing an FS, we must write protect all IS_DAX()
inodes that have an mmap mapping on an inode. Otherwise
application will be able to modify previously faulted-in
file pages.

I'm actually doing a full unmap_mapping_range because
there is no readily available "mapping_write_protect" like
functionality. I do not think it is worth it to define one
just for here and just for some extra read-faults after an
fs_freeze.

How hot-path is fs_freeze at all?

FIXME: As pointed out by Dave this is completely the wrong
       fix because we need to first fsync all cache dirty
       inodes, and only for those write protect. So maybe
       plug this in the regular sb_sync path, checking the
       FREEZE flag.

CC: Dave Chinner <dchinner@redhat.com>
CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
NOT-Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/dax.c           | 29 +++++++++++++++++++++++++++++
 fs/super.c         |  3 +++
 include/linux/fs.h |  5 +++++
 3 files changed, 37 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index d0bd1f4..ec99d1c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
+#include "internal.h"
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -549,3 +550,31 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 	return dax_zero_page_range(inode, from, length, get_block);
 }
 EXPORT_SYMBOL_GPL(dax_truncate_page);
+
+/* This is meant to be called as part of freeze_super. otherwise we might
+ * Need some extra locking before calling here.
+ */
+void dax_prepare_freeze(struct super_block *sb)
+{
+	struct inode *inode;
+
+	if (!(sb->s_bdev && sb->s_bdev->bd_disk->fops->direct_access))
+		return;
+
+	spin_lock(&inode_sb_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		/* TODO: For freezing we can actually do with write-protecting
+		 * the page. But I cannot find a ready made function that does
+		 * that for a giving mapping (with all the proper locking).
+		 * How performance sensitive is the all sb_freeze API?
+		 * For now we can just unmap the all mapping, and pay extra
+		 * on read faults.
+		 */
+		/* NOTE: Do not unmap private COW mapped pages it will not
+		 * modify the FS.
+		 */
+		if (IS_DAX(inode))
+			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
+	}
+	spin_unlock(&inode_sb_list_lock);
+}
diff --git a/fs/super.c b/fs/super.c
index 2b7dc90..9ef490c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
 	/* All writers are done so after syncing there won't be dirty data */
 	sync_filesystem(sb);
 
+	/* Need to take care of DAX mmaped inodes */
+	dax_prepare_freeze(sb);
+
 	/* Now wait for internal filesystem counter */
 	sb->s_writers.frozen = SB_FREEZE_FS;
 	smp_wmb();
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 24af817..ac48ba6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2599,6 +2599,11 @@ int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
+#ifdef CONFIG_FS_DAX
+void dax_prepare_freeze(struct super_block *sb);
+#else /* !CONFIG_FS_DAX */
+static inline void dax_prepare_freeze(struct super_block *sb){}
+#endif /* !CONFIG_FS_DAX */
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [FIXME] NOT-GOOD: dax: dax_prepare_freeze
@ 2015-03-25 13:47   ` Boaz Harrosh
  0 siblings, 0 replies; 16+ messages in thread
From: Boaz Harrosh @ 2015-03-25 13:47 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan


This is just for reference!

When freezing an FS, we must write protect all IS_DAX()
inodes that have an mmap mapping on an inode. Otherwise
application will be able to modify previously faulted-in
file pages.

I'm actually doing a full unmap_mapping_range because
there is no readily available "mapping_write_protect" like
functionality. I do not think it is worth it to define one
just for here and just for some extra read-faults after an
fs_freeze.

How hot-path is fs_freeze at all?

FIXME: As pointed out by Dave this is completely the wrong
       fix because we need to first fsync all cache dirty
       inodes, and only for those write protect. So maybe
       plug this in the regular sb_sync path, checking the
       FREEZE flag.

CC: Dave Chinner <dchinner@redhat.com>
CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
NOT-Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/dax.c           | 29 +++++++++++++++++++++++++++++
 fs/super.c         |  3 +++
 include/linux/fs.h |  5 +++++
 3 files changed, 37 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index d0bd1f4..ec99d1c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
+#include "internal.h"
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -549,3 +550,31 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 	return dax_zero_page_range(inode, from, length, get_block);
 }
 EXPORT_SYMBOL_GPL(dax_truncate_page);
+
+/* This is meant to be called as part of freeze_super. otherwise we might
+ * Need some extra locking before calling here.
+ */
+void dax_prepare_freeze(struct super_block *sb)
+{
+	struct inode *inode;
+
+	if (!(sb->s_bdev && sb->s_bdev->bd_disk->fops->direct_access))
+		return;
+
+	spin_lock(&inode_sb_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		/* TODO: For freezing we can actually do with write-protecting
+		 * the page. But I cannot find a ready made function that does
+		 * that for a giving mapping (with all the proper locking).
+		 * How performance sensitive is the all sb_freeze API?
+		 * For now we can just unmap the all mapping, and pay extra
+		 * on read faults.
+		 */
+		/* NOTE: Do not unmap private COW mapped pages it will not
+		 * modify the FS.
+		 */
+		if (IS_DAX(inode))
+			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
+	}
+	spin_unlock(&inode_sb_list_lock);
+}
diff --git a/fs/super.c b/fs/super.c
index 2b7dc90..9ef490c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
 	/* All writers are done so after syncing there won't be dirty data */
 	sync_filesystem(sb);
 
+	/* Need to take care of DAX mmaped inodes */
+	dax_prepare_freeze(sb);
+
 	/* Now wait for internal filesystem counter */
 	sb->s_writers.frozen = SB_FREEZE_FS;
 	smp_wmb();
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 24af817..ac48ba6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2599,6 +2599,11 @@ int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
+#ifdef CONFIG_FS_DAX
+void dax_prepare_freeze(struct super_block *sb);
+#else /* !CONFIG_FS_DAX */
+static inline void dax_prepare_freeze(struct super_block *sb){}
+#endif /* !CONFIG_FS_DAX */
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-03-25 13:38 ` [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
@ 2015-03-25 14:34   ` Kirill A. Shutemov
  2015-03-26  7:49     ` Boaz Harrosh
  2015-03-25 15:08   ` Dave Hansen
  1 sibling, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2015-03-25 14:34 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

On Wed, Mar 25, 2015 at 03:38:37PM +0200, Boaz Harrosh wrote:
> From: Yigal Korman <yigal@plexistor.com>
> 
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.
> 
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> CC: Jan Kara <jack@suse.cz>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Hugh Dickins <hughd@google.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux-mm@kvack.org
> 
> Signed-off-by: Yigal Korman <yigal@plexistor.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>

This is not going to apply to -mm. do_wp_page() is reworked there.
BTW, shouldn't we rename it to do_wp_fault() or something?

> ---
>  include/linux/mm.h |  2 ++
>  mm/memory.c        | 28 +++++++++++++++++++++++++++-

Documentation/filesystems/Locking ?

>  2 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 47a9392..1cd820c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -250,6 +250,8 @@ struct vm_operations_struct {
>  	/* notification that a previously read-only page is about to become
>  	 * writable, if an error is returned it will cause a SIGBUS */
>  	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
> +	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */

New line before the comment?

> +	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
>  	/* called by access_process_vm when get_user_pages() fails, typically
>  	 * for use by special VMAs that can switch between memory and hardware
> diff --git a/mm/memory.c b/mm/memory.c
> index 8068893..8d640d1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1982,6 +1982,23 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>  	return ret;
>  }
>  
> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
> +{
> +	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
> +		struct vm_fault vmf = {
> +			.page = 0,

.page = NULL,

> +			.pgoff = (((address & PAGE_MASK) - vma->vm_start)
> +						>> PAGE_SHIFT) + vma->vm_pgoff,

.pgoff = linear_page_index(vma, address),

> +			.virtual_address = (void __user *)(address & PAGE_MASK),
> +			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
> +		};
> +
> +		return vma->vm_ops->pfn_mkwrite(vma, &vmf);
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * This routine handles present pages, when users try to write
>   * to a shared page. It is done by copying the page to a new address
> @@ -2025,8 +2042,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * accounting on raw pfn maps.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> -				     (VM_WRITE|VM_SHARED))
> +				     (VM_WRITE|VM_SHARED)) {
> +			pte_unmap_unlock(page_table, ptl);

It would be nice to avoid ptl drop if ->pfn_mkwrite is not defined for the
vma.

> +			ret = do_pfn_mkwrite(vma, address);
> +			if (ret & VM_FAULT_ERROR)
> +				return ret;
> +			page_table = pte_offset_map_lock(mm, pmd, address,
> +							 &ptl);
> +			if (!pte_same(*page_table, orig_pte))
> +				goto unlock;
>  			goto reuse;
> +		}
>  		goto gotten;
>  	}
>  
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-03-25 13:38 ` [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
  2015-03-25 14:34   ` Kirill A. Shutemov
@ 2015-03-25 15:08   ` Dave Hansen
  2015-03-25 15:13     ` Kirill A. Shutemov
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2015-03-25 15:08 UTC (permalink / raw)
  To: Boaz Harrosh, Dave Chinner, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm,
	linux-nvdimm, linux-fsdevel, Eryu Guan

On 03/25/2015 06:38 AM, Boaz Harrosh wrote:
>  /*
>   * This routine handles present pages, when users try to write
>   * to a shared page. It is done by copying the page to a new address
> @@ -2025,8 +2042,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * accounting on raw pfn maps.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> -				     (VM_WRITE|VM_SHARED))
> +				     (VM_WRITE|VM_SHARED)) {
> +			pte_unmap_unlock(page_table, ptl);
> +			ret = do_pfn_mkwrite(vma, address);
> +			if (ret & VM_FAULT_ERROR)
> +				return ret;
> +			page_table = pte_offset_map_lock(mm, pmd, address,
> +							 &ptl);
> +			if (!pte_same(*page_table, orig_pte))
> +				goto unlock;
>  			goto reuse;
> +		}
>  		goto gotten;
>  	}

This adds a lock release/reacquire in a place where the lock was
previously just held.  Could you explain a bit why this is safe?

Also, that pte_same() check looks a bit fragile.  It seems like it would
fail if the hardware, for instance, set the accessed bit in here
somewhere.  Is that what we want?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-03-25 15:08   ` Dave Hansen
@ 2015-03-25 15:13     ` Kirill A. Shutemov
  0 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2015-03-25 15:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Boaz Harrosh, Dave Chinner, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm,
	linux-nvdimm, linux-fsdevel, Eryu Guan

On Wed, Mar 25, 2015 at 08:08:24AM -0700, Dave Hansen wrote:
> On 03/25/2015 06:38 AM, Boaz Harrosh wrote:
> >  /*
> >   * This routine handles present pages, when users try to write
> >   * to a shared page. It is done by copying the page to a new address
> > @@ -2025,8 +2042,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		 * accounting on raw pfn maps.
> >  		 */
> >  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> > -				     (VM_WRITE|VM_SHARED))
> > +				     (VM_WRITE|VM_SHARED)) {
> > +			pte_unmap_unlock(page_table, ptl);
> > +			ret = do_pfn_mkwrite(vma, address);
> > +			if (ret & VM_FAULT_ERROR)
> > +				return ret;
> > +			page_table = pte_offset_map_lock(mm, pmd, address,
> > +							 &ptl);
> > +			if (!pte_same(*page_table, orig_pte))
> > +				goto unlock;
> >  			goto reuse;
> > +		}
> >  		goto gotten;
> >  	}
> 
> This adds a lock release/reacquire in a place where the lock was
> previously just held.  Could you explain a bit why this is safe?

It's common practice in page fault codepath. See code around
->page_mkwrite for example.
> 
> Also, that pte_same() check looks a bit fragile.  It seems like it would
> fail if the hardware, for instance, set the accessed bit in here
> somewhere.  Is that what we want?

In this case we will cancel this fault handling and fault again. No
problems here.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-03-25 14:34   ` Kirill A. Shutemov
@ 2015-03-26  7:49     ` Boaz Harrosh
  0 siblings, 0 replies; 16+ messages in thread
From: Boaz Harrosh @ 2015-03-26  7:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

On 03/25/2015 04:34 PM, Kirill A. Shutemov wrote:
> On Wed, Mar 25, 2015 at 03:38:37PM +0200, Boaz Harrosh wrote:
>> From: Yigal Korman <yigal@plexistor.com>
>>
>> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
>> to get notified when access is a write to a read-only PFN.
>>
>> This can happen if we mmap() a file then first mmap-read from it
>> to page-in a read-only PFN, than we mmap-write to the same page.
>>
>> We need this functionality to fix a DAX bug, where in the scenario
>> above we fail to set ctime/mtime though we modified the file.
>> An xfstest is attached to this patchset that shows the failure
>> and the fix. (A DAX patch will follow)
>>
>> This functionality is extra important for us, because upon
>> dirtying of a pmem page we also want to RDMA the page to a
>> remote cluster node.
>>
>> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>>   1 - The name ;-)
>>   2 - But mainly because it would take a very long and tedious
>>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>>       users. To make sure they do not now CRASH. For example current
>>       DAX code (which this is for) would crash.
>>       If we would want to reuse page_mkwrite, We will need to first
>>       patch all users, so to not-crash-on-no-page. Then enable this
>>       patch. But even if I did that I would not sleep so well at night.
>>       Adding a new vector is the safest thing to do, and is not that
>>       expensive. an extra pointer at a static function vector per driver.
>>       Also the new vector is better for performance, because else we
>>       Will call all current Kernel vectors, so to:
>> 	check-ha-no-page-do-nothing and return.
>>
>> No need to call it from do_shared_fault because do_wp_page is called to
>> change pte permissions anyway.
>>
>> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
>> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> CC: Jan Kara <jack@suse.cz>
>> CC: Andrew Morton <akpm@linux-foundation.org>
>> CC: Hugh Dickins <hughd@google.com>
>> CC: Mel Gorman <mgorman@suse.de>
>> CC: linux-mm@kvack.org
>>
>> Signed-off-by: Yigal Korman <yigal@plexistor.com>
>> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> 
> This is not going to apply to -mm. do_wp_page() is reworked there.
> BTW, shouldn't we rename it to do_wp_fault() or something?
> 

Wowhoo you were not kidding ;-)

I'll redo this patch based on linux-next/akpm branch. I will
need an hard up testing. Current patch I had for 6 month and
I'm confident about it. I'll need to stare at this real hard.

>> ---
>>  include/linux/mm.h |  2 ++
>>  mm/memory.c        | 28 +++++++++++++++++++++++++++-
> 
> Documentation/filesystems/Locking ?
> 
>>  2 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 47a9392..1cd820c 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -250,6 +250,8 @@ struct vm_operations_struct {
>>  	/* notification that a previously read-only page is about to become
>>  	 * writable, if an error is returned it will cause a SIGBUS */
>>  	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
>> +	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
> 
> New line before the comment?
> 
>> +	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
>>  
>>  	/* called by access_process_vm when get_user_pages() fails, typically
>>  	 * for use by special VMAs that can switch between memory and hardware
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 8068893..8d640d1 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1982,6 +1982,23 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>>  	return ret;
>>  }
>>  
>> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
>> +{
>> +	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
>> +		struct vm_fault vmf = {
>> +			.page = 0,
> 
> .page = NULL,
> 
>> +			.pgoff = (((address & PAGE_MASK) - vma->vm_start)
>> +						>> PAGE_SHIFT) + vma->vm_pgoff,
> 
> .pgoff = linear_page_index(vma, address),
> 
>> +			.virtual_address = (void __user *)(address & PAGE_MASK),
>> +			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
>> +		};
>> +
>> +		return vma->vm_ops->pfn_mkwrite(vma, &vmf);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  /*
>>   * This routine handles present pages, when users try to write
>>   * to a shared page. It is done by copying the page to a new address
>> @@ -2025,8 +2042,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>>  		 * accounting on raw pfn maps.
>>  		 */
>>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>> -				     (VM_WRITE|VM_SHARED))
>> +				     (VM_WRITE|VM_SHARED)) {
>> +			pte_unmap_unlock(page_table, ptl);
> 
> It would be nice to avoid ptl drop if ->pfn_mkwrite is not defined for the
> vma.

OK Yes, I will move the if (vma->vm_ops && vma->vm_ops->pfn_mkwrite)
to out here surrounding the unlock/lock

> 
>> +			ret = do_pfn_mkwrite(vma, address);
>> +			if (ret & VM_FAULT_ERROR)
>> +				return ret;
>> +			page_table = pte_offset_map_lock(mm, pmd, address,
>> +							 &ptl);
>> +			if (!pte_same(*page_table, orig_pte))
>> +				goto unlock;
>>  			goto reuse;
>> +		}
>>  		goto gotten;
>>  	}
>>  

Thank you Kirill, very much. I was hopping you'll have a look at this
see all the fine implications.

I will fix and send, after some hard testing.

Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-03-23 12:49 ` [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
@ 2015-03-23 22:49     ` Andrew Morton
  0 siblings, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2015-03-23 22:49 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Mon, 23 Mar 2015 14:49:32 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:

> From: Yigal Korman <yigal@plexistor.com>
> 
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.

Looks OK to me.

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1982,6 +1982,22 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>  	return ret;
>  }
>  
> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct vm_fault vmf;
> +
> +	if (!vma->vm_ops || !vma->vm_ops->pfn_mkwrite)
> +		return 0;
> +
> +	vmf.page = 0;
> +	vmf.pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) +
> +			vma->vm_pgoff;
> +	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
> +	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> +
> +	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
> +}

It might be a little neater to use

	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
		struct vm_fault vmf = {
			...
		};
		...
	}

>  /*
>   * This routine handles present pages, when users try to write
>   * to a shared page. It is done by copying the page to a new address
> @@ -2025,8 +2041,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * accounting on raw pfn maps.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> -				     (VM_WRITE|VM_SHARED))
> +				     (VM_WRITE|VM_SHARED)) {
> +			pte_unmap_unlock(page_table, ptl);
> +			ret = do_pfn_mkwrite(vma, address);
> +			if (ret & VM_FAULT_ERROR)
> +				return ret;
> +			page_table = pte_offset_map_lock(mm, pmd, address,
> +							 &ptl);
> +			if (!pte_same(*page_table, orig_pte))
> +				goto unlock;
>  			goto reuse;
> +		}
>  		goto gotten;

There are significant pending changes in this area.  See linux-next,
or http://ozlabs.org/~akpm/mmots/broken-out/mm-refactor-*

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
@ 2015-03-23 22:49     ` Andrew Morton
  0 siblings, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2015-03-23 22:49 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Kirill A. Shutemov, Jan Kara,
	Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm, linux-fsdevel,
	Eryu Guan

On Mon, 23 Mar 2015 14:49:32 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:

> From: Yigal Korman <yigal@plexistor.com>
> 
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.

Looks OK to me.

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1982,6 +1982,22 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>  	return ret;
>  }
>  
> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct vm_fault vmf;
> +
> +	if (!vma->vm_ops || !vma->vm_ops->pfn_mkwrite)
> +		return 0;
> +
> +	vmf.page = 0;
> +	vmf.pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) +
> +			vma->vm_pgoff;
> +	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
> +	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> +
> +	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
> +}

It might be a little neater to use

	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
		struct vm_fault vmf = {
			...
		};
		...
	}

>  /*
>   * This routine handles present pages, when users try to write
>   * to a shared page. It is done by copying the page to a new address
> @@ -2025,8 +2041,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * accounting on raw pfn maps.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> -				     (VM_WRITE|VM_SHARED))
> +				     (VM_WRITE|VM_SHARED)) {
> +			pte_unmap_unlock(page_table, ptl);
> +			ret = do_pfn_mkwrite(vma, address);
> +			if (ret & VM_FAULT_ERROR)
> +				return ret;
> +			page_table = pte_offset_map_lock(mm, pmd, address,
> +							 &ptl);
> +			if (!pte_same(*page_table, orig_pte))
> +				goto unlock;
>  			goto reuse;
> +		}
>  		goto gotten;

There are significant pending changes in this area.  See linux-next,
or http://ozlabs.org/~akpm/mmots/broken-out/mm-refactor-*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-03-23 12:47 [PATCH 0/3 v3] dax: Fix mmap-write not updating c/mtime Boaz Harrosh
@ 2015-03-23 12:49 ` Boaz Harrosh
  2015-03-23 22:49     ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Boaz Harrosh @ 2015-03-23 12:49 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan

From: Yigal Korman <yigal@plexistor.com>

This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
to get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it
to page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario
above we fail to set ctime/mtime though we modified the file.
An xfstest is attached to this patchset that shows the failure
and the fix. (A DAX patch will follow)

This functionality is extra important for us, because upon
dirtying of a pmem page we also want to RDMA the page to a
remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
  1 - The name ;-)
  2 - But mainly because it would take a very long and tedious
      audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
      users. To make sure they do not now CRASH. For example current
      DAX code (which this is for) would crash.
      If we would want to reuse page_mkwrite, We will need to first
      patch all users, so to not-crash-on-no-page. Then enable this
      patch. But even if I did that I would not sleep so well at night.
      Adding a new vector is the safest thing to do, and is not that
      expensive. an extra pointer at a static function vector per driver.
      Also the new vector is better for performance, because else we
      Will call all current Kernel vectors, so to:
	check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Hugh Dickins <hughd@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-mm@kvack.org

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 27 ++++++++++++++++++++++++++-
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a9392..1cd820c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -250,6 +250,8 @@ struct vm_operations_struct {
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
+	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
diff --git a/mm/memory.c b/mm/memory.c
index 8068893..ebef8f6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1982,6 +1982,22 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	return ret;
 }
 
+static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
+{
+	struct vm_fault vmf;
+
+	if (!vma->vm_ops || !vma->vm_ops->pfn_mkwrite)
+		return 0;
+
+	vmf.page = 0;
+	vmf.pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) +
+			vma->vm_pgoff;
+	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+
+	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -2025,8 +2041,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * accounting on raw pfn maps.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-				     (VM_WRITE|VM_SHARED))
+				     (VM_WRITE|VM_SHARED)) {
+			pte_unmap_unlock(page_table, ptl);
+			ret = do_pfn_mkwrite(vma, address);
+			if (ret & VM_FAULT_ERROR)
+				return ret;
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte))
+				goto unlock;
 			goto reuse;
+		}
 		goto gotten;
 	}
 
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-03-26  7:49 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-25 13:34 [PATCH 0/3 v4] dax: some dax fixes and cleanups Boaz Harrosh
2015-03-25 13:34 ` Boaz Harrosh
2015-03-25 13:38 ` [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
2015-03-25 14:34   ` Kirill A. Shutemov
2015-03-26  7:49     ` Boaz Harrosh
2015-03-25 15:08   ` Dave Hansen
2015-03-25 15:13     ` Kirill A. Shutemov
2015-03-25 13:41 ` [PATCH 2/3] dax: pfn_mkwrite update c/mtime + freeze protection Boaz Harrosh
2015-03-25 13:41   ` Boaz Harrosh
2015-03-25 13:44 ` [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations Boaz Harrosh
2015-03-25 13:44   ` Boaz Harrosh
2015-03-25 13:47 ` [FIXME] NOT-GOOD: dax: dax_prepare_freeze Boaz Harrosh
2015-03-25 13:47   ` Boaz Harrosh
  -- strict thread matches above, loose matches on Subject: below --
2015-03-23 12:47 [PATCH 0/3 v3] dax: Fix mmap-write not updating c/mtime Boaz Harrosh
2015-03-23 12:49 ` [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
2015-03-23 22:49   ` Andrew Morton
2015-03-23 22:49     ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.