All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3 v5] dax: some dax fixes and cleanups
@ 2015-04-07  8:33 ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07  8:33 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree

Hi Andrew

I finally had the time to beat up these fixes based on linux-next/akpm
and it looks OK.
I'm sending the two fix patches with @stable + a patch-1 for the 4.0
Kernel. The 4.1-rc Kernel will need a different patch.

It is your call if you want these in stable. It is a breakage in the dax
code that went into 4.0. But I guess it will not have that many users right
at the get go. So feel free to remove the CC:@stable. (Also the old XIP that
this DAX changed had all the same problems)

[v5]
* A new patch-1 Based on linux-next/akpm branch because mm/memory.c
  completely changed there.
  Also a 4.0 version of the same patch-1 if needed for stable@

List of patches:
 [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
 [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze
 [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations

	All these patches are based on linux-next/akpm. I'm not sure how
	it will interact with ext4-next though.

 [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP

	This patch is for 4.0 based tree if we decide to send
	[PATCH 2/3] to stable.


[v4] dax: some dax fixes and cleanups
* First patch fixed according to Andrew's comments. Thanks Andrew.
  1st and 2nd patch can go into current Kernel as they fix something
  that was merged this release.
* Added a new patch to fix up splice in the dax case, and cleanup.
  This one can wait for 4.1 (Also the first two not that anyone uses dax
  in production.)
* DAX freeze is not fixed yet. As we have more problems then I originally
  hoped for, as pointed out by Dave.
  (Just as a referance I'm sending a NO-GOOD additional patch to show what
   is not good enough to do. Was the RFC of [v3])
* Not re-posting the xfstest Dave please pick this up (It already found bugs
  in none dax FSs)

[v3] dax: Fix mmap-write not updating c/mtime
* I'm re-posting the two DAX patches that fix the mmap-write after read
  problem with DAX. (No changes since [v2])
* I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze
  and making mapping read-only. 
  Jan Please review and see if this is what you meant.

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

[v1]
The main problem is that current mm/memory.c will no call us with page_mkwrite
if we do not have an actual page mapping, which is what DAX uses.
The solution presented here introduces a new pfn_mkwrite to solve this problem.
Please see patch-2 for details.

I've been running with this patch for 4 month both HW and VMs with no apparent
danger, but see patch-1 I played it safe.

I am also posting an xfstest 080 that demonstrate this problem, I believe
that also some git operations (can't remember which) suffer from this problem.
Actually Eryu Guan found that this test fails on some other FS as well.

Matthew hi
I would love to have your ACK on these patches?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 0/3 v5] dax: some dax fixes and cleanups
@ 2015-04-07  8:33 ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07  8:33 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree

Hi Andrew

I finally had the time to beat up these fixes based on linux-next/akpm
and it looks OK.
I'm sending the two fix patches with @stable + a patch-1 for the 4.0
Kernel. The 4.1-rc Kernel will need a different patch.

It is your call if you want these in stable. It is a breakage in the dax
code that went into 4.0. But I guess it will not have that many users right
at the get go. So feel free to remove the CC:@stable. (Also the old XIP that
this DAX changed had all the same problems)

[v5]
* A new patch-1 Based on linux-next/akpm branch because mm/memory.c
  completely changed there.
  Also a 4.0 version of the same patch-1 if needed for stable@

List of patches:
 [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
 [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze
 [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations

	All these patches are based on linux-next/akpm. I'm not sure how
	it will interact with ext4-next though.

 [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP

	This patch is for 4.0 based tree if we decide to send
	[PATCH 2/3] to stable.


[v4] dax: some dax fixes and cleanups
* First patch fixed according to Andrew's comments. Thanks Andrew.
  1st and 2nd patch can go into current Kernel as they fix something
  that was merged this release.
* Added a new patch to fix up splice in the dax case, and cleanup.
  This one can wait for 4.1 (Also the first two not that anyone uses dax
  in production.)
* DAX freeze is not fixed yet. As we have more problems then I originally
  hoped for, as pointed out by Dave.
  (Just as a referance I'm sending a NO-GOOD additional patch to show what
   is not good enough to do. Was the RFC of [v3])
* Not re-posting the xfstest Dave please pick this up (It already found bugs
  in none dax FSs)

[v3] dax: Fix mmap-write not updating c/mtime
* I'm re-posting the two DAX patches that fix the mmap-write after read
  problem with DAX. (No changes since [v2])
* I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze
  and making mapping read-only. 
  Jan Please review and see if this is what you meant.

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

[v1]
The main problem is that current mm/memory.c will no call us with page_mkwrite
if we do not have an actual page mapping, which is what DAX uses.
The solution presented here introduces a new pfn_mkwrite to solve this problem.
Please see patch-2 for details.

I've been running with this patch for 4 month both HW and VMs with no apparent
danger, but see patch-1 I played it safe.

I am also posting an xfstest 080 that demonstrate this problem, I believe
that also some git operations (can't remember which) suffer from this problem.
Actually Eryu Guan found that this test fails on some other FS as well.

Matthew hi
I would love to have your ACK on these patches?

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07  8:33 ` Boaz Harrosh
  (?)
@ 2015-04-07  8:40 ` Boaz Harrosh
  2015-04-07  8:51     ` Boaz Harrosh
                     ` (3 more replies)
  -1 siblings, 4 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07  8:40 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree


[v2]
Based on linux-next/akpm [3dc4623]. For v4.1 merge window
Incorporated comments from Andrew And Kirill

[v1]
This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
to get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it
to page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario
above we fail to set ctime/mtime though we modified the file.
An xfstest is attached to this patchset that shows the failure
and the fix. (A DAX patch will follow)

This functionality is extra important for us, because upon
dirtying of a pmem page we also want to RDMA the page to a
remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
  1 - The name ;-)
  2 - But mainly because it would take a very long and tedious
      audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
      users. To make sure they do not now CRASH. For example current
      DAX code (which this is for) would crash.
      If we would want to reuse page_mkwrite, We will need to first
      patch all users, so to not-crash-on-no-page. Then enable this
      patch. But even if I did that I would not sleep so well at night.
      Adding a new vector is the safest thing to do, and is not that
      expensive. an extra pointer at a static function vector per driver.
      Also the new vector is better for performance, because else we
      Will call all current Kernel vectors, so to:
	check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Hugh Dickins <hughd@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-mm@kvack.org

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 35 +++++++++++++++++++++++++++++++----
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d584b95..70c47f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -251,6 +251,9 @@ struct vm_operations_struct {
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
+	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
+	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
+
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
 	 */
diff --git a/mm/memory.c b/mm/memory.c
index 59f6268..6e8f3f6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1982,6 +1982,19 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	return ret;
 }
 
+static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
+{
+	struct vm_fault vmf = {
+		.page = 0,
+		.pgoff = (((address & PAGE_MASK) - vma->vm_start)
+					>> PAGE_SHIFT) + vma->vm_pgoff,
+		.virtual_address = (void __user *)(address & PAGE_MASK),
+		.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
+	};
+
+	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
@@ -2259,14 +2272,28 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * VM_PFNMAP VMA.
 		 *
 		 * We should not cow pages in a shared writeable mapping.
-		 * Just mark the pages writable as we can't do any dirty
-		 * accounting on raw pfn maps.
+		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-				     (VM_WRITE|VM_SHARED))
+				     (VM_WRITE|VM_SHARED)) {
+			if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
+				int ret;
+
+				pte_unmap_unlock(page_table, ptl);
+				ret = do_pfn_mkwrite(vma, address);
+				if (ret & VM_FAULT_ERROR)
+					return ret;
+				page_table = pte_offset_map_lock(mm, pmd,
+								 address, &ptl);
+				/* Did pfn_mkwrite already fixed up the pte */
+				if (!pte_same(*page_table, orig_pte)) {
+					pte_unmap_unlock(page_table, ptl);
+					return ret;
+				}
+			}
 			return wp_page_reuse(mm, vma, address, page_table, ptl,
 					     orig_pte, old_page, 0, 0);
-
+		}
 		pte_unmap_unlock(page_table, ptl);
 		return wp_page_copy(mm, vma, address, page_table, pmd,
 				    orig_pte, old_page);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection
  2015-04-07  8:33 ` Boaz Harrosh
@ 2015-04-07  8:43   ` Boaz Harrosh
  -1 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07  8:43 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree

From: Yigal Korman <yigal@plexistor.com>

[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.

A new xfstest is submitted for testing this (generic/080)

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Stable Tree <stable@vger.kernel.org>

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/dax.c           | 17 +++++++++++++++++
 fs/ext2/file.c     |  1 +
 fs/ext4/file.c     |  1 +
 include/linux/fs.h |  1 +
 4 files changed, 20 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ed1619e..d0bd1f4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
+ * dax_pfn_mkwrite - handle first write to DAX page
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ *
+ */
+int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	sb_start_pagefault(sb);
+	file_update_time(vma->vm_file);
+	sb_end_pagefault(sb);
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
+
+/**
  * dax_zero_page_range - zero a range within a page of a DAX file
  * @inode: The file being truncated
  * @from: The file offset that is being truncated to
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e317017..866a3ce 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
 	.page_mkwrite	= ext2_dax_mkwrite,
+	.pfn_mkwrite	= dax_pfn_mkwrite,
 };
 
 static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 598abbb..aa78c70 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct ext4_dax_vm_ops = {
 	.fault		= ext4_dax_fault,
 	.page_mkwrite	= ext4_dax_mkwrite,
+	.pfn_mkwrite	= dax_pfn_mkwrite,
 };
 #else
 #define ext4_dax_vm_ops	ext4_file_vm_ops
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 368e349..394035f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2628,6 +2628,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
 #ifdef CONFIG_BLOCK
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection
@ 2015-04-07  8:43   ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07  8:43 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree

From: Yigal Korman <yigal@plexistor.com>

[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.

A new xfstest is submitted for testing this (generic/080)

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Stable Tree <stable@vger.kernel.org>

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/dax.c           | 17 +++++++++++++++++
 fs/ext2/file.c     |  1 +
 fs/ext4/file.c     |  1 +
 include/linux/fs.h |  1 +
 4 files changed, 20 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ed1619e..d0bd1f4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
+ * dax_pfn_mkwrite - handle first write to DAX page
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ *
+ */
+int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	sb_start_pagefault(sb);
+	file_update_time(vma->vm_file);
+	sb_end_pagefault(sb);
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
+
+/**
  * dax_zero_page_range - zero a range within a page of a DAX file
  * @inode: The file being truncated
  * @from: The file offset that is being truncated to
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e317017..866a3ce 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
 	.page_mkwrite	= ext2_dax_mkwrite,
+	.pfn_mkwrite	= dax_pfn_mkwrite,
 };
 
 static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 598abbb..aa78c70 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct ext4_dax_vm_ops = {
 	.fault		= ext4_dax_fault,
 	.page_mkwrite	= ext4_dax_mkwrite,
+	.pfn_mkwrite	= dax_pfn_mkwrite,
 };
 #else
 #define ext4_dax_vm_ops	ext4_file_vm_ops
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 368e349..394035f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2628,6 +2628,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
 #ifdef CONFIG_BLOCK
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations
  2015-04-07  8:33 ` Boaz Harrosh
@ 2015-04-07  8:45   ` Boaz Harrosh
  -1 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07  8:45 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree


The original dax patchset split the ext2/4_file_operations
because of the two NULL splice_read/splice_write in the dax
case.

At vfs if splice_read/splice_write are NULL would then call
default_splice_read/write.

What we do here is make generic_file_splice_read aware of
IS_DAX() so the original ext2/4_file_operations can be used
as is.

For write it appears that iter_file_splice_write is just fine.
It uses the regular f_op->write(file,..) or new_sync_write(file, ...).

CC: Dave Chinner <dchinner@redhat.com>
CC: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/ext2/ext2.h  |  1 -
 fs/ext2/file.c  | 18 ------------------
 fs/ext2/inode.c |  5 +----
 fs/ext2/namei.c | 10 ++--------
 fs/ext4/ext4.h  |  1 -
 fs/ext4/file.c  | 20 --------------------
 fs/ext4/inode.c |  5 +----
 fs/ext4/namei.c | 10 ++--------
 fs/splice.c     |  3 +++
 9 files changed, 9 insertions(+), 64 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 678f9ab..8d15feb 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
 		      int datasync);
 extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_dax_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 866a3ce..19cac93 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -109,24 +109,6 @@ const struct file_operations ext2_file_operations = {
 	.splice_write	= iter_file_splice_write,
 };
 
-#ifdef CONFIG_FS_DAX
-const struct file_operations ext2_dax_file_operations = {
-	.llseek		= generic_file_llseek,
-	.read		= new_sync_read,
-	.write		= new_sync_write,
-	.read_iter	= generic_file_read_iter,
-	.write_iter	= generic_file_write_iter,
-	.unlocked_ioctl = ext2_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= ext2_compat_ioctl,
-#endif
-	.mmap		= ext2_file_mmap,
-	.open		= dquot_file_open,
-	.release	= ext2_release_file,
-	.fsync		= ext2_fsync,
-};
-#endif
-
 const struct inode_operations ext2_file_inode_operations = {
 #ifdef CONFIG_EXT2_FS_XATTR
 	.setxattr	= generic_setxattr,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index df9d6af..b29eb67 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1388,10 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX)) {
-			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_dax_file_operations;
-		} else if (test_opt(inode->i_sb, NOBH)) {
+		if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
 			inode->i_fop = &ext2_file_operations;
 		} else {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 148f6e3..ce42293 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,10 +104,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, DAX)) {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_dax_file_operations;
-	} else if (test_opt(inode->i_sb, NOBH)) {
+	if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
 	} else {
@@ -125,10 +122,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, DAX)) {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_dax_file_operations;
-	} else if (test_opt(inode->i_sb, NOBH)) {
+	if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
 	} else {
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f63c3d5..8a3981e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2593,7 +2593,6 @@ extern const struct file_operations ext4_dir_operations;
 /* file.c */
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
-extern const struct file_operations ext4_dax_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 
 /* inline.c */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aa78c70..e6d4280 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -625,26 +625,6 @@ const struct file_operations ext4_file_operations = {
 	.fallocate	= ext4_fallocate,
 };
 
-#ifdef CONFIG_FS_DAX
-const struct file_operations ext4_dax_file_operations = {
-	.llseek		= ext4_llseek,
-	.read		= new_sync_read,
-	.write		= new_sync_write,
-	.read_iter	= generic_file_read_iter,
-	.write_iter	= ext4_file_write_iter,
-	.unlocked_ioctl = ext4_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= ext4_compat_ioctl,
-#endif
-	.mmap		= ext4_file_mmap,
-	.open		= ext4_file_open,
-	.release	= ext4_release_file,
-	.fsync		= ext4_sync_file,
-	/* Splice not yet supported with DAX */
-	.fallocate	= ext4_fallocate,
-};
-#endif
-
 const struct inode_operations ext4_file_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a3f4513..035b7a0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4090,10 +4090,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 28fe71a..2291923 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2235,10 +2235,7 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
 		if (!err && IS_DIRSYNC(dir))
@@ -2302,10 +2299,7 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		d_tmpfile(dentry, inode);
 		err = ext4_orphan_add(handle, inode);
diff --git a/fs/splice.c b/fs/splice.c
index 41cbb16..476024b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -523,6 +523,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	loff_t isize, left;
 	int ret;
 
+	if (IS_DAX(in->f_mapping->host))
+		return default_file_splice_read(in, ppos, pipe, len, flags);
+
 	isize = i_size_read(in->f_mapping->host);
 	if (unlikely(*ppos >= isize))
 		return 0;
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations
@ 2015-04-07  8:45   ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07  8:45 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree


The original dax patchset split the ext2/4_file_operations
because of the two NULL splice_read/splice_write in the dax
case.

At vfs if splice_read/splice_write are NULL would then call
default_splice_read/write.

What we do here is make generic_file_splice_read aware of
IS_DAX() so the original ext2/4_file_operations can be used
as is.

For write it appears that iter_file_splice_write is just fine.
It uses the regular f_op->write(file,..) or new_sync_write(file, ...).

CC: Dave Chinner <dchinner@redhat.com>
CC: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/ext2/ext2.h  |  1 -
 fs/ext2/file.c  | 18 ------------------
 fs/ext2/inode.c |  5 +----
 fs/ext2/namei.c | 10 ++--------
 fs/ext4/ext4.h  |  1 -
 fs/ext4/file.c  | 20 --------------------
 fs/ext4/inode.c |  5 +----
 fs/ext4/namei.c | 10 ++--------
 fs/splice.c     |  3 +++
 9 files changed, 9 insertions(+), 64 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 678f9ab..8d15feb 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
 		      int datasync);
 extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_dax_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 866a3ce..19cac93 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -109,24 +109,6 @@ const struct file_operations ext2_file_operations = {
 	.splice_write	= iter_file_splice_write,
 };
 
-#ifdef CONFIG_FS_DAX
-const struct file_operations ext2_dax_file_operations = {
-	.llseek		= generic_file_llseek,
-	.read		= new_sync_read,
-	.write		= new_sync_write,
-	.read_iter	= generic_file_read_iter,
-	.write_iter	= generic_file_write_iter,
-	.unlocked_ioctl = ext2_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= ext2_compat_ioctl,
-#endif
-	.mmap		= ext2_file_mmap,
-	.open		= dquot_file_open,
-	.release	= ext2_release_file,
-	.fsync		= ext2_fsync,
-};
-#endif
-
 const struct inode_operations ext2_file_inode_operations = {
 #ifdef CONFIG_EXT2_FS_XATTR
 	.setxattr	= generic_setxattr,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index df9d6af..b29eb67 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1388,10 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX)) {
-			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_dax_file_operations;
-		} else if (test_opt(inode->i_sb, NOBH)) {
+		if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
 			inode->i_fop = &ext2_file_operations;
 		} else {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 148f6e3..ce42293 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,10 +104,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, DAX)) {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_dax_file_operations;
-	} else if (test_opt(inode->i_sb, NOBH)) {
+	if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
 	} else {
@@ -125,10 +122,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, DAX)) {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_dax_file_operations;
-	} else if (test_opt(inode->i_sb, NOBH)) {
+	if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
 	} else {
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f63c3d5..8a3981e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2593,7 +2593,6 @@ extern const struct file_operations ext4_dir_operations;
 /* file.c */
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
-extern const struct file_operations ext4_dax_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 
 /* inline.c */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aa78c70..e6d4280 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -625,26 +625,6 @@ const struct file_operations ext4_file_operations = {
 	.fallocate	= ext4_fallocate,
 };
 
-#ifdef CONFIG_FS_DAX
-const struct file_operations ext4_dax_file_operations = {
-	.llseek		= ext4_llseek,
-	.read		= new_sync_read,
-	.write		= new_sync_write,
-	.read_iter	= generic_file_read_iter,
-	.write_iter	= ext4_file_write_iter,
-	.unlocked_ioctl = ext4_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= ext4_compat_ioctl,
-#endif
-	.mmap		= ext4_file_mmap,
-	.open		= ext4_file_open,
-	.release	= ext4_release_file,
-	.fsync		= ext4_sync_file,
-	/* Splice not yet supported with DAX */
-	.fallocate	= ext4_fallocate,
-};
-#endif
-
 const struct inode_operations ext4_file_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a3f4513..035b7a0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4090,10 +4090,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 28fe71a..2291923 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2235,10 +2235,7 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
 		if (!err && IS_DIRSYNC(dir))
@@ -2302,10 +2299,7 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		if (test_opt(inode->i_sb, DAX))
-			inode->i_fop = &ext4_dax_file_operations;
-		else
-			inode->i_fop = &ext4_file_operations;
+		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		d_tmpfile(dentry, inode);
 		err = ext4_orphan_add(handle, inode);
diff --git a/fs/splice.c b/fs/splice.c
index 41cbb16..476024b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -523,6 +523,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	loff_t isize, left;
 	int ret;
 
+	if (IS_DAX(in->f_mapping->host))
+		return default_file_splice_read(in, ppos, pipe, len, flags);
+
 	isize = i_size_read(in->f_mapping->host);
 	if (unlikely(*ppos >= isize))
 		return 0;
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07  8:40 ` [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
@ 2015-04-07  8:51     ` Boaz Harrosh
  2015-04-07  9:03   ` Kirill A. Shutemov
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07  8:51 UTC (permalink / raw)
  To: Boaz Harrosh, Dave Chinner, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm,
	linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree

On 04/07/2015 11:40 AM, Boaz Harrosh wrote:
> 

Crap this is the wrong version I have a [v3] with some more
of Kirill's comments fixes.

Will resend

Sorry for the noise
Boaz

> [v2]
> Based on linux-next/akpm [3dc4623]. For v4.1 merge window
> Incorporated comments from Andrew And Kirill
> 
> [v1]
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.
> 
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> CC: Jan Kara <jack@suse.cz>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Hugh Dickins <hughd@google.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux-mm@kvack.org
> 
> Signed-off-by: Yigal Korman <yigal@plexistor.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  include/linux/mm.h |  3 +++
>  mm/memory.c        | 35 +++++++++++++++++++++++++++++++----
>  2 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d584b95..70c47f2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -251,6 +251,9 @@ struct vm_operations_struct {
>  	 * writable, if an error is returned it will cause a SIGBUS */
>  	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
> +	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
> +	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
> +
>  	/* called by access_process_vm when get_user_pages() fails, typically
>  	 * for use by special VMAs that can switch between memory and hardware
>  	 */
> diff --git a/mm/memory.c b/mm/memory.c
> index 59f6268..6e8f3f6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1982,6 +1982,19 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>  	return ret;
>  }
>  
> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct vm_fault vmf = {
> +		.page = 0,
> +		.pgoff = (((address & PAGE_MASK) - vma->vm_start)
> +					>> PAGE_SHIFT) + vma->vm_pgoff,
> +		.virtual_address = (void __user *)(address & PAGE_MASK),
> +		.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
> +	};
> +
> +	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
> +}
> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -2259,14 +2272,28 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * VM_PFNMAP VMA.
>  		 *
>  		 * We should not cow pages in a shared writeable mapping.
> -		 * Just mark the pages writable as we can't do any dirty
> -		 * accounting on raw pfn maps.
> +		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> -				     (VM_WRITE|VM_SHARED))
> +				     (VM_WRITE|VM_SHARED)) {
> +			if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
> +				int ret;
> +
> +				pte_unmap_unlock(page_table, ptl);
> +				ret = do_pfn_mkwrite(vma, address);
> +				if (ret & VM_FAULT_ERROR)
> +					return ret;
> +				page_table = pte_offset_map_lock(mm, pmd,
> +								 address, &ptl);
> +				/* Did pfn_mkwrite already fixed up the pte */
> +				if (!pte_same(*page_table, orig_pte)) {
> +					pte_unmap_unlock(page_table, ptl);
> +					return ret;
> +				}
> +			}
>  			return wp_page_reuse(mm, vma, address, page_table, ptl,
>  					     orig_pte, old_page, 0, 0);
> -
> +		}
>  		pte_unmap_unlock(page_table, ptl);
>  		return wp_page_copy(mm, vma, address, page_table, pmd,
>  				    orig_pte, old_page);
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
@ 2015-04-07  8:51     ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07  8:51 UTC (permalink / raw)
  To: Boaz Harrosh, Dave Chinner, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm,
	linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree

On 04/07/2015 11:40 AM, Boaz Harrosh wrote:
> 

Crap this is the wrong version I have a [v3] with some more
of Kirill's comments fixes.

Will resend

Sorry for the noise
Boaz

> [v2]
> Based on linux-next/akpm [3dc4623]. For v4.1 merge window
> Incorporated comments from Andrew And Kirill
> 
> [v1]
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.
> 
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> CC: Jan Kara <jack@suse.cz>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Hugh Dickins <hughd@google.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux-mm@kvack.org
> 
> Signed-off-by: Yigal Korman <yigal@plexistor.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  include/linux/mm.h |  3 +++
>  mm/memory.c        | 35 +++++++++++++++++++++++++++++++----
>  2 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d584b95..70c47f2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -251,6 +251,9 @@ struct vm_operations_struct {
>  	 * writable, if an error is returned it will cause a SIGBUS */
>  	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
> +	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
> +	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
> +
>  	/* called by access_process_vm when get_user_pages() fails, typically
>  	 * for use by special VMAs that can switch between memory and hardware
>  	 */
> diff --git a/mm/memory.c b/mm/memory.c
> index 59f6268..6e8f3f6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1982,6 +1982,19 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>  	return ret;
>  }
>  
> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct vm_fault vmf = {
> +		.page = 0,
> +		.pgoff = (((address & PAGE_MASK) - vma->vm_start)
> +					>> PAGE_SHIFT) + vma->vm_pgoff,
> +		.virtual_address = (void __user *)(address & PAGE_MASK),
> +		.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
> +	};
> +
> +	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
> +}
> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -2259,14 +2272,28 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * VM_PFNMAP VMA.
>  		 *
>  		 * We should not cow pages in a shared writeable mapping.
> -		 * Just mark the pages writable as we can't do any dirty
> -		 * accounting on raw pfn maps.
> +		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> -				     (VM_WRITE|VM_SHARED))
> +				     (VM_WRITE|VM_SHARED)) {
> +			if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
> +				int ret;
> +
> +				pte_unmap_unlock(page_table, ptl);
> +				ret = do_pfn_mkwrite(vma, address);
> +				if (ret & VM_FAULT_ERROR)
> +					return ret;
> +				page_table = pte_offset_map_lock(mm, pmd,
> +								 address, &ptl);
> +				/* Did pfn_mkwrite already fixed up the pte */
> +				if (!pte_same(*page_table, orig_pte)) {
> +					pte_unmap_unlock(page_table, ptl);
> +					return ret;
> +				}
> +			}
>  			return wp_page_reuse(mm, vma, address, page_table, ptl,
>  					     orig_pte, old_page, 0, 0);
> -
> +		}
>  		pte_unmap_unlock(page_table, ptl);
>  		return wp_page_copy(mm, vma, address, page_table, pmd,
>  				    orig_pte, old_page);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07  8:40 ` [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
  2015-04-07  8:51     ` Boaz Harrosh
@ 2015-04-07  9:03   ` Kirill A. Shutemov
  2015-04-07  9:22     ` Boaz Harrosh
  2015-04-07 12:57     ` Boaz Harrosh
  2015-04-07 14:06     ` Boaz Harrosh
  3 siblings, 1 reply; 33+ messages in thread
From: Kirill A. Shutemov @ 2015-04-07  9:03 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On Tue, Apr 07, 2015 at 11:40:06AM +0300, Boaz Harrosh wrote:
> 
> [v2]
> Based on linux-next/akpm [3dc4623]. For v4.1 merge window
> Incorporated comments from Andrew And Kirill

Not really. You've ignored most of them. See below.

> [v1]
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.
> 
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> CC: Jan Kara <jack@suse.cz>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Hugh Dickins <hughd@google.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux-mm@kvack.org
> 
> Signed-off-by: Yigal Korman <yigal@plexistor.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  include/linux/mm.h |  3 +++
>  mm/memory.c        | 35 +++++++++++++++++++++++++++++++----

Please, document it in Documentation/filesystems/Locking.

>  2 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d584b95..70c47f2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -251,6 +251,9 @@ struct vm_operations_struct {
>  	 * writable, if an error is returned it will cause a SIGBUS */
>  	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
> +	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
> +	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
> +
>  	/* called by access_process_vm when get_user_pages() fails, typically
>  	 * for use by special VMAs that can switch between memory and hardware
>  	 */
> diff --git a/mm/memory.c b/mm/memory.c
> index 59f6268..6e8f3f6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1982,6 +1982,19 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>  	return ret;
>  }
>  
> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct vm_fault vmf = {
> +		.page = 0,

.page = NULL,

> +		.pgoff = (((address & PAGE_MASK) - vma->vm_start)
> +					>> PAGE_SHIFT) + vma->vm_pgoff,

.pgoff = linear_page_index(vma, address),

> +		.virtual_address = (void __user *)(address & PAGE_MASK),
> +		.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
> +	};
> +
> +	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
> +}
> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -2259,14 +2272,28 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * VM_PFNMAP VMA.
>  		 *
>  		 * We should not cow pages in a shared writeable mapping.
> -		 * Just mark the pages writable as we can't do any dirty
> -		 * accounting on raw pfn maps.
> +		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> -				     (VM_WRITE|VM_SHARED))
> +				     (VM_WRITE|VM_SHARED)) {

Let's move this case in separate function -- wp_pfn_shared(). As we do for
wp_page_shared().

> +			if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
> +				int ret;
> +
> +				pte_unmap_unlock(page_table, ptl);
> +				ret = do_pfn_mkwrite(vma, address);
> +				if (ret & VM_FAULT_ERROR)
> +					return ret;
> +				page_table = pte_offset_map_lock(mm, pmd,
> +								 address, &ptl);
> +				/* Did pfn_mkwrite already fixed up the pte */
> +				if (!pte_same(*page_table, orig_pte)) {
> +					pte_unmap_unlock(page_table, ptl);
> +					return ret;
> +				}
> +			}
>  			return wp_page_reuse(mm, vma, address, page_table, ptl,
>  					     orig_pte, old_page, 0, 0);
> -
> +		}
>  		pte_unmap_unlock(page_table, ptl);
>  		return wp_page_copy(mm, vma, address, page_table, pmd,
>  				    orig_pte, old_page);
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07  9:03   ` Kirill A. Shutemov
@ 2015-04-07  9:22     ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07  9:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On 04/07/2015 12:03 PM, Kirill A. Shutemov wrote:
> On Tue, Apr 07, 2015 at 11:40:06AM +0300, Boaz Harrosh wrote:
>>
>> [v2]
>> Based on linux-next/akpm [3dc4623]. For v4.1 merge window
>> Incorporated comments from Andrew And Kirill
> 
> Not really. You've ignored most of them. See below.
> 

Yes sorry about that I sent the wrong version.

<>
>> ---
>>  include/linux/mm.h |  3 +++
>>  mm/memory.c        | 35 +++++++++++++++++++++++++++++++----
> 
> Please, document it in Documentation/filesystems/Locking.
> 

Ha, I missed this one. Ok will try to put something coherent.

<>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 59f6268..6e8f3f6 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1982,6 +1982,19 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>>  	return ret;
>>  }
>>  
>> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
>> +{
>> +	struct vm_fault vmf = {
>> +		.page = 0,
> 
> .page = NULL,
> 
>> +		.pgoff = (((address & PAGE_MASK) - vma->vm_start)
>> +					>> PAGE_SHIFT) + vma->vm_pgoff,
> 
> .pgoff = linear_page_index(vma, address),
> 

Yes I had fixes for these two

>> +		.virtual_address = (void __user *)(address & PAGE_MASK),
>> +		.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
>> +	};
>> +
>> +	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
>> +}
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> @@ -2259,14 +2272,28 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>>  		 * VM_PFNMAP VMA.
>>  		 *
>>  		 * We should not cow pages in a shared writeable mapping.
>> -		 * Just mark the pages writable as we can't do any dirty
>> -		 * accounting on raw pfn maps.
>> +		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
>>  		 */
>>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>> -				     (VM_WRITE|VM_SHARED))
>> +				     (VM_WRITE|VM_SHARED)) {
> 
> Let's move this case in separate function -- wp_pfn_shared(). As we do for
> wp_page_shared().
> 

Ha, OK I will try that. I will need to re-run tests to make sure I did
not mess up

Thanks will fix, makes sense
Boaz

>> +			if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
>> +				int ret;
>> +
>> +				pte_unmap_unlock(page_table, ptl);
>> +				ret = do_pfn_mkwrite(vma, address);
>> +				if (ret & VM_FAULT_ERROR)
>> +					return ret;
>> +				page_table = pte_offset_map_lock(mm, pmd,
>> +								 address, &ptl);
>> +				/* Did pfn_mkwrite already fixed up the pte */
>> +				if (!pte_same(*page_table, orig_pte)) {
>> +					pte_unmap_unlock(page_table, ptl);
>> +					return ret;
>> +				}
>> +			}
>>  			return wp_page_reuse(mm, vma, address, page_table, ptl,
>>  					     orig_pte, old_page, 0, 0);
>> -
>> +		}
>>  		pte_unmap_unlock(page_table, ptl);
>>  		return wp_page_copy(mm, vma, address, page_table, pmd,
>>  				    orig_pte, old_page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07  8:40 ` [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
@ 2015-04-07 12:57     ` Boaz Harrosh
  2015-04-07  9:03   ` Kirill A. Shutemov
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07 12:57 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree


[v4]
Kirill's comments about splitting out a new wp_pfn_shared().
Add Documentation/filesystems/Locking text about pfn_mkwrite.

[v3]
Kirill's comments about use of linear_page_index()

[v2]
Based on linux-next/akpm [3dc4623]. For v4.1 merge window
Incorporated comments from Andrew And Kirill

[v1]
This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
to get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it
to page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario
above we fail to set ctime/mtime though we modified the file.
An xfstest is attached to this patchset that shows the failure
and the fix. (A DAX patch will follow)

This functionality is extra important for us, because upon
dirtying of a pmem page we also want to RDMA the page to a
remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
  1 - The name ;-)
  2 - But mainly because it would take a very long and tedious
      audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
      users. To make sure they do not now CRASH. For example current
      DAX code (which this is for) would crash.
      If we would want to reuse page_mkwrite, We will need to first
      patch all users, so to not-crash-on-no-page. Then enable this
      patch. But even if I did that I would not sleep so well at night.
      Adding a new vector is the safest thing to do, and is not that
      expensive. an extra pointer at a static function vector per driver.
      Also the new vector is better for performance, because else we
      Will call all current Kernel vectors, so to:
	check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Hugh Dickins <hughd@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-mm@kvack.org

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 Documentation/filesystems/Locking |  8 ++++++++
 include/linux/mm.h                |  3 +++
 mm/memory.c                       | 40 +++++++++++++++++++++++++++++++++++----
 3 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f91926f..25f36e6 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ prototypes:
 	void (*close)(struct vm_area_struct*);
 	int (*fault)(struct vm_area_struct*, struct vm_fault *);
 	int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
+	int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
 	int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
 
 locking rules:
@@ -534,6 +535,7 @@ close:		yes
 fault:		yes		can return with page locked
 map_pages:	yes
 page_mkwrite:	yes		can return with page locked
+pfn_mkwrite:	yes
 access:		yes
 
 	->fault() is called when a previously not present pte is about
@@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page
 like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
 will cause the VM to retry the fault.
 
+	->pfn_mkwrite() is the same as page_mkwrite but when the pte is
+VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
+VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
+after this call is to make the pte read-write, unless pfn_mkwrite()
+already touched the pte, in that case it is untouched.
+
 	->access() is called when get_user_pages() fails in
 access_process_vm(), typically used to debug a process through
 /proc/pid/mem or ptrace.  This function is needed only for
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d584b95..70c47f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -251,6 +251,9 @@ struct vm_operations_struct {
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
+	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
+	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
+
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
 	 */
diff --git a/mm/memory.c b/mm/memory.c
index 59f6268..67b1ee8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2181,6 +2181,39 @@ oom:
 	return VM_FAULT_OOM;
 }
 
+/*
+ * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
+ * mapping
+ */
+static int wp_pfn_shared(struct mm_struct *mm,
+			struct vm_area_struct *vma, unsigned long address,
+			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
+			pmd_t *pmd)
+{
+	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
+		struct vm_fault vmf = {
+			.page = NULL,
+			.pgoff = linear_page_index(vma, address),
+			.virtual_address = (void __user *)(address & PAGE_MASK),
+			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
+		};
+		int ret;
+
+		pte_unmap_unlock(page_table, ptl);
+		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
+		if (ret & VM_FAULT_ERROR)
+			return ret;
+		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+		/* Did pfn_mkwrite already fixed up the pte */
+		if (!pte_same(*page_table, orig_pte)) {
+			pte_unmap_unlock(page_table, ptl);
+			return ret;
+		}
+	}
+	return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
+			     NULL, 0, 0);
+}
+
 static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 			  unsigned long address, pte_t *page_table,
 			  pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
@@ -2259,13 +2292,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * VM_PFNMAP VMA.
 		 *
 		 * We should not cow pages in a shared writeable mapping.
-		 * Just mark the pages writable as we can't do any dirty
-		 * accounting on raw pfn maps.
+		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 				     (VM_WRITE|VM_SHARED))
-			return wp_page_reuse(mm, vma, address, page_table, ptl,
-					     orig_pte, old_page, 0, 0);
+			return wp_pfn_shared(mm, vma, address, page_table, ptl,
+					     orig_pte, pmd);
 
 		pte_unmap_unlock(page_table, ptl);
 		return wp_page_copy(mm, vma, address, page_table, pmd,
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
@ 2015-04-07 12:57     ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07 12:57 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree


[v4]
Kirill's comments about splitting out a new wp_pfn_shared().
Add Documentation/filesystems/Locking text about pfn_mkwrite.

[v3]
Kirill's comments about use of linear_page_index()

[v2]
Based on linux-next/akpm [3dc4623]. For v4.1 merge window
Incorporated comments from Andrew And Kirill

[v1]
This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
to get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it
to page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario
above we fail to set ctime/mtime though we modified the file.
An xfstest is attached to this patchset that shows the failure
and the fix. (A DAX patch will follow)

This functionality is extra important for us, because upon
dirtying of a pmem page we also want to RDMA the page to a
remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
  1 - The name ;-)
  2 - But mainly because it would take a very long and tedious
      audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
      users. To make sure they do not now CRASH. For example current
      DAX code (which this is for) would crash.
      If we would want to reuse page_mkwrite, We will need to first
      patch all users, so to not-crash-on-no-page. Then enable this
      patch. But even if I did that I would not sleep so well at night.
      Adding a new vector is the safest thing to do, and is not that
      expensive. an extra pointer at a static function vector per driver.
      Also the new vector is better for performance, because else we
      Will call all current Kernel vectors, so to:
	check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Hugh Dickins <hughd@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-mm@kvack.org

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 Documentation/filesystems/Locking |  8 ++++++++
 include/linux/mm.h                |  3 +++
 mm/memory.c                       | 40 +++++++++++++++++++++++++++++++++++----
 3 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f91926f..25f36e6 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ prototypes:
 	void (*close)(struct vm_area_struct*);
 	int (*fault)(struct vm_area_struct*, struct vm_fault *);
 	int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
+	int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
 	int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
 
 locking rules:
@@ -534,6 +535,7 @@ close:		yes
 fault:		yes		can return with page locked
 map_pages:	yes
 page_mkwrite:	yes		can return with page locked
+pfn_mkwrite:	yes
 access:		yes
 
 	->fault() is called when a previously not present pte is about
@@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page
 like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
 will cause the VM to retry the fault.
 
+	->pfn_mkwrite() is the same as page_mkwrite but when the pte is
+VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
+VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
+after this call is to make the pte read-write, unless pfn_mkwrite()
+already touched the pte, in that case it is untouched.
+
 	->access() is called when get_user_pages() fails in
 access_process_vm(), typically used to debug a process through
 /proc/pid/mem or ptrace.  This function is needed only for
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d584b95..70c47f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -251,6 +251,9 @@ struct vm_operations_struct {
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
+	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
+	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
+
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
 	 */
diff --git a/mm/memory.c b/mm/memory.c
index 59f6268..67b1ee8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2181,6 +2181,39 @@ oom:
 	return VM_FAULT_OOM;
 }
 
+/*
+ * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
+ * mapping
+ */
+static int wp_pfn_shared(struct mm_struct *mm,
+			struct vm_area_struct *vma, unsigned long address,
+			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
+			pmd_t *pmd)
+{
+	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
+		struct vm_fault vmf = {
+			.page = NULL,
+			.pgoff = linear_page_index(vma, address),
+			.virtual_address = (void __user *)(address & PAGE_MASK),
+			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
+		};
+		int ret;
+
+		pte_unmap_unlock(page_table, ptl);
+		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
+		if (ret & VM_FAULT_ERROR)
+			return ret;
+		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+		/* Did pfn_mkwrite already fixed up the pte */
+		if (!pte_same(*page_table, orig_pte)) {
+			pte_unmap_unlock(page_table, ptl);
+			return ret;
+		}
+	}
+	return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
+			     NULL, 0, 0);
+}
+
 static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 			  unsigned long address, pte_t *page_table,
 			  pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
@@ -2259,13 +2292,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * VM_PFNMAP VMA.
 		 *
 		 * We should not cow pages in a shared writeable mapping.
-		 * Just mark the pages writable as we can't do any dirty
-		 * accounting on raw pfn maps.
+		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 				     (VM_WRITE|VM_SHARED))
-			return wp_page_reuse(mm, vma, address, page_table, ptl,
-					     orig_pte, old_page, 0, 0);
+			return wp_pfn_shared(mm, vma, address, page_table, ptl,
+					     orig_pte, pmd);
 
 		pte_unmap_unlock(page_table, ptl);
 		return wp_page_copy(mm, vma, address, page_table, pmd,
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07 12:57     ` Boaz Harrosh
  (?)
@ 2015-04-07 13:17     ` Kirill A. Shutemov
  2015-04-07 13:26       ` Kirill A. Shutemov
  -1 siblings, 1 reply; 33+ messages in thread
From: Kirill A. Shutemov @ 2015-04-07 13:17 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On Tue, Apr 07, 2015 at 03:57:32PM +0300, Boaz Harrosh wrote:
> 
> [v4]
> Kirill's comments about splitting out a new wp_pfn_shared().
> Add Documentation/filesystems/Locking text about pfn_mkwrite.
> 
> [v3]
> Kirill's comments about use of linear_page_index()
> 
> [v2]
> Based on linux-next/akpm [3dc4623]. For v4.1 merge window
> Incorporated comments from Andrew And Kirill
> 
> [v1]
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.
> 
> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Jan Kara <jack@suse.cz>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Hugh Dickins <hughd@google.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux-mm@kvack.org
> 
> Signed-off-by: Yigal Korman <yigal@plexistor.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  Documentation/filesystems/Locking |  8 ++++++++
>  include/linux/mm.h                |  3 +++
>  mm/memory.c                       | 40 +++++++++++++++++++++++++++++++++++----
>  3 files changed, 47 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index f91926f..25f36e6 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -525,6 +525,7 @@ prototypes:
>  	void (*close)(struct vm_area_struct*);
>  	int (*fault)(struct vm_area_struct*, struct vm_fault *);
>  	int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
> +	int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
>  	int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
>  
>  locking rules:
> @@ -534,6 +535,7 @@ close:		yes
>  fault:		yes		can return with page locked
>  map_pages:	yes
>  page_mkwrite:	yes		can return with page locked
> +pfn_mkwrite:	yes
>  access:		yes
>  
>  	->fault() is called when a previously not present pte is about
> @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page
>  like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
>  will cause the VM to retry the fault.
>  
> +	->pfn_mkwrite() is the same as page_mkwrite but when the pte is
> +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
> +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
> +after this call is to make the pte read-write, unless pfn_mkwrite()
> +already touched the pte, in that case it is untouched.
> +
>  	->access() is called when get_user_pages() fails in
>  access_process_vm(), typically used to debug a process through
>  /proc/pid/mem or ptrace.  This function is needed only for
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d584b95..70c47f2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -251,6 +251,9 @@ struct vm_operations_struct {
>  	 * writable, if an error is returned it will cause a SIGBUS */
>  	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
> +	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
> +	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
> +
>  	/* called by access_process_vm when get_user_pages() fails, typically
>  	 * for use by special VMAs that can switch between memory and hardware
>  	 */
> diff --git a/mm/memory.c b/mm/memory.c
> index 59f6268..67b1ee8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2181,6 +2181,39 @@ oom:
>  	return VM_FAULT_OOM;
>  }
>  
> +/*
> + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
> + * mapping
> + */
> +static int wp_pfn_shared(struct mm_struct *mm,
> +			struct vm_area_struct *vma, unsigned long address,
> +			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
> +			pmd_t *pmd)
> +{
> +	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
> +		struct vm_fault vmf = {
> +			.page = NULL,
> +			.pgoff = linear_page_index(vma, address),
> +			.virtual_address = (void __user *)(address & PAGE_MASK),
> +			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
> +		};
> +		int ret;
> +
> +		pte_unmap_unlock(page_table, ptl);
> +		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
> +		if (ret & VM_FAULT_ERROR)
> +			return ret;
> +		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> +		/* Did pfn_mkwrite already fixed up the pte */
> +		if (!pte_same(*page_table, orig_pte)) {
> +			pte_unmap_unlock(page_table, ptl);
> +			return ret;

This should be "return 0;", shouldn't it?

VM_FAULT_NOPAGE would imply you've installed new pte, but you did not.

> +		}
> +	}
> +	return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
> +			     NULL, 0, 0);
> +}
> +
>  static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
>  			  unsigned long address, pte_t *page_table,
>  			  pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
> @@ -2259,13 +2292,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * VM_PFNMAP VMA.
>  		 *
>  		 * We should not cow pages in a shared writeable mapping.
> -		 * Just mark the pages writable as we can't do any dirty
> -		 * accounting on raw pfn maps.
> +		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>  				     (VM_WRITE|VM_SHARED))
> -			return wp_page_reuse(mm, vma, address, page_table, ptl,
> -					     orig_pte, old_page, 0, 0);
> +			return wp_pfn_shared(mm, vma, address, page_table, ptl,
> +					     orig_pte, pmd);
>  
>  		pte_unmap_unlock(page_table, ptl);
>  		return wp_page_copy(mm, vma, address, page_table, pmd,
> -- 
> 1.9.3
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07 13:17     ` Kirill A. Shutemov
@ 2015-04-07 13:26       ` Kirill A. Shutemov
  2015-04-07 13:37           ` Boaz Harrosh
  0 siblings, 1 reply; 33+ messages in thread
From: Kirill A. Shutemov @ 2015-04-07 13:26 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On Tue, Apr 07, 2015 at 04:17:00PM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 07, 2015 at 03:57:32PM +0300, Boaz Harrosh wrote:
> > +/*
> > + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
> > + * mapping
> > + */
> > +static int wp_pfn_shared(struct mm_struct *mm,
> > +			struct vm_area_struct *vma, unsigned long address,
> > +			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
> > +			pmd_t *pmd)
> > +{
> > +	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
> > +		struct vm_fault vmf = {
> > +			.page = NULL,
> > +			.pgoff = linear_page_index(vma, address),
> > +			.virtual_address = (void __user *)(address & PAGE_MASK),
> > +			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
> > +		};
> > +		int ret;
> > +
> > +		pte_unmap_unlock(page_table, ptl);
> > +		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
> > +		if (ret & VM_FAULT_ERROR)
> > +			return ret;
> > +		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> > +		/* Did pfn_mkwrite already fixed up the pte */

Oh. I guess you've missunderstood why we need pte_same() check below.
It's not about ->pfn_mkwrite() changing the pte (generatlly, it should
not). It's requited to address race with parallel page fault to the pte.

> > +		if (!pte_same(*page_table, orig_pte)) {
> > +			pte_unmap_unlock(page_table, ptl);
> > +			return ret;
> 
> This should be "return 0;", shouldn't it?
> 
> VM_FAULT_NOPAGE would imply you've installed new pte, but you did not.
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07 13:26       ` Kirill A. Shutemov
@ 2015-04-07 13:37           ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07 13:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On 04/07/2015 04:26 PM, Kirill A. Shutemov wrote:
> On Tue, Apr 07, 2015 at 04:17:00PM +0300, Kirill A. Shutemov wrote:
>> On Tue, Apr 07, 2015 at 03:57:32PM +0300, Boaz Harrosh wrote:
>>> +/*
>>> + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
>>> + * mapping
>>> + */
>>> +static int wp_pfn_shared(struct mm_struct *mm,
>>> +			struct vm_area_struct *vma, unsigned long address,
>>> +			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
>>> +			pmd_t *pmd)
>>> +{
>>> +	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
>>> +		struct vm_fault vmf = {
>>> +			.page = NULL,
>>> +			.pgoff = linear_page_index(vma, address),
>>> +			.virtual_address = (void __user *)(address & PAGE_MASK),
>>> +			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
>>> +		};
>>> +		int ret;
>>> +
>>> +		pte_unmap_unlock(page_table, ptl);
>>> +		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
>>> +		if (ret & VM_FAULT_ERROR)
>>> +			return ret;
>>> +		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
>>> +		/* Did pfn_mkwrite already fixed up the pte */
> 
> Oh. I guess you've missunderstood why we need pte_same() check below.
> It's not about ->pfn_mkwrite() changing the pte (generatlly, it should
> not). It's requited to address race with parallel page fault to the pte.
> 
>>> +		if (!pte_same(*page_table, orig_pte)) {
>>> +			pte_unmap_unlock(page_table, ptl);
>>> +			return ret;
>>
>> This should be "return 0;", shouldn't it?
>>
>> VM_FAULT_NOPAGE would imply you've installed new pte, but you did not.

Changing this to "return 0" would be very scary for me. Because I'm running
with this code for 1/2 a year now. And it is stable. You see since the original
code it was always doing just that pte_unmap_unlock && return ret. (See the patch
based on 4.0)

I did not understand if you want that I keep it "return ret".

I gather that you would like the comment changed, about the changed pte.
Both here and at Documentation/.../locking.

I'll send a new patch just tell me if you want the reurn thing

Thank you
Boaz


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
@ 2015-04-07 13:37           ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07 13:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On 04/07/2015 04:26 PM, Kirill A. Shutemov wrote:
> On Tue, Apr 07, 2015 at 04:17:00PM +0300, Kirill A. Shutemov wrote:
>> On Tue, Apr 07, 2015 at 03:57:32PM +0300, Boaz Harrosh wrote:
>>> +/*
>>> + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
>>> + * mapping
>>> + */
>>> +static int wp_pfn_shared(struct mm_struct *mm,
>>> +			struct vm_area_struct *vma, unsigned long address,
>>> +			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
>>> +			pmd_t *pmd)
>>> +{
>>> +	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
>>> +		struct vm_fault vmf = {
>>> +			.page = NULL,
>>> +			.pgoff = linear_page_index(vma, address),
>>> +			.virtual_address = (void __user *)(address & PAGE_MASK),
>>> +			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
>>> +		};
>>> +		int ret;
>>> +
>>> +		pte_unmap_unlock(page_table, ptl);
>>> +		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
>>> +		if (ret & VM_FAULT_ERROR)
>>> +			return ret;
>>> +		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
>>> +		/* Did pfn_mkwrite already fixed up the pte */
> 
> Oh. I guess you've missunderstood why we need pte_same() check below.
> It's not about ->pfn_mkwrite() changing the pte (generatlly, it should
> not). It's requited to address race with parallel page fault to the pte.
> 
>>> +		if (!pte_same(*page_table, orig_pte)) {
>>> +			pte_unmap_unlock(page_table, ptl);
>>> +			return ret;
>>
>> This should be "return 0;", shouldn't it?
>>
>> VM_FAULT_NOPAGE would imply you've installed new pte, but you did not.

Changing this to "return 0" would be very scary for me. Because I'm running
with this code for 1/2 a year now. And it is stable. You see since the original
code it was always doing just that pte_unmap_unlock && return ret. (See the patch
based on 4.0)

I did not understand if you want that I keep it "return ret".

I gather that you would like the comment changed, about the changed pte.
Both here and at Documentation/.../locking.

I'll send a new patch just tell me if you want the reurn thing

Thank you
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07 13:37           ` Boaz Harrosh
  (?)
@ 2015-04-07 13:47           ` Kirill A. Shutemov
  2015-04-07 14:09               ` Boaz Harrosh
  -1 siblings, 1 reply; 33+ messages in thread
From: Kirill A. Shutemov @ 2015-04-07 13:47 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On Tue, Apr 07, 2015 at 04:37:07PM +0300, Boaz Harrosh wrote:
> On 04/07/2015 04:26 PM, Kirill A. Shutemov wrote:
> > On Tue, Apr 07, 2015 at 04:17:00PM +0300, Kirill A. Shutemov wrote:
> >> On Tue, Apr 07, 2015 at 03:57:32PM +0300, Boaz Harrosh wrote:
> >>> +/*
> >>> + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
> >>> + * mapping
> >>> + */
> >>> +static int wp_pfn_shared(struct mm_struct *mm,
> >>> +			struct vm_area_struct *vma, unsigned long address,
> >>> +			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
> >>> +			pmd_t *pmd)
> >>> +{
> >>> +	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
> >>> +		struct vm_fault vmf = {
> >>> +			.page = NULL,
> >>> +			.pgoff = linear_page_index(vma, address),
> >>> +			.virtual_address = (void __user *)(address & PAGE_MASK),
> >>> +			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
> >>> +		};
> >>> +		int ret;
> >>> +
> >>> +		pte_unmap_unlock(page_table, ptl);
> >>> +		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
> >>> +		if (ret & VM_FAULT_ERROR)
> >>> +			return ret;
> >>> +		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> >>> +		/* Did pfn_mkwrite already fixed up the pte */
> > 
> > Oh. I guess you've missunderstood why we need pte_same() check below.
> > It's not about ->pfn_mkwrite() changing the pte (generatlly, it should
> > not). It's requited to address race with parallel page fault to the pte.
> > 
> >>> +		if (!pte_same(*page_table, orig_pte)) {
> >>> +			pte_unmap_unlock(page_table, ptl);
> >>> +			return ret;
> >>
> >> This should be "return 0;", shouldn't it?
> >>
> >> VM_FAULT_NOPAGE would imply you've installed new pte, but you did not.
> 
> Changing this to "return 0" would be very scary for me. Because I'm running
> with this code for 1/2 a year now. And it is stable. You see since the original
> code it was always doing just that pte_unmap_unlock && return ret. (See the patch
> based on 4.0)
> 
> I did not understand if you want that I keep it "return ret".

I think "return 0;" is right way to go. It matches wp_page_shared()
behaviour.

> I gather that you would like the comment changed, about the changed pte.
> Both here and at Documentation/.../locking.
> 
> I'll send a new patch just tell me if you want the reurn thing
> 
> Thank you
> Boaz
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/3 v7] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07  8:40 ` [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
@ 2015-04-07 14:06     ` Boaz Harrosh
  2015-04-07  9:03   ` Kirill A. Shutemov
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07 14:06 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree


[v5]
Changed comments about pte_same check after the call to
pfn_mkwrite and the return value.

[v4]
Kirill's comments about splitting out a new wp_pfn_shared().
Add Documentation/filesystems/Locking text about pfn_mkwrite.

[v3]
Kirill's comments about use of linear_page_index()

[v2]
Based on linux-next/akpm [3dc4623]. For v4.1 merge window
Incorporated comments from Andrew And Kirill

[v1]
This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
to get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it
to page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario
above we fail to set ctime/mtime though we modified the file.
An xfstest is attached to this patchset that shows the failure
and the fix. (A DAX patch will follow)

This functionality is extra important for us, because upon
dirtying of a pmem page we also want to RDMA the page to a
remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
  1 - The name ;-)
  2 - But mainly because it would take a very long and tedious
      audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
      users. To make sure they do not now CRASH. For example current
      DAX code (which this is for) would crash.
      If we would want to reuse page_mkwrite, We will need to first
      patch all users, so to not-crash-on-no-page. Then enable this
      patch. But even if I did that I would not sleep so well at night.
      Adding a new vector is the safest thing to do, and is not that
      expensive. an extra pointer at a static function vector per driver.
      Also the new vector is better for performance, because else we
      Will call all current Kernel vectors, so to:
	check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Hugh Dickins <hughd@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-mm@kvack.org

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 Documentation/filesystems/Locking |  8 ++++++++
 include/linux/mm.h                |  3 +++
 mm/memory.c                       | 43 +++++++++++++++++++++++++++++++++++----
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f91926f..8bb8a7e 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ prototypes:
 	void (*close)(struct vm_area_struct*);
 	int (*fault)(struct vm_area_struct*, struct vm_fault *);
 	int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
+	int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
 	int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
 
 locking rules:
@@ -534,6 +535,7 @@ close:		yes
 fault:		yes		can return with page locked
 map_pages:	yes
 page_mkwrite:	yes		can return with page locked
+pfn_mkwrite:	yes
 access:		yes
 
 	->fault() is called when a previously not present pte is about
@@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page
 like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
 will cause the VM to retry the fault.
 
+	->pfn_mkwrite() is the same as page_mkwrite but when the pte is
+VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
+VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
+after this call is to make the pte read-write, unless pfn_mkwrite returns
+an error.
+
 	->access() is called when get_user_pages() fails in
 access_process_vm(), typically used to debug a process through
 /proc/pid/mem or ptrace.  This function is needed only for
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d584b95..70c47f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -251,6 +251,9 @@ struct vm_operations_struct {
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
+	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
+	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
+
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
 	 */
diff --git a/mm/memory.c b/mm/memory.c
index 59f6268..d839cbc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2181,6 +2181,42 @@ oom:
 	return VM_FAULT_OOM;
 }
 
+/*
+ * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
+ * mapping
+ */
+static int wp_pfn_shared(struct mm_struct *mm,
+			struct vm_area_struct *vma, unsigned long address,
+			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
+			pmd_t *pmd)
+{
+	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
+		struct vm_fault vmf = {
+			.page = NULL,
+			.pgoff = linear_page_index(vma, address),
+			.virtual_address = (void __user *)(address & PAGE_MASK),
+			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
+		};
+		int ret;
+
+		pte_unmap_unlock(page_table, ptl);
+		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
+		if (ret & VM_FAULT_ERROR)
+			return ret;
+		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+		/*
+		 * We might have raced with another page fault while we
+		 * released the pte_offset_map_lock.
+		 */
+		if (!pte_same(*page_table, orig_pte)) {
+			pte_unmap_unlock(page_table, ptl);
+			return 0;
+		}
+	}
+	return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
+			     NULL, 0, 0);
+}
+
 static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 			  unsigned long address, pte_t *page_table,
 			  pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
@@ -2259,13 +2295,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * VM_PFNMAP VMA.
 		 *
 		 * We should not cow pages in a shared writeable mapping.
-		 * Just mark the pages writable as we can't do any dirty
-		 * accounting on raw pfn maps.
+		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 				     (VM_WRITE|VM_SHARED))
-			return wp_page_reuse(mm, vma, address, page_table, ptl,
-					     orig_pte, old_page, 0, 0);
+			return wp_pfn_shared(mm, vma, address, page_table, ptl,
+					     orig_pte, pmd);
 
 		pte_unmap_unlock(page_table, ptl);
 		return wp_page_copy(mm, vma, address, page_table, pmd,
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 1/3 v7] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
@ 2015-04-07 14:06     ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07 14:06 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig
  Cc: Stable Tree


[v5]
Changed comments about pte_same check after the call to
pfn_mkwrite and the return value.

[v4]
Kirill's comments about splitting out a new wp_pfn_shared().
Add Documentation/filesystems/Locking text about pfn_mkwrite.

[v3]
Kirill's comments about use of linear_page_index()

[v2]
Based on linux-next/akpm [3dc4623]. For v4.1 merge window
Incorporated comments from Andrew And Kirill

[v1]
This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
to get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it
to page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario
above we fail to set ctime/mtime though we modified the file.
An xfstest is attached to this patchset that shows the failure
and the fix. (A DAX patch will follow)

This functionality is extra important for us, because upon
dirtying of a pmem page we also want to RDMA the page to a
remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
  1 - The name ;-)
  2 - But mainly because it would take a very long and tedious
      audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
      users. To make sure they do not now CRASH. For example current
      DAX code (which this is for) would crash.
      If we would want to reuse page_mkwrite, We will need to first
      patch all users, so to not-crash-on-no-page. Then enable this
      patch. But even if I did that I would not sleep so well at night.
      Adding a new vector is the safest thing to do, and is not that
      expensive. an extra pointer at a static function vector per driver.
      Also the new vector is better for performance, because else we
      Will call all current Kernel vectors, so to:
	check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Hugh Dickins <hughd@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-mm@kvack.org

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 Documentation/filesystems/Locking |  8 ++++++++
 include/linux/mm.h                |  3 +++
 mm/memory.c                       | 43 +++++++++++++++++++++++++++++++++++----
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f91926f..8bb8a7e 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ prototypes:
 	void (*close)(struct vm_area_struct*);
 	int (*fault)(struct vm_area_struct*, struct vm_fault *);
 	int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
+	int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
 	int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
 
 locking rules:
@@ -534,6 +535,7 @@ close:		yes
 fault:		yes		can return with page locked
 map_pages:	yes
 page_mkwrite:	yes		can return with page locked
+pfn_mkwrite:	yes
 access:		yes
 
 	->fault() is called when a previously not present pte is about
@@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page
 like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
 will cause the VM to retry the fault.
 
+	->pfn_mkwrite() is the same as page_mkwrite but when the pte is
+VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
+VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
+after this call is to make the pte read-write, unless pfn_mkwrite returns
+an error.
+
 	->access() is called when get_user_pages() fails in
 access_process_vm(), typically used to debug a process through
 /proc/pid/mem or ptrace.  This function is needed only for
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d584b95..70c47f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -251,6 +251,9 @@ struct vm_operations_struct {
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
+	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
+	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
+
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
 	 */
diff --git a/mm/memory.c b/mm/memory.c
index 59f6268..d839cbc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2181,6 +2181,42 @@ oom:
 	return VM_FAULT_OOM;
 }
 
+/*
+ * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
+ * mapping
+ */
+static int wp_pfn_shared(struct mm_struct *mm,
+			struct vm_area_struct *vma, unsigned long address,
+			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
+			pmd_t *pmd)
+{
+	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
+		struct vm_fault vmf = {
+			.page = NULL,
+			.pgoff = linear_page_index(vma, address),
+			.virtual_address = (void __user *)(address & PAGE_MASK),
+			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
+		};
+		int ret;
+
+		pte_unmap_unlock(page_table, ptl);
+		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
+		if (ret & VM_FAULT_ERROR)
+			return ret;
+		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+		/*
+		 * We might have raced with another page fault while we
+		 * released the pte_offset_map_lock.
+		 */
+		if (!pte_same(*page_table, orig_pte)) {
+			pte_unmap_unlock(page_table, ptl);
+			return 0;
+		}
+	}
+	return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
+			     NULL, 0, 0);
+}
+
 static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 			  unsigned long address, pte_t *page_table,
 			  pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
@@ -2259,13 +2295,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * VM_PFNMAP VMA.
 		 *
 		 * We should not cow pages in a shared writeable mapping.
-		 * Just mark the pages writable as we can't do any dirty
-		 * accounting on raw pfn maps.
+		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 				     (VM_WRITE|VM_SHARED))
-			return wp_page_reuse(mm, vma, address, page_table, ptl,
-					     orig_pte, old_page, 0, 0);
+			return wp_pfn_shared(mm, vma, address, page_table, ptl,
+					     orig_pte, pmd);
 
 		pte_unmap_unlock(page_table, ptl);
 		return wp_page_copy(mm, vma, address, page_table, pmd,
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07 13:47           ` Kirill A. Shutemov
@ 2015-04-07 14:09               ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07 14:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On 04/07/2015 04:47 PM, Kirill A. Shutemov wrote:
<>
>> I did not understand if you want that I keep it "return ret".
> 
> I think "return 0;" is right way to go. It matches wp_page_shared()
> behaviour.
> 

Ok I sent a v7 with "return 0;" as is in wp_page_shared(). I guess
it is better. It survived of course an xfstests quick but I'll let
it smoke for the night as well.

Thanks for all your help
Boaz

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
@ 2015-04-07 14:09               ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-07 14:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On 04/07/2015 04:47 PM, Kirill A. Shutemov wrote:
<>
>> I did not understand if you want that I keep it "return ret".
> 
> I think "return 0;" is right way to go. It matches wp_page_shared()
> behaviour.
> 

Ok I sent a v7 with "return 0;" as is in wp_page_shared(). I guess
it is better. It survived of course an xfstests quick but I'll let
it smoke for the night as well.

Thanks for all your help
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 v7] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07 14:06     ` Boaz Harrosh
  (?)
@ 2015-04-07 14:12     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 33+ messages in thread
From: Kirill A. Shutemov @ 2015-04-07 14:12 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On Tue, Apr 07, 2015 at 05:06:11PM +0300, Boaz Harrosh wrote:
> 
> [v5]
> Changed comments about pte_same check after the call to
> pfn_mkwrite and the return value.
> 
> [v4]
> Kirill's comments about splitting out a new wp_pfn_shared().
> Add Documentation/filesystems/Locking text about pfn_mkwrite.
> 
> [v3]
> Kirill's comments about use of linear_page_index()
> 
> [v2]
> Based on linux-next/akpm [3dc4623]. For v4.1 merge window
> Incorporated comments from Andrew And Kirill
> 
> [v1]
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.
> 
> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Jan Kara <jack@suse.cz>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Hugh Dickins <hughd@google.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux-mm@kvack.org
> 
> Signed-off-by: Yigal Korman <yigal@plexistor.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations
  2015-04-07  8:45   ` Boaz Harrosh
  (?)
@ 2015-04-07 16:26   ` Jan Kara
  -1 siblings, 0 replies; 33+ messages in thread
From: Jan Kara @ 2015-04-07 16:26 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On Tue 07-04-15 11:45:09, Boaz Harrosh wrote:
> 
> The original dax patchset split the ext2/4_file_operations
> because of the two NULL splice_read/splice_write in the dax
> case.
> 
> At vfs if splice_read/splice_write are NULL would then call
> default_splice_read/write.
> 
> What we do here is make generic_file_splice_read aware of
> IS_DAX() so the original ext2/4_file_operations can be used
> as is.
> 
> For write it appears that iter_file_splice_write is just fine.
> It uses the regular f_op->write(file,..) or new_sync_write(file, ...).
> 
> CC: Dave Chinner <dchinner@redhat.com>
> CC: Matthew Wilcox <willy@linux.intel.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
  The patch looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext2/ext2.h  |  1 -
>  fs/ext2/file.c  | 18 ------------------
>  fs/ext2/inode.c |  5 +----
>  fs/ext2/namei.c | 10 ++--------
>  fs/ext4/ext4.h  |  1 -
>  fs/ext4/file.c  | 20 --------------------
>  fs/ext4/inode.c |  5 +----
>  fs/ext4/namei.c | 10 ++--------
>  fs/splice.c     |  3 +++
>  9 files changed, 9 insertions(+), 64 deletions(-)
> 
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index 678f9ab..8d15feb 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -793,7 +793,6 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
>  		      int datasync);
>  extern const struct inode_operations ext2_file_inode_operations;
>  extern const struct file_operations ext2_file_operations;
> -extern const struct file_operations ext2_dax_file_operations;
>  
>  /* inode.c */
>  extern const struct address_space_operations ext2_aops;
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 866a3ce..19cac93 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -109,24 +109,6 @@ const struct file_operations ext2_file_operations = {
>  	.splice_write	= iter_file_splice_write,
>  };
>  
> -#ifdef CONFIG_FS_DAX
> -const struct file_operations ext2_dax_file_operations = {
> -	.llseek		= generic_file_llseek,
> -	.read		= new_sync_read,
> -	.write		= new_sync_write,
> -	.read_iter	= generic_file_read_iter,
> -	.write_iter	= generic_file_write_iter,
> -	.unlocked_ioctl = ext2_ioctl,
> -#ifdef CONFIG_COMPAT
> -	.compat_ioctl	= ext2_compat_ioctl,
> -#endif
> -	.mmap		= ext2_file_mmap,
> -	.open		= dquot_file_open,
> -	.release	= ext2_release_file,
> -	.fsync		= ext2_fsync,
> -};
> -#endif
> -
>  const struct inode_operations ext2_file_inode_operations = {
>  #ifdef CONFIG_EXT2_FS_XATTR
>  	.setxattr	= generic_setxattr,
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index df9d6af..b29eb67 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -1388,10 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
>  
>  	if (S_ISREG(inode->i_mode)) {
>  		inode->i_op = &ext2_file_inode_operations;
> -		if (test_opt(inode->i_sb, DAX)) {
> -			inode->i_mapping->a_ops = &ext2_aops;
> -			inode->i_fop = &ext2_dax_file_operations;
> -		} else if (test_opt(inode->i_sb, NOBH)) {
> +		if (test_opt(inode->i_sb, NOBH)) {
>  			inode->i_mapping->a_ops = &ext2_nobh_aops;
>  			inode->i_fop = &ext2_file_operations;
>  		} else {
> diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
> index 148f6e3..ce42293 100644
> --- a/fs/ext2/namei.c
> +++ b/fs/ext2/namei.c
> @@ -104,10 +104,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
>  		return PTR_ERR(inode);
>  
>  	inode->i_op = &ext2_file_inode_operations;
> -	if (test_opt(inode->i_sb, DAX)) {
> -		inode->i_mapping->a_ops = &ext2_aops;
> -		inode->i_fop = &ext2_dax_file_operations;
> -	} else if (test_opt(inode->i_sb, NOBH)) {
> +	if (test_opt(inode->i_sb, NOBH)) {
>  		inode->i_mapping->a_ops = &ext2_nobh_aops;
>  		inode->i_fop = &ext2_file_operations;
>  	} else {
> @@ -125,10 +122,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
>  		return PTR_ERR(inode);
>  
>  	inode->i_op = &ext2_file_inode_operations;
> -	if (test_opt(inode->i_sb, DAX)) {
> -		inode->i_mapping->a_ops = &ext2_aops;
> -		inode->i_fop = &ext2_dax_file_operations;
> -	} else if (test_opt(inode->i_sb, NOBH)) {
> +	if (test_opt(inode->i_sb, NOBH)) {
>  		inode->i_mapping->a_ops = &ext2_nobh_aops;
>  		inode->i_fop = &ext2_file_operations;
>  	} else {
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index f63c3d5..8a3981e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2593,7 +2593,6 @@ extern const struct file_operations ext4_dir_operations;
>  /* file.c */
>  extern const struct inode_operations ext4_file_inode_operations;
>  extern const struct file_operations ext4_file_operations;
> -extern const struct file_operations ext4_dax_file_operations;
>  extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
>  
>  /* inline.c */
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index aa78c70..e6d4280 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -625,26 +625,6 @@ const struct file_operations ext4_file_operations = {
>  	.fallocate	= ext4_fallocate,
>  };
>  
> -#ifdef CONFIG_FS_DAX
> -const struct file_operations ext4_dax_file_operations = {
> -	.llseek		= ext4_llseek,
> -	.read		= new_sync_read,
> -	.write		= new_sync_write,
> -	.read_iter	= generic_file_read_iter,
> -	.write_iter	= ext4_file_write_iter,
> -	.unlocked_ioctl = ext4_ioctl,
> -#ifdef CONFIG_COMPAT
> -	.compat_ioctl	= ext4_compat_ioctl,
> -#endif
> -	.mmap		= ext4_file_mmap,
> -	.open		= ext4_file_open,
> -	.release	= ext4_release_file,
> -	.fsync		= ext4_sync_file,
> -	/* Splice not yet supported with DAX */
> -	.fallocate	= ext4_fallocate,
> -};
> -#endif
> -
>  const struct inode_operations ext4_file_inode_operations = {
>  	.setattr	= ext4_setattr,
>  	.getattr	= ext4_getattr,
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index a3f4513..035b7a0 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4090,10 +4090,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>  
>  	if (S_ISREG(inode->i_mode)) {
>  		inode->i_op = &ext4_file_inode_operations;
> -		if (test_opt(inode->i_sb, DAX))
> -			inode->i_fop = &ext4_dax_file_operations;
> -		else
> -			inode->i_fop = &ext4_file_operations;
> +		inode->i_fop = &ext4_file_operations;
>  		ext4_set_aops(inode);
>  	} else if (S_ISDIR(inode->i_mode)) {
>  		inode->i_op = &ext4_dir_inode_operations;
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 28fe71a..2291923 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2235,10 +2235,7 @@ retry:
>  	err = PTR_ERR(inode);
>  	if (!IS_ERR(inode)) {
>  		inode->i_op = &ext4_file_inode_operations;
> -		if (test_opt(inode->i_sb, DAX))
> -			inode->i_fop = &ext4_dax_file_operations;
> -		else
> -			inode->i_fop = &ext4_file_operations;
> +		inode->i_fop = &ext4_file_operations;
>  		ext4_set_aops(inode);
>  		err = ext4_add_nondir(handle, dentry, inode);
>  		if (!err && IS_DIRSYNC(dir))
> @@ -2302,10 +2299,7 @@ retry:
>  	err = PTR_ERR(inode);
>  	if (!IS_ERR(inode)) {
>  		inode->i_op = &ext4_file_inode_operations;
> -		if (test_opt(inode->i_sb, DAX))
> -			inode->i_fop = &ext4_dax_file_operations;
> -		else
> -			inode->i_fop = &ext4_file_operations;
> +		inode->i_fop = &ext4_file_operations;
>  		ext4_set_aops(inode);
>  		d_tmpfile(dentry, inode);
>  		err = ext4_orphan_add(handle, inode);
> diff --git a/fs/splice.c b/fs/splice.c
> index 41cbb16..476024b 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -523,6 +523,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
>  	loff_t isize, left;
>  	int ret;
>  
> +	if (IS_DAX(in->f_mapping->host))
> +		return default_file_splice_read(in, ppos, pipe, len, flags);
> +
>  	isize = i_size_read(in->f_mapping->host);
>  	if (unlikely(*ppos >= isize))
>  		return 0;
> -- 
> 1.9.3
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection
  2015-04-07  8:43   ` Boaz Harrosh
@ 2015-04-07 16:28     ` Jan Kara
  -1 siblings, 0 replies; 33+ messages in thread
From: Jan Kara @ 2015-04-07 16:28 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On Tue 07-04-15 11:43:50, Boaz Harrosh wrote:
> From: Yigal Korman <yigal@plexistor.com>
> 
> [v1]
> Without this patch, c/mtime is not updated correctly when mmap'ed page is
> first read from and then written to.
> 
> A new xfstest is submitted for testing this (generic/080)
> 
> [v2]
> Jan Kara has pointed out that if we add the
> sb_start/end_pagefault pair in the new pfn_mkwrite we
> are then fixing another bug where: A user could start
> writing to the page while filesystem is frozen.
  The patch looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> CC: Jan Kara <jack@suse.cz>
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Stable Tree <stable@vger.kernel.org>
> 
> Signed-off-by: Yigal Korman <yigal@plexistor.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  fs/dax.c           | 17 +++++++++++++++++
>  fs/ext2/file.c     |  1 +
>  fs/ext4/file.c     |  1 +
>  include/linux/fs.h |  1 +
>  4 files changed, 20 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ed1619e..d0bd1f4 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  EXPORT_SYMBOL_GPL(dax_fault);
>  
>  /**
> + * dax_pfn_mkwrite - handle first write to DAX page
> + * @vma: The virtual memory area where the fault occurred
> + * @vmf: The description of the fault
> + *
> + */
> +int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +
> +	sb_start_pagefault(sb);
> +	file_update_time(vma->vm_file);
> +	sb_end_pagefault(sb);
> +	return VM_FAULT_NOPAGE;
> +}
> +EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
> +
> +/**
>   * dax_zero_page_range - zero a range within a page of a DAX file
>   * @inode: The file being truncated
>   * @from: The file offset that is being truncated to
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index e317017..866a3ce 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
>  	.fault		= ext2_dax_fault,
>  	.page_mkwrite	= ext2_dax_mkwrite,
> +	.pfn_mkwrite	= dax_pfn_mkwrite,
>  };
>  
>  static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 598abbb..aa78c70 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  static const struct vm_operations_struct ext4_dax_vm_ops = {
>  	.fault		= ext4_dax_fault,
>  	.page_mkwrite	= ext4_dax_mkwrite,
> +	.pfn_mkwrite	= dax_pfn_mkwrite,
>  };
>  #else
>  #define ext4_dax_vm_ops	ext4_file_vm_ops
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 368e349..394035f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2628,6 +2628,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size);
>  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
> +int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
>  #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
>  
>  #ifdef CONFIG_BLOCK
> -- 
> 1.9.3
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection
@ 2015-04-07 16:28     ` Jan Kara
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Kara @ 2015-04-07 16:28 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig, Stable Tree

On Tue 07-04-15 11:43:50, Boaz Harrosh wrote:
> From: Yigal Korman <yigal@plexistor.com>
> 
> [v1]
> Without this patch, c/mtime is not updated correctly when mmap'ed page is
> first read from and then written to.
> 
> A new xfstest is submitted for testing this (generic/080)
> 
> [v2]
> Jan Kara has pointed out that if we add the
> sb_start/end_pagefault pair in the new pfn_mkwrite we
> are then fixing another bug where: A user could start
> writing to the page while filesystem is frozen.
  The patch looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> CC: Jan Kara <jack@suse.cz>
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Stable Tree <stable@vger.kernel.org>
> 
> Signed-off-by: Yigal Korman <yigal@plexistor.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  fs/dax.c           | 17 +++++++++++++++++
>  fs/ext2/file.c     |  1 +
>  fs/ext4/file.c     |  1 +
>  include/linux/fs.h |  1 +
>  4 files changed, 20 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ed1619e..d0bd1f4 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  EXPORT_SYMBOL_GPL(dax_fault);
>  
>  /**
> + * dax_pfn_mkwrite - handle first write to DAX page
> + * @vma: The virtual memory area where the fault occurred
> + * @vmf: The description of the fault
> + *
> + */
> +int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +
> +	sb_start_pagefault(sb);
> +	file_update_time(vma->vm_file);
> +	sb_end_pagefault(sb);
> +	return VM_FAULT_NOPAGE;
> +}
> +EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
> +
> +/**
>   * dax_zero_page_range - zero a range within a page of a DAX file
>   * @inode: The file being truncated
>   * @from: The file offset that is being truncated to
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index e317017..866a3ce 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
>  	.fault		= ext2_dax_fault,
>  	.page_mkwrite	= ext2_dax_mkwrite,
> +	.pfn_mkwrite	= dax_pfn_mkwrite,
>  };
>  
>  static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 598abbb..aa78c70 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  static const struct vm_operations_struct ext4_dax_vm_ops = {
>  	.fault		= ext4_dax_fault,
>  	.page_mkwrite	= ext4_dax_mkwrite,
> +	.pfn_mkwrite	= dax_pfn_mkwrite,
>  };
>  #else
>  #define ext4_dax_vm_ops	ext4_file_vm_ops
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 368e349..394035f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2628,6 +2628,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size);
>  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
> +int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
>  #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
>  
>  #ifdef CONFIG_BLOCK
> -- 
> 1.9.3
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-07  8:33 ` Boaz Harrosh
@ 2015-04-08 15:56   ` Boaz Harrosh
  -1 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-08 15:56 UTC (permalink / raw)
  To: Stable Tree
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig

From: Yigal Korman <yigal@plexistor.com>

[For Stable 4.0.X]
The parallel patch at 4.1-rc1 to this patch is:
  Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP

We need this patch for the 4.0.X stable tree if the patch
  Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection

Was decided to be pulled into stable since it is a dependency
of this patch. The file mm/memory.c was heavily changed in 4.1
hence this here.

[v3]
In the case of !pte_same when we lost the race better
return 0 instead of FAULT_NO_PAGE

[v2]
Fixed according to Kirill's comments

[v1]
This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
to get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it
to page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario
above we fail to set ctime/mtime though we modified the file.
An xfstest is attached to this patchset that shows the failure
and the fix. (A DAX patch will follow)

This functionality is extra important for us, because upon
dirtying of a pmem page we also want to RDMA the page to a
remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
  1 - The name ;-)
  2 - But mainly because it would take a very long and tedious
      audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
      users. To make sure they do not now CRASH. For example current
      DAX code (which this is for) would crash.
      If we would want to reuse page_mkwrite, We will need to first
      patch all users, so to not-crash-on-no-page. Then enable this
      patch. But even if I did that I would not sleep so well at night.
      Adding a new vector is the safest thing to do, and is not that
      expensive. an extra pointer at a static function vector per driver.
      Also the new vector is better for performance, because else we
      Will call all current Kernel vectors, so to:
	check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Hugh Dickins <hughd@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: Konstantin Khlebnikov <koct9i@gmail.com>
CC: linux-mm@kvack.org
CC: Stable Tree <stable@vger.kernel.org>
---
 Documentation/filesystems/Locking |  8 ++++++++
 include/linux/mm.h                |  3 +++
 mm/memory.c                       | 27 ++++++++++++++++++++++++++-
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f91926f..25f36e6 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ prototypes:
 	void (*close)(struct vm_area_struct*);
 	int (*fault)(struct vm_area_struct*, struct vm_fault *);
 	int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
+	int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
 	int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
 
 locking rules:
@@ -534,6 +535,7 @@ close:		yes
 fault:		yes		can return with page locked
 map_pages:	yes
 page_mkwrite:	yes		can return with page locked
+pfn_mkwrite:	yes
 access:		yes
 
 	->fault() is called when a previously not present pte is about
@@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page
 like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
 will cause the VM to retry the fault.
 
+	->pfn_mkwrite() is the same as page_mkwrite but when the pte is
+VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
+VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
+after this call is to make the pte read-write, unless pfn_mkwrite()
+already touched the pte, in that case it is untouched.
+
 	->access() is called when get_user_pages() fails in
 access_process_vm(), typically used to debug a process through
 /proc/pid/mem or ptrace.  This function is needed only for
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a9392..85ba9c2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -251,6 +251,9 @@ struct vm_operations_struct {
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
+	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
+	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
+
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
 	 */
diff --git a/mm/memory.c b/mm/memory.c
index 97839f5..6029777 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1982,6 +1982,18 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	return ret;
 }
 
+static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
+{
+	struct vm_fault vmf = {
+		.page = NULL,
+		.pgoff = linear_page_index(vma, address),
+		.virtual_address = (void __user *)(address & PAGE_MASK),
+		.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
+	};
+
+	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -2025,8 +2037,21 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * accounting on raw pfn maps.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-				     (VM_WRITE|VM_SHARED))
+				     (VM_WRITE|VM_SHARED)) {
+			if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
+				pte_unmap_unlock(page_table, ptl);
+				ret = do_pfn_mkwrite(vma, address);
+				if (ret & VM_FAULT_ERROR)
+					return ret;
+				page_table = pte_offset_map_lock(mm, pmd,
+								 address, &ptl);
+				if (!pte_same(*page_table, orig_pte)) {
+					ret = 0;
+					goto unlock;
+				}
+			}
 			goto reuse;
+		}
 		goto gotten;
 	}
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
@ 2015-04-08 15:56   ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-08 15:56 UTC (permalink / raw)
  To: Stable Tree
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig

From: Yigal Korman <yigal@plexistor.com>

[For Stable 4.0.X]
The parallel patch at 4.1-rc1 to this patch is:
  Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP

We need this patch for the 4.0.X stable tree if the patch
  Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection

Was decided to be pulled into stable since it is a dependency
of this patch. The file mm/memory.c was heavily changed in 4.1
hence this here.

[v3]
In the case of !pte_same when we lost the race better
return 0 instead of FAULT_NO_PAGE

[v2]
Fixed according to Kirill's comments

[v1]
This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
to get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it
to page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario
above we fail to set ctime/mtime though we modified the file.
An xfstest is attached to this patchset that shows the failure
and the fix. (A DAX patch will follow)

This functionality is extra important for us, because upon
dirtying of a pmem page we also want to RDMA the page to a
remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
  1 - The name ;-)
  2 - But mainly because it would take a very long and tedious
      audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
      users. To make sure they do not now CRASH. For example current
      DAX code (which this is for) would crash.
      If we would want to reuse page_mkwrite, We will need to first
      patch all users, so to not-crash-on-no-page. Then enable this
      patch. But even if I did that I would not sleep so well at night.
      Adding a new vector is the safest thing to do, and is not that
      expensive. an extra pointer at a static function vector per driver.
      Also the new vector is better for performance, because else we
      Will call all current Kernel vectors, so to:
	check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Hugh Dickins <hughd@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: Konstantin Khlebnikov <koct9i@gmail.com>
CC: linux-mm@kvack.org
CC: Stable Tree <stable@vger.kernel.org>
---
 Documentation/filesystems/Locking |  8 ++++++++
 include/linux/mm.h                |  3 +++
 mm/memory.c                       | 27 ++++++++++++++++++++++++++-
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f91926f..25f36e6 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ prototypes:
 	void (*close)(struct vm_area_struct*);
 	int (*fault)(struct vm_area_struct*, struct vm_fault *);
 	int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
+	int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
 	int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
 
 locking rules:
@@ -534,6 +535,7 @@ close:		yes
 fault:		yes		can return with page locked
 map_pages:	yes
 page_mkwrite:	yes		can return with page locked
+pfn_mkwrite:	yes
 access:		yes
 
 	->fault() is called when a previously not present pte is about
@@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page
 like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
 will cause the VM to retry the fault.
 
+	->pfn_mkwrite() is the same as page_mkwrite but when the pte is
+VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
+VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
+after this call is to make the pte read-write, unless pfn_mkwrite()
+already touched the pte, in that case it is untouched.
+
 	->access() is called when get_user_pages() fails in
 access_process_vm(), typically used to debug a process through
 /proc/pid/mem or ptrace.  This function is needed only for
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a9392..85ba9c2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -251,6 +251,9 @@ struct vm_operations_struct {
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
+	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
+	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
+
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
 	 */
diff --git a/mm/memory.c b/mm/memory.c
index 97839f5..6029777 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1982,6 +1982,18 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	return ret;
 }
 
+static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
+{
+	struct vm_fault vmf = {
+		.page = NULL,
+		.pgoff = linear_page_index(vma, address),
+		.virtual_address = (void __user *)(address & PAGE_MASK),
+		.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
+	};
+
+	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -2025,8 +2037,21 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * accounting on raw pfn maps.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-				     (VM_WRITE|VM_SHARED))
+				     (VM_WRITE|VM_SHARED)) {
+			if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
+				pte_unmap_unlock(page_table, ptl);
+				ret = do_pfn_mkwrite(vma, address);
+				if (ret & VM_FAULT_ERROR)
+					return ret;
+				page_table = pte_offset_map_lock(mm, pmd,
+								 address, &ptl);
+				if (!pte_same(*page_table, orig_pte)) {
+					ret = 0;
+					goto unlock;
+				}
+			}
 			goto reuse;
+		}
 		goto gotten;
 	}
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-08 15:56   ` Boaz Harrosh
@ 2015-04-08 16:00     ` Boaz Harrosh
  -1 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-08 16:00 UTC (permalink / raw)
  To: Stable Tree
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig

On 04/08/2015 06:56 PM, Boaz Harrosh wrote:
> From: Yigal Korman <yigal@plexistor.com>
> 
> [For Stable 4.0.X]
> The parallel patch at 4.1-rc1 to this patch is:
>   Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP
> 
> We need this patch for the 4.0.X stable tree if the patch
>   Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection
> 
> Was decided to be pulled into stable since it is a dependency
> of this patch. The file mm/memory.c was heavily changed in 4.1
> hence this here.
> 

I forgot to send this patch for the stables tree, 4.0 only.

Again this one is only needed if we are truing to pull
   Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection

Which has the Stable@ tag. The problem it fixes is minor and might
be skipped if causes problems.

Thanks
Boaz

> [v3]
> In the case of !pte_same when we lost the race better
> return 0 instead of FAULT_NO_PAGE
> 
> [v2]
> Fixed according to Kirill's comments
> 
> [v1]
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.
> 
> Signed-off-by: Yigal Korman <yigal@plexistor.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Jan Kara <jack@suse.cz>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Hugh Dickins <hughd@google.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Konstantin Khlebnikov <koct9i@gmail.com>
> CC: linux-mm@kvack.org
> CC: Stable Tree <stable@vger.kernel.org>
> ---
>  Documentation/filesystems/Locking |  8 ++++++++
>  include/linux/mm.h                |  3 +++
>  mm/memory.c                       | 27 ++++++++++++++++++++++++++-
>  3 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index f91926f..25f36e6 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -525,6 +525,7 @@ prototypes:
>  	void (*close)(struct vm_area_struct*);
>  	int (*fault)(struct vm_area_struct*, struct vm_fault *);
>  	int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
> +	int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
>  	int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
>  
>  locking rules:
> @@ -534,6 +535,7 @@ close:		yes
>  fault:		yes		can return with page locked
>  map_pages:	yes
>  page_mkwrite:	yes		can return with page locked
> +pfn_mkwrite:	yes
>  access:		yes
>  
>  	->fault() is called when a previously not present pte is about
> @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page
>  like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
>  will cause the VM to retry the fault.
>  
> +	->pfn_mkwrite() is the same as page_mkwrite but when the pte is
> +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
> +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
> +after this call is to make the pte read-write, unless pfn_mkwrite()
> +already touched the pte, in that case it is untouched.
> +
>  	->access() is called when get_user_pages() fails in
>  access_process_vm(), typically used to debug a process through
>  /proc/pid/mem or ptrace.  This function is needed only for
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 47a9392..85ba9c2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -251,6 +251,9 @@ struct vm_operations_struct {
>  	 * writable, if an error is returned it will cause a SIGBUS */
>  	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
> +	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
> +	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
> +
>  	/* called by access_process_vm when get_user_pages() fails, typically
>  	 * for use by special VMAs that can switch between memory and hardware
>  	 */
> diff --git a/mm/memory.c b/mm/memory.c
> index 97839f5..6029777 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1982,6 +1982,18 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>  	return ret;
>  }
>  
> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct vm_fault vmf = {
> +		.page = NULL,
> +		.pgoff = linear_page_index(vma, address),
> +		.virtual_address = (void __user *)(address & PAGE_MASK),
> +		.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
> +	};
> +
> +	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
> +}
> +
>  /*
>   * This routine handles present pages, when users try to write
>   * to a shared page. It is done by copying the page to a new address
> @@ -2025,8 +2037,21 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * accounting on raw pfn maps.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> -				     (VM_WRITE|VM_SHARED))
> +				     (VM_WRITE|VM_SHARED)) {
> +			if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
> +				pte_unmap_unlock(page_table, ptl);
> +				ret = do_pfn_mkwrite(vma, address);
> +				if (ret & VM_FAULT_ERROR)
> +					return ret;
> +				page_table = pte_offset_map_lock(mm, pmd,
> +								 address, &ptl);
> +				if (!pte_same(*page_table, orig_pte)) {
> +					ret = 0;
> +					goto unlock;
> +				}
> +			}
>  			goto reuse;
> +		}
>  		goto gotten;
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
@ 2015-04-08 16:00     ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-08 16:00 UTC (permalink / raw)
  To: Stable Tree
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov,
	Jan Kara, Hugh Dickins, Mel Gorman, linux-mm, linux-nvdimm,
	linux-fsdevel, Eryu Guan, Christoph Hellwig

On 04/08/2015 06:56 PM, Boaz Harrosh wrote:
> From: Yigal Korman <yigal@plexistor.com>
> 
> [For Stable 4.0.X]
> The parallel patch at 4.1-rc1 to this patch is:
>   Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP
> 
> We need this patch for the 4.0.X stable tree if the patch
>   Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection
> 
> Was decided to be pulled into stable since it is a dependency
> of this patch. The file mm/memory.c was heavily changed in 4.1
> hence this here.
> 

I forgot to send this patch for the stables tree, 4.0 only.

Again this one is only needed if we are truing to pull
   Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection

Which has the Stable@ tag. The problem it fixes is minor and might
be skipped if causes problems.

Thanks
Boaz

> [v3]
> In the case of !pte_same when we lost the race better
> return 0 instead of FAULT_NO_PAGE
> 
> [v2]
> Fixed according to Kirill's comments
> 
> [v1]
> This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs)
> to get notified when access is a write to a read-only PFN.
> 
> This can happen if we mmap() a file then first mmap-read from it
> to page-in a read-only PFN, than we mmap-write to the same page.
> 
> We need this functionality to fix a DAX bug, where in the scenario
> above we fail to set ctime/mtime though we modified the file.
> An xfstest is attached to this patchset that shows the failure
> and the fix. (A DAX patch will follow)
> 
> This functionality is extra important for us, because upon
> dirtying of a pmem page we also want to RDMA the page to a
> remote cluster node.
> 
> We define a new pfn_mkwrite and do not reuse page_mkwrite because
>   1 - The name ;-)
>   2 - But mainly because it would take a very long and tedious
>       audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
>       users. To make sure they do not now CRASH. For example current
>       DAX code (which this is for) would crash.
>       If we would want to reuse page_mkwrite, We will need to first
>       patch all users, so to not-crash-on-no-page. Then enable this
>       patch. But even if I did that I would not sleep so well at night.
>       Adding a new vector is the safest thing to do, and is not that
>       expensive. an extra pointer at a static function vector per driver.
>       Also the new vector is better for performance, because else we
>       Will call all current Kernel vectors, so to:
> 	check-ha-no-page-do-nothing and return.
> 
> No need to call it from do_shared_fault because do_wp_page is called to
> change pte permissions anyway.
> 
> Signed-off-by: Yigal Korman <yigal@plexistor.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> CC: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Jan Kara <jack@suse.cz>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Hugh Dickins <hughd@google.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Konstantin Khlebnikov <koct9i@gmail.com>
> CC: linux-mm@kvack.org
> CC: Stable Tree <stable@vger.kernel.org>
> ---
>  Documentation/filesystems/Locking |  8 ++++++++
>  include/linux/mm.h                |  3 +++
>  mm/memory.c                       | 27 ++++++++++++++++++++++++++-
>  3 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index f91926f..25f36e6 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -525,6 +525,7 @@ prototypes:
>  	void (*close)(struct vm_area_struct*);
>  	int (*fault)(struct vm_area_struct*, struct vm_fault *);
>  	int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
> +	int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
>  	int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
>  
>  locking rules:
> @@ -534,6 +535,7 @@ close:		yes
>  fault:		yes		can return with page locked
>  map_pages:	yes
>  page_mkwrite:	yes		can return with page locked
> +pfn_mkwrite:	yes
>  access:		yes
>  
>  	->fault() is called when a previously not present pte is about
> @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page
>  like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
>  will cause the VM to retry the fault.
>  
> +	->pfn_mkwrite() is the same as page_mkwrite but when the pte is
> +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
> +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
> +after this call is to make the pte read-write, unless pfn_mkwrite()
> +already touched the pte, in that case it is untouched.
> +
>  	->access() is called when get_user_pages() fails in
>  access_process_vm(), typically used to debug a process through
>  /proc/pid/mem or ptrace.  This function is needed only for
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 47a9392..85ba9c2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -251,6 +251,9 @@ struct vm_operations_struct {
>  	 * writable, if an error is returned it will cause a SIGBUS */
>  	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
> +	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
> +	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
> +
>  	/* called by access_process_vm when get_user_pages() fails, typically
>  	 * for use by special VMAs that can switch between memory and hardware
>  	 */
> diff --git a/mm/memory.c b/mm/memory.c
> index 97839f5..6029777 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1982,6 +1982,18 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>  	return ret;
>  }
>  
> +static int do_pfn_mkwrite(struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct vm_fault vmf = {
> +		.page = NULL,
> +		.pgoff = linear_page_index(vma, address),
> +		.virtual_address = (void __user *)(address & PAGE_MASK),
> +		.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
> +	};
> +
> +	return vma->vm_ops->pfn_mkwrite(vma, &vmf);
> +}
> +
>  /*
>   * This routine handles present pages, when users try to write
>   * to a shared page. It is done by copying the page to a new address
> @@ -2025,8 +2037,21 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * accounting on raw pfn maps.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> -				     (VM_WRITE|VM_SHARED))
> +				     (VM_WRITE|VM_SHARED)) {
> +			if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
> +				pte_unmap_unlock(page_table, ptl);
> +				ret = do_pfn_mkwrite(vma, address);
> +				if (ret & VM_FAULT_ERROR)
> +					return ret;
> +				page_table = pte_offset_map_lock(mm, pmd,
> +								 address, &ptl);
> +				if (!pte_same(*page_table, orig_pte)) {
> +					ret = 0;
> +					goto unlock;
> +				}
> +			}
>  			goto reuse;
> +		}
>  		goto gotten;
>  	}
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-08 16:00     ` Boaz Harrosh
  (?)
@ 2015-04-08 20:26     ` Greg KH
  2015-04-12  7:49         ` Boaz Harrosh
  -1 siblings, 1 reply; 33+ messages in thread
From: Greg KH @ 2015-04-08 20:26 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Stable Tree, Dave Chinner, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm,
	linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig

On Wed, Apr 08, 2015 at 07:00:37PM +0300, Boaz Harrosh wrote:
> On 04/08/2015 06:56 PM, Boaz Harrosh wrote:
> > From: Yigal Korman <yigal@plexistor.com>
> > 
> > [For Stable 4.0.X]
> > The parallel patch at 4.1-rc1 to this patch is:
> >   Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP
> > 
> > We need this patch for the 4.0.X stable tree if the patch
> >   Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection
> > 
> > Was decided to be pulled into stable since it is a dependency
> > of this patch. The file mm/memory.c was heavily changed in 4.1
> > hence this here.
> > 
> 
> I forgot to send this patch for the stables tree, 4.0 only.
> 
> Again this one is only needed if we are truing to pull
>    Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection
> 
> Which has the Stable@ tag. The problem it fixes is minor and might
> be skipped if causes problems.

I can't take patches in the stable tree that are not in Linus's tree
also.  Why can't I just take a corrisponding patch that is already in
Linus's tree, why do we need something "special" here?

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  2015-04-08 20:26     ` Greg KH
@ 2015-04-12  7:49         ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-12  7:49 UTC (permalink / raw)
  To: Greg KH
  Cc: Stable Tree, Dave Chinner, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm,
	linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig

On 04/08/2015 11:26 PM, Greg KH wrote:
> On Wed, Apr 08, 2015 at 07:00:37PM +0300, Boaz Harrosh wrote:
>> On 04/08/2015 06:56 PM, Boaz Harrosh wrote:
>>> From: Yigal Korman <yigal@plexistor.com>
>>>
>>> [For Stable 4.0.X]
>>> The parallel patch at 4.1-rc1 to this patch is:
>>>   Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP
>>>
>>> We need this patch for the 4.0.X stable tree if the patch
>>>   Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection
>>>
>>> Was decided to be pulled into stable since it is a dependency
>>> of this patch. The file mm/memory.c was heavily changed in 4.1
>>> hence this here.
>>>
>>
>> I forgot to send this patch for the stables tree, 4.0 only.
>>
>> Again this one is only needed if we are truing to pull
>>    Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection
>>
>> Which has the Stable@ tag. The problem it fixes is minor and might
>> be skipped if causes problems.
> 
> I can't take patches in the stable tree that are not in Linus's tree
> also.  Why can't I just take a corrisponding patch that is already in
> Linus's tree, why do we need something "special" here?
> 
> thanks,
> 

Hi greg

Yes sorry I did not explain very well.

the akpm tree in linux-next as two patches:
  8dca515 mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  dac1bd2 dax: use pfn_mkwrite to update c/mtime + freeze protection

Now these patches will hit Linus tree in 4.1 merge window.
The second patch is tagged with stable@ CC, because it fixes DAX
which was introduced in 4.0. It depends on the 1st patch.

However the first patch is not tagged stable@ because it will not
apply at all to 4.0. This is because it patches mm/memory.c which
will completely change in 4.1. This is why I sent this special
patch which has the same exact title, and does exactly the same
as the 1st patch but on the 4.0 Kernel.

So when you encounter this 2nd patch with the Stable@ tag. I think
the best is to just ignore it, and wait for complains, which will
most probably not come because DAX is pretty experimental.
(But if we do pull it we will need this here)

> greg k-h
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
@ 2015-04-12  7:49         ` Boaz Harrosh
  0 siblings, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2015-04-12  7:49 UTC (permalink / raw)
  To: Greg KH
  Cc: Stable Tree, Dave Chinner, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Jan Kara, Hugh Dickins, Mel Gorman, linux-mm,
	linux-nvdimm, linux-fsdevel, Eryu Guan, Christoph Hellwig

On 04/08/2015 11:26 PM, Greg KH wrote:
> On Wed, Apr 08, 2015 at 07:00:37PM +0300, Boaz Harrosh wrote:
>> On 04/08/2015 06:56 PM, Boaz Harrosh wrote:
>>> From: Yigal Korman <yigal@plexistor.com>
>>>
>>> [For Stable 4.0.X]
>>> The parallel patch at 4.1-rc1 to this patch is:
>>>   Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP
>>>
>>> We need this patch for the 4.0.X stable tree if the patch
>>>   Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection
>>>
>>> Was decided to be pulled into stable since it is a dependency
>>> of this patch. The file mm/memory.c was heavily changed in 4.1
>>> hence this here.
>>>
>>
>> I forgot to send this patch for the stables tree, 4.0 only.
>>
>> Again this one is only needed if we are truing to pull
>>    Subject: dax: use pfn_mkwrite to update c/mtime + freeze protection
>>
>> Which has the Stable@ tag. The problem it fixes is minor and might
>> be skipped if causes problems.
> 
> I can't take patches in the stable tree that are not in Linus's tree
> also.  Why can't I just take a corrisponding patch that is already in
> Linus's tree, why do we need something "special" here?
> 
> thanks,
> 

Hi greg

Yes sorry I did not explain very well.

the akpm tree in linux-next as two patches:
  8dca515 mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP
  dac1bd2 dax: use pfn_mkwrite to update c/mtime + freeze protection

Now these patches will hit Linus tree in 4.1 merge window.
The second patch is tagged with stable@ CC, because it fixes DAX
which was introduced in 4.0. It depends on the 1st patch.

However the first patch is not tagged stable@ because it will not
apply at all to 4.0. This is because it patches mm/memory.c which
will completely change in 4.1. This is why I sent this special
patch which has the same exact title, and does exactly the same
as the 1st patch but on the 4.0 Kernel.

So when you encounter this 2nd patch with the Stable@ tag. I think
the best is to just ignore it, and wait for complains, which will
most probably not come because DAX is pretty experimental.
(But if we do pull it we will need this here)

> greg k-h
> 

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2015-04-12  7:49 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-07  8:33 [PATCH 0/3 v5] dax: some dax fixes and cleanups Boaz Harrosh
2015-04-07  8:33 ` Boaz Harrosh
2015-04-07  8:40 ` [PATCH 1/3] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
2015-04-07  8:51   ` Boaz Harrosh
2015-04-07  8:51     ` Boaz Harrosh
2015-04-07  9:03   ` Kirill A. Shutemov
2015-04-07  9:22     ` Boaz Harrosh
2015-04-07 12:57   ` [PATCH 1/3 v6] " Boaz Harrosh
2015-04-07 12:57     ` Boaz Harrosh
2015-04-07 13:17     ` Kirill A. Shutemov
2015-04-07 13:26       ` Kirill A. Shutemov
2015-04-07 13:37         ` Boaz Harrosh
2015-04-07 13:37           ` Boaz Harrosh
2015-04-07 13:47           ` Kirill A. Shutemov
2015-04-07 14:09             ` Boaz Harrosh
2015-04-07 14:09               ` Boaz Harrosh
2015-04-07 14:06   ` [PATCH 1/3 v7] " Boaz Harrosh
2015-04-07 14:06     ` Boaz Harrosh
2015-04-07 14:12     ` Kirill A. Shutemov
2015-04-07  8:43 ` [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection Boaz Harrosh
2015-04-07  8:43   ` Boaz Harrosh
2015-04-07 16:28   ` Jan Kara
2015-04-07 16:28     ` Jan Kara
2015-04-07  8:45 ` [PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations Boaz Harrosh
2015-04-07  8:45   ` Boaz Harrosh
2015-04-07 16:26   ` Jan Kara
2015-04-08 15:56 ` [PATCH 1/3 @stable] mm(v4.0): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
2015-04-08 15:56   ` Boaz Harrosh
2015-04-08 16:00   ` Boaz Harrosh
2015-04-08 16:00     ` Boaz Harrosh
2015-04-08 20:26     ` Greg KH
2015-04-12  7:49       ` Boaz Harrosh
2015-04-12  7:49         ` Boaz Harrosh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.