All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] DAX fixes for 4.3
@ 2015-08-04 19:57 ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Hi Andrew,

I believe this entire patch series should apply cleanly to your tree.

The first patch I have already sent to Ted.  It should be applied for 4.2.

Patch 4 depends on patch 1 having been applied first, patch 5 depends
on patch 4, and patch 6 depends on patch 5, so I can't wait for those
patches to go in via the ext4 tree.

Most of these patches aren't needed for 4.2 and earlier, because they
pertain to the PMD support which is only in the -mm tree.

Kirill A. Shutemov (3):
  thp: Decrement refcount on huge zero page if it is split
  thp: Fix zap_huge_pmd() for DAX
  dax: Don't use set_huge_zero_page()

Matthew Wilcox (8):
  ext4: Use ext4_get_block_write() for DAX
  thp: Change insert_pfn's return type to void
  dax: Improve comment about truncate race
  ext4: Add ext4_get_block_dax()
  ext4: Start transaction before calling into DAX
  dax: Fix race between simultaneous faults
  dax: Ensure that zero pages are removed from other processes
  dax: Use linear_page_index()

 fs/dax.c                | 66 ++++++++++++++++++++++++---------------
 fs/ext4/ext4.h          |  2 ++
 fs/ext4/file.c          | 61 ++++++++++++++++++++++++++++++++----
 fs/ext4/inode.c         | 11 +++++++
 include/linux/huge_mm.h |  3 --
 mm/huge_memory.c        | 83 ++++++++++++++++++++++---------------------------
 mm/memory.c             | 11 +++++--
 7 files changed, 155 insertions(+), 82 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 00/11] DAX fixes for 4.3
@ 2015-08-04 19:57 ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Hi Andrew,

I believe this entire patch series should apply cleanly to your tree.

The first patch I have already sent to Ted.  It should be applied for 4.2.

Patch 4 depends on patch 1 having been applied first, patch 5 depends
on patch 4, and patch 6 depends on patch 5, so I can't wait for those
patches to go in via the ext4 tree.

Most of these patches aren't needed for 4.2 and earlier, because they
pertain to the PMD support which is only in the -mm tree.

Kirill A. Shutemov (3):
  thp: Decrement refcount on huge zero page if it is split
  thp: Fix zap_huge_pmd() for DAX
  dax: Don't use set_huge_zero_page()

Matthew Wilcox (8):
  ext4: Use ext4_get_block_write() for DAX
  thp: Change insert_pfn's return type to void
  dax: Improve comment about truncate race
  ext4: Add ext4_get_block_dax()
  ext4: Start transaction before calling into DAX
  dax: Fix race between simultaneous faults
  dax: Ensure that zero pages are removed from other processes
  dax: Use linear_page_index()

 fs/dax.c                | 66 ++++++++++++++++++++++++---------------
 fs/ext4/ext4.h          |  2 ++
 fs/ext4/file.c          | 61 ++++++++++++++++++++++++++++++++----
 fs/ext4/inode.c         | 11 +++++++
 include/linux/huge_mm.h |  3 --
 mm/huge_memory.c        | 83 ++++++++++++++++++++++---------------------------
 mm/memory.c             | 11 +++++--
 7 files changed, 155 insertions(+), 82 deletions(-)

-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/11] ext4: Use ext4_get_block_write() for DAX
  2015-08-04 19:57 ` Matthew Wilcox
@ 2015-08-04 19:57   ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

DAX relies on the get_block function either zeroing newly allocated blocks
before they're findable by subsequent calls to get_block, or marking newly
allocated blocks as unwritten.  ext4_get_block() cannot create unwritten
extents, but ext4_get_block_write() can.

Reported-by: Andy Rudoff <andy.rudoff@intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/ext4/file.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 953d519..ca5302a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -196,7 +196,7 @@ out:
 static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
 {
 	struct inode *inode = bh->b_assoc_map->host;
-	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
+	/* XXX: breaks on 32-bit > 16TB. Is that even supported? */
 	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
 	int err;
 	if (!uptodate)
@@ -207,8 +207,7 @@ static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
 
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
-					/* Is this the right get_block? */
+	return dax_fault(vma, vmf, ext4_get_block_write, ext4_end_io_unwritten);
 }
 
 static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
@@ -220,7 +219,8 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
+	return dax_mkwrite(vma, vmf, ext4_get_block_write,
+				ext4_end_io_unwritten);
 }
 
 static const struct vm_operations_struct ext4_dax_vm_ops = {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 01/11] ext4: Use ext4_get_block_write() for DAX
@ 2015-08-04 19:57   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

DAX relies on the get_block function either zeroing newly allocated blocks
before they're findable by subsequent calls to get_block, or marking newly
allocated blocks as unwritten.  ext4_get_block() cannot create unwritten
extents, but ext4_get_block_write() can.

Reported-by: Andy Rudoff <andy.rudoff@intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/ext4/file.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 953d519..ca5302a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -196,7 +196,7 @@ out:
 static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
 {
 	struct inode *inode = bh->b_assoc_map->host;
-	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
+	/* XXX: breaks on 32-bit > 16TB. Is that even supported? */
 	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
 	int err;
 	if (!uptodate)
@@ -207,8 +207,7 @@ static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
 
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
-					/* Is this the right get_block? */
+	return dax_fault(vma, vmf, ext4_get_block_write, ext4_end_io_unwritten);
 }
 
 static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
@@ -220,7 +219,8 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
+	return dax_mkwrite(vma, vmf, ext4_get_block_write,
+				ext4_end_io_unwritten);
 }
 
 static const struct vm_operations_struct ext4_dax_vm_ops = {
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/11] thp: Change insert_pfn's return type to void
  2015-08-04 19:57 ` Matthew Wilcox
@ 2015-08-04 19:57   ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

It would make more sense to have all the return values from
vmf_insert_pfn_pmd() encoded in one place instead of having to follow
the convention into insert_pfn().  Suggested by Jeff Moyer.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 mm/huge_memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 26d0fc1..5ffdcaa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -837,7 +837,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
-static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, unsigned long pfn, pgprot_t prot, bool write)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -855,7 +855,6 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		update_mmu_cache_pmd(vma, addr, pmd);
 	}
 	spin_unlock(ptl);
-	return VM_FAULT_NOPAGE;
 }
 
 int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -877,7 +876,8 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		return VM_FAULT_SIGBUS;
 	if (track_pfn_insert(vma, &pgprot, pfn))
 		return VM_FAULT_SIGBUS;
-	return insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write);
+	insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write);
+	return VM_FAULT_NOPAGE;
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/11] thp: Change insert_pfn's return type to void
@ 2015-08-04 19:57   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

It would make more sense to have all the return values from
vmf_insert_pfn_pmd() encoded in one place instead of having to follow
the convention into insert_pfn().  Suggested by Jeff Moyer.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 mm/huge_memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 26d0fc1..5ffdcaa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -837,7 +837,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
-static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, unsigned long pfn, pgprot_t prot, bool write)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -855,7 +855,6 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		update_mmu_cache_pmd(vma, addr, pmd);
 	}
 	spin_unlock(ptl);
-	return VM_FAULT_NOPAGE;
 }
 
 int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -877,7 +876,8 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		return VM_FAULT_SIGBUS;
 	if (track_pfn_insert(vma, &pgprot, pfn))
 		return VM_FAULT_SIGBUS;
-	return insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write);
+	insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write);
+	return VM_FAULT_NOPAGE;
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/11] dax: Improve comment about truncate race
  2015-08-04 19:57 ` Matthew Wilcox
@ 2015-08-04 19:57   ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Jan Kara pointed out I should be more explicit here about the perils of
racing against truncate.  The comment is mostly the same as for the PTE
case.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 15f8ffc..0a13118 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -553,7 +553,12 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
 		goto fallback;
 
-	/* Guard against a race with truncate */
+	/*
+	 * If a truncate happened while we were allocating blocks, we may
+	 * leave blocks allocated to the file that are beyond EOF.  We can't
+	 * take i_mutex here, so just leave them hanging; they'll be freed
+	 * when the file is deleted.
+	 */
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (pgoff >= size) {
 		result = VM_FAULT_SIGBUS;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/11] dax: Improve comment about truncate race
@ 2015-08-04 19:57   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Jan Kara pointed out I should be more explicit here about the perils of
racing against truncate.  The comment is mostly the same as for the PTE
case.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 15f8ffc..0a13118 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -553,7 +553,12 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
 		goto fallback;
 
-	/* Guard against a race with truncate */
+	/*
+	 * If a truncate happened while we were allocating blocks, we may
+	 * leave blocks allocated to the file that are beyond EOF.  We can't
+	 * take i_mutex here, so just leave them hanging; they'll be freed
+	 * when the file is deleted.
+	 */
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (pgoff >= size) {
 		result = VM_FAULT_SIGBUS;
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/11] ext4: Add ext4_get_block_dax()
  2015-08-04 19:57 ` Matthew Wilcox
@ 2015-08-04 19:57   ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

DAX wants different semantics from any currently-existing ext4
get_block callback.  Unlike ext4_get_block_write(), it needs to honour
the 'create' flag, and unlike ext4_get_block(), it needs to be able
to return unwritten extents.  So introduce a new ext4_get_block_dax()
which has those semantics.  We could also change ext4_get_block_write()
to honour the 'create' flag, but that might have consequences on other
users that I do not currently understand.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/ext4/ext4.h  |  2 ++
 fs/ext4/file.c  |  6 +++---
 fs/ext4/inode.c | 11 +++++++++++
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d743d93..51c5008 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2274,6 +2274,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
 struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
 int ext4_get_block_write(struct inode *inode, sector_t iblock,
 			 struct buffer_head *bh_result, int create);
+int ext4_get_block_dax(struct inode *inode, sector_t iblock,
+			 struct buffer_head *bh_result, int create);
 int ext4_get_block(struct inode *inode, sector_t iblock,
 				struct buffer_head *bh_result, int create);
 int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ca5302a..d5219e4 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,19 +207,19 @@ static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
 
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext4_get_block_write, ext4_end_io_unwritten);
+	return dax_fault(vma, vmf, ext4_get_block_dax, ext4_end_io_unwritten);
 }
 
 static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 						pmd_t *pmd, unsigned int flags)
 {
-	return dax_pmd_fault(vma, addr, pmd, flags, ext4_get_block_write,
+	return dax_pmd_fault(vma, addr, pmd, flags, ext4_get_block_dax,
 				ext4_end_io_unwritten);
 }
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext4_get_block_write,
+	return dax_mkwrite(vma, vmf, ext4_get_block_dax,
 				ext4_end_io_unwritten);
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 40e6c66..75146d1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3025,6 +3025,17 @@ static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
 			       EXT4_GET_BLOCKS_NO_LOCK);
 }
 
+int ext4_get_block_dax(struct inode *inode, sector_t iblock,
+		   struct buffer_head *bh_result, int create)
+{
+	int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT;
+	if (create)
+		flags |= EXT4_GET_BLOCKS_CREATE;
+	ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n",
+		   inode->i_ino, create);
+	return _ext4_get_block(inode, iblock, bh_result, flags);
+}
+
 static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 			    ssize_t size, void *private)
 {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/11] ext4: Add ext4_get_block_dax()
@ 2015-08-04 19:57   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

DAX wants different semantics from any currently-existing ext4
get_block callback.  Unlike ext4_get_block_write(), it needs to honour
the 'create' flag, and unlike ext4_get_block(), it needs to be able
to return unwritten extents.  So introduce a new ext4_get_block_dax()
which has those semantics.  We could also change ext4_get_block_write()
to honour the 'create' flag, but that might have consequences on other
users that I do not currently understand.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/ext4/ext4.h  |  2 ++
 fs/ext4/file.c  |  6 +++---
 fs/ext4/inode.c | 11 +++++++++++
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d743d93..51c5008 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2274,6 +2274,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
 struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
 int ext4_get_block_write(struct inode *inode, sector_t iblock,
 			 struct buffer_head *bh_result, int create);
+int ext4_get_block_dax(struct inode *inode, sector_t iblock,
+			 struct buffer_head *bh_result, int create);
 int ext4_get_block(struct inode *inode, sector_t iblock,
 				struct buffer_head *bh_result, int create);
 int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ca5302a..d5219e4 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,19 +207,19 @@ static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
 
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext4_get_block_write, ext4_end_io_unwritten);
+	return dax_fault(vma, vmf, ext4_get_block_dax, ext4_end_io_unwritten);
 }
 
 static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 						pmd_t *pmd, unsigned int flags)
 {
-	return dax_pmd_fault(vma, addr, pmd, flags, ext4_get_block_write,
+	return dax_pmd_fault(vma, addr, pmd, flags, ext4_get_block_dax,
 				ext4_end_io_unwritten);
 }
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext4_get_block_write,
+	return dax_mkwrite(vma, vmf, ext4_get_block_dax,
 				ext4_end_io_unwritten);
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 40e6c66..75146d1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3025,6 +3025,17 @@ static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
 			       EXT4_GET_BLOCKS_NO_LOCK);
 }
 
+int ext4_get_block_dax(struct inode *inode, sector_t iblock,
+		   struct buffer_head *bh_result, int create)
+{
+	int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT;
+	if (create)
+		flags |= EXT4_GET_BLOCKS_CREATE;
+	ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n",
+		   inode->i_ino, create);
+	return _ext4_get_block(inode, iblock, bh_result, flags);
+}
+
 static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 			    ssize_t size, void *private)
 {
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/11] ext4: Start transaction before calling into DAX
  2015-08-04 19:57 ` Matthew Wilcox
@ 2015-08-04 19:57   ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Jan Kara pointed out that in the case where we are writing to a hole,
we can end up with a lock inversion between the page lock and the
journal lock.  We can avoid this by starting the transaction in ext4
before calling into DAX.  The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/ext4/file.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d5219e4..113837e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,14 +207,63 @@ static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
 
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext4_get_block_dax, ext4_end_io_unwritten);
+	int result;
+	handle_t *handle = NULL;
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	bool write = vmf->flags & FAULT_FLAG_WRITE;
+
+	if (write) {
+		sb_start_pagefault(sb);
+		file_update_time(vma->vm_file);
+		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
+						EXT4_DATA_TRANS_BLOCKS(sb));
+	}
+
+	if (IS_ERR(handle))
+		result = VM_FAULT_SIGBUS;
+	else
+		result = __dax_fault(vma, vmf, ext4_get_block_dax,
+						ext4_end_io_unwritten);
+
+	if (write) {
+		if (!IS_ERR(handle))
+			ext4_journal_stop(handle);
+		sb_end_pagefault(sb);
+	}
+
+	return result;
 }
 
 static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 						pmd_t *pmd, unsigned int flags)
 {
-	return dax_pmd_fault(vma, addr, pmd, flags, ext4_get_block_dax,
-				ext4_end_io_unwritten);
+	int result;
+	handle_t *handle = NULL;
+	struct inode *inode = file_inode(vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+	bool write = flags & FAULT_FLAG_WRITE;
+
+	if (write) {
+		sb_start_pagefault(sb);
+		file_update_time(vma->vm_file);
+		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
+				ext4_chunk_trans_blocks(inode,
+							PMD_SIZE / PAGE_SIZE));
+	}
+
+	if (IS_ERR(handle))
+		result = VM_FAULT_SIGBUS;
+	else
+		result = __dax_pmd_fault(vma, addr, pmd, flags,
+				ext4_get_block_dax, ext4_end_io_unwritten);
+
+	if (write) {
+		if (!IS_ERR(handle))
+			ext4_journal_stop(handle);
+		sb_end_pagefault(sb);
+	}
+
+	return result;
 }
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/11] ext4: Start transaction before calling into DAX
@ 2015-08-04 19:57   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Jan Kara pointed out that in the case where we are writing to a hole,
we can end up with a lock inversion between the page lock and the
journal lock.  We can avoid this by starting the transaction in ext4
before calling into DAX.  The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/ext4/file.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d5219e4..113837e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,14 +207,63 @@ static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
 
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext4_get_block_dax, ext4_end_io_unwritten);
+	int result;
+	handle_t *handle = NULL;
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	bool write = vmf->flags & FAULT_FLAG_WRITE;
+
+	if (write) {
+		sb_start_pagefault(sb);
+		file_update_time(vma->vm_file);
+		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
+						EXT4_DATA_TRANS_BLOCKS(sb));
+	}
+
+	if (IS_ERR(handle))
+		result = VM_FAULT_SIGBUS;
+	else
+		result = __dax_fault(vma, vmf, ext4_get_block_dax,
+						ext4_end_io_unwritten);
+
+	if (write) {
+		if (!IS_ERR(handle))
+			ext4_journal_stop(handle);
+		sb_end_pagefault(sb);
+	}
+
+	return result;
 }
 
 static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 						pmd_t *pmd, unsigned int flags)
 {
-	return dax_pmd_fault(vma, addr, pmd, flags, ext4_get_block_dax,
-				ext4_end_io_unwritten);
+	int result;
+	handle_t *handle = NULL;
+	struct inode *inode = file_inode(vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+	bool write = flags & FAULT_FLAG_WRITE;
+
+	if (write) {
+		sb_start_pagefault(sb);
+		file_update_time(vma->vm_file);
+		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
+				ext4_chunk_trans_blocks(inode,
+							PMD_SIZE / PAGE_SIZE));
+	}
+
+	if (IS_ERR(handle))
+		result = VM_FAULT_SIGBUS;
+	else
+		result = __dax_pmd_fault(vma, addr, pmd, flags,
+				ext4_get_block_dax, ext4_end_io_unwritten);
+
+	if (write) {
+		if (!IS_ERR(handle))
+			ext4_journal_stop(handle);
+		sb_end_pagefault(sb);
+	}
+
+	return result;
 }
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/11] dax: Fix race between simultaneous faults
  2015-08-04 19:57 ` Matthew Wilcox
@ 2015-08-04 19:58   ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

If two threads write-fault on the same hole at the same time, the winner
of the race will return to userspace and complete their store, only to
have the loser overwrite their store with zeroes.  Fix this for now by
taking the i_mmap_sem for write instead of read, and do so outside the
call to get_block().  Now the loser of the race will see the block has
already been zeroed, and will not zero it again.

This severely limits our scalability.  I have ideas for improving it,
but those can wait for a later patch.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c    | 33 +++++++++++++++++----------------
 mm/memory.c | 11 ++++++++---
 2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0a13118..9daf005 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -272,7 +272,6 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct address_space *mapping = inode->i_mapping;
 	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	void *addr;
@@ -280,8 +279,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	pgoff_t size;
 	int error;
 
-	i_mmap_lock_read(mapping);
-
 	/*
 	 * Check truncate didn't happen while we were allocating a block.
 	 * If it did, this block may or may not be still allocated to the
@@ -309,8 +306,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	error = vm_insert_mixed(vma, vaddr, pfn);
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	return error;
 }
 
@@ -372,15 +367,17 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			 * from a read fault and we've raced with a truncate
 			 */
 			error = -EIO;
-			goto unlock_page;
+			goto unlock;
 		}
+	} else {
+		i_mmap_lock_write(mapping);
 	}
 
 	error = get_block(inode, block, &bh, 0);
 	if (!error && (bh.b_size < PAGE_SIZE))
 		error = -EIO;		/* fs corruption? */
 	if (error)
-		goto unlock_page;
+		goto unlock;
 
 	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
@@ -391,8 +388,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			if (!error && (bh.b_size < PAGE_SIZE))
 				error = -EIO;
 			if (error)
-				goto unlock_page;
+				goto unlock;
 		} else {
+			i_mmap_unlock_write(mapping);
 			return dax_load_hole(mapping, page, vmf);
 		}
 	}
@@ -404,17 +402,15 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
-			goto unlock_page;
+			goto unlock;
 		vmf->page = page;
 		if (!page) {
-			i_mmap_lock_read(mapping);
 			/* Check we didn't race with truncate */
 			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
 								PAGE_SHIFT;
 			if (vmf->pgoff >= size) {
-				i_mmap_unlock_read(mapping);
 				error = -EIO;
-				goto out;
+				goto unlock;
 			}
 		}
 		return VM_FAULT_LOCKED;
@@ -450,6 +446,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
 	}
 
+	if (!page)
+		i_mmap_unlock_write(mapping);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
@@ -458,11 +456,14 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 
- unlock_page:
+ unlock:
 	if (page) {
 		unlock_page(page);
 		page_cache_release(page);
+	} else {
+		i_mmap_unlock_write(mapping);
 	}
+
 	goto out;
 }
 EXPORT_SYMBOL(__dax_fault);
@@ -540,10 +541,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
+	i_mmap_lock_write(mapping);
 	length = get_block(inode, block, &bh, write);
 	if (length)
 		return VM_FAULT_SIGBUS;
-	i_mmap_lock_read(mapping);
 
 	/*
 	 * If the filesystem isn't willing to tell us the length of a hole,
@@ -607,11 +608,11 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	}
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	if (buffer_unwritten(&bh))
 		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
 
+	i_mmap_unlock_write(mapping);
+
 	return result;
 
  fallback:
diff --git a/mm/memory.c b/mm/memory.c
index b94b587..5f46350 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2426,11 +2426,16 @@ void unmap_mapping_range(struct address_space *mapping,
 		details.last_index = ULONG_MAX;
 
 
-	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
-	i_mmap_lock_write(mapping);
+	/*
+	 * DAX already holds i_mmap_lock to serialise file truncate vs
+	 * page fault and page fault vs page fault.
+	 */
+	if (!IS_DAX(mapping->host))
+		i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
-	i_mmap_unlock_write(mapping);
+	if (!IS_DAX(mapping->host))
+		i_mmap_unlock_write(mapping);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/11] dax: Fix race between simultaneous faults
@ 2015-08-04 19:58   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

If two threads write-fault on the same hole at the same time, the winner
of the race will return to userspace and complete their store, only to
have the loser overwrite their store with zeroes.  Fix this for now by
taking the i_mmap_sem for write instead of read, and do so outside the
call to get_block().  Now the loser of the race will see the block has
already been zeroed, and will not zero it again.

This severely limits our scalability.  I have ideas for improving it,
but those can wait for a later patch.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c    | 33 +++++++++++++++++----------------
 mm/memory.c | 11 ++++++++---
 2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0a13118..9daf005 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -272,7 +272,6 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct address_space *mapping = inode->i_mapping;
 	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	void *addr;
@@ -280,8 +279,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	pgoff_t size;
 	int error;
 
-	i_mmap_lock_read(mapping);
-
 	/*
 	 * Check truncate didn't happen while we were allocating a block.
 	 * If it did, this block may or may not be still allocated to the
@@ -309,8 +306,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	error = vm_insert_mixed(vma, vaddr, pfn);
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	return error;
 }
 
@@ -372,15 +367,17 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			 * from a read fault and we've raced with a truncate
 			 */
 			error = -EIO;
-			goto unlock_page;
+			goto unlock;
 		}
+	} else {
+		i_mmap_lock_write(mapping);
 	}
 
 	error = get_block(inode, block, &bh, 0);
 	if (!error && (bh.b_size < PAGE_SIZE))
 		error = -EIO;		/* fs corruption? */
 	if (error)
-		goto unlock_page;
+		goto unlock;
 
 	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
@@ -391,8 +388,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			if (!error && (bh.b_size < PAGE_SIZE))
 				error = -EIO;
 			if (error)
-				goto unlock_page;
+				goto unlock;
 		} else {
+			i_mmap_unlock_write(mapping);
 			return dax_load_hole(mapping, page, vmf);
 		}
 	}
@@ -404,17 +402,15 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
-			goto unlock_page;
+			goto unlock;
 		vmf->page = page;
 		if (!page) {
-			i_mmap_lock_read(mapping);
 			/* Check we didn't race with truncate */
 			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
 								PAGE_SHIFT;
 			if (vmf->pgoff >= size) {
-				i_mmap_unlock_read(mapping);
 				error = -EIO;
-				goto out;
+				goto unlock;
 			}
 		}
 		return VM_FAULT_LOCKED;
@@ -450,6 +446,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
 	}
 
+	if (!page)
+		i_mmap_unlock_write(mapping);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
@@ -458,11 +456,14 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 
- unlock_page:
+ unlock:
 	if (page) {
 		unlock_page(page);
 		page_cache_release(page);
+	} else {
+		i_mmap_unlock_write(mapping);
 	}
+
 	goto out;
 }
 EXPORT_SYMBOL(__dax_fault);
@@ -540,10 +541,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
+	i_mmap_lock_write(mapping);
 	length = get_block(inode, block, &bh, write);
 	if (length)
 		return VM_FAULT_SIGBUS;
-	i_mmap_lock_read(mapping);
 
 	/*
 	 * If the filesystem isn't willing to tell us the length of a hole,
@@ -607,11 +608,11 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	}
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	if (buffer_unwritten(&bh))
 		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
 
+	i_mmap_unlock_write(mapping);
+
 	return result;
 
  fallback:
diff --git a/mm/memory.c b/mm/memory.c
index b94b587..5f46350 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2426,11 +2426,16 @@ void unmap_mapping_range(struct address_space *mapping,
 		details.last_index = ULONG_MAX;
 
 
-	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
-	i_mmap_lock_write(mapping);
+	/*
+	 * DAX already holds i_mmap_lock to serialise file truncate vs
+	 * page fault and page fault vs page fault.
+	 */
+	if (!IS_DAX(mapping->host))
+		i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
-	i_mmap_unlock_write(mapping);
+	if (!IS_DAX(mapping->host))
+		i_mmap_unlock_write(mapping);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/11] thp: Decrement refcount on huge zero page if it is split
  2015-08-04 19:57 ` Matthew Wilcox
@ 2015-08-04 19:58   ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm
  Cc: Kirill A. Shutemov, willy

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The DAX code neglected to put the refcount on the huge zero page.
Also we must notify on splits.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 mm/huge_memory.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5ffdcaa..326d17e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2951,7 +2951,9 @@ again:
 	if (unlikely(!pmd_trans_huge(*pmd)))
 		goto unlock;
 	if (vma_is_dax(vma)) {
-		pmdp_huge_clear_flush(vma, haddr, pmd);
+		pmd_t _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+		if (is_huge_zero_pmd(_pmd))
+			put_huge_zero_page();
 	} else if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 	} else {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/11] thp: Decrement refcount on huge zero page if it is split
@ 2015-08-04 19:58   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm
  Cc: Kirill A. Shutemov, willy

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The DAX code neglected to put the refcount on the huge zero page.
Also we must notify on splits.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 mm/huge_memory.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5ffdcaa..326d17e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2951,7 +2951,9 @@ again:
 	if (unlikely(!pmd_trans_huge(*pmd)))
 		goto unlock;
 	if (vma_is_dax(vma)) {
-		pmdp_huge_clear_flush(vma, haddr, pmd);
+		pmd_t _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+		if (is_huge_zero_pmd(_pmd))
+			put_huge_zero_page();
 	} else if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 	} else {
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/11] thp: Fix zap_huge_pmd() for DAX
  2015-08-04 19:57 ` Matthew Wilcox
@ 2015-08-04 19:58   ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm
  Cc: Kirill A. Shutemov, willy

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The original DAX code assumed that pgtable_t was a pointer, which isn't
true on all architectures.  Restructure the code to not rely on that
assumption.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[further fixes from Matthew integrated into this patch]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 mm/huge_memory.c | 71 +++++++++++++++++++++++++-------------------------------
 1 file changed, 31 insertions(+), 40 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 326d17e..b51bed1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1426,50 +1426,41 @@ out:
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
+	pmd_t orig_pmd;
 	spinlock_t *ptl;
-	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
-		pgtable_t pgtable;
-		pmd_t orig_pmd;
-		/*
-		 * For architectures like ppc64 we look at deposited pgtable
-		 * when calling pmdp_huge_get_and_clear. So do the
-		 * pgtable_trans_huge_withdraw after finishing pmdp related
-		 * operations.
-		 */
-		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
-							tlb->fullmm);
-		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-		if (vma_is_dax(vma)) {
-			if (is_huge_zero_pmd(orig_pmd)) {
-				pgtable = NULL;
-			} else {
-				spin_unlock(ptl);
-				return 1;
-			}
-		} else {
-			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
-		}
-		if (is_huge_zero_pmd(orig_pmd)) {
-			atomic_long_dec(&tlb->mm->nr_ptes);
-			spin_unlock(ptl);
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl) != 1)
+		return 0;
+	/*
+	 * For architectures like ppc64 we look at deposited pgtable
+	 * when calling pmdp_huge_get_and_clear. So do the
+	 * pgtable_trans_huge_withdraw after finishing pmdp related
+	 * operations.
+	 */
+	orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
+			tlb->fullmm);
+	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+	if (vma_is_dax(vma)) {
+		spin_unlock(ptl);
+		if (is_huge_zero_pmd(orig_pmd))
 			put_huge_zero_page();
-		} else {
-			struct page *page = pmd_page(orig_pmd);
-			page_remove_rmap(page);
-			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-			VM_BUG_ON_PAGE(!PageHead(page), page);
-			atomic_long_dec(&tlb->mm->nr_ptes);
-			spin_unlock(ptl);
-			tlb_remove_page(tlb, page);
-		}
-		if (pgtable)
-			pte_free(tlb->mm, pgtable);
-		ret = 1;
+	} else if (is_huge_zero_pmd(orig_pmd)) {
+		pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
+		atomic_long_dec(&tlb->mm->nr_ptes);
+		spin_unlock(ptl);
+		put_huge_zero_page();
+	} else {
+		struct page *page = pmd_page(orig_pmd);
+		page_remove_rmap(page);
+		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		VM_BUG_ON_PAGE(!PageHead(page), page);
+		pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
+		atomic_long_dec(&tlb->mm->nr_ptes);
+		spin_unlock(ptl);
+		tlb_remove_page(tlb, page);
 	}
-	return ret;
+	return 1;
 }
 
 int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/11] thp: Fix zap_huge_pmd() for DAX
@ 2015-08-04 19:58   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm
  Cc: Kirill A. Shutemov, willy

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The original DAX code assumed that pgtable_t was a pointer, which isn't
true on all architectures.  Restructure the code to not rely on that
assumption.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[further fixes from Matthew integrated into this patch]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 mm/huge_memory.c | 71 +++++++++++++++++++++++++-------------------------------
 1 file changed, 31 insertions(+), 40 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 326d17e..b51bed1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1426,50 +1426,41 @@ out:
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
+	pmd_t orig_pmd;
 	spinlock_t *ptl;
-	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
-		pgtable_t pgtable;
-		pmd_t orig_pmd;
-		/*
-		 * For architectures like ppc64 we look at deposited pgtable
-		 * when calling pmdp_huge_get_and_clear. So do the
-		 * pgtable_trans_huge_withdraw after finishing pmdp related
-		 * operations.
-		 */
-		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
-							tlb->fullmm);
-		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-		if (vma_is_dax(vma)) {
-			if (is_huge_zero_pmd(orig_pmd)) {
-				pgtable = NULL;
-			} else {
-				spin_unlock(ptl);
-				return 1;
-			}
-		} else {
-			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
-		}
-		if (is_huge_zero_pmd(orig_pmd)) {
-			atomic_long_dec(&tlb->mm->nr_ptes);
-			spin_unlock(ptl);
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl) != 1)
+		return 0;
+	/*
+	 * For architectures like ppc64 we look at deposited pgtable
+	 * when calling pmdp_huge_get_and_clear. So do the
+	 * pgtable_trans_huge_withdraw after finishing pmdp related
+	 * operations.
+	 */
+	orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
+			tlb->fullmm);
+	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+	if (vma_is_dax(vma)) {
+		spin_unlock(ptl);
+		if (is_huge_zero_pmd(orig_pmd))
 			put_huge_zero_page();
-		} else {
-			struct page *page = pmd_page(orig_pmd);
-			page_remove_rmap(page);
-			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-			VM_BUG_ON_PAGE(!PageHead(page), page);
-			atomic_long_dec(&tlb->mm->nr_ptes);
-			spin_unlock(ptl);
-			tlb_remove_page(tlb, page);
-		}
-		if (pgtable)
-			pte_free(tlb->mm, pgtable);
-		ret = 1;
+	} else if (is_huge_zero_pmd(orig_pmd)) {
+		pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
+		atomic_long_dec(&tlb->mm->nr_ptes);
+		spin_unlock(ptl);
+		put_huge_zero_page();
+	} else {
+		struct page *page = pmd_page(orig_pmd);
+		page_remove_rmap(page);
+		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		VM_BUG_ON_PAGE(!PageHead(page), page);
+		pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
+		atomic_long_dec(&tlb->mm->nr_ptes);
+		spin_unlock(ptl);
+		tlb_remove_page(tlb, page);
 	}
-	return ret;
+	return 1;
 }
 
 int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/11] dax: Don't use set_huge_zero_page()
  2015-08-04 19:57 ` Matthew Wilcox
@ 2015-08-04 19:58   ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm
  Cc: Kirill A. Shutemov, willy

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

This is another place where DAX assumed that pgtable_t was a pointer.
Open code the important parts of set_huge_zero_page() in DAX and make
set_huge_zero_page() static again.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c                | 18 ++++++++++++------
 include/linux/huge_mm.h |  3 ---
 mm/huge_memory.c        |  2 +-
 3 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9daf005..bf9a22b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -572,18 +572,24 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
 
 	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
-		bool set;
 		spinlock_t *ptl;
-		struct mm_struct *mm = vma->vm_mm;
+		pmd_t entry;
 		struct page *zero_page = get_huge_zero_page();
+
 		if (unlikely(!zero_page))
 			goto fallback;
 
-		ptl = pmd_lock(mm, pmd);
-		set = set_huge_zero_page(NULL, mm, vma, pmd_addr, pmd,
-								zero_page);
-		spin_unlock(ptl);
+		ptl = pmd_lock(vma->vm_mm, pmd);
+		if (!pmd_none(*pmd)) {
+			spin_unlock(ptl);
+			goto fallback;
+		}
+
+		entry = mk_pmd(zero_page, vma->vm_page_prot);
+		entry = pmd_mkhuge(entry);
+		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
 		result = VM_FAULT_NOPAGE;
+		spin_unlock(ptl);
 	} else {
 		sector = bh.b_blocknr << (blkbits - 9);
 		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f9b612f..ecb080d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -163,9 +163,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
 }
 
 struct page *get_huge_zero_page(void);
-bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long haddr,
-		pmd_t *pmd, struct page *zero_page);
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b51bed1..63c8fe5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -767,7 +767,7 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
 }
 
 /* Caller must hold page table lock. */
-bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		struct page *zero_page)
 {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/11] dax: Don't use set_huge_zero_page()
@ 2015-08-04 19:58   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm
  Cc: Kirill A. Shutemov, willy

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

This is another place where DAX assumed that pgtable_t was a pointer.
Open code the important parts of set_huge_zero_page() in DAX and make
set_huge_zero_page() static again.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c                | 18 ++++++++++++------
 include/linux/huge_mm.h |  3 ---
 mm/huge_memory.c        |  2 +-
 3 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9daf005..bf9a22b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -572,18 +572,24 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
 
 	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
-		bool set;
 		spinlock_t *ptl;
-		struct mm_struct *mm = vma->vm_mm;
+		pmd_t entry;
 		struct page *zero_page = get_huge_zero_page();
+
 		if (unlikely(!zero_page))
 			goto fallback;
 
-		ptl = pmd_lock(mm, pmd);
-		set = set_huge_zero_page(NULL, mm, vma, pmd_addr, pmd,
-								zero_page);
-		spin_unlock(ptl);
+		ptl = pmd_lock(vma->vm_mm, pmd);
+		if (!pmd_none(*pmd)) {
+			spin_unlock(ptl);
+			goto fallback;
+		}
+
+		entry = mk_pmd(zero_page, vma->vm_page_prot);
+		entry = pmd_mkhuge(entry);
+		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
 		result = VM_FAULT_NOPAGE;
+		spin_unlock(ptl);
 	} else {
 		sector = bh.b_blocknr << (blkbits - 9);
 		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f9b612f..ecb080d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -163,9 +163,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
 }
 
 struct page *get_huge_zero_page(void);
-bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long haddr,
-		pmd_t *pmd, struct page *zero_page);
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b51bed1..63c8fe5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -767,7 +767,7 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
 }
 
 /* Caller must hold page table lock. */
-bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		struct page *zero_page)
 {
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/11] dax: Ensure that zero pages are removed from other processes
  2015-08-04 19:57 ` Matthew Wilcox
@ 2015-08-04 19:58   ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

If the first access to a huge page was a store, there would be no existing
zero pmd in this process's page tables.  There could be a zero pmd in
another process's page tables, if it had done a load.  We can detect this
case by noticing that the buffer_head returned from the filesystem is
New, and ensure that other processes mapping this huge page have their
page tables flushed.

Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index bf9a22b..233e61e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -568,7 +568,11 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if ((pgoff | PG_PMD_COLOUR) >= size)
 		goto fallback;
 
-	if (is_huge_zero_pmd(*pmd))
+	/*
+	 * If we allocated new storage, make sure no process has any
+	 * zero pages covering this hole
+	 */
+	if (buffer_new(&bh))
 		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
 
 	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/11] dax: Ensure that zero pages are removed from other processes
@ 2015-08-04 19:58   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

If the first access to a huge page was a store, there would be no existing
zero pmd in this process's page tables.  There could be a zero pmd in
another process's page tables, if it had done a load.  We can detect this
case by noticing that the buffer_head returned from the filesystem is
New, and ensure that other processes mapping this huge page have their
page tables flushed.

Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index bf9a22b..233e61e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -568,7 +568,11 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if ((pgoff | PG_PMD_COLOUR) >= size)
 		goto fallback;
 
-	if (is_huge_zero_pmd(*pmd))
+	/*
+	 * If we allocated new storage, make sure no process has any
+	 * zero pages covering this hole
+	 */
+	if (buffer_new(&bh))
 		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
 
 	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 11/11] dax: Use linear_page_index()
  2015-08-04 19:57 ` Matthew Wilcox
@ 2015-08-04 19:58   ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

I was basically open-coding it (thanks to copying code from do_fault()
which probably also needs to be fixed).

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 233e61e..b6769ce 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -529,7 +529,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
 		return VM_FAULT_FALLBACK;
 
-	pgoff = ((pmd_addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	pgoff = linear_page_index(vma, pmd_addr);
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (pgoff >= size)
 		return VM_FAULT_SIGBUS;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 11/11] dax: Use linear_page_index()
@ 2015-08-04 19:58   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-04 19:58 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

I was basically open-coding it (thanks to copying code from do_fault()
which probably also needs to be fixed).

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 233e61e..b6769ce 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -529,7 +529,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
 		return VM_FAULT_FALLBACK;
 
-	pgoff = ((pmd_addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	pgoff = linear_page_index(vma, pmd_addr);
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (pgoff >= size)
 		return VM_FAULT_SIGBUS;
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 04/11] ext4: Add ext4_get_block_dax()
  2015-08-04 19:57   ` Matthew Wilcox
@ 2015-08-05  2:03     ` Dave Chinner
  -1 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2015-08-05  2:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm, Matthew Wilcox

On Tue, Aug 04, 2015 at 03:57:58PM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> DAX wants different semantics from any currently-existing ext4
> get_block callback.  Unlike ext4_get_block_write(), it needs to honour
> the 'create' flag, and unlike ext4_get_block(), it needs to be able
> to return unwritten extents.  So introduce a new ext4_get_block_dax()
> which has those semantics.  We could also change ext4_get_block_write()
> to honour the 'create' flag, but that might have consequences on other
> users that I do not currently understand.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>

Doesn't this make the first patch in the series redundant?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 04/11] ext4: Add ext4_get_block_dax()
@ 2015-08-05  2:03     ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2015-08-05  2:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm, Matthew Wilcox

On Tue, Aug 04, 2015 at 03:57:58PM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> DAX wants different semantics from any currently-existing ext4
> get_block callback.  Unlike ext4_get_block_write(), it needs to honour
> the 'create' flag, and unlike ext4_get_block(), it needs to be able
> to return unwritten extents.  So introduce a new ext4_get_block_dax()
> which has those semantics.  We could also change ext4_get_block_write()
> to honour the 'create' flag, but that might have consequences on other
> users that I do not currently understand.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>

Doesn't this make the first patch in the series redundant?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 06/11] dax: Fix race between simultaneous faults
  2015-08-04 19:58   ` Matthew Wilcox
@ 2015-08-05 11:43     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 30+ messages in thread
From: Kirill A. Shutemov @ 2015-08-05 11:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm, Matthew Wilcox

On Tue, Aug 04, 2015 at 03:58:00PM -0400, Matthew Wilcox wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index b94b587..5f46350 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2426,11 +2426,16 @@ void unmap_mapping_range(struct address_space *mapping,
>  		details.last_index = ULONG_MAX;
>  
>  
> -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
> -	i_mmap_lock_write(mapping);
> +	/*
> +	 * DAX already holds i_mmap_lock to serialise file truncate vs
> +	 * page fault and page fault vs page fault.
> +	 */
> +	if (!IS_DAX(mapping->host))
> +		i_mmap_lock_write(mapping);
>  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> -	i_mmap_unlock_write(mapping);
> +	if (!IS_DAX(mapping->host))
> +		i_mmap_unlock_write(mapping);
>  }
>  EXPORT_SYMBOL(unmap_mapping_range);

Huh? What protects mapping->i_mmap here? I don't see anything up by stack
taking the lock.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 06/11] dax: Fix race between simultaneous faults
@ 2015-08-05 11:43     ` Kirill A. Shutemov
  0 siblings, 0 replies; 30+ messages in thread
From: Kirill A. Shutemov @ 2015-08-05 11:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-fsdevel, linux-kernel, linux-mm, Matthew Wilcox

On Tue, Aug 04, 2015 at 03:58:00PM -0400, Matthew Wilcox wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index b94b587..5f46350 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2426,11 +2426,16 @@ void unmap_mapping_range(struct address_space *mapping,
>  		details.last_index = ULONG_MAX;
>  
>  
> -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
> -	i_mmap_lock_write(mapping);
> +	/*
> +	 * DAX already holds i_mmap_lock to serialise file truncate vs
> +	 * page fault and page fault vs page fault.
> +	 */
> +	if (!IS_DAX(mapping->host))
> +		i_mmap_lock_write(mapping);
>  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> -	i_mmap_unlock_write(mapping);
> +	if (!IS_DAX(mapping->host))
> +		i_mmap_unlock_write(mapping);
>  }
>  EXPORT_SYMBOL(unmap_mapping_range);

Huh? What protects mapping->i_mmap here? I don't see anything up by stack
taking the lock.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 04/11] ext4: Add ext4_get_block_dax()
  2015-08-05  2:03     ` Dave Chinner
@ 2015-08-05 15:19       ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-05 15:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-kernel, linux-mm

On Wed, Aug 05, 2015 at 12:03:57PM +1000, Dave Chinner wrote:
> On Tue, Aug 04, 2015 at 03:57:58PM -0400, Matthew Wilcox wrote:
> > From: Matthew Wilcox <willy@linux.intel.com>
> > 
> > DAX wants different semantics from any currently-existing ext4
> > get_block callback.  Unlike ext4_get_block_write(), it needs to honour
> > the 'create' flag, and unlike ext4_get_block(), it needs to be able
> > to return unwritten extents.  So introduce a new ext4_get_block_dax()
> > which has those semantics.  We could also change ext4_get_block_write()
> > to honour the 'create' flag, but that might have consequences on other
> > users that I do not currently understand.
> > 
> > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> 
> Doesn't this make the first patch in the series redundant?

As I explained in the cover letter, patch 1 already went to Ted.  It might
be on its way in, and it might not.  Rather than sending a patch that
applies to current mainline and forcing someone to fix up a conflict
later, better to resend the patch as part of this series, and our tools
will do the right thing no matter which order patches go into Linus' tree.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 04/11] ext4: Add ext4_get_block_dax()
@ 2015-08-05 15:19       ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2015-08-05 15:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-kernel, linux-mm

On Wed, Aug 05, 2015 at 12:03:57PM +1000, Dave Chinner wrote:
> On Tue, Aug 04, 2015 at 03:57:58PM -0400, Matthew Wilcox wrote:
> > From: Matthew Wilcox <willy@linux.intel.com>
> > 
> > DAX wants different semantics from any currently-existing ext4
> > get_block callback.  Unlike ext4_get_block_write(), it needs to honour
> > the 'create' flag, and unlike ext4_get_block(), it needs to be able
> > to return unwritten extents.  So introduce a new ext4_get_block_dax()
> > which has those semantics.  We could also change ext4_get_block_write()
> > to honour the 'create' flag, but that might have consequences on other
> > users that I do not currently understand.
> > 
> > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> 
> Doesn't this make the first patch in the series redundant?

As I explained in the cover letter, patch 1 already went to Ted.  It might
be on its way in, and it might not.  Rather than sending a patch that
applies to current mainline and forcing someone to fix up a conflict
later, better to resend the patch as part of this series, and our tools
will do the right thing no matter which order patches go into Linus' tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2015-08-05 15:19 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-04 19:57 [PATCH 00/11] DAX fixes for 4.3 Matthew Wilcox
2015-08-04 19:57 ` Matthew Wilcox
2015-08-04 19:57 ` [PATCH 01/11] ext4: Use ext4_get_block_write() for DAX Matthew Wilcox
2015-08-04 19:57   ` Matthew Wilcox
2015-08-04 19:57 ` [PATCH 02/11] thp: Change insert_pfn's return type to void Matthew Wilcox
2015-08-04 19:57   ` Matthew Wilcox
2015-08-04 19:57 ` [PATCH 03/11] dax: Improve comment about truncate race Matthew Wilcox
2015-08-04 19:57   ` Matthew Wilcox
2015-08-04 19:57 ` [PATCH 04/11] ext4: Add ext4_get_block_dax() Matthew Wilcox
2015-08-04 19:57   ` Matthew Wilcox
2015-08-05  2:03   ` Dave Chinner
2015-08-05  2:03     ` Dave Chinner
2015-08-05 15:19     ` Matthew Wilcox
2015-08-05 15:19       ` Matthew Wilcox
2015-08-04 19:57 ` [PATCH 05/11] ext4: Start transaction before calling into DAX Matthew Wilcox
2015-08-04 19:57   ` Matthew Wilcox
2015-08-04 19:58 ` [PATCH 06/11] dax: Fix race between simultaneous faults Matthew Wilcox
2015-08-04 19:58   ` Matthew Wilcox
2015-08-05 11:43   ` Kirill A. Shutemov
2015-08-05 11:43     ` Kirill A. Shutemov
2015-08-04 19:58 ` [PATCH 07/11] thp: Decrement refcount on huge zero page if it is split Matthew Wilcox
2015-08-04 19:58   ` Matthew Wilcox
2015-08-04 19:58 ` [PATCH 08/11] thp: Fix zap_huge_pmd() for DAX Matthew Wilcox
2015-08-04 19:58   ` Matthew Wilcox
2015-08-04 19:58 ` [PATCH 09/11] dax: Don't use set_huge_zero_page() Matthew Wilcox
2015-08-04 19:58   ` Matthew Wilcox
2015-08-04 19:58 ` [PATCH 10/11] dax: Ensure that zero pages are removed from other processes Matthew Wilcox
2015-08-04 19:58   ` Matthew Wilcox
2015-08-04 19:58 ` [PATCH 11/11] dax: Use linear_page_index() Matthew Wilcox
2015-08-04 19:58   ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.