[PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking
@ 2015-11-17 20:15 Dan Williams
  2015-11-17 20:15 ` [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled Dan Williams
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Dan Williams @ 2015-11-17 20:15 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Boaz Harrosh, Theodore Ts'o, Dave Hansen,
	Dave Chinner, Yigal Korman, stable, Jens Axboe, Jeff Moyer,
	linux-block, Matthew Wilcox, Dave Chinner, linux-fsdevel, willy,
	ross.zwisler, linux-ext4, akpm, Kirill A. Shutemov,
	Alexander Viro

Changes since last posting [1]:

1/ Further cleanups to dax_clear_blocks(): Dropped increments of 'addr'
   since we call bdev_direct_access() before the next use, and dropped the
   BUG_ON for sector unaligned return values from bdev_direct_access().

2/ In [PATCH 8/8] introduce blk_dax_ctl to remove the need to have
   separate dax_map_atomic and __dax_map_atomic routines.  Note,
   blk_dax_ctl is not passed through to drivers, it gets unpacked in
   bdev_direct_access.  (Willy)

3/ New [PATCH 2/8]: Disable huge page dax mappings while we resolve
   various crash scenarios in this development cycle.

4/ New [PATCH 4/8]: Unmap all dax mappings at block device shutdown

I have kept the reviewed-by's received to date, let me know if these
incremental updates invalidate that review.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-November/002733.html

---

The first 4 patches in this series I consider 4.4-rc / -stable material.
The rest are for 4.5.  [PATCH 4/8] needs scrutiny.  It is yet another
example of where DAX behavior necessarily differs from page cache
behavior.  I still maintain that we should not be surprising unaware
applications with DAX semantics, i.e. that DAX should be per-inode
opt-in, not globally enabled for all inodes at fs mount time.

The largest patch in the set, [PATCH 8/8], addresses the lifetime of the
'addr' returned by bdev_direct_access.  That address is only valid while
the device driver is enabled.  The new dax_map_atomic() /
dax_unmap_atomic() pairing guarantees that 'addr' stays valid for the
duration of that mapping.

While dax_map_atomic() protects against 'addr' going invalid, the new
calls to truncate_pagecache() via invalidate_inodes() protect against
the 'pfn' returned from bdev_direct_access() going invalid.  Otherwise,
the storage media can be directly accessed after the driver has been
disabled.

---

[PATCH 1/8] ext2, ext4: warn when mounting with dax enabled
[PATCH 2/8] dax: disable pmd mappings
[PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault)
[PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
[PATCH 5/8] pmem, dax: clean up clear_pmem()
[PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations
[PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access()
[PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()

 arch/x86/include/asm/pmem.h |    7 -
 block/blk.h                 |    2 
 fs/Kconfig                  |    6 +
 fs/block_dev.c              |   15 +--
 fs/dax.c                    |  228 ++++++++++++++++++++++++++-----------------
 fs/ext2/super.c             |    2 
 fs/ext4/super.c             |    6 +
 fs/inode.c                  |   27 +++++
 include/linux/blkdev.h      |   19 +++-
 mm/memory.c                 |    8 +-
 mm/truncate.c               |   13 ++
 11 files changed, 217 insertions(+), 116 deletions(-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled
  2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams
@ 2015-11-17 20:15 ` Dan Williams
  2015-11-17 20:16 ` [PATCH 2/8] dax: disable pmd mappings Dan Williams
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2015-11-17 20:15 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Theodore Ts'o, Dave Chinner, stable, linux-block, Jan Kara,
	linux-fsdevel, willy, ross.zwisler, linux-ext4, akpm,
	Kirill A. Shutemov

Similar to XFS warn when mounting DAX while it is still considered under
development.  Also, aspects of the DAX implementation, for example
synchronization against multiple faults and faults causing block
allocation, depend on the correct implementation in the filesystem.  The
maturity of a given DAX implementation is filesystem specific.

Cc: <stable@vger.kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: linux-ext4@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext2/super.c |    2 ++
 fs/ext4/super.c |    6 +++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 3a71cea68420..748d35afc902 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -569,6 +569,8 @@ static int parse_options(char *options, struct super_block *sb)
 			/* Fall through */
 		case Opt_dax:
 #ifdef CONFIG_FS_DAX
+			ext2_msg(sb, KERN_WARNING,
+		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
 			set_opt(sbi->s_mount_opt, DAX);
 #else
 			ext2_msg(sb, KERN_INFO, "dax option not supported");
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 753f4e68b820..c9ab67da6e5a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1664,8 +1664,12 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		}
 		sbi->s_jquota_fmt = m->mount_opt;
 #endif
-#ifndef CONFIG_FS_DAX
 	} else if (token == Opt_dax) {
+#ifdef CONFIG_FS_DAX
+		ext4_msg(sb, KERN_WARNING,
+		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
+			sbi->s_mount_opt |= m->mount_opt;
+#else
 		ext4_msg(sb, KERN_INFO, "dax option not supported");
 		return -1;
 #endif


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/8] dax: disable pmd mappings
  2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams
  2015-11-17 20:15 ` [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled Dan Williams
@ 2015-11-17 20:16 ` Dan Williams
  2015-11-17 20:51   ` Ross Zwisler
  2015-11-17 20:16 ` [PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault) Dan Williams
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Chinner, stable, linux-block, Jan Kara, linux-fsdevel,
	willy, ross.zwisler, akpm, Kirill A. Shutemov

While dax pmd mappings are functional in the nominal path they trigger
kernel crashes in the following paths:

 BUG: unable to handle kernel paging request at ffffea0004098000
 IP: [<ffffffff812362f7>] follow_trans_huge_pmd+0x117/0x3b0
 [..]
 Call Trace:
  [<ffffffff811f6573>] follow_page_mask+0x2d3/0x380
  [<ffffffff811f6708>] __get_user_pages+0xe8/0x6f0
  [<ffffffff811f7045>] get_user_pages_unlocked+0x165/0x1e0
  [<ffffffff8106f5b1>] get_user_pages_fast+0xa1/0x1b0

 kernel BUG at arch/x86/mm/gup.c:131!
 [..]
 Call Trace:
  [<ffffffff8106f34c>] gup_pud_range+0x1bc/0x220
  [<ffffffff8106f634>] get_user_pages_fast+0x124/0x1b0

 BUG: unable to handle kernel paging request at ffffea0004088000
 IP: [<ffffffff81235f49>] copy_huge_pmd+0x159/0x350
 [..]
 Call Trace:
  [<ffffffff811fad3c>] copy_page_range+0x34c/0x9f0
  [<ffffffff810a0daf>] copy_process+0x1b7f/0x1e10
  [<ffffffff810a11c1>] _do_fork+0x91/0x590

All of these paths are interpreting a dax pmd mapping as a transparent
huge page and making the assumption that the pfn is covered by the
memmap, i.e. that the pfn has an associated struct page.  PTE mappings
do not suffer the same fate since they have the _PAGE_SPECIAL flag to
cause the gup path to fault.  We can do something similar for the PMD
path, or otherwise defer pmd support for cases where a struct page is
available.  For now, 4.4-rc and -stable need to disable dax pmd support
by default.

For development the "depends on BROKEN" line can be removed from
CONFIG_FS_DAX_PMD.

Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Kconfig |    6 ++++++
 fs/dax.c   |    4 ++++
 2 files changed, 10 insertions(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index da3f32f1a4e4..6ce72d8d1ee1 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -46,6 +46,12 @@ config FS_DAX
 	  or if unsure, say N.  Saying Y will increase the size of the kernel
 	  by about 5kB.
 
+config FS_DAX_PMD
+	bool
+	default FS_DAX
+	depends on FS_DAX
+	depends on BROKEN
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/dax.c b/fs/dax.c
index d1e5cb7311a1..43671b68220e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -541,6 +541,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	unsigned long pfn;
 	int result = 0;
 
+	/* dax pmd mappings are broken wrt gup and fork */
+	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
+		return VM_FAULT_FALLBACK;
+
 	/* Fall back to PTEs if we're going to COW */
 	if (write && !(vma->vm_flags & VM_SHARED))
 		return VM_FAULT_FALLBACK;


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault)
  2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams
  2015-11-17 20:15 ` [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled Dan Williams
  2015-11-17 20:16 ` [PATCH 2/8] dax: disable pmd mappings Dan Williams
@ 2015-11-17 20:16 ` Dan Williams
  2015-11-17 20:16 ` [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown Dan Williams
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Boaz Harrosh, Dave Chinner, stable, linux-block, Yigal Korman,
	Alexander Viro, Jan Kara, linux-fsdevel, willy, ross.zwisler,
	akpm, Kirill A. Shutemov, Matthew Wilcox

From: Yigal Korman <yigal@plexistor.com>

DAX handling of COW faults has wrong locking sequence:
	dax_fault does i_mmap_lock_read
	do_cow_fault does i_mmap_unlock_write

Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
commit[3].

Original COW locking logic was introduced by Matthew here[4].

This should be applied to v4.3 as well.

[1] 0f90cc6609c7 mm, dax: fix DAX deadlocks
[2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault()
[3] 843172978bb9 dax: fix race between simultaneous faults
[4] 2e4cdab0584f mm: allow page fault handlers to perform the COW

Cc: <stable@vger.kernel.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/memory.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index deb679c31f2a..c387430f06c3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3015,9 +3015,9 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		} else {
 			/*
 			 * The fault handler has no page to lock, so it holds
-			 * i_mmap_lock for write to protect against truncate.
+			 * i_mmap_lock for read to protect against truncate.
 			 */
-			i_mmap_unlock_write(vma->vm_file->f_mapping);
+			i_mmap_unlock_read(vma->vm_file->f_mapping);
 		}
 		goto uncharge_out;
 	}
@@ -3031,9 +3031,9 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	} else {
 		/*
 		 * The fault handler has no page to lock, so it holds
-		 * i_mmap_lock for write to protect against truncate.
+		 * i_mmap_lock for read to protect against truncate.
 		 */
-		i_mmap_unlock_write(vma->vm_file->f_mapping);
+		i_mmap_unlock_read(vma->vm_file->f_mapping);
 	}
 	return ret;
 uncharge_out:


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
  2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams
                   ` (2 preceding siblings ...)
  2015-11-17 20:16 ` [PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault) Dan Williams
@ 2015-11-17 20:16 ` Dan Williams
  2015-11-18 15:09   ` Jan Kara
  2015-11-17 20:16 ` [PATCH 5/8] pmem, dax: clean up clear_pmem() Dan Williams
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Chinner, stable, linux-block, Jan Kara, linux-fsdevel,
	willy, ross.zwisler, akpm

Currently dax mappings survive block_device shutdown.  While page cache
pages are permitted to be read/written after the block_device is torn
down this is not acceptable in the dax case as all media access must end
when the device is disabled.  The pfn backing a dax mapping is permitted
to be invalidated after bdev shutdown and this is indeed the case with
brd.

When a dax capable block_device driver calls del_gendisk() in its
shutdown path, or a filesystem evicts an inode it needs to ensure that
all the pfns that had been mapped via bdev_direct_access() are unmapped.
This is different than the pagecache backed case where
truncate_inode_pages() is sufficient to end I/O to pages mapped to a
dying inode.

Since dax bypasses the page cache we need to unmap in addition to
truncating pages.  Also, since dax mappings are not accounted in the
mapping radix we uncoditionally truncate all inodes with the S_DAX flag.
Likely when we add support for dynamic dax enable/disable control we'll
have infrastructure to detect if the inode is unmapped and can skip the
truncate.

Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/inode.c    |   27 +++++++++++++++++++++++++++
 mm/truncate.c |   13 +++++++++++--
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 1be5f9003eb3..1029e033e991 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -579,6 +579,18 @@ static void dispose_list(struct list_head *head)
 	}
 }
 
+static void truncate_list(struct list_head *head)
+{
+	struct inode *inode, *_i;
+
+	list_for_each_entry_safe(inode, _i, head, i_lru) {
+		list_del_init(&inode->i_lru);
+		truncate_pagecache(inode, 0);
+		iput(inode);
+		cond_resched();
+	}
+}
+
 /**
  * evict_inodes	- evict all evictable inodes for a superblock
  * @sb:		superblock to operate on
@@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	int busy = 0;
 	struct inode *inode, *next;
 	LIST_HEAD(dispose);
+	LIST_HEAD(truncate);
 
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
@@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 			busy = 1;
 			continue;
 		}
+		if (IS_DAX(inode) && atomic_read(&inode->i_count)) {
+			/*
+			 * dax mappings can't live past this invalidation event
+			 * as there is no page cache present to allow the data
+			 * to remain accessiable.
+			 */
+			__iget(inode);
+			inode_lru_list_del(inode);
+			spin_unlock(&inode->i_lock);
+			list_add(&inode->i_lru, &truncate);
+			busy = 1;
+			continue;
+		}
 		if (atomic_read(&inode->i_count)) {
 			spin_unlock(&inode->i_lock);
 			busy = 1;
@@ -669,6 +695,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
+	truncate_list(&truncate);
 
 	return busy;
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad97102..ff1fb3b0980e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -402,6 +402,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
  */
 void truncate_inode_pages_final(struct address_space *mapping)
 {
+	struct inode *inode = mapping->host;
 	unsigned long nrshadows;
 	unsigned long nrpages;
 
@@ -423,7 +424,7 @@ void truncate_inode_pages_final(struct address_space *mapping)
 	smp_rmb();
 	nrshadows = mapping->nrshadows;
 
-	if (nrpages || nrshadows) {
+	if (nrpages || nrshadows || IS_DAX(inode)) {
 		/*
 		 * As truncation uses a lockless tree lookup, cycle
 		 * the tree lock to make sure any ongoing tree
@@ -433,7 +434,15 @@ void truncate_inode_pages_final(struct address_space *mapping)
 		spin_lock_irq(&mapping->tree_lock);
 		spin_unlock_irq(&mapping->tree_lock);
 
-		truncate_inode_pages(mapping, 0);
+		/*
+		 * In the case of DAX we also need to unmap the inode
+		 * since the pfn backing the mapping may be invalidated
+		 * after this returns
+		 */
+		if (IS_DAX(inode))
+			truncate_pagecache(inode, 0);
+		else
+			truncate_inode_pages(mapping, 0);
 	}
 }
 EXPORT_SYMBOL(truncate_inode_pages_final);


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/8] pmem, dax: clean up clear_pmem()
  2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams
                   ` (3 preceding siblings ...)
  2015-11-17 20:16 ` [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown Dan Williams
@ 2015-11-17 20:16 ` Dan Williams
  2015-11-17 20:16 ` [PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations Dan Williams
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, linux-block, Jeff Moyer, linux-fsdevel, willy,
	ross.zwisler, akpm

Both, __dax_pmd_fault, and clear_pmem() were taking special steps to
clear memory a page at a time to take advantage of non-temporal
clear_page() implementations.  However, x86_64 does not use
non-temporal instructions for clear_page(), and arch_clear_pmem() was
always incurring the cost of __arch_wb_cache_pmem().

Clean up the assumption that doing clear_pmem() a page at a time is more
performant.

Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/pmem.h |    7 +------
 fs/dax.c                    |    4 +---
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec816ab..1544fabcd7f9 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -132,12 +132,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	void *vaddr = (void __force *)addr;
 
-	/* TODO: implement the zeroing via non-temporal writes */
-	if (size == PAGE_SIZE && ((unsigned long)vaddr & ~PAGE_MASK) == 0)
-		clear_page(vaddr);
-	else
-		memset(vaddr, 0, size);
-
+	memset(vaddr, 0, size);
 	__arch_wb_cache_pmem(vaddr, size);
 }
 
diff --git a/fs/dax.c b/fs/dax.c
index 43671b68220e..19492cc65a30 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -641,9 +641,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			int i;
-			for (i = 0; i < PTRS_PER_PMD; i++)
-				clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
+			clear_pmem(kaddr, PMD_SIZE);
 			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations
  2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams
                   ` (4 preceding siblings ...)
  2015-11-17 20:16 ` [PATCH 5/8] pmem, dax: clean up clear_pmem() Dan Williams
@ 2015-11-17 20:16 ` Dan Williams
  2015-11-17 20:16 ` [PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access() Dan Williams
  2015-11-17 20:16 ` [PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
  7 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-block, Jeff Moyer, Jan Kara, linux-fsdevel, willy,
	ross.zwisler, akpm

dax_clear_blocks is currently performing a cond_resched() after every
PAGE_SIZE memset.  We need not check so frequently, for example md-raid
only calls cond_resched() at stripe granularity.  Also, in preparation
for introducing a dax_map_atomic() operation that temporarily pins a dax
mapping move the call to cond_resched() to the outer loop.

Reviewed-by: Jan Kara <jack@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 19492cc65a30..e11d88835bb2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -28,6 +28,7 @@
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
+#include <linux/sizes.h>
 
 /*
  * dax_clear_blocks() is called from within transaction context from XFS,
@@ -43,24 +44,17 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	do {
 		void __pmem *addr;
 		unsigned long pfn;
-		long count;
+		long count, sz;
 
 		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
 		if (count < 0)
 			return count;
-		BUG_ON(size < count);
-		while (count > 0) {
-			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
-			if (pgsz > count)
-				pgsz = count;
-			clear_pmem(addr, pgsz);
-			addr += pgsz;
-			size -= pgsz;
-			count -= pgsz;
-			BUG_ON(pgsz & 511);
-			sector += pgsz / 512;
-			cond_resched();
-		}
+		sz = min_t(long, count, SZ_1M);
+		clear_pmem(addr, sz);
+		size -= sz;
+		BUG_ON(sz & 511);
+		sector += sz / 512;
+		cond_resched();
 	} while (size);
 
 	wmb_pmem();


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access()
  2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams
                   ` (5 preceding siblings ...)
  2015-11-17 20:16 ` [PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations Dan Williams
@ 2015-11-17 20:16 ` Dan Williams
  2015-11-17 20:16 ` [PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
  7 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-block, willy, ross.zwisler, linux-fsdevel, akpm

If a ->direct_access() implementation ever returns a map count less than
PAGE_SIZE, catch the error in bdev_direct_access().  This simplifies
error checking in upper layers.

Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c |    2 ++
 fs/dax.c       |    1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index bb0dfb1c7af1..e2cc681d4afe 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -475,6 +475,8 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	avail = ops->direct_access(bdev, sector, addr, pfn);
 	if (!avail)
 		return -ERANGE;
+	if (avail > 0 && avail & ~PAGE_MASK)
+		return -ENXIO;
 	return min(avail, size);
 }
 EXPORT_SYMBOL_GPL(bdev_direct_access);
diff --git a/fs/dax.c b/fs/dax.c
index e11d88835bb2..6e498c2570bf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -52,7 +52,6 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 		sz = min_t(long, count, SZ_1M);
 		clear_pmem(addr, sz);
 		size -= sz;
-		BUG_ON(sz & 511);
 		sector += sz / 512;
 		cond_resched();
 	} while (size);


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
  2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams
                   ` (6 preceding siblings ...)
  2015-11-17 20:16 ` [PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access() Dan Williams
@ 2015-11-17 20:16 ` Dan Williams
  7 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-block, Dave Chinner, Jens Axboe, Jeff Moyer, Jan Kara,
	linux-fsdevel, willy, ross.zwisler, akpm

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against the driver for the underlying
block device being disabled.  Use blk_queue_enter()/blk_queue_exit() to
hold off blk_cleanup_queue() from proceeding, or otherwise fail new
mapping requests if the request_queue is being torn down.

This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
through dax_map_atomic() to bdev_direct_access().

Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk.h            |    2 
 fs/block_dev.c         |   13 +--
 fs/dax.c               |  207 ++++++++++++++++++++++++++++++------------------
 include/linux/blkdev.h |   19 ++++
 4 files changed, 151 insertions(+), 90 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index da722eb786df..c43926d3d74d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
-int blk_queue_enter(struct request_queue *q, gfp_t gfp);
-void blk_queue_exit(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
 
 static inline void blk_queue_enter_live(struct request_queue *q)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e2cc681d4afe..228c313c0ac1 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -436,10 +436,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
 /**
  * bdev_direct_access() - Get the address for directly-accessibly memory
  * @bdev: The device containing the memory
- * @sector: The offset within the device
- * @addr: Where to put the address of the memory
- * @pfn: The Page Frame Number for the memory
- * @size: The number of bytes requested
+ * @dax: control and output parameters for ->direct_access
  *
  * If a block device is made up of directly addressable memory, this function
  * will tell the caller the PFN and the address of the memory.  The address
@@ -450,10 +447,10 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
  * Return: negative errno if an error occurs, otherwise the number of bytes
  * accessible at this address.
  */
-long bdev_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **addr, unsigned long *pfn, long size)
+long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
-	long avail;
+	sector_t sector = dax->sector;
+	long avail, size = dax->size;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
 
 	/*
@@ -472,7 +469,7 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	sector += get_start_sect(bdev);
 	if (sector % (PAGE_SIZE / 512))
 		return -EINVAL;
-	avail = ops->direct_access(bdev, sector, addr, pfn);
+	avail = ops->direct_access(bdev, sector, &dax->addr, &dax->pfn);
 	if (!avail)
 		return -ERANGE;
 	if (avail > 0 && avail & ~PAGE_MASK)
diff --git a/fs/dax.c b/fs/dax.c
index 6e498c2570bf..a6e29d5ad6fd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,45 +30,65 @@
 #include <linux/vmstat.h>
 #include <linux/sizes.h>
 
+static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
+{
+	struct request_queue *q = bdev->bd_queue;
+	long rc = -EIO;
+
+	dax->addr = (void __pmem *) ERR_PTR(-EIO);
+	if (blk_queue_enter(q, GFP_NOWAIT) != 0)
+		return rc;
+
+	rc = bdev_direct_access(bdev, dax);
+	if (rc < 0) {
+		dax->addr = (void __pmem *) ERR_PTR(rc);
+		blk_queue_exit(q);
+		return rc;
+	}
+	return rc;
+}
+
+static void dax_unmap_atomic(struct block_device *bdev,
+		const struct blk_dax_ctl *dax)
+{
+	if (IS_ERR(dax->addr))
+		return;
+	blk_queue_exit(bdev->bd_queue);
+}
+
 /*
  * dax_clear_blocks() is called from within transaction context from XFS,
  * and hence this means the stack from this point must follow GFP_NOFS
  * semantics for all operations.
  */
-int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+int dax_clear_blocks(struct inode *inode, sector_t block, long _size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
-	sector_t sector = block << (inode->i_blkbits - 9);
+	struct blk_dax_ctl dax = {
+		.sector = block << (inode->i_blkbits - 9),
+		.size = _size,
+	};
 
 	might_sleep();
 	do {
-		void __pmem *addr;
-		unsigned long pfn;
 		long count, sz;
 
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+		count = dax_map_atomic(bdev, &dax);
 		if (count < 0)
 			return count;
 		sz = min_t(long, count, SZ_1M);
-		clear_pmem(addr, sz);
-		size -= sz;
-		sector += sz / 512;
+		clear_pmem(dax.addr, sz);
+		dax.size -= sz;
+		dax.sector += sz / 512;
+		dax_unmap_atomic(bdev, &dax);
 		cond_resched();
-	} while (size);
+	} while (dax.size);
 
 	wmb_pmem();
 	return 0;
 }
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
 
-static long dax_get_addr(struct buffer_head *bh, void __pmem **addr,
-		unsigned blkbits)
-{
-	unsigned long pfn;
-	sector_t sector = bh->b_blocknr << (blkbits - 9);
-	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
-}
-
 /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
 static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
 		loff_t pos, loff_t end)
@@ -98,19 +118,29 @@ static bool buffer_size_valid(struct buffer_head *bh)
 	return bh->b_state != 0;
 }
 
+
+static sector_t to_sector(const struct buffer_head *bh,
+		const struct inode *inode)
+{
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+
+	return sector;
+}
+
 static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		      loff_t start, loff_t end, get_block_t get_block,
 		      struct buffer_head *bh)
 {
-	ssize_t retval = 0;
-	loff_t pos = start;
-	loff_t max = start;
-	loff_t bh_max = start;
-	void __pmem *addr;
-	bool hole = false;
-	bool need_wmb = false;
-
-	if (iov_iter_rw(iter) != WRITE)
+	loff_t pos = start, max = start, bh_max = start;
+	bool hole = false, need_wmb = false;
+	struct block_device *bdev = NULL;
+	int rw = iov_iter_rw(iter), rc;
+	long map_len = 0;
+	struct blk_dax_ctl dax = {
+		.addr = (void __pmem *) ERR_PTR(-EIO),
+	};
+
+	if (rw == READ)
 		end = min(end, i_size_read(inode));
 
 	while (pos < end) {
@@ -125,13 +155,13 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 			if (pos == bh_max) {
 				bh->b_size = PAGE_ALIGN(end - pos);
 				bh->b_state = 0;
-				retval = get_block(inode, block, bh,
-						   iov_iter_rw(iter) == WRITE);
-				if (retval)
+				rc = get_block(inode, block, bh, rw == WRITE);
+				if (rc)
 					break;
 				if (!buffer_size_valid(bh))
 					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
+				bdev = bh->b_bdev;
 			} else {
 				unsigned done = bh->b_size -
 						(bh_max - (pos - first));
@@ -139,47 +169,52 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				bh->b_size -= done;
 			}
 
-			hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh);
+			hole = rw == READ && !buffer_written(bh);
 			if (hole) {
-				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				retval = dax_get_addr(bh, &addr, blkbits);
-				if (retval < 0)
+				dax_unmap_atomic(bdev, &dax);
+				dax.sector = to_sector(bh, inode);
+				dax.size = bh->b_size;
+				map_len = dax_map_atomic(bdev, &dax);
+				if (map_len < 0) {
+					rc = map_len;
 					break;
+				}
 				if (buffer_unwritten(bh) || buffer_new(bh)) {
-					dax_new_buf(addr, retval, first, pos,
-									end);
+					dax_new_buf(dax.addr, map_len, first,
+							pos, end);
 					need_wmb = true;
 				}
-				addr += first;
-				size = retval - first;
+				dax.addr += first;
+				size = map_len - first;
 			}
 			max = min(pos + size, end);
 		}
 
 		if (iov_iter_rw(iter) == WRITE) {
-			len = copy_from_iter_pmem(addr, max - pos, iter);
+			len = copy_from_iter_pmem(dax.addr, max - pos, iter);
 			need_wmb = true;
 		} else if (!hole)
-			len = copy_to_iter((void __force *)addr, max - pos,
+			len = copy_to_iter((void __force *) dax.addr, max - pos,
 					iter);
 		else
 			len = iov_iter_zero(max - pos, iter);
 
 		if (!len) {
-			retval = -EFAULT;
+			rc = -EFAULT;
 			break;
 		}
 
 		pos += len;
-		addr += len;
+		dax.addr += len;
 	}
 
 	if (need_wmb)
 		wmb_pmem();
+	dax_unmap_atomic(bdev, &dax);
 
-	return (pos == start) ? retval : pos - start;
+	return (pos == start) ? rc : pos - start;
 }
 
 /**
@@ -268,28 +303,35 @@ static int dax_load_hole(struct address_space *mapping, struct page *page,
 	return VM_FAULT_LOCKED;
 }
 
-static int copy_user_bh(struct page *to, struct buffer_head *bh,
-			unsigned blkbits, unsigned long vaddr)
+static int copy_user_bh(struct page *to, struct inode *inode,
+		struct buffer_head *bh, unsigned long vaddr)
 {
-	void __pmem *vfrom;
+	struct blk_dax_ctl dax = {
+		.sector = to_sector(bh, inode),
+		.size = bh->b_size,
+	};
+	struct block_device *bdev = bh->b_bdev;
 	void *vto;
 
-	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
-		return -EIO;
+	if (dax_map_atomic(bdev, &dax) < 0)
+		return PTR_ERR(dax.addr);
 	vto = kmap_atomic(to);
-	copy_user_page(vto, (void __force *)vfrom, vaddr, to);
+	copy_user_page(vto, (void __force *)dax.addr, vaddr, to);
 	kunmap_atomic(vto);
+	dax_unmap_atomic(bdev, &dax);
 	return 0;
 }
 
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct address_space *mapping = inode->i_mapping;
-	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	void __pmem *addr;
-	unsigned long pfn;
+	struct address_space *mapping = inode->i_mapping;
+	struct block_device *bdev = bh->b_bdev;
+	struct blk_dax_ctl dax = {
+		.sector = to_sector(bh, inode),
+		.size = bh->b_size,
+	};
 	pgoff_t size;
 	int error;
 
@@ -308,20 +350,18 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
-	if (error < 0)
-		goto out;
-	if (error < PAGE_SIZE) {
-		error = -EIO;
+	if (dax_map_atomic(bdev, &dax) < 0) {
+		error = PTR_ERR(dax.addr);
 		goto out;
 	}
 
 	if (buffer_unwritten(bh) || buffer_new(bh)) {
-		clear_pmem(addr, PAGE_SIZE);
+		clear_pmem(dax.addr, PAGE_SIZE);
 		wmb_pmem();
 	}
+	dax_unmap_atomic(bdev, &dax);
 
-	error = vm_insert_mixed(vma, vaddr, pfn);
+	error = vm_insert_mixed(vma, vaddr, dax.pfn);
 
  out:
 	i_mmap_unlock_read(mapping);
@@ -415,7 +455,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
 		if (buffer_written(&bh))
-			error = copy_user_bh(new_page, &bh, blkbits, vaddr);
+			error = copy_user_bh(new_page, inode, &bh, vaddr);
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
@@ -527,11 +567,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	unsigned blkbits = inode->i_blkbits;
 	unsigned long pmd_addr = address & PMD_MASK;
 	bool write = flags & FAULT_FLAG_WRITE;
-	long length;
-	void __pmem *kaddr;
+	struct block_device *bdev;
 	pgoff_t size, pgoff;
-	sector_t block, sector;
-	unsigned long pfn;
+	sector_t block;
 	int result = 0;
 
 	/* dax pmd mappings are broken wrt gup and fork */
@@ -559,9 +597,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
-	length = get_block(inode, block, &bh, write);
-	if (length)
+	if (get_block(inode, block, &bh, write) != 0)
 		return VM_FAULT_SIGBUS;
+	bdev = bh.b_bdev;
 	i_mmap_lock_read(mapping);
 
 	/*
@@ -616,32 +654,40 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		sector = bh.b_blocknr << (blkbits - 9);
-		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
-						bh.b_size);
+		struct blk_dax_ctl dax = {
+			.sector = to_sector(&bh, inode),
+			.size = PMD_SIZE,
+		};
+		long length = dax_map_atomic(bdev, &dax);
+
 		if (length < 0) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
+		if ((length < PMD_SIZE) || (dax.pfn & PG_PMD_COLOUR)) {
+			dax_unmap_atomic(bdev, &dax);
 			goto fallback;
+		}
 
 		/*
 		 * TODO: teach vmf_insert_pfn_pmd() to support
 		 * 'pte_special' for pmds
 		 */
-		if (pfn_valid(pfn))
+		if (pfn_valid(dax.pfn)) {
+			dax_unmap_atomic(bdev, &dax);
 			goto fallback;
+		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			clear_pmem(kaddr, PMD_SIZE);
+			clear_pmem(dax.addr, PMD_SIZE);
 			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
 		}
+		dax_unmap_atomic(bdev, &dax);
 
-		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+		result |= vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
 	}
 
  out:
@@ -743,12 +789,17 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	if (err < 0)
 		return err;
 	if (buffer_written(&bh)) {
-		void __pmem *addr;
-		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
-		if (err < 0)
-			return err;
-		clear_pmem(addr + offset, length);
+		struct block_device *bdev = bh.b_bdev;
+		struct blk_dax_ctl dax = {
+			.sector = to_sector(&bh, inode),
+			.size = PAGE_CACHE_SIZE,
+		};
+
+		if (dax_map_atomic(bdev, &dax) < 0)
+			return PTR_ERR(dax.addr);
+		clear_pmem(dax.addr + offset, length);
 		wmb_pmem();
+		dax_unmap_atomic(bdev, &dax);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3fe27f8d91f0..8aa53454ce27 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -794,6 +794,8 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+extern void blk_queue_exit(struct request_queue *q);
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
@@ -1615,6 +1617,20 @@ static inline bool integrity_req_gap_front_merge(struct request *req,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+/**
+ * struct blk_dax_ctl - control and output parameters for ->direct_access
+ * @sector: (input) offset relative to a block_device
+ * @addr: (output) kernel virtual address for @sector populated by driver
+ * @pfn: (output) page frame number for @addr populated by driver
+ * @size: (input) number of bytes requested
+ */
+struct blk_dax_ctl {
+	sector_t sector;
+	void __pmem *addr;
+	long size;
+	unsigned long pfn;
+};
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
@@ -1641,8 +1657,7 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
 extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
-extern long bdev_direct_access(struct block_device *, sector_t,
-		void __pmem **addr, unsigned long *pfn, long size);
+extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 #else /* CONFIG_BLOCK */
 
 struct block_device;


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/8] dax: disable pmd mappings
  2015-11-17 20:16 ` [PATCH 2/8] dax: disable pmd mappings Dan Williams
@ 2015-11-17 20:51   ` Ross Zwisler
  0 siblings, 0 replies; 20+ messages in thread
From: Ross Zwisler @ 2015-11-17 20:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Dave Chinner, stable, linux-block, Jan Kara,
	linux-fsdevel, willy, ross.zwisler, akpm, Kirill A. Shutemov

On Tue, Nov 17, 2015 at 12:16:03PM -0800, Dan Williams wrote:
> While dax pmd mappings are functional in the nominal path they trigger
> kernel crashes in the following paths:
> 
>  BUG: unable to handle kernel paging request at ffffea0004098000
>  IP: [<ffffffff812362f7>] follow_trans_huge_pmd+0x117/0x3b0
>  [..]
>  Call Trace:
>   [<ffffffff811f6573>] follow_page_mask+0x2d3/0x380
>   [<ffffffff811f6708>] __get_user_pages+0xe8/0x6f0
>   [<ffffffff811f7045>] get_user_pages_unlocked+0x165/0x1e0
>   [<ffffffff8106f5b1>] get_user_pages_fast+0xa1/0x1b0
> 
>  kernel BUG at arch/x86/mm/gup.c:131!
>  [..]
>  Call Trace:
>   [<ffffffff8106f34c>] gup_pud_range+0x1bc/0x220
>   [<ffffffff8106f634>] get_user_pages_fast+0x124/0x1b0
> 
>  BUG: unable to handle kernel paging request at ffffea0004088000
>  IP: [<ffffffff81235f49>] copy_huge_pmd+0x159/0x350
>  [..]
>  Call Trace:
>   [<ffffffff811fad3c>] copy_page_range+0x34c/0x9f0
>   [<ffffffff810a0daf>] copy_process+0x1b7f/0x1e10
>   [<ffffffff810a11c1>] _do_fork+0x91/0x590
> 
> All of these paths are interpreting a dax pmd mapping as a transparent
> huge page and making the assumption that the pfn is covered by the
> memmap, i.e. that the pfn has an associated struct page.  PTE mappings
> do not suffer the same fate since they have the _PAGE_SPECIAL flag to
> cause the gup path to fault.  We can do something similar for the PMD
> path, or otherwise defer pmd support for cases where a struct page is
> available.  For now, 4.4-rc and -stable need to disable dax pmd support
> by default.
> 
> For development the "depends on BROKEN" line can be removed from
> CONFIG_FS_DAX_PMD.
> 
> Cc: <stable@vger.kernel.org>
> Cc: Jan Kara <jack@suse.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
  2015-11-17 20:16 ` [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown Dan Williams
@ 2015-11-18 15:09   ` Jan Kara
  2015-11-19  0:22     ` Williams, Dan J
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2015-11-18 15:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Dave Chinner, stable, linux-block, Jan Kara,
	linux-fsdevel, willy, ross.zwisler, akpm

On Tue 17-11-15 12:16:14, Dan Williams wrote:
> Currently dax mappings survive block_device shutdown.  While page cache
> pages are permitted to be read/written after the block_device is torn
> down this is not acceptable in the dax case as all media access must end
> when the device is disabled.  The pfn backing a dax mapping is permitted
> to be invalidated after bdev shutdown and this is indeed the case with
> brd.
> 
> When a dax capable block_device driver calls del_gendisk() in its
> shutdown path, or a filesystem evicts an inode it needs to ensure that
> all the pfns that had been mapped via bdev_direct_access() are unmapped.
> This is different than the pagecache backed case where
> truncate_inode_pages() is sufficient to end I/O to pages mapped to a
> dying inode.
> 
> Since dax bypasses the page cache we need to unmap in addition to
> truncating pages.  Also, since dax mappings are not accounted in the
> mapping radix we uncoditionally truncate all inodes with the S_DAX flag.
> Likely when we add support for dynamic dax enable/disable control we'll
> have infrastructure to detect if the inode is unmapped and can skip the
> truncate.
> 
> Cc: <stable@vger.kernel.org>
> Cc: Jan Kara <jack@suse.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/inode.c    |   27 +++++++++++++++++++++++++++
>  mm/truncate.c |   13 +++++++++++--
>  2 files changed, 38 insertions(+), 2 deletions(-)
...
> @@ -433,7 +434,15 @@ void truncate_inode_pages_final(struct address_space *mapping)
>  		spin_lock_irq(&mapping->tree_lock);
>  		spin_unlock_irq(&mapping->tree_lock);
>  
> -		truncate_inode_pages(mapping, 0);
> +		/*
> +		 * In the case of DAX we also need to unmap the inode
> +		 * since the pfn backing the mapping may be invalidated
> +		 * after this returns
> +		 */
> +		if (IS_DAX(inode))
> +			truncate_pagecache(inode, 0);
> +		else
> +			truncate_inode_pages(mapping, 0);
>  	}

Hum, I don't get this. truncate_inode_pages_final() gets called when inode
has no more users. So there are no mappings of the inode. So how could
truncate_pagecache() possibly make a difference?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
  2015-11-18 15:09   ` Jan Kara
@ 2015-11-19  0:22     ` Williams, Dan J
  2015-11-19 12:55       ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Williams, Dan J @ 2015-11-19  0:22 UTC (permalink / raw)
  To: jack
  Cc: linux-block, stable, akpm, linux-nvdimm, willy, linux-fsdevel,
	ross.zwisler, jack, david

On Wed, 2015-11-18 at 16:09 +0100, Jan Kara wrote:
> Hum, I don't get this. truncate_inode_pages_final() gets called when inode
> has no more users. So there are no mappings of the inode. So how could
> truncate_pagecache() possibly make a difference?

True.  I confirmed with more focus testing that the change to
truncate_inode_pages_final() is not necessary.  After
invalidate_inodes() does unmap_mapping_range() we are protected by
future calls to get_block() and blk_queue_enter() failing when there
are attempts to re-establish a mapping after the block device has been
torn down.

Here's a revised patch.  Note that the call truncate_pagecache() is
replaced with a call to unmap_mapping_range() since it is fine to
access zero pages that might still be in the page cache.

8<----
Subject: mm, dax: unmap dax mappings at bdev or fs shutdown

From: Dan Williams <dan.j.williams@intel.com>

Currently dax mappings leak past / survive block_device shutdown.  While
page cache pages are permitted to be read/written after the block_device
is torn down this is not acceptable in the dax case as all media access
must end when the device is disabled.  The pfn backing a dax mapping is
permitted to be invalidated after bdev shutdown and this is indeed the
case with brd.

When a dax capable block_device driver calls del_gendisk() in its
shutdown path del_gendisk() needs to ensure that all DAX pfns are
unmapped.  This is different than the pagecache backed case where the
disk is protected by the queue being torn down which ends I/O to the
device.  Since dax bypasses the page cache we need to unconditionally
unmap the inode.

Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
[honza: drop changes to truncate_inode_pages_final]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/inode.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 1be5f9003eb3..dcb31d2c15e6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -579,6 +579,18 @@ static void dispose_list(struct list_head *head)
 	}
 }

+static void unmap_list(struct list_head *head)
+{
+	struct inode *inode, *_i;
+
+	list_for_each_entry_safe(inode, _i, head, i_lru) {
+		list_del_init(&inode->i_lru);
+		unmap_mapping_range(&inode->i_data, 0, 0, 1);
+		iput(inode);
+		cond_resched();
+	}
+}
+
 /**
  * evict_inodes	- evict all evictable inodes for a superblock
  * @sb:		superblock to operate on
@@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	int busy = 0;
 	struct inode *inode, *next;
 	LIST_HEAD(dispose);
+	LIST_HEAD(unmap);

 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
@@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 			busy = 1;
 			continue;
 		}
+		if (IS_DAX(inode) && atomic_read(&inode->i_count)) {
+			/*
+			 * dax mappings can't live past this invalidation event
+			 * as there is no page cache present to allow the data
+			 * to remain accessible.
+			 */
+			__iget(inode);
+			inode_lru_list_del(inode);
+			spin_unlock(&inode->i_lock);
+			list_add(&inode->i_lru, &unmap);
+			busy = 1;
+			continue;
+		}
 		if (atomic_read(&inode->i_count)) {
 			spin_unlock(&inode->i_lock);
 			busy = 1;
@@ -669,6 +695,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	spin_unlock(&sb->s_inode_list_lock);

 	dispose_list(&dispose);
+	unmap_list(&unmap);

 	return busy;
 }

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
  2015-11-19  0:22     ` Williams, Dan J
@ 2015-11-19 12:55       ` Jan Kara
  2015-11-19 16:55         ` Dan Williams
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2015-11-19 12:55 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: jack, linux-block, stable, akpm, linux-nvdimm, willy,
	linux-fsdevel, ross.zwisler, jack, david

> Subject: mm, dax: unmap dax mappings at bdev or fs shutdown
> From: Dan Williams <dan.j.williams@intel.com>
> 
> Currently dax mappings leak past / survive block_device shutdown.ï¿½ï¿½While
> page cache pages are permitted to be read/written after the block_device
> is torn down this is not acceptable in the dax case as all media access
> must end when the device is disabled.ï¿½ï¿½The pfn backing a dax mapping is
> permitted to be invalidated after bdev shutdown and this is indeed the
> case with brd.
> 
> When a dax capable block_device driver calls del_gendisk() in its
> shutdown path del_gendisk() needs to ensure that all DAX pfns are
> unmapped.ï¿½ï¿½This is different than the pagecache backed case where the
> disk is protected by the queue being torn down which ends I/O to the
> device.ï¿½ï¿½Since dax bypasses the page cache we need to unconditionally
> unmap the inode.
...
> +static void unmap_list(struct list_head *head)
> +{
> +	struct inode *inode, *_i;
> +
> +	list_for_each_entry_safe(inode, _i, head, i_lru) {
> +		list_del_init(&inode->i_lru);
> +		unmap_mapping_range(&inode->i_data, 0, 0, 1);
> +		iput(inode);
> +		cond_resched();
> +	}
> +}
> +
> ï¿½/**
> ï¿½ * evict_inodes	- evict all evictable inodes for a superblock
> ï¿½ * @sb:		superblock to operate on
> @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> ï¿½	int busy = 0;
> ï¿½	struct inode *inode, *next;
> ï¿½	LIST_HEAD(dispose);
> +	LIST_HEAD(unmap);
> ï¿½
> ï¿½	spin_lock(&sb->s_inode_list_lock);
> ï¿½	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> ï¿½			busy = 1;
> ï¿½			continue;
> ï¿½		}
> +		if (IS_DAX(inode) && atomic_read(&inode->i_count)) {
> +			/*
> +			ï¿½* dax mappings can't live past this invalidation event
> +			ï¿½* as there is no page cache present to allow the data
> +			ï¿½* to remain accessible.
> +			ï¿½*/
> +			__iget(inode);
> +			inode_lru_list_del(inode);
> +			spin_unlock(&inode->i_lock);
> +			list_add(&inode->i_lru, &unmap);
> +			busy = 1;
> +			continue;
> +		}

I'm somewhat concerned about the abuse of i_lru list here. The inodes have
i_count > 0 and so the code generally assumes such inodes should be removed
from LRU list if they are there. Now I didn't find a place where this could
really hit you but it is fragile. And when that happens, you have a
corruption of your unmap list (since you access it without any locks) and
also you will not unmap your inode.

Also are you sure that your unmapping cannot race with other process
mapping the pfns again?

BTW what happens when you access a DAX pfn that got removed?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
  2015-11-19 12:55       ` Jan Kara
@ 2015-11-19 16:55         ` Dan Williams
  2015-11-19 17:12           ` Jan Kara
  2015-11-19 23:17           ` Dave Chinner
  0 siblings, 2 replies; 20+ messages in thread
From: Dan Williams @ 2015-11-19 16:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-block, stable, akpm, linux-nvdimm, willy, linux-fsdevel,
	ross.zwisler, jack, david

On Thu, Nov 19, 2015 at 4:55 AM, Jan Kara <jack@suse.cz> wrote:
>> Subject: mm, dax: unmap dax mappings at bdev or fs shutdown
>> From: Dan Williams <dan.j.williams@intel.com>
>>
>> Currently dax mappings leak past / survive block_device shutdown.  While
>> page cache pages are permitted to be read/written after the block_device
>> is torn down this is not acceptable in the dax case as all media access
>> must end when the device is disabled.  The pfn backing a dax mapping is
>> permitted to be invalidated after bdev shutdown and this is indeed the
>> case with brd.
>>
>> When a dax capable block_device driver calls del_gendisk() in its
>> shutdown path del_gendisk() needs to ensure that all DAX pfns are
>> unmapped.  This is different than the pagecache backed case where the
>> disk is protected by the queue being torn down which ends I/O to the
>> device.  Since dax bypasses the page cache we need to unconditionally
>> unmap the inode.
> ...
>> +static void unmap_list(struct list_head *head)
>> +{
>> +     struct inode *inode, *_i;
>> +
>> +     list_for_each_entry_safe(inode, _i, head, i_lru) {
>> +             list_del_init(&inode->i_lru);
>> +             unmap_mapping_range(&inode->i_data, 0, 0, 1);
>> +             iput(inode);
>> +             cond_resched();
>> +     }
>> +}
>> +
>>  /**
>>   * evict_inodes      - evict all evictable inodes for a superblock
>>   * @sb:              superblock to operate on
>> @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>>       int busy = 0;
>>       struct inode *inode, *next;
>>       LIST_HEAD(dispose);
>> +     LIST_HEAD(unmap);
>>
>>       spin_lock(&sb->s_inode_list_lock);
>>       list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
>> @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>>                       busy = 1;
>>                       continue;
>>               }
>> +             if (IS_DAX(inode) && atomic_read(&inode->i_count)) {
>> +                     /*
>> +                      * dax mappings can't live past this invalidation event
>> +                      * as there is no page cache present to allow the data
>> +                      * to remain accessible.
>> +                      */
>> +                     __iget(inode);
>> +                     inode_lru_list_del(inode);
>> +                     spin_unlock(&inode->i_lock);
>> +                     list_add(&inode->i_lru, &unmap);
>> +                     busy = 1;
>> +                     continue;
>> +             }
>
> I'm somewhat concerned about the abuse of i_lru list here. The inodes have
> i_count > 0 and so the code generally assumes such inodes should be removed
> from LRU list if they are there. Now I didn't find a place where this could
> really hit you but it is fragile. And when that happens, you have a
> corruption of your unmap list (since you access it without any locks) and
> also you will not unmap your inode.

"i_lru" list abuse was my main concern with this patch, I'll look into
a different way.

> Also are you sure that your unmapping cannot race with other process
> mapping the pfns again?

You're right, there is indeed a gap between the unmap and when
get_blocks() will start returning errors in the fault path.  I think I
need to defer the unmapping until after blk_cleanup_queue() where we
know that no new I/o to the device is possible.

> BTW what happens when you access a DAX pfn that got removed?

SIGBUS.  I don't see a way to be any kinder to the application.  After
the ->remove() method for the block_device is complete we can't be
sure that the pfn is valid or even present in the system (think brd,
or VM hot provisioning).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
  2015-11-19 16:55         ` Dan Williams
@ 2015-11-19 17:12           ` Jan Kara
  2015-11-19 23:17           ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Kara @ 2015-11-19 17:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-block, stable, akpm, linux-nvdimm, willy,
	linux-fsdevel, ross.zwisler, jack, david

On Thu 19-11-15 08:55:48, Dan Williams wrote:
> On Thu, Nov 19, 2015 at 4:55 AM, Jan Kara <jack@suse.cz> wrote:
> > Also are you sure that your unmapping cannot race with other process
> > mapping the pfns again?
> 
> You're right, there is indeed a gap between the unmap and when
> get_blocks() will start returning errors in the fault path.  I think I
> need to defer the unmapping until after blk_cleanup_queue() where we
> know that no new I/o to the device is possible.

Yeah, you need to squeeze it somewhere after the point where get_blocks()
start returning errors and before the point where pfn can go away.

> > BTW what happens when you access a DAX pfn that got removed?
> 
> SIGBUS.  I don't see a way to be any kinder to the application.  After
> the ->remove() method for the block_device is complete we can't be
> sure that the pfn is valid or even present in the system (think brd,
> or VM hot provisioning).

I see. So if we kept the PFN mapped, it could result e.g. in memory
corruption (at least in case of brd). So we really need this to be 100%
reliable. That's what I was interested in.

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
  2015-11-19 16:55         ` Dan Williams
  2015-11-19 17:12           ` Jan Kara
@ 2015-11-19 23:17           ` Dave Chinner
  2015-11-20  0:05             ` Williams, Dan J
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2015-11-19 23:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-block, stable, akpm, linux-nvdimm, willy,
	linux-fsdevel, ross.zwisler, jack

On Thu, Nov 19, 2015 at 08:55:48AM -0800, Dan Williams wrote:
> On Thu, Nov 19, 2015 at 4:55 AM, Jan Kara <jack@suse.cz> wrote:
> >> Subject: mm, dax: unmap dax mappings at bdev or fs shutdown
> >> From: Dan Williams <dan.j.williams@intel.com>
> >>
> >> Currently dax mappings leak past / survive block_device shutdown.  While
> >> page cache pages are permitted to be read/written after the block_device
> >> is torn down this is not acceptable in the dax case as all media access
> >> must end when the device is disabled.  The pfn backing a dax mapping is
> >> permitted to be invalidated after bdev shutdown and this is indeed the
> >> case with brd.
> >>
> >> When a dax capable block_device driver calls del_gendisk() in its
> >> shutdown path del_gendisk() needs to ensure that all DAX pfns are
> >> unmapped.  This is different than the pagecache backed case where the
> >> disk is protected by the queue being torn down which ends I/O to the
> >> device.  Since dax bypasses the page cache we need to unconditionally
> >> unmap the inode.
> > ...
> >> +static void unmap_list(struct list_head *head)
> >> +{
> >> +     struct inode *inode, *_i;
> >> +
> >> +     list_for_each_entry_safe(inode, _i, head, i_lru) {
> >> +             list_del_init(&inode->i_lru);
> >> +             unmap_mapping_range(&inode->i_data, 0, 0, 1);
> >> +             iput(inode);
> >> +             cond_resched();
> >> +     }
> >> +}
> >> +
> >>  /**
> >>   * evict_inodes      - evict all evictable inodes for a superblock
> >>   * @sb:              superblock to operate on
> >> @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> >>       int busy = 0;
> >>       struct inode *inode, *next;
> >>       LIST_HEAD(dispose);
> >> +     LIST_HEAD(unmap);
> >>
> >>       spin_lock(&sb->s_inode_list_lock);
> >>       list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> >> @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> >>                       busy = 1;
> >>                       continue;
> >>               }
> >> +             if (IS_DAX(inode) && atomic_read(&inode->i_count)) {
> >> +                     /*
> >> +                      * dax mappings can't live past this invalidation event
> >> +                      * as there is no page cache present to allow the data
> >> +                      * to remain accessible.
> >> +                      */
> >> +                     __iget(inode);
> >> +                     inode_lru_list_del(inode);
> >> +                     spin_unlock(&inode->i_lock);
> >> +                     list_add(&inode->i_lru, &unmap);
> >> +                     busy = 1;
> >> +                     continue;
> >> +             }
> >
> > I'm somewhat concerned about the abuse of i_lru list here. The inodes have
> > i_count > 0 and so the code generally assumes such inodes should be removed
> > from LRU list if they are there. Now I didn't find a place where this could
> > really hit you but it is fragile. And when that happens, you have a
> > corruption of your unmap list (since you access it without any locks) and
> > also you will not unmap your inode.
> 
> "i_lru" list abuse was my main concern with this patch, I'll look into
> a different way.

Yeah, you can't use i_lru at all for purposes like this - even if
there are active references to the inode, the shrinker could be
walking the LRU list and accessing this inode (e.g.
inode_lru_isolate()) at the same time this code is removing it from
the LRU. Then things will go real bad....

> > Also are you sure that your unmapping cannot race with other process
> > mapping the pfns again?
> 
> You're right, there is indeed a gap between the unmap and when
> get_blocks() will start returning errors in the fault path.

get_blocks() won't start returning errors until the filesystem has
had an IO error. Given they cache metadata and do async
transactions, it could be some time before the filesystem notices
that the device has gone away in a fatal way.

> I think I
> need to defer the unmapping until after blk_cleanup_queue() where we
> know that no new I/o to the device is possible.

Actually, I think we need to trigger a filesystem shutdown before
doing anything else (e.g. before unmapping the inodes). That way the
filesystem will error out new calls to get_blocks() and prevent any
new IO submission while the block device teardown and inode
unmapping is done. i.e. solving the problem at the block device
level is hard, but we already have all the necessary infrastructure
to shut off IO and new block mappings at the filesystem level....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
  2015-11-19 23:17           ` Dave Chinner
@ 2015-11-20  0:05             ` Williams, Dan J
  2015-11-20  4:06               ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Williams, Dan J @ 2015-11-20  0:05 UTC (permalink / raw)
  To: david
  Cc: linux-block, stable, akpm, linux-nvdimm, willy, linux-fsdevel,
	ross.zwisler, jack, jack

On Fri, 2015-11-20 at 10:17 +1100, Dave Chinner wrote:
> Actually, I think we need to trigger a filesystem shutdown before
> doing anything else (e.g. before unmapping the inodes).That way the
> filesystem will error out new calls to get_blocks() and prevent any
> new IO submission while the block device teardown and inode
> unmapping is done. i.e. solving the problem at the block device
> level is hard, but we already have all the necessary infrastructure
> to shut off IO and new block mappings at the filesystem level....
> 

Shutting down the filesystem on block_device remove seems a more
invasive behavior change from what we've historically done.  I.e. a
best effort "let the fs struggle on to service whatever it can that is
not dependent on new disk I/O".

Solving it at the block layer isn't that hard now that we have
blk_queue_enter() to cheaply test if the block_device is dying or dead.

Below is an updated attempt that borrows a common inode walking
pattern, and simply reorders the order of operations done by
del_gendisk and blk_cleanup_queue.

Note that this depends on dax_map_atomic() from "[PATCH 8/8] dax: fix
lifetime of in-kernel dax mappings with dax_map_atomic()" (https://list
s.01.org/pipermail/linux-nvdimm/2015-November/002882.html).  Where
dax_map_atomic() checks that the queue is still live before allowing
new calls to bdev_direct_access().

8<-----
Subject: mm, dax: unmap dax mappings at bdev shutdown

From: Dan Williams <dan.j.williams@intel.com>

Currently dax mappings leak past / survive block_device shutdown.  While
page cache pages are permitted to be read/written after the block_device
is torn down this is not acceptable in the dax case as all media access
must end when the device is disabled.  The pfn backing a dax mapping is
permitted to be invalidated after bdev shutdown and this is indeed the
case with brd.

When a dax capable block_device driver calls del_gendisk_queue() in its
shutdown path it needs to ensure that all DAX pfns are unmapped, and
that no new mappings can be established.  This is different than the
pagecache backed case where the disk is protected by the queue being
torn down which ends I/O to the device.  Since dax bypasses the page
cache we need to unconditionally unmap the inode.

Cc: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
[honza: drop changes to truncate_inode_pages_final]
[honza: ensure mappings can't be re-established]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/genhd.c                |   88 +++++++++++++++++++++++++++++++++++-------
 drivers/block/brd.c          |    3 -
 drivers/nvdimm/pmem.c        |    3 -
 drivers/s390/block/dcssblk.c |    6 +--
 fs/block_dev.c               |   41 ++++++++++++++++++++
 include/linux/fs.h           |    1 
 include/linux/genhd.h        |    1 
 7 files changed, 121 insertions(+), 22 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index e5cafa51567c..37ab780c0937 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -634,24 +634,14 @@ void add_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(add_disk);
 
-void del_gendisk(struct gendisk *disk)
+static void del_gendisk_start(struct gendisk *disk)
 {
-	struct disk_part_iter piter;
-	struct hd_struct *part;
-
 	blk_integrity_del(disk);
 	disk_del_events(disk);
+}
 
-	/* invalidate stuff */
-	disk_part_iter_init(&piter, disk,
-			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
-	while ((part = disk_part_iter_next(&piter))) {
-		invalidate_partition(disk, part->partno);
-		delete_partition(disk, part->partno);
-	}
-	disk_part_iter_exit(&piter);
-
-	invalidate_partition(disk, 0);
+static void del_gendisk_end(struct gendisk *disk)
+{
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
@@ -670,9 +660,79 @@ void del_gendisk(struct gendisk *disk)
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
 	device_del(disk_to_dev(disk));
 }
+
+#define for_each_part(part, piter) \
+	for (part = disk_part_iter_next(piter); part; \
+			part = disk_part_iter_next(piter))
+void del_gendisk(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* invalidate stuff */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		invalidate_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+
+	del_gendisk_end(disk);
+}
 EXPORT_SYMBOL(del_gendisk);
 
 /**
+ * del_gendisk_queue - combined del_gendisk + blk_cleanup_queue
+ * @disk: disk to delete, invalidate, unmap, and end i/o
+ *
+ * This alternative for open coded calls to:
+ *     del_gendisk()
+ *     blk_cleanup_queue()
+ * ...is for ->direct_access() (DAX) capable devices.  DAX bypasses page
+ * cache and mappings go directly to storage media.  When such a disk is
+ * removed the pfn backing a mapping may be invalid or removed from the
+ * system.  Upon return accessing DAX mappings of this disk will trigger
+ * SIGBUS.
+ */
+void del_gendisk_queue(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* pass1 sync fs + evict idle inodes */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter)
+		invalidate_partition(disk, part->partno);
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+
+	blk_cleanup_queue(disk->queue);
+
+	/*
+	 * pass2 now that the queue is dead unmap DAX inodes, and delete
+	 * partitions
+	 */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		unmap_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	unmap_partition(disk, 0);
+
+	del_gendisk_end(disk);
+}
+EXPORT_SYMBOL(del_gendisk_queue);
+
+/**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
  * @partno: returned partition index
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index a5880f4ab40e..f149dd57bba3 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -532,7 +532,6 @@ out:
 static void brd_free(struct brd_device *brd)
 {
 	put_disk(brd->brd_disk);
-	blk_cleanup_queue(brd->brd_queue);
 	brd_free_pages(brd);
 	kfree(brd);
 }
@@ -560,7 +559,7 @@ out:
 static void brd_del_one(struct brd_device *brd)
 {
 	list_del(&brd->brd_list);
-	del_gendisk(brd->brd_disk);
+	del_gendisk_queue(brd->brd_disk);
 	brd_free(brd);
 }
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 8ee79893d2f5..6dd06e9d34b0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,9 +158,8 @@ static void pmem_detach_disk(struct pmem_device *pmem)
 	if (!pmem->pmem_disk)
 		return;
 
-	del_gendisk(pmem->pmem_disk);
+	del_gendisk_queue(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
-	blk_cleanup_queue(pmem->pmem_queue);
 }
 
 static int pmem_attach_disk(struct device *dev,
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 94a8f4ab57bc..0c3c968b57d9 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -388,8 +388,7 @@ removeseg:
 	}
 	list_del(&dev_info->lh);
 
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	up_write(&dcssblk_devices_sem);
@@ -751,8 +750,7 @@ dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch
 	}
 
 	list_del(&dev_info->lh);
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	device_unregister(&dev_info->dev);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4c14d4467bbf..975d32b5623b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1795,6 +1795,47 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 }
 EXPORT_SYMBOL(__invalidate_device);
 
+void unmap_partition(struct gendisk *disk, int partno)
+{
+	bool dax = !!disk->fops->direct_access;
+	struct block_device *bdev = dax ? bdget_disk(disk, partno) : NULL;
+	struct super_block *sb = get_super(bdev);
+	struct inode *inode, *_inode = NULL;
+
+	if (!bdev)
+		return;
+
+	if (!sb) {
+		bdput(bdev);
+		return;
+	}
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+				|| !IS_DAX(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+
+		unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+		iput(_inode);
+		_inode = inode;
+		cond_resched();
+
+		spin_lock(&sb->s_inode_list_lock);
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+	iput(_inode);
+
+	drop_super(sb);
+	bdput(bdev);
+}
+
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa514254161..d2dda249abf8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2390,6 +2390,7 @@ extern int revalidate_disk(struct gendisk *);
 extern int check_disk_change(struct block_device *);
 extern int __invalidate_device(struct block_device *, bool);
 extern int invalidate_partition(struct gendisk *, int);
+extern void unmap_partition(struct gendisk *, int);
 #endif
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 847cc1d91634..028cf15a8a57 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -431,6 +431,7 @@ extern void part_round_stats(int cpu, struct hd_struct *part);
 /* block/genhd.c */
 extern void add_disk(struct gendisk *disk);
 extern void del_gendisk(struct gendisk *gp);
+extern void del_gendisk_queue(struct gendisk *disk);
 extern struct gendisk *get_gendisk(dev_t dev, int *partno);
 extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
  2015-11-20  0:05             ` Williams, Dan J
@ 2015-11-20  4:06               ` Dave Chinner
  2015-11-20  4:25                 ` Dan Williams
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2015-11-20  4:06 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-block, stable, akpm, linux-nvdimm, willy, linux-fsdevel,
	ross.zwisler, jack, jack

On Fri, Nov 20, 2015 at 12:05:11AM +0000, Williams, Dan J wrote:
> On Fri, 2015-11-20 at 10:17 +1100, Dave Chinner wrote:
> > Actually, I think we need to trigger a filesystem shutdown before
> > doing anything else (e.g. before unmapping the inodes).That way the
> > filesystem will error out new calls to get_blocks() and prevent any
> > new IO submission while the block device teardown and inode
> > unmapping is done. i.e. solving the problem at the block device
> > level is hard, but we already have all the necessary infrastructure
> > to shut off IO and new block mappings at the filesystem level....
> > 
> 
> Shutting down the filesystem on block_device remove seems a more
> invasive behavior change from what we've historically done.

I've heard that so many times I figured that would be your answer.
yet we've got a clear situation where we have a race between file
level access and block device operations that is clearly solved by
doing an upfront filesystem shutdown on unplug, but still the answer
is "ignore the filesystem, we need to do everything in the block
layer, no matter how hard or complex it is to solve"...

> ï¿½I.e. a
> best effort "let the fs struggle on to service whatever it can that is
> not dependent on new disk I/O".

And so we still have this limbo fs state that is an utter nightmare
to handle sanely. We don't know what the cause of the IO error are
and so we have to try to handle them as though we can recover in
some way from the error. Only when we get an error we can't possibly
recover from do we shut the fileystem down and then stop all
attempts at issuing IO, mapping page faults, etc.

However, if the device has been unplugged then we *can never
recover* and so continuing on with out eyes closed and fingers in
our eyes shouting 'lalalalalalala" as loud as we can won't change
the fact that we are going to shut down the filesystem in the near
future.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
  2015-11-20  4:06               ` Dave Chinner
@ 2015-11-20  4:25                 ` Dan Williams
  2015-11-20 17:08                   ` Dan Williams
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Williams @ 2015-11-20  4:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: jack, linux-nvdimm, stable, linux-block, jack, linux-fsdevel, akpm

On Thu, Nov 19, 2015 at 8:06 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Nov 20, 2015 at 12:05:11AM +0000, Williams, Dan J wrote:
>> On Fri, 2015-11-20 at 10:17 +1100, Dave Chinner wrote:
>> > Actually, I think we need to trigger a filesystem shutdown before
>> > doing anything else (e.g. before unmapping the inodes).That way the
>> > filesystem will error out new calls to get_blocks() and prevent any
>> > new IO submission while the block device teardown and inode
>> > unmapping is done. i.e. solving the problem at the block device
>> > level is hard, but we already have all the necessary infrastructure
>> > to shut off IO and new block mappings at the filesystem level....
>> >
>>
>> Shutting down the filesystem on block_device remove seems a more
>> invasive behavior change from what we've historically done.
>
> I've heard that so many times I figured that would be your answer.
> yet we've got a clear situation where we have a race between file
> level access and block device operations that is clearly solved by
> doing an upfront filesystem shutdown on unplug, but still the answer
> is "ignore the filesystem, we need to do everything in the block
> layer, no matter how hard or complex it is to solve"...

I was only disagreeing with your assertion that solving this
particular problem in the block layer was hard, and not asserting that
this needs to be solved in the block layer at all costs.

>>  I.e. a
>> best effort "let the fs struggle on to service whatever it can that is
>> not dependent on new disk I/O".
>
> And so we still have this limbo fs state that is an utter nightmare
> to handle sanely. We don't know what the cause of the IO error are
> and so we have to try to handle them as though we can recover in
> some way from the error. Only when we get an error we can't possibly
> recover from do we shut the fileystem down and then stop all
> attempts at issuing IO, mapping page faults, etc.
>
> However, if the device has been unplugged then we *can never
> recover* and so continuing on with out eyes closed and fingers in
> our eyes shouting 'lalalalalalala" as loud as we can won't change
> the fact that we are going to shut down the filesystem in the near
> future.
>

Can't argue with that, and XFS takes a lot longer to mourn the loss of
the block device than ext4.

What would be the recommended interface to tell XFS to sync if it can,
but give up quickly if it hits an error and then shutdown permanently?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
  2015-11-20  4:25                 ` Dan Williams
@ 2015-11-20 17:08                   ` Dan Williams
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2015-11-20 17:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: jack, linux-nvdimm, stable, linux-block, jack, linux-fsdevel, akpm

On Thu, Nov 19, 2015 at 8:25 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Nov 19, 2015 at 8:06 PM, Dave Chinner <david@fromorbit.com> wrote:
[..]
> What would be the recommended interface to tell XFS to sync if it can,
> but give up quickly if it hits an error and then shutdown permanently?

Thinking about this a bit further I think a "give up" notification to
a filesystem is an incremental addition to the common vfs layer
invalidate_partition() and now unmap_partition().  Because "sync if
you can, but give up quickly" is handled by fsync_bdev() and
generic_make_request() returning immediate errors.  It's only the file
system triggered retries that we need to turn off.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-11-20 17:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams
2015-11-17 20:15 ` [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled Dan Williams
2015-11-17 20:16 ` [PATCH 2/8] dax: disable pmd mappings Dan Williams
2015-11-17 20:51   ` Ross Zwisler
2015-11-17 20:16 ` [PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault) Dan Williams
2015-11-17 20:16 ` [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown Dan Williams
2015-11-18 15:09   ` Jan Kara
2015-11-19  0:22     ` Williams, Dan J
2015-11-19 12:55       ` Jan Kara
2015-11-19 16:55         ` Dan Williams
2015-11-19 17:12           ` Jan Kara
2015-11-19 23:17           ` Dave Chinner
2015-11-20  0:05             ` Williams, Dan J
2015-11-20  4:06               ` Dave Chinner
2015-11-20  4:25                 ` Dan Williams
2015-11-20 17:08                   ` Dan Williams
2015-11-17 20:16 ` [PATCH 5/8] pmem, dax: clean up clear_pmem() Dan Williams
2015-11-17 20:16 ` [PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations Dan Williams
2015-11-17 20:16 ` [PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access() Dan Williams
2015-11-17 20:16 ` [PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.