nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch
@ 2018-03-10  6:54 Dan Williams
  2018-03-10  6:54 ` [PATCH v5 01/11] dax: store pfns in the radix Dan Williams
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:54 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Theodore Ts'o, Michal Hocko, jack, Andrew Morton,
	Matthew Wilcox, Peter Zijlstra, Dave Hansen, Dave Chinner,
	Jérôme Glisse, linux-kernel, linux-xfs, Ingo Molnar,
	Andreas Dilger, Alexander Viro, Jan Kara, linux-fsdevel,
	Darrick J. Wong, linux-ext4, hch

Changes since v4 [1]:
* Kill the DEFINE_FSDAX_AOPS macro and just open code new
  address_space_operations instances for each fs (Matthew, Jan, Dave,
  Christoph)

* Rename routines that had a 'dma_' prefix with 'dax_layout_' and merge
  the dax-layout-break into xfs_break_layouts() (Dave, Christoph)

* Rework the implementation to have the fsdax core find the pages, but
  leave the responsibility of waiting on those pages to the filesystem
  (Dave).

* Drop the nfit_test infrastructure for testing this mechanism, I plan
  to investigate better mechanisms for injecting arbitrary put_page()
  delays for dax pages relative to an extent unmap operation. The
  dm_delay target does not do what I want since it operates at whole
  device level. A better test interface would be a mechanism to delay
  I/O completion based on whether a bio referenced a given LBA.

Not changed since v4:
* This implementation still relies on RCU for synchronizing
  get_user_pages() and get_user_pages_fast() against
  dax_layout_busy_page(). We could perform the operation with just
  barriers if we knew at get_user_pages() time that the pages were flagged
  for truncation.  However, dax_layout_busy_page() does not have the
  information to flag that a page is actually going to be truncated, only
  that it *might* be truncated.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html

----

Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.


---

Dan Williams (11):
      dax: store pfns in the radix
      xfs, dax: introduce xfs_dax_aops
      ext4, dax: introduce ext4_dax_aops
      ext2, dax: introduce ext2_dax_aops
      fs, dax: use page->mapping to warn if truncate collides with a busy page
      mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
      mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
      wait_bit: introduce {wait_on,wake_up}_atomic_one
      mm, fs, dax: handle layout changes to pinned dax mappings
      xfs: prepare xfs_break_layouts() for another layout type
      xfs, dax: introduce xfs_break_dax_layouts()


 drivers/dax/super.c      |   96 +++++++++++++++--
 drivers/nvdimm/pmem.c    |    3 -
 fs/Kconfig               |    1 
 fs/dax.c                 |  259 +++++++++++++++++++++++++++++++++++++---------
 fs/ext2/ext2.h           |    1 
 fs/ext2/inode.c          |   28 ++++-
 fs/ext2/namei.c          |   18 ---
 fs/ext2/super.c          |    6 +
 fs/ext4/inode.c          |   11 ++
 fs/ext4/super.c          |    6 +
 fs/xfs/xfs_aops.c        |    7 +
 fs/xfs/xfs_aops.h        |    1 
 fs/xfs/xfs_file.c        |   94 ++++++++++++++++-
 fs/xfs/xfs_inode.h       |    9 ++
 fs/xfs/xfs_ioctl.c       |    9 +-
 fs/xfs/xfs_iops.c        |   17 ++-
 fs/xfs/xfs_pnfs.c        |    8 +
 fs/xfs/xfs_pnfs.h        |    4 -
 fs/xfs/xfs_super.c       |   20 ++--
 include/linux/dax.h      |   45 +++++++-
 include/linux/memremap.h |   28 ++---
 include/linux/mm.h       |   61 ++++++++---
 include/linux/wait_bit.h |   13 ++
 kernel/memremap.c        |   30 +++++
 kernel/sched/wait_bit.c  |   59 +++++++++-
 mm/Kconfig               |    5 +
 mm/gup.c                 |    5 +
 mm/hmm.c                 |   13 --
 mm/swap.c                |    3 -
 29 files changed, 663 insertions(+), 197 deletions(-)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 01/11] dax: store pfns in the radix
  2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
@ 2018-03-10  6:54 ` Dan Williams
  2018-03-10  6:54 ` [PATCH v5 02/11] xfs, dax: introduce xfs_dax_aops Dan Williams
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:54 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Matthew Wilcox, david, linux-kernel, linux-xfs, linux-fsdevel, hch

In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |   15 +++++++--
 fs/dax.c            |   83 +++++++++++++++++++--------------------------------
 2 files changed, 43 insertions(+), 55 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecdc292aa4e4..2b2332b605e4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
-	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
-			|| pfn_t_devmap(pfn))
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+		/*
+		 * An arch that has enabled the pmem api should also
+		 * have its drivers support pfn_t_devmap()
+		 *
+		 * This is a developer warning and should not trigger in
+		 * production. dax_flush() will crash since it depends
+		 * on being able to do (page_address(pfn_to_page())).
+		 */
+		WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+	} else if (pfn_t_devmap(pfn)) {
 		/* pass */;
-	else {
+	} else {
 		pr_debug("VFS (%s): error: dax support not enabled\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index 0276df90e86c..b646a46e4d12 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -73,16 +73,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE	(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY		(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
 	return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-			((unsigned long)sector << RADIX_DAX_SHIFT) |
-			RADIX_DAX_ENTRY_LOCK);
+			(pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -526,12 +525,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector,
+				      void *entry, pfn_t pfn_t,
 				      unsigned long flags, bool dirty)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *new_entry;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
 	pgoff_t index = vmf->pgoff;
+	void *new_entry;
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -546,7 +546,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = dax_radix_locked_entry(sector, flags);
+	new_entry = dax_radix_locked_entry(pfn, flags);
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
@@ -657,17 +657,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-		struct dax_device *dax_dev, struct address_space *mapping,
-		pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *entry2, **slot, *kaddr;
-	long ret = 0, id;
-	sector_t sector;
-	pgoff_t pgoff;
+	void *entry2, **slot;
+	unsigned long pfn;
+	long ret = 0;
 	size_t size;
-	pfn_t pfn;
 
 	/*
 	 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -683,10 +680,10 @@ static int dax_writeback_one(struct block_device *bdev,
 		goto put_unlocked;
 	/*
 	 * Entry got reallocated elsewhere? No need to writeback. We have to
-	 * compare sectors as we must not bail out due to difference in lockbit
+	 * compare pfns as we must not bail out due to difference in lockbit
 	 * or entry type.
 	 */
-	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
 		goto put_unlocked;
 	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
 				dax_is_zero_entry(entry))) {
@@ -712,33 +709,15 @@ static int dax_writeback_one(struct block_device *bdev,
 	/*
 	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
 	 * in the middle of a PMD, the 'index' we are given will be aligned to
-	 * the start index of the PMD, as will the sector we pull from
-	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
-	 * worry about partial PMD writebacks.
+	 * the start index of the PMD, as will the pfn we pull from 'entry'.
+	 * This allows us to flush for PMD_SIZE and not have to worry about
+	 * partial PMD writebacks.
 	 */
-	sector = dax_radix_sector(entry);
+	pfn = dax_radix_pfn(entry);
 	size = PAGE_SIZE << dax_radix_order(entry);
 
-	id = dax_read_lock();
-	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
-	if (ret)
-		goto dax_unlock;
-
-	/*
-	 * dax_direct_access() may sleep, so cannot hold tree_lock over
-	 * its invocation.
-	 */
-	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
-	if (ret < 0)
-		goto dax_unlock;
-
-	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
-		ret = -EIO;
-		goto dax_unlock;
-	}
-
-	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-	dax_flush(dax_dev, kaddr, size);
+	dax_mapping_entry_mkclean(mapping, index, pfn);
+	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -749,8 +728,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
- dax_unlock:
-	dax_read_unlock(id);
 	put_locked_mapping_entry(mapping, index);
 	return ret;
 
@@ -808,8 +785,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 				break;
 			}
 
-			ret = dax_writeback_one(bdev, dax_dev, mapping,
-					indices[i], pvec.pages[i]);
+			ret = dax_writeback_one(dax_dev, mapping, indices[i],
+					pvec.pages[i]);
 			if (ret < 0) {
 				mapping_set_error(mapping, ret);
 				goto out;
@@ -877,6 +854,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 	int ret = VM_FAULT_NOPAGE;
 	struct page *zero_page;
 	void *entry2;
+	pfn_t pfn;
 
 	zero_page = ZERO_PAGE(0);
 	if (unlikely(!zero_page)) {
@@ -884,14 +862,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 		goto out;
 	}
 
-	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_ZERO_PAGE, false);
 	if (IS_ERR(entry2)) {
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-	vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
+	vm_insert_mixed(vmf->vma, vaddr, pfn);
 out:
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
@@ -1200,8 +1179,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto error_finish_iomap;
 
-		entry = dax_insert_mapping_entry(mapping, vmf, entry,
-						 dax_iomap_sector(&iomap, pos),
+		entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 						 0, write && !sync);
 		if (IS_ERR(entry)) {
 			error = PTR_ERR(entry);
@@ -1280,13 +1258,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
 	void *ret = NULL;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
+	pfn_t pfn;
 
 	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
 
 	if (unlikely(!zero_page))
 		goto fallback;
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE, false);
 	if (IS_ERR(ret))
 		goto fallback;
@@ -1409,8 +1389,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto finish_iomap;
 
-		entry = dax_insert_mapping_entry(mapping, vmf, entry,
-						dax_iomap_sector(&iomap, pos),
+		entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 						RADIX_DAX_PMD, write && !sync);
 		if (IS_ERR(entry))
 			goto finish_iomap;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 02/11] xfs, dax: introduce xfs_dax_aops
  2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
  2018-03-10  6:54 ` [PATCH v5 01/11] dax: store pfns in the radix Dan Williams
@ 2018-03-10  6:54 ` Dan Williams
  2018-03-10  9:46   ` Christoph Hellwig
  2018-03-10  6:55 ` [PATCH v5 03/11] ext4, dax: introduce ext4_dax_aops Dan Williams
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:54 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Matthew Wilcox, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, hch

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O 4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c            |   27 +++++++++++++++++++++++++++
 fs/xfs/xfs_aops.c   |    7 +++++++
 fs/xfs/xfs_aops.h   |    1 +
 fs/xfs/xfs_iops.c   |    5 ++++-
 include/linux/dax.h |    6 ++++++
 5 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index b646a46e4d12..ba02772fccbc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -46,6 +46,33 @@
 #define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
 #define PG_PMD_NR	(PMD_SIZE >> PAGE_SHIFT)
 
+int dax_set_page_dirty(struct page *page)
+{
+	/*
+	 * Unlike __set_page_dirty_no_writeback that handles dirty page
+	 * tracking in the page object, dax does all dirty tracking in
+	 * the inode address_space in response to mkwrite faults. In the
+	 * dax case we only need to worry about potentially dirty CPU
+	 * caches, not dirty page cache pages to write back.
+	 *
+	 * This callback is defined to prevent fallback to
+	 * __set_page_dirty_buffers() in set_page_dirty().
+	 */
+	return 0;
+}
+EXPORT_SYMBOL(dax_set_page_dirty);
+
+void dax_invalidatepage(struct page *page, unsigned int offset,
+		unsigned int length)
+{
+	/*
+	 * There is no page cache to invalidate in the dax case, however
+	 * we need this callback defined to prevent falling back to
+	 * block_invalidatepage() in do_invalidatepage().
+	 */
+}
+EXPORT_SYMBOL(dax_invalidatepage);
+
 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..5788b680fa01 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1505,3 +1505,10 @@ const struct address_space_operations xfs_address_space_operations = {
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 };
+
+const struct address_space_operations xfs_dax_aops = {
+	.writepages		= xfs_vm_writepages,
+	.direct_IO		= xfs_vm_direct_IO,
+	.set_page_dirty		= dax_set_page_dirty,
+	.invalidatepage		= dax_invalidatepage,
+};
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 88c85ea63da0..69346d460dfa 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -54,6 +54,7 @@ struct xfs_ioend {
 };
 
 extern const struct address_space_operations xfs_address_space_operations;
+extern const struct address_space_operations xfs_dax_aops;
 
 int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fcd76f2..951e84df5576 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1272,7 +1272,10 @@ xfs_setup_iops(
 	case S_IFREG:
 		inode->i_op = &xfs_inode_operations;
 		inode->i_fop = &xfs_file_operations;
-		inode->i_mapping->a_ops = &xfs_address_space_operations;
+		if (IS_DAX(inode))
+			inode->i_mapping->a_ops = &xfs_dax_aops;
+		else
+			inode->i_mapping->a_ops = &xfs_address_space_operations;
 		break;
 	case S_IFDIR:
 		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0185ecdae135..3045c0d9c804 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -57,6 +57,9 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 }
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+int dax_set_page_dirty(struct page *page);
+void dax_invalidatepage(struct page *page, unsigned int offset,
+		unsigned int length);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -76,6 +79,9 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
 {
 	return NULL;
 }
+
+#define dax_set_page_dirty NULL
+#define dax_invalidatepage NULL
 #endif
 
 int dax_read_lock(void);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 03/11] ext4, dax: introduce ext4_dax_aops
  2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
  2018-03-10  6:54 ` [PATCH v5 01/11] dax: store pfns in the radix Dan Williams
  2018-03-10  6:54 ` [PATCH v5 02/11] xfs, dax: introduce xfs_dax_aops Dan Williams
@ 2018-03-10  6:55 ` Dan Williams
  2018-03-10  6:55 ` [PATCH v5 04/11] ext2, dax: introduce ext2_dax_aops Dan Williams
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, david, linux-kernel, linux-xfs, Andreas Dilger,
	linux-fsdevel, Theodore Ts'o, linux-ext4, hch

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext4/inode.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c94780075b04..ef21f0ad38ff 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3946,6 +3946,13 @@ static const struct address_space_operations ext4_da_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
+static const struct address_space_operations ext4_dax_aops = {
+	.writepages		= ext4_writepages,
+	.direct_IO		= ext4_direct_IO,
+	.set_page_dirty		= dax_set_page_dirty,
+	.invalidatepage		= dax_invalidatepage,
+};
+
 void ext4_set_aops(struct inode *inode)
 {
 	switch (ext4_inode_journal_mode(inode)) {
@@ -3958,7 +3965,9 @@ void ext4_set_aops(struct inode *inode)
 	default:
 		BUG();
 	}
-	if (test_opt(inode->i_sb, DELALLOC))
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &ext4_dax_aops;
+	else if (test_opt(inode->i_sb, DELALLOC))
 		inode->i_mapping->a_ops = &ext4_da_aops;
 	else
 		inode->i_mapping->a_ops = &ext4_aops;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 04/11] ext2, dax: introduce ext2_dax_aops
  2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (2 preceding siblings ...)
  2018-03-10  6:55 ` [PATCH v5 03/11] ext4, dax: introduce ext4_dax_aops Dan Williams
@ 2018-03-10  6:55 ` Dan Williams
  2018-03-10  6:55 ` [PATCH v5 05/11] fs, dax: use page->mapping to warn if truncate collides with a busy page Dan Williams
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, david, linux-kernel, linux-xfs, Jan Kara, linux-fsdevel, hch

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext2/ext2.h  |    1 +
 fs/ext2/inode.c |   28 ++++++++++++++++++++--------
 fs/ext2/namei.c |   18 ++----------------
 3 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 032295e1d386..cc40802ddfa8 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -814,6 +814,7 @@ extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
 
 /* inode.c */
+extern void ext2_set_file_ops(struct inode *inode);
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_nobh_aops;
 extern const struct iomap_ops ext2_iomap_ops;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9b2ac55ac34f..09608f7e9e39 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -990,6 +990,13 @@ const struct address_space_operations ext2_nobh_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
+const struct address_space_operations ext2_dax_aops = {
+	.writepages		= ext2_writepages,
+	.direct_IO		= ext2_direct_IO,
+	.set_page_dirty		= dax_set_page_dirty,
+	.invalidatepage		= dax_invalidatepage,
+};
+
 /*
  * Probably it should be a library function... search for first non-zero word
  * or memcmp with zero_page, whatever is better for particular architecture.
@@ -1388,6 +1395,18 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_DAX;
 }
 
+void ext2_set_file_ops(struct inode *inode)
+{
+	inode->i_op = &ext2_file_inode_operations;
+	inode->i_fop = &ext2_file_operations;
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &ext2_dax_aops;
+	else if (test_opt(inode->i_sb, NOBH))
+		inode->i_mapping->a_ops = &ext2_nobh_aops;
+	else
+		inode->i_mapping->a_ops = &ext2_aops;
+}
+
 struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 {
 	struct ext2_inode_info *ei;
@@ -1480,14 +1499,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 		ei->i_data[n] = raw_inode->i_block[n];
 
 	if (S_ISREG(inode->i_mode)) {
-		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, NOBH)) {
-			inode->i_mapping->a_ops = &ext2_nobh_aops;
-			inode->i_fop = &ext2_file_operations;
-		} else {
-			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_file_operations;
-		}
+		ext2_set_file_ops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext2_dir_inode_operations;
 		inode->i_fop = &ext2_dir_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index e078075dc66f..55f7caadb093 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -107,14 +107,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
-	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, NOBH)) {
-		inode->i_mapping->a_ops = &ext2_nobh_aops;
-		inode->i_fop = &ext2_file_operations;
-	} else {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_file_operations;
-	}
+	ext2_set_file_ops(inode);
 	mark_inode_dirty(inode);
 	return ext2_add_nondir(dentry, inode);
 }
@@ -125,14 +118,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
-	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, NOBH)) {
-		inode->i_mapping->a_ops = &ext2_nobh_aops;
-		inode->i_fop = &ext2_file_operations;
-	} else {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_file_operations;
-	}
+	ext2_set_file_ops(inode);
 	mark_inode_dirty(inode);
 	d_tmpfile(dentry, inode);
 	unlock_new_inode(inode);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 05/11] fs, dax: use page->mapping to warn if truncate collides with a busy page
  2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (3 preceding siblings ...)
  2018-03-10  6:55 ` [PATCH v5 04/11] ext2, dax: introduce ext2_dax_aops Dan Williams
@ 2018-03-10  6:55 ` Dan Williams
  2018-03-10  6:55 ` [PATCH v5 06/11] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Matthew Wilcox, david, linux-kernel, linux-xfs, linux-fsdevel, hch

Catch cases where extent unmap operations encounter pages that are
pinned / busy. Typically this is pinned pages that are under active dma.
This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
performing i/o.

Here is an example of a collision that this implementation catches:

 WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
 [..]
 Call Trace:
  __dax_invalidate_mapping_entry+0x6c/0xf0
  dax_delete_mapping_entry+0xf/0x20
  truncate_exceptional_pvec_entries.part.12+0x1af/0x200
  truncate_inode_pages_range+0x268/0x970
  ? tlb_gather_mmu+0x10/0x20
  ? up_write+0x1c/0x40
  ? unmap_mapping_range+0x73/0x140
  xfs_free_file_space+0x1b6/0x5b0 [xfs]
  ? xfs_file_fallocate+0x7f/0x320 [xfs]
  ? down_write_nested+0x40/0x70
  ? xfs_ilock+0x21d/0x2f0 [xfs]
  xfs_file_fallocate+0x162/0x320 [xfs]
  ? rcu_read_lock_sched_held+0x3f/0x70
  ? rcu_sync_lockdep_assert+0x2a/0x50
  ? __sb_start_write+0xd0/0x1b0
  ? vfs_fallocate+0x20c/0x270
  vfs_fallocate+0x154/0x270
  SyS_fallocate+0x43/0x80
  entry_SYSCALL_64_fastpath+0x1f/0x96

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ba02772fccbc..fecf463a1468 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -325,6 +325,56 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_empty_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return HPAGE_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+#define for_each_entry_pfn(entry, pfn, end_pfn) \
+	for (pfn = dax_radix_pfn(entry), \
+			end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
+			pfn < end_pfn; \
+			pfn++)
+
+static void dax_associate_entry(void *entry, struct address_space *mapping)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(page->mapping);
+		page->mapping = mapping;
+	}
+}
+
+static void dax_disassociate_entry(void *entry, struct address_space *mapping,
+		bool trunc)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+		page->mapping = NULL;
+	}
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -431,6 +481,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 		}
 
 		if (pmd_downgrade) {
+			dax_disassociate_entry(entry, mapping, false);
 			radix_tree_delete(&mapping->page_tree, index);
 			mapping->nrexceptional--;
 			dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -480,6 +531,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+	dax_disassociate_entry(entry, mapping, trunc);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;
@@ -574,6 +626,10 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 
 	spin_lock_irq(&mapping->tree_lock);
 	new_entry = dax_radix_locked_entry(pfn, flags);
+	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+		dax_disassociate_entry(entry, mapping, false);
+		dax_associate_entry(new_entry, mapping);
+	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 06/11] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (4 preceding siblings ...)
  2018-03-10  6:55 ` [PATCH v5 05/11] fs, dax: use page->mapping to warn if truncate collides with a busy page Dan Williams
@ 2018-03-10  6:55 ` Dan Williams
  2018-03-12 14:09   ` Jerome Glisse
  2018-03-10  6:55 ` [PATCH v5 07/11] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, jack, david, linux-kernel, linux-xfs,
	Jérôme Glisse, linux-fsdevel, hch

In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.

Recall that the 'struct page' entries for DAX memory are created with
devm_memremap_pages(). That routine arranges for the pages to be
allocated, but never onlined, so a DAX page is DMA-idle when its
reference count reaches one.

Also recall that the HMM sub-system added infrastructure to trap the
page-idle (2-to-1 reference count) transition of the pages allocated by
devm_memremap_pages() and trigger a callback via the 'struct
dev_pagemap' associated with the page range. Whereas the HMM callbacks
are going to a device driver to manage bounce pages in device-memory in
the filesystem-dax case we will call back to filesystem specified
callback.

Since the callback is not known at devm_memremap_pages() time we arrange
for the filesystem to install it at mount time. No functional changes
are expected as this only registers a nop handler for the ->page_free()
event for device-mapped pages.

Cc: Michal Hocko <mhocko@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |   79 ++++++++++++++++++++++++++++++++++++++++------
 drivers/nvdimm/pmem.c    |    3 +-
 fs/ext2/super.c          |    6 ++-
 fs/ext4/super.c          |    6 ++-
 fs/xfs/xfs_super.c       |   20 ++++++------
 include/linux/dax.h      |   17 +++++-----
 include/linux/memremap.h |    8 +++++
 7 files changed, 103 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..ecefe9f7eb60 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+static DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-	if (!blk_queue_dax(bdev->bd_queue))
-		return NULL;
-	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -169,9 +160,66 @@ struct dax_device {
 	const char *host;
 	void *private;
 	unsigned long flags;
+	struct dev_pagemap *pgmap;
 	const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+	/* TODO: wakeup page-idle waiters */
+}
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+
+	if (!blk_queue_dax(bdev->bd_queue))
+		return NULL;
+	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+	if (!dax_dev->pgmap)
+		return dax_dev;
+	pgmap = dax_dev->pgmap;
+
+	mutex_lock(&devmap_lock);
+	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
+			|| pgmap->page_fault
+			|| pgmap->type != MEMORY_DEVICE_HOST) {
+		put_dax(dax_dev);
+		mutex_unlock(&devmap_lock);
+		return NULL;
+	}
+
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->page_free = generic_dax_pagefree;
+	pgmap->data = owner;
+	mutex_unlock(&devmap_lock);
+
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+	struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+	put_dax(dax_dev);
+	if (!pgmap)
+		return;
+	if (!pgmap->data)
+		return;
+
+	mutex_lock(&devmap_lock);
+	WARN_ON(pgmap->data != owner);
+	pgmap->type = MEMORY_DEVICE_HOST;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
+	mutex_unlock(&devmap_lock);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -499,6 +547,17 @@ struct dax_device *alloc_dax(void *private, const char *__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+	struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+	if (dax_dev)
+		dax_dev->pgmap = pgmap;
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
 void put_dax(struct dax_device *dax_dev)
 {
 	if (!dax_dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 06f8dcc52ca6..e6d7351f3379 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
 	nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
 	disk->bb = &pmem->bb;
 
-	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+	dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops,
+			&pmem->pgmap);
 	if (!dax_dev) {
 		put_disk(disk);
 		return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7666c065b96f..6ae20e319bc4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
 	brelse (sbi->s_sbh);
 	sb->s_fs_info = NULL;
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
 
 static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	struct buffer_head * bh;
 	struct ext2_sb_info * sbi;
 	struct ext2_super_block * es;
@@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->s_blockgroup_lock);
 	kfree(sbi);
 failed:
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 39bf464c35f1..315a323729e3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -952,7 +952,7 @@ static void ext4_put_super(struct super_block *sb)
 	if (sbi->s_chksum_driver)
 		crypto_free_shash(sbi->s_chksum_driver);
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -3398,7 +3398,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
 
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	char *orig_data = kstrdup(data, GFP_KERNEL);
 	struct buffer_head *bh;
 	struct ext4_super_block *es = NULL;
@@ -4408,7 +4408,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 out_free_base:
 	kfree(sbi);
 	kfree(orig_data);
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return err ? err : ret;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 93588ea3d3d2..ef7dd7148c0b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -724,7 +724,7 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_logdev_targp);
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
 	if (mp->m_rtdev_targp) {
 		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
@@ -732,10 +732,10 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_rtdev_targp);
 		xfs_blkdev_put(rtdev);
-		fs_put_dax(dax_rtdev);
+		fs_dax_release(dax_rtdev, mp);
 	}
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 }
 
 /*
@@ -753,9 +753,9 @@ xfs_open_devices(
 	struct xfs_mount	*mp)
 {
 	struct block_device	*ddev = mp->m_super->s_bdev;
-	struct dax_device	*dax_ddev = fs_dax_get_by_bdev(ddev);
-	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
+	struct dax_device	*dax_ddev = fs_dax_claim_bdev(ddev, mp);
 	struct block_device	*logdev = NULL, *rtdev = NULL;
+	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
 	int			error;
 
 	/*
@@ -765,7 +765,7 @@ xfs_open_devices(
 		error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
 		if (error)
 			goto out;
-		dax_logdev = fs_dax_get_by_bdev(logdev);
+		dax_logdev = fs_dax_claim_bdev(logdev, mp);
 	}
 
 	if (mp->m_rtname) {
@@ -779,7 +779,7 @@ xfs_open_devices(
 			error = -EINVAL;
 			goto out_close_rtdev;
 		}
-		dax_rtdev = fs_dax_get_by_bdev(rtdev);
+		dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
 	}
 
 	/*
@@ -813,14 +813,14 @@ xfs_open_devices(
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
  out_close_rtdev:
 	xfs_blkdev_put(rtdev);
-	fs_put_dax(dax_rtdev);
+	fs_dax_release(dax_rtdev, mp);
  out_close_logdev:
 	if (logdev && logdev != ddev) {
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
  out:
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 	return error;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3045c0d9c804..9b4259aee016 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -51,12 +51,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return dax_get_by_host(host);
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
-{
-	put_dax(dax_dev);
-}
-
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
 int dax_set_page_dirty(struct page *page);
 void dax_invalidatepage(struct page *page, unsigned int offset,
 		unsigned int length);
@@ -71,13 +67,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return NULL;
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
 {
+	return NULL;
 }
 
-static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
+static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 {
-	return NULL;
 }
 
 #define dax_set_page_dirty NULL
@@ -88,6 +85,8 @@ int dax_read_lock(void);
 void dax_read_unlock(int id);
 struct dax_device *alloc_dax(void *private, const char *host,
 		const struct dax_operations *ops);
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..02d6d042ee7f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,19 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * When MEMORY_DEVICE_HOST memory is represented by a device that can
+ * host a filesystem, for example /dev/pmem0, that filesystem can
+ * register for a callback when a page is idled. For the filesystem-dax
+ * case page idle callbacks are used to coordinate DMA vs
+ * hole-punch/truncate.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_FS_DAX,
 };
 
 /*

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 07/11] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (5 preceding siblings ...)
  2018-03-10  6:55 ` [PATCH v5 06/11] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
@ 2018-03-10  6:55 ` Dan Williams
  2018-03-12 14:17   ` Jerome Glisse
  2018-03-10  6:55 ` [PATCH v5 08/11] wait_bit: introduce {wait_on,wake_up}_atomic_one Dan Williams
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, jack, david, linux-kernel, linux-xfs,
	Jérôme Glisse, linux-fsdevel, hch

The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    2 ++
 fs/Kconfig               |    1 +
 include/linux/memremap.h |   20 ++-------------
 include/linux/mm.h       |   61 ++++++++++++++++++++++++++++++++--------------
 kernel/memremap.c        |   30 ++++++++++++++++++++---
 mm/Kconfig               |    5 ++++
 mm/hmm.c                 |   13 ++--------
 mm/swap.c                |    3 ++
 8 files changed, 84 insertions(+), 51 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecefe9f7eb60..619b1ed6434c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -191,6 +191,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
 		return NULL;
 	}
 
+	dev_pagemap_get_ops();
 	pgmap->type = MEMORY_DEVICE_FS_DAX;
 	pgmap->page_free = generic_dax_pagefree;
 	pgmap->data = owner;
@@ -215,6 +216,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
 	pgmap->type = MEMORY_DEVICE_HOST;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
+	dev_pagemap_put_ops();
 	mutex_unlock(&devmap_lock);
 }
 EXPORT_SYMBOL_GPL(fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a86d965..1f0832bbc32f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select DEV_PAGEMAP_OPS
 	select FS_IOMAP
 	select DAX
 	help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 02d6d042ee7f..9faf25d6abef 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,7 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
@@ -130,6 +129,9 @@ struct dev_pagemap {
 	enum memory_type type;
 };
 
+void dev_pagemap_get_ops(void);
+void dev_pagemap_put_ops(void);
+
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
@@ -137,8 +139,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
-
-static inline bool is_zone_device_page(const struct page *page);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
@@ -169,20 +169,6 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-static inline bool is_device_private_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_device_public_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
-}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
 static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
 {
 	if (pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..088c76bce360 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -812,27 +812,55 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
 {
+	if (!static_branch_unlikely(&devmap_managed_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+	case MEMORY_DEVICE_FS_DAX:
+		__put_devmap_managed_page(page);
+		return true;
+	default:
+		break;
+	}
+	return false;
 }
-#define IS_HMM_ENABLED 0
+
 static inline bool is_device_private_page(const struct page *page)
 {
-	return false;
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
+
 static inline bool is_device_public_page(const struct page *page)
 {
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
+
+#else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline bool put_devmap_managed_page(struct page *page)
+{
 	return false;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
 
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline void get_page(struct page *page)
 {
@@ -850,16 +878,13 @@ static inline void put_page(struct page *page)
 	page = compound_head(page);
 
 	/*
-	 * For private device pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the private device page is
-	 * free and we need to inform the device driver through callback. See
+	 * For devmap managed pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the page is free and we
+	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
-	    unlikely(is_device_public_page(page)))) {
-		put_zone_device_private_or_public_page(page);
+	if (put_devmap_managed_page(page))
 		return;
-	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 4dd4274cabe2..bd8300d02d7c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -476,8 +476,30 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL(devmap_managed_key);
+static atomic_t devmap_enable;
+
+/*
+ * Toggle the static key for ->page_free() callbacks when dev_pagemap
+ * pages go idle.
+ */
+void dev_pagemap_get_ops(void)
+{
+	if (atomic_inc_return(&devmap_enable) == 1)
+		static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_get_ops);
+
+void dev_pagemap_put_ops(void)
+{
+	if (atomic_dec_and_test(&devmap_enable))
+		static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_put_ops);
+
+void __put_devmap_managed_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -497,5 +519,5 @@ void put_zone_device_private_or_public_page(struct page *page)
 	} else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL_GPL(__put_devmap_managed_page);
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/Kconfig b/mm/Kconfig
index c782e8fb7235..dc32828984a3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -700,6 +700,9 @@ config ARCH_HAS_HMM
 config MIGRATE_VMA_HELPER
 	bool
 
+config DEV_PAGEMAP_OPS
+	bool
+
 config HMM
 	bool
 	select MIGRATE_VMA_HELPER
@@ -720,6 +723,7 @@ config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
@@ -730,6 +734,7 @@ config DEVICE_PUBLIC
 	bool "Addressable device memory (like GPU memory)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..4aa554e76d06 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
@@ -996,7 +987,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	resource_size_t addr;
 	int ret;
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_get_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
@@ -1090,7 +1081,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
 		return ERR_PTR(-EINVAL);
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_get_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index 0f17330dd0e5..eed846cfc8b8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
+#include <linux/memremap.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
@@ -744,7 +745,7 @@ void release_pages(struct page **pages, int nr)
 						       flags);
 				locked_pgdat = NULL;
 			}
-			put_zone_device_private_or_public_page(page);
+			put_devmap_managed_page(page);
 			continue;
 		}
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 08/11] wait_bit: introduce {wait_on,wake_up}_atomic_one
  2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (6 preceding siblings ...)
  2018-03-10  6:55 ` [PATCH v5 07/11] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
@ 2018-03-10  6:55 ` Dan Williams
  2018-03-11 11:27   ` [PATCH v5 08/11] wait_bit: introduce {wait_on, wake_up}_atomic_one Peter Zijlstra
  2018-03-10  6:55 ` [PATCH v5 09/11] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Peter Zijlstra, david, linux-kernel, linux-xfs,
	Ingo Molnar, linux-fsdevel, hch

Add a generic facility for awaiting an atomic_t to reach a value of 1.

Page reference counts typically need to reach 0 to be considered a
free / inactive page. However, ZONE_DEVICE pages allocated via
devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
done at init time to assign pages to the page allocator is skipped.

These pages will have their reference count elevated > 1 by
get_user_pages() when they are under DMA. In order to coordinate DMA to
these pages vs filesytem operations like hole-punch and truncate the
filesystem-dax implementation needs to capture the DMA-idle event i.e.
the 2 to 1 count transition).

For now, this implementation does not have functional behavior change,
follow-on patches will add waiters for these page-idle events.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    2 +-
 include/linux/wait_bit.h |   13 ++++++++++
 kernel/sched/wait_bit.c  |   59 +++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 619b1ed6434c..7e10fa3460e2 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,7 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
-	/* TODO: wakeup page-idle waiters */
+	wake_up_atomic_one(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 61b39eaf7cad..564c9a0141cd 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -33,10 +33,15 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+static inline void wake_up_atomic_one(atomic_t *p)
+{
+	wake_up_atomic_t(p);
+}
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, wait_atomic_t_action_f action, unsigned int mode);
+int out_of_line_wait_on_atomic_one(atomic_t *p, wait_atomic_t_action_f action, unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -262,4 +267,12 @@ int wait_on_atomic_t(atomic_t *val, wait_atomic_t_action_f action, unsigned mode
 	return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_atomic_one(atomic_t *val, wait_atomic_t_action_f action, unsigned mode)
+{
+	might_sleep();
+	if (atomic_read(val) == 1)
+		return 0;
+	return out_of_line_wait_on_atomic_one(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 84cb3acd9260..8739b1e50df5 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,28 +162,47 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
 	return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync,
-				  void *arg)
+static struct wait_bit_queue_entry *to_wait_bit_q(
+		struct wait_queue_entry *wq_entry)
+{
+	return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int __wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg, int target)
 {
 	struct wait_bit_key *key = arg;
-	struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
 	atomic_t *val = key->flags;
 
 	if (wait_bit->key.flags != key->flags ||
 	    wait_bit->key.bit_nr != key->bit_nr ||
-	    atomic_read(val) != 0)
+	    atomic_read(val) != target)
 		return 0;
 	return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	return __wake_atomic_t_function(wq_entry, mode, sync, arg, 0);
+}
+
+static int wake_atomic_one_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	return __wake_atomic_t_function(wq_entry, mode, sync, arg, 1);
+}
+
 /*
  * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
  * the actions of __wait_on_atomic_t() are permitted return codes.  Nonzero
  * return codes halt waiting and return.
  */
 static __sched
-int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry,
-		       wait_atomic_t_action_f action, unsigned int mode)
+int __wait_on_atomic_t(struct wait_queue_head *wq_head,
+		struct wait_bit_queue_entry *wbq_entry,
+		wait_atomic_t_action_f action, unsigned int mode, int target)
 {
 	atomic_t *val;
 	int ret = 0;
@@ -191,10 +210,10 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 	do {
 		prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
 		val = wbq_entry->key.flags;
-		if (atomic_read(val) == 0)
+		if (atomic_read(val) == target)
 			break;
 		ret = (*action)(val, mode);
-	} while (!ret && atomic_read(val) != 0);
+	} while (!ret && atomic_read(val) != target);
 	finish_wait(wq_head, &wbq_entry->wq_entry);
 	return ret;
 }
@@ -210,6 +229,17 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 		},							\
 	}
 
+#define DEFINE_WAIT_ATOMIC_ONE(name, p)					\
+	struct wait_bit_queue_entry name = {				\
+		.key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p),		\
+		.wq_entry = {						\
+			.private	= current,			\
+			.func		= wake_atomic_one_function,	\
+			.entry		=				\
+				LIST_HEAD_INIT((name).wq_entry.entry),	\
+		},							\
+	}
+
 __sched int out_of_line_wait_on_atomic_t(atomic_t *p,
 					 wait_atomic_t_action_f action,
 					 unsigned int mode)
@@ -217,7 +247,7 @@ __sched int out_of_line_wait_on_atomic_t(atomic_t *p,
 	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
 	DEFINE_WAIT_ATOMIC_T(wq_entry, p);
 
-	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode);
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 0);
 }
 EXPORT_SYMBOL(out_of_line_wait_on_atomic_t);
 
@@ -230,6 +260,17 @@ __sched int atomic_t_wait(atomic_t *counter, unsigned int mode)
 }
 EXPORT_SYMBOL(atomic_t_wait);
 
+__sched int out_of_line_wait_on_atomic_one(atomic_t *p,
+					   wait_atomic_t_action_f action,
+					   unsigned int mode)
+{
+	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
+	DEFINE_WAIT_ATOMIC_ONE(wq_entry, p);
+
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 1);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_atomic_one);
+
 /**
  * wake_up_atomic_t - Wake up a waiter on a atomic_t
  * @p: The atomic_t being waited on, a kernel virtual address

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 09/11] mm, fs, dax: handle layout changes to pinned dax mappings
  2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (7 preceding siblings ...)
  2018-03-10  6:55 ` [PATCH v5 08/11] wait_bit: introduce {wait_on,wake_up}_atomic_one Dan Williams
@ 2018-03-10  6:55 ` Dan Williams
  2018-03-10  6:55 ` [PATCH v5 10/11] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
  2018-03-10  6:55 ` [PATCH v5 11/11] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
  10 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, jack, Darrick J. Wong, Matthew Wilcox, Dave Chinner,
	linux-kernel, hch, linux-xfs, Alexander Viro, linux-fsdevel,
	Andrew Morton

Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c            |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |   30 ++++++++++++++++
 mm/gup.c            |    5 +++
 3 files changed, 128 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index fecf463a1468..cfaaf31fae85 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -375,6 +375,19 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	}
 }
 
+static struct page *dax_busy_page(void *entry)
+{
+	unsigned long pfn, end_pfn;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (page_ref_count(page) > 1)
+			return page;
+	}
+	return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -516,6 +529,85 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true. It expects that get_user_pages() pte
+ * walks are performed under rcu_read_lock().
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+	pgoff_t	indices[PAGEVEC_SIZE];
+	struct page *page = NULL;
+	struct pagevec pvec;
+	pgoff_t	index, end;
+	unsigned i;
+
+	/*
+	 * In the 'limited' case get_user_pages() for dax is disabled.
+	 */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return NULL;
+
+	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+		return NULL;
+
+	pagevec_init(&pvec);
+	index = 0;
+	end = -1;
+	/*
+	 * Flush dax_layout_lock() sections to ensure all possible page
+	 * references have been taken, or otherwise arrange for faults
+	 * to block on the filesystem lock that is taken for
+	 * establishing new mappings.
+	 */
+	unmap_mapping_range(mapping, 0, 0, 1);
+	synchronize_rcu();
+
+	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
+				min(end - index, (pgoff_t)PAGEVEC_SIZE),
+				indices)) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *pvec_ent = pvec.pages[i];
+			void *entry;
+
+			index = indices[i];
+			if (index >= end)
+				break;
+
+			if (!radix_tree_exceptional_entry(pvec_ent))
+				continue;
+
+			spin_lock_irq(&mapping->tree_lock);
+			entry = get_unlocked_mapping_entry(mapping, index, NULL);
+			if (entry)
+				page = dax_busy_page(entry);
+			put_unlocked_mapping_entry(mapping, index, entry);
+			spin_unlock_irq(&mapping->tree_lock);
+			if (page)
+				break;
+		}
+		pagevec_remove_exceptionals(&pvec);
+		pagevec_release(&pvec);
+		index++;
+
+		if (page)
+			break;
+	}
+	return page;
+}
+EXPORT_SYMBOL_GPL(dax_layout_busy_page);
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -540,6 +632,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
+
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9b4259aee016..62671a636512 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -56,6 +56,18 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner);
 int dax_set_page_dirty(struct page *page);
 void dax_invalidatepage(struct page *page, unsigned int offset,
 		unsigned int length);
+
+static inline void dax_layout_lock(void)
+{
+	rcu_read_lock();
+}
+
+static inline void dax_layout_unlock(void)
+{
+	rcu_read_unlock();
+}
+
+struct page *dax_layout_busy_page(struct address_space *mapping);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -79,6 +91,19 @@ static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 
 #define dax_set_page_dirty NULL
 #define dax_invalidatepage NULL
+
+static inline void dax_layout_lock(void)
+{
+}
+
+static inline void dax_layout_unlock(void)
+{
+}
+
+static inline struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+	return NULL;
+}
 #endif
 
 int dax_read_lock(void);
@@ -108,6 +133,11 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 
+static inline struct page *refcount_to_page(atomic_t *c)
+{
+	return container_of(c, struct page, _refcount);
+}
+
 #ifdef CONFIG_FS_DAX
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
diff --git a/mm/gup.c b/mm/gup.c
index 1b46e6e74881..a81efac6983a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,7 @@
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
+#include <linux/dax.h>
 
 #include <asm/mmu_context.h>
 #include <asm/pgtable.h>
@@ -693,7 +694,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		dax_layout_lock();
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
+		dax_layout_unlock();
 		if (!page) {
 			int ret;
 			ret = faultin_page(tsk, vma, start, &foll_flags,
@@ -1809,7 +1812,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	if (gup_fast_permitted(start, nr_pages, write)) {
 		local_irq_disable();
+		dax_layout_lock();
 		gup_pgd_range(addr, end, write, pages, &nr);
+		dax_layout_unlock();
 		local_irq_enable();
 		ret = nr;
 	}

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 10/11] xfs: prepare xfs_break_layouts() for another layout type
  2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (8 preceding siblings ...)
  2018-03-10  6:55 ` [PATCH v5 09/11] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
@ 2018-03-10  6:55 ` Dan Williams
  2018-03-10  9:51   ` Christoph Hellwig
  2018-03-10  6:55 ` [PATCH v5 11/11] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
  10 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Darrick J. Wong, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, hch

When xfs is operating as the back-end of a pNFS block server, it prevents
collisions between local and remote operations by requiring a lease to
be held for remotely accessed blocks. Local filesystem operations break
those leases before writing or mutating the extent map of the file.

A similar mechanism is needed to prevent operations on pinned dax
mappings, like device-DMA, from colliding with extent unmap operations.

XFS_BREAK_REMOTE and XFS_BREAK_MAPS are introduced as flags to control
the layouts that need to be broken by xfs_break_layouts(). While
XFS_BREAK_REMOTE is invoked in all calls to the new xfs_break_layouts(),
XFS_BREAK_MAPS only needs to specified when extents may be unmapped,
i.e. xfs_file_fallocate() and xfs_ioc_space(). XFS_BREAK_MAPS also
imposes the additional locking constraint of breaking (awaiting) pinned
dax mappings while holding XFS_MMAPLOCK_EXCL.

There is a small functional change in this rework. For the cases where
XFS_BREAK_MAPS is specified to xfs_break_layouts(), the
XFS_MMAPLOCK_EXCL ilock is held over the break_layouts() loop in
xfs_break_leased_layouts().

Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c  |   32 ++++++++++++++++++++++++++------
 fs/xfs/xfs_inode.h |    9 +++++++++
 fs/xfs/xfs_ioctl.c |    9 +++------
 fs/xfs/xfs_iops.c  |   12 +++++++-----
 fs/xfs/xfs_pnfs.c  |    8 +++-----
 fs/xfs/xfs_pnfs.h  |    4 ++--
 6 files changed, 50 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9ea08326f876..f914f0628dc2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -350,7 +350,7 @@ xfs_file_aio_write_checks(
 	if (error <= 0)
 		return error;
 
-	error = xfs_break_layouts(inode, iolock);
+	error = xfs_break_layouts(inode, iolock, XFS_BREAK_REMOTE);
 	if (error)
 		return error;
 
@@ -752,6 +752,28 @@ xfs_file_write_iter(
 	return ret;
 }
 
+int
+xfs_break_layouts(
+	struct inode		*inode,
+	uint			*iolock,
+	unsigned long		flags)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	uint			iolock_assert = 0;
+	int			ret = 0;
+
+	if (flags & XFS_BREAK_REMOTE)
+		iolock_assert |= XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL;
+	if (flags & XFS_BREAK_MAPS)
+		iolock_assert |= XFS_MMAPLOCK_EXCL;
+
+	ASSERT(xfs_isilocked(ip, iolock_assert));
+
+	if (flags & XFS_BREAK_REMOTE)
+		ret = xfs_break_leased_layouts(inode, iolock);
+	return ret;
+}
+
 #define	XFS_FALLOC_FL_SUPPORTED						\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
@@ -768,7 +790,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
 	loff_t			new_size = 0;
 	bool			do_file_insert = false;
 
@@ -778,13 +800,11 @@ xfs_file_fallocate(
 		return -EOPNOTSUPP;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_layouts(inode, &iolock,
+			XFS_BREAK_REMOTE | XFS_BREAK_MAPS);
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3e8dc990d41c..9b73ceb09cb1 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -379,6 +379,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 					>> XFS_ILOCK_SHIFT)
 
 /*
+ * Flags for layout breaks
+ */
+#define XFS_BREAK_REMOTE (1<<0) /* break remote layout leases */
+#define XFS_BREAK_MAPS   (1<<1) /* break local direct (dax) mappings */
+
+/*
  * For multiple groups support: if S_ISGID bit is set in the parent
  * directory, group of new file is set to that of the parent, and
  * new subdirectory gets S_ISGID bit from parent.
@@ -447,6 +453,9 @@ int	xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 		     xfs_fsize_t isize, bool *did_zeroing);
 int	xfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
 		bool *did_zero);
+int	xfs_break_layouts(struct inode *inode, uint *iolock,
+		unsigned long flags);
+
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 89fb1eb80aae..31288d900c8a 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -39,7 +39,6 @@
 #include "xfs_icache.h"
 #include "xfs_symlink.h"
 #include "xfs_trans.h"
-#include "xfs_pnfs.h"
 #include "xfs_acl.h"
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
@@ -614,7 +613,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	int			error;
 
 	/*
@@ -644,13 +643,11 @@ xfs_ioc_space(
 		return error;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_layouts(inode, &iolock,
+			XFS_BREAK_REMOTE | XFS_BREAK_MAPS);
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 951e84df5576..15a1c7c874d9 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -37,7 +37,6 @@
 #include "xfs_da_btree.h"
 #include "xfs_dir2.h"
 #include "xfs_trans_space.h"
-#include "xfs_pnfs.h"
 #include "xfs_iomap.h"
 
 #include <linux/capability.h>
@@ -1030,11 +1029,14 @@ xfs_vn_setattr(
 		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
 		uint			iolock = XFS_IOLOCK_EXCL;
 
-		error = xfs_break_layouts(d_inode(dentry), &iolock);
-		if (error)
-			return error;
-
 		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+		error = xfs_break_layouts(d_inode(dentry), &iolock,
+				XFS_BREAK_REMOTE | XFS_BREAK_MAPS);
+
+		if (error) {
+			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+			return error;
+		}
 		error = xfs_vn_setattr_size(dentry, iattr);
 		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 	} else {
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index aa6c5c193f45..ed13c9baffff 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -31,15 +31,13 @@
  * rules in the page fault path we don't bother.
  */
 int
-xfs_break_layouts(
+xfs_break_leased_layouts(
 	struct inode		*inode,
 	uint			*iolock)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			error;
 
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
 		error = break_layout(inode, true);
@@ -120,8 +118,8 @@ xfs_fs_map_blocks(
 	 * Lock out any other I/O before we flush and invalidate the pagecache,
 	 * and then hand out a layout to the remote system.  This is very
 	 * similar to direct I/O, except that the synchronization is much more
-	 * complicated.  See the comment near xfs_break_layouts for a detailed
-	 * explanation.
+	 * complicated.  See the comment near xfs_break_leased_layouts
+	 * for a detailed explanation.
 	 */
 	xfs_ilock(ip, XFS_IOLOCK_EXCL);
 
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index bf45951e28fe..12f46fe6d902 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -9,10 +9,10 @@ int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
 int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
 		struct iattr *iattr);
 
-int xfs_break_layouts(struct inode *inode, uint *iolock);
+int xfs_break_leased_layouts(struct inode *inode, uint *iolock);
 #else
 static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
+xfs_break_leased_layouts(struct inode *inode, uint *iolock)
 {
 	return 0;
 }

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 11/11] xfs, dax: introduce xfs_break_dax_layouts()
  2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (9 preceding siblings ...)
  2018-03-10  6:55 ` [PATCH v5 10/11] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
@ 2018-03-10  6:55 ` Dan Williams
  2018-03-10  9:55   ` Christoph Hellwig
  10 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-03-10  6:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Darrick J. Wong, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, hch

xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
for busy / pinned dax pages and waits for those pages to go idle before
any potential extent unmap operation.

dax_layout_busy_page() handles synchronizing against new page-busy
events (get_user_pages). It invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the xfs inode
log held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
busy page it returns it for xfs to wait for the page-idle event that
will fire when the page reference count reaches 1 (recall ZONE_DEVICE
pages are idle at count 1). While waiting, the XFS_MMAPLOCK_EXCL lock is
dropped in order to not deadlock the process that might be trying to
elevate the page count of more pages before arranging for any of them to
go idle. I.e. the typical case of submitting I/O is that
iov_iter_get_pages() elevates the reference count of all pages in the
I/O before starting I/O on the first page.

Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f914f0628dc2..3e7a69cebf95 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -752,6 +752,55 @@ xfs_file_write_iter(
 	return ret;
 }
 
+static int xfs_wait_dax_page(
+		atomic_t                *count,
+		unsigned int            mode)
+{
+	uint                    iolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
+	struct page             *page = refcount_to_page(count);
+	struct address_space    *mapping = page->mapping;
+	struct inode            *inode = mapping->host;
+	struct xfs_inode        *ip = XFS_I(inode);
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));
+
+	if (page_ref_count(page) == 1)
+		return 0;
+
+	xfs_iunlock(ip, iolock);
+	schedule();
+	xfs_ilock(ip, iolock);
+
+	if (signal_pending_state(mode, current))
+		return -EINTR;
+	return 1;
+}
+
+static int
+xfs_break_dax_layouts(
+	struct inode		*inode,
+	uint			iolock)
+{
+	struct page		*page;
+	int			ret;
+
+	page = dax_layout_busy_page(inode->i_mapping);
+	if (!page)
+		return 0;
+
+	ret = wait_on_atomic_one(&page->_refcount, xfs_wait_dax_page,
+			TASK_INTERRUPTIBLE);
+
+	if (ret <= 0)
+		return ret;
+
+	/*
+	 * We slept, so need to retry. Yes, this assumes transient page
+	 * pins.
+	 */
+	return -EBUSY;
+}
+
 int
 xfs_break_layouts(
 	struct inode		*inode,
@@ -765,12 +814,25 @@ xfs_break_layouts(
 	if (flags & XFS_BREAK_REMOTE)
 		iolock_assert |= XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL;
 	if (flags & XFS_BREAK_MAPS)
-		iolock_assert |= XFS_MMAPLOCK_EXCL;
+		iolock_assert |= XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
 
 	ASSERT(xfs_isilocked(ip, iolock_assert));
 
-	if (flags & XFS_BREAK_REMOTE)
-		ret = xfs_break_leased_layouts(inode, iolock);
+	do {
+		if (flags & XFS_BREAK_REMOTE)
+			ret = xfs_break_leased_layouts(inode, iolock);
+		if (ret)
+			return ret;
+		if (flags & XFS_BREAK_MAPS)
+			ret = xfs_break_dax_layouts(inode, *iolock);
+		/*
+		 * EBUSY indicates that we dropped locks and waited for
+		 * the dax layout to be released. When that happens we
+		 * need to revalidate that no new leases or pinned dax
+		 * mappings have been established.
+		 */
+	} while (ret == -EBUSY);
+
 	return ret;
 }
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/11] xfs, dax: introduce xfs_dax_aops
  2018-03-10  6:54 ` [PATCH v5 02/11] xfs, dax: introduce xfs_dax_aops Dan Williams
@ 2018-03-10  9:46   ` Christoph Hellwig
  2018-03-10 17:40     ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2018-03-10  9:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, linux-fsdevel, Christoph Hellwig

> +int dax_set_page_dirty(struct page *page)
> +{
> +	/*
> +	 * Unlike __set_page_dirty_no_writeback that handles dirty page
> +	 * tracking in the page object, dax does all dirty tracking in
> +	 * the inode address_space in response to mkwrite faults. In the
> +	 * dax case we only need to worry about potentially dirty CPU
> +	 * caches, not dirty page cache pages to write back.
> +	 *
> +	 * This callback is defined to prevent fallback to
> +	 * __set_page_dirty_buffers() in set_page_dirty().
> +	 */
> +	return 0;
> +}

Make this a generic noop_set_page_dirty maybe?

> +EXPORT_SYMBOL(dax_set_page_dirty);
> +
> +void dax_invalidatepage(struct page *page, unsigned int offset,
> +		unsigned int length)
> +{
> +	/*
> +	 * There is no page cache to invalidate in the dax case, however
> +	 * we need this callback defined to prevent falling back to
> +	 * block_invalidatepage() in do_invalidatepage().
> +	 */
> +}

Same here.

> +EXPORT_SYMBOL(dax_invalidatepage);

And EXPORT_SYMBOL_GPL for anything dax-related, please.

> +const struct address_space_operations xfs_dax_aops = {
> +	.writepages		= xfs_vm_writepages,

Please split out the DAX case from xfs_vm_writepages.

This patch should probably also split into VFS and XFS parts.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 10/11] xfs: prepare xfs_break_layouts() for another layout type
  2018-03-10  6:55 ` [PATCH v5 10/11] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
@ 2018-03-10  9:51   ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2018-03-10  9:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, Darrick J. Wong, linux-nvdimm, Dave Chinner, linux-kernel,
	linux-xfs, linux-fsdevel, Christoph Hellwig

>  
> +int
> +xfs_break_layouts(
> +	struct inode		*inode,
> +	uint			*iolock,
> +	unsigned long		flags)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	uint			iolock_assert = 0;
> +	int			ret = 0;
> +
> +	if (flags & XFS_BREAK_REMOTE)
> +		iolock_assert |= XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL;
> +	if (flags & XFS_BREAK_MAPS)
> +		iolock_assert |= XFS_MMAPLOCK_EXCL;
> +
> +	ASSERT(xfs_isilocked(ip, iolock_assert));
> +
> +	if (flags & XFS_BREAK_REMOTE)
> +		ret = xfs_break_leased_layouts(inode, iolock);
> +	return ret;

This just looks weird as hell.  We already pass in what to drop/reacquire
in the iolock argument.  I don't think we need another argument controlled
by the same callers to assert it.

> @@ -768,7 +790,7 @@ xfs_file_fallocate(
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	long			error;
>  	enum xfs_prealloc_flags	flags = 0;
> -	uint			iolock = XFS_IOLOCK_EXCL;
> +	uint			iolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;

This is a behavior change that should not be in a patch titled
"prepare xfs_break_layouts() for another layout type" but in one
explicitly changing this and documenting why.

In summary:  I think this should be replaced with a patch that
allows xfs_break_layouts to be called with the mmap lock held, and
change the callers that want the mmap lock to pass it with a good
explanation, and we should get rid of the XFS_BREAK_* flags here.
(need to check the next patch if there is any other good reason for
them to be added later, though).
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 11/11] xfs, dax: introduce xfs_break_dax_layouts()
  2018-03-10  6:55 ` [PATCH v5 11/11] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
@ 2018-03-10  9:55   ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2018-03-10  9:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, linux-fsdevel, Christoph Hellwig

> +static int xfs_wait_dax_page(
> +		atomic_t                *count,
> +		unsigned int            mode)
> +{

Normal XFS style would be:

static int
xfs_wait_dax_page(
	atomic_t		*count,
	unsigned int		mode)
{

> +	struct page             *page = refcount_to_page(count);
> +	struct address_space    *mapping = page->mapping;
> +	struct inode            *inode = mapping->host;
> +	struct xfs_inode        *ip = XFS_I(inode);

Looks we don't really need the mapping and inode variables:

	struct page		*page = refcount_to_page(count);
	struct xfs_inode	*ip = XFS_I(page->mapping->host);

> +	do {
> +		if (flags & XFS_BREAK_REMOTE)
> +			ret = xfs_break_leased_layouts(inode, iolock);
> +		if (ret)
> +			return ret;
> +		if (flags & XFS_BREAK_MAPS)
> +			ret = xfs_break_dax_layouts(inode, *iolock);
> +		/*
> +		 * EBUSY indicates that we dropped locks and waited for
> +		 * the dax layout to be released. When that happens we
> +		 * need to revalidate that no new leases or pinned dax
> +		 * mappings have been established.
> +		 */
> +	} while (ret == -EBUSY);

Maybe instead of the flags argument this should be a type argument
of something like

enum layout_break_reason {
	BREAK_WRITE,		/* write to file */
	BREAK_TRUNCATE,		/* truncate or hole punch */
};

as that makes the intent more clear?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/11] xfs, dax: introduce xfs_dax_aops
  2018-03-10  9:46   ` Christoph Hellwig
@ 2018-03-10 17:40     ` Dan Williams
  2018-03-11 19:16       ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-03-10 17:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, linux-xfs, linux-fsdevel

On Sat, Mar 10, 2018 at 1:46 AM, Christoph Hellwig <hch@lst.de> wrote:
>> +int dax_set_page_dirty(struct page *page)
>> +{
>> +     /*
>> +      * Unlike __set_page_dirty_no_writeback that handles dirty page
>> +      * tracking in the page object, dax does all dirty tracking in
>> +      * the inode address_space in response to mkwrite faults. In the
>> +      * dax case we only need to worry about potentially dirty CPU
>> +      * caches, not dirty page cache pages to write back.
>> +      *
>> +      * This callback is defined to prevent fallback to
>> +      * __set_page_dirty_buffers() in set_page_dirty().
>> +      */
>> +     return 0;
>> +}
>
> Make this a generic noop_set_page_dirty maybe?
>
>> +EXPORT_SYMBOL(dax_set_page_dirty);
>> +
>> +void dax_invalidatepage(struct page *page, unsigned int offset,
>> +             unsigned int length)
>> +{
>> +     /*
>> +      * There is no page cache to invalidate in the dax case, however
>> +      * we need this callback defined to prevent falling back to
>> +      * block_invalidatepage() in do_invalidatepage().
>> +      */
>> +}
>
> Same here.

I guess I'm not sure what you mean. These nops are specific to dax I
don't think they make sense in another context besides dax.

>
>> +EXPORT_SYMBOL(dax_invalidatepage);
>
> And EXPORT_SYMBOL_GPL for anything dax-related, please.
>
>> +const struct address_space_operations xfs_dax_aops = {
>> +     .writepages             = xfs_vm_writepages,
>
> Please split out the DAX case from xfs_vm_writepages.

Will do.

> This patch should probably also split into VFS and XFS parts.

Ok.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 08/11] wait_bit: introduce {wait_on, wake_up}_atomic_one
  2018-03-10  6:55 ` [PATCH v5 08/11] wait_bit: introduce {wait_on,wake_up}_atomic_one Dan Williams
@ 2018-03-11 11:27   ` Peter Zijlstra
  2018-03-11 17:15     ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-11 11:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, david, linux-kernel, linux-xfs, Ingo Molnar,
	linux-fsdevel, Christoph Hellwig

On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote:
> Add a generic facility for awaiting an atomic_t to reach a value of 1.
> 
> Page reference counts typically need to reach 0 to be considered a
> free / inactive page. However, ZONE_DEVICE pages allocated via
> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
> done at init time to assign pages to the page allocator is skipped.
> 
> These pages will have their reference count elevated > 1 by
> get_user_pages() when they are under DMA. In order to coordinate DMA to
> these pages vs filesytem operations like hole-punch and truncate the
> filesystem-dax implementation needs to capture the DMA-idle event i.e.
> the 2 to 1 count transition).
> 
> For now, this implementation does not have functional behavior change,
> follow-on patches will add waiters for these page-idle events.

Argh, no no no.. That whole wait_for_atomic_t thing is a giant
trainwreck already and now you're making it worse still.

Please have a look here:

  https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx2m7@hirez.programming.kicks-ass.net



_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 08/11] wait_bit: introduce {wait_on, wake_up}_atomic_one
  2018-03-11 11:27   ` [PATCH v5 08/11] wait_bit: introduce {wait_on, wake_up}_atomic_one Peter Zijlstra
@ 2018-03-11 17:15     ` Dan Williams
  2018-03-12 19:32       ` Dan Williams
                         ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Dan Williams @ 2018-03-11 17:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, linux-nvdimm, david, Linux Kernel Mailing List,
	linux-xfs, Ingo Molnar, linux-fsdevel, Christoph Hellwig

On Sun, Mar 11, 2018 at 4:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote:
>> Add a generic facility for awaiting an atomic_t to reach a value of 1.
>>
>> Page reference counts typically need to reach 0 to be considered a
>> free / inactive page. However, ZONE_DEVICE pages allocated via
>> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
>> done at init time to assign pages to the page allocator is skipped.
>>
>> These pages will have their reference count elevated > 1 by
>> get_user_pages() when they are under DMA. In order to coordinate DMA to
>> these pages vs filesytem operations like hole-punch and truncate the
>> filesystem-dax implementation needs to capture the DMA-idle event i.e.
>> the 2 to 1 count transition).
>>
>> For now, this implementation does not have functional behavior change,
>> follow-on patches will add waiters for these page-idle events.
>
> Argh, no no no.. That whole wait_for_atomic_t thing is a giant
> trainwreck already and now you're making it worse still.
>
> Please have a look here:
>
>   https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx2m7@hirez.programming.kicks-ass.net

That thread seems to be worried about the object disappearing the
moment it's reference count reaches a target. That isn't the case with
the memmap / struct page objects for ZONE_DEVICE pages. I understand
wait_for_atomic_one() is broken in the general case, but as far as I
can see it works fine specifically for ZONE_DEVICE page busy tracking,
just not generic object lifetime.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/11] xfs, dax: introduce xfs_dax_aops
  2018-03-10 17:40     ` Dan Williams
@ 2018-03-11 19:16       ` Dan Williams
  2018-03-12  7:51         ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-03-11 19:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, linux-xfs, linux-fsdevel

On Sat, Mar 10, 2018 at 9:40 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sat, Mar 10, 2018 at 1:46 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> +int dax_set_page_dirty(struct page *page)
>>> +{
>>> +     /*
>>> +      * Unlike __set_page_dirty_no_writeback that handles dirty page
>>> +      * tracking in the page object, dax does all dirty tracking in
>>> +      * the inode address_space in response to mkwrite faults. In the
>>> +      * dax case we only need to worry about potentially dirty CPU
>>> +      * caches, not dirty page cache pages to write back.
>>> +      *
>>> +      * This callback is defined to prevent fallback to
>>> +      * __set_page_dirty_buffers() in set_page_dirty().
>>> +      */
>>> +     return 0;
>>> +}
>>
>> Make this a generic noop_set_page_dirty maybe?
>>
>>> +EXPORT_SYMBOL(dax_set_page_dirty);
>>> +
>>> +void dax_invalidatepage(struct page *page, unsigned int offset,
>>> +             unsigned int length)
>>> +{
>>> +     /*
>>> +      * There is no page cache to invalidate in the dax case, however
>>> +      * we need this callback defined to prevent falling back to
>>> +      * block_invalidatepage() in do_invalidatepage().
>>> +      */
>>> +}
>>
>> Same here.
>
> I guess I'm not sure what you mean. These nops are specific to dax I
> don't think they make sense in another context besides dax.
>

I did the rename, and am housing these in fs/dax.c, I assume that's
what you wanted.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/11] xfs, dax: introduce xfs_dax_aops
  2018-03-11 19:16       ` Dan Williams
@ 2018-03-12  7:51         ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2018-03-12  7:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, linux-xfs, linux-fsdevel,
	Christoph Hellwig

On Sun, Mar 11, 2018 at 12:16:25PM -0700, Dan Williams wrote:
> I did the rename, and am housing these in fs/dax.c, I assume that's
> what you wanted.

libfs.c would seem ok to, but we're into micro-management land now :)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 06/11] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  2018-03-10  6:55 ` [PATCH v5 06/11] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
@ 2018-03-12 14:09   ` Jerome Glisse
  0 siblings, 0 replies; 32+ messages in thread
From: Jerome Glisse @ 2018-03-12 14:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, jack, linux-nvdimm, david, linux-kernel, linux-xfs,
	linux-fsdevel, Christoph Hellwig

On Fri, Mar 09, 2018 at 10:55:21PM -0800, Dan Williams wrote:
> In order to resolve collisions between filesystem operations and DMA to
> DAX mapped pages we need a callback when DMA completes. With a callback
> we can hold off filesystem operations while DMA is in-flight and then
> resume those operations when the last put_page() occurs on a DMA page.
> 
> Recall that the 'struct page' entries for DAX memory are created with
> devm_memremap_pages(). That routine arranges for the pages to be
> allocated, but never onlined, so a DAX page is DMA-idle when its
> reference count reaches one.
> 
> Also recall that the HMM sub-system added infrastructure to trap the
> page-idle (2-to-1 reference count) transition of the pages allocated by
> devm_memremap_pages() and trigger a callback via the 'struct
> dev_pagemap' associated with the page range. Whereas the HMM callbacks
> are going to a device driver to manage bounce pages in device-memory in
> the filesystem-dax case we will call back to filesystem specified
> callback.
> 
> Since the callback is not known at devm_memremap_pages() time we arrange
> for the filesystem to install it at mount time. No functional changes
> are expected as this only registers a nop handler for the ->page_free()
> event for device-mapped pages.
> 
> Cc: Michal Hocko <mhocko@suse.com>

Looks good to me

Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>

> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/dax/super.c      |   79 ++++++++++++++++++++++++++++++++++++++++------
>  drivers/nvdimm/pmem.c    |    3 +-
>  fs/ext2/super.c          |    6 ++-
>  fs/ext4/super.c          |    6 ++-
>  fs/xfs/xfs_super.c       |   20 ++++++------
>  include/linux/dax.h      |   17 +++++-----
>  include/linux/memremap.h |    8 +++++
>  7 files changed, 103 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 2b2332b605e4..ecefe9f7eb60 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
>  static DEFINE_IDA(dax_minor_ida);
>  static struct kmem_cache *dax_cache __read_mostly;
>  static struct super_block *dax_superblock __read_mostly;
> +static DEFINE_MUTEX(devmap_lock);
>  
>  #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
>  static struct hlist_head dax_host_list[DAX_HASH_SIZE];
> @@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
>  }
>  EXPORT_SYMBOL(bdev_dax_pgoff);
>  
> -#if IS_ENABLED(CONFIG_FS_DAX)
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> -{
> -	if (!blk_queue_dax(bdev->bd_queue))
> -		return NULL;
> -	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
> -}
> -EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
> -#endif
> -
>  /**
>   * __bdev_dax_supported() - Check if the device supports dax for filesystem
>   * @sb: The superblock of the device
> @@ -169,9 +160,66 @@ struct dax_device {
>  	const char *host;
>  	void *private;
>  	unsigned long flags;
> +	struct dev_pagemap *pgmap;
>  	const struct dax_operations *ops;
>  };
>  
> +#if IS_ENABLED(CONFIG_FS_DAX)
> +static void generic_dax_pagefree(struct page *page, void *data)
> +{
> +	/* TODO: wakeup page-idle waiters */
> +}
> +
> +struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
> +{
> +	struct dax_device *dax_dev;
> +	struct dev_pagemap *pgmap;
> +
> +	if (!blk_queue_dax(bdev->bd_queue))
> +		return NULL;
> +	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
> +	if (!dax_dev->pgmap)
> +		return dax_dev;
> +	pgmap = dax_dev->pgmap;
> +
> +	mutex_lock(&devmap_lock);
> +	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
> +			|| pgmap->page_fault
> +			|| pgmap->type != MEMORY_DEVICE_HOST) {
> +		put_dax(dax_dev);
> +		mutex_unlock(&devmap_lock);
> +		return NULL;
> +	}
> +
> +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +	pgmap->page_free = generic_dax_pagefree;
> +	pgmap->data = owner;
> +	mutex_unlock(&devmap_lock);
> +
> +	return dax_dev;
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
> +
> +void fs_dax_release(struct dax_device *dax_dev, void *owner)
> +{
> +	struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
> +
> +	put_dax(dax_dev);
> +	if (!pgmap)
> +		return;
> +	if (!pgmap->data)
> +		return;
> +
> +	mutex_lock(&devmap_lock);
> +	WARN_ON(pgmap->data != owner);
> +	pgmap->type = MEMORY_DEVICE_HOST;
> +	pgmap->page_free = NULL;
> +	pgmap->data = NULL;
> +	mutex_unlock(&devmap_lock);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_release);
> +#endif
> +
>  static ssize_t write_cache_show(struct device *dev,
>  		struct device_attribute *attr, char *buf)
>  {
> @@ -499,6 +547,17 @@ struct dax_device *alloc_dax(void *private, const char *__host,
>  }
>  EXPORT_SYMBOL_GPL(alloc_dax);
>  
> +struct dax_device *alloc_dax_devmap(void *private, const char *host,
> +		const struct dax_operations *ops, struct dev_pagemap *pgmap)
> +{
> +	struct dax_device *dax_dev = alloc_dax(private, host, ops);
> +
> +	if (dax_dev)
> +		dax_dev->pgmap = pgmap;
> +	return dax_dev;
> +}
> +EXPORT_SYMBOL_GPL(alloc_dax_devmap);
> +
>  void put_dax(struct dax_device *dax_dev)
>  {
>  	if (!dax_dev)
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 06f8dcc52ca6..e6d7351f3379 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
>  	nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
>  	disk->bb = &pmem->bb;
>  
> -	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
> +	dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops,
> +			&pmem->pgmap);
>  	if (!dax_dev) {
>  		put_disk(disk);
>  		return -ENOMEM;
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 7666c065b96f..6ae20e319bc4 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
>  	brelse (sbi->s_sbh);
>  	sb->s_fs_info = NULL;
>  	kfree(sbi->s_blockgroup_lock);
> -	fs_put_dax(sbi->s_daxdev);
> +	fs_dax_release(sbi->s_daxdev, sb);
>  	kfree(sbi);
>  }
>  
> @@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
>  
>  static int ext2_fill_super(struct super_block *sb, void *data, int silent)
>  {
> -	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
> +	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
>  	struct buffer_head * bh;
>  	struct ext2_sb_info * sbi;
>  	struct ext2_super_block * es;
> @@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
>  	kfree(sbi->s_blockgroup_lock);
>  	kfree(sbi);
>  failed:
> -	fs_put_dax(dax_dev);
> +	fs_dax_release(dax_dev, sb);
>  	return ret;
>  }
>  
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 39bf464c35f1..315a323729e3 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -952,7 +952,7 @@ static void ext4_put_super(struct super_block *sb)
>  	if (sbi->s_chksum_driver)
>  		crypto_free_shash(sbi->s_chksum_driver);
>  	kfree(sbi->s_blockgroup_lock);
> -	fs_put_dax(sbi->s_daxdev);
> +	fs_dax_release(sbi->s_daxdev, sb);
>  	kfree(sbi);
>  }
>  
> @@ -3398,7 +3398,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
>  
>  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  {
> -	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
> +	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
>  	char *orig_data = kstrdup(data, GFP_KERNEL);
>  	struct buffer_head *bh;
>  	struct ext4_super_block *es = NULL;
> @@ -4408,7 +4408,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  out_free_base:
>  	kfree(sbi);
>  	kfree(orig_data);
> -	fs_put_dax(dax_dev);
> +	fs_dax_release(dax_dev, sb);
>  	return err ? err : ret;
>  }
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 93588ea3d3d2..ef7dd7148c0b 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -724,7 +724,7 @@ xfs_close_devices(
>  
>  		xfs_free_buftarg(mp, mp->m_logdev_targp);
>  		xfs_blkdev_put(logdev);
> -		fs_put_dax(dax_logdev);
> +		fs_dax_release(dax_logdev, mp);
>  	}
>  	if (mp->m_rtdev_targp) {
>  		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
> @@ -732,10 +732,10 @@ xfs_close_devices(
>  
>  		xfs_free_buftarg(mp, mp->m_rtdev_targp);
>  		xfs_blkdev_put(rtdev);
> -		fs_put_dax(dax_rtdev);
> +		fs_dax_release(dax_rtdev, mp);
>  	}
>  	xfs_free_buftarg(mp, mp->m_ddev_targp);
> -	fs_put_dax(dax_ddev);
> +	fs_dax_release(dax_ddev, mp);
>  }
>  
>  /*
> @@ -753,9 +753,9 @@ xfs_open_devices(
>  	struct xfs_mount	*mp)
>  {
>  	struct block_device	*ddev = mp->m_super->s_bdev;
> -	struct dax_device	*dax_ddev = fs_dax_get_by_bdev(ddev);
> -	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
> +	struct dax_device	*dax_ddev = fs_dax_claim_bdev(ddev, mp);
>  	struct block_device	*logdev = NULL, *rtdev = NULL;
> +	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
>  	int			error;
>  
>  	/*
> @@ -765,7 +765,7 @@ xfs_open_devices(
>  		error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
>  		if (error)
>  			goto out;
> -		dax_logdev = fs_dax_get_by_bdev(logdev);
> +		dax_logdev = fs_dax_claim_bdev(logdev, mp);
>  	}
>  
>  	if (mp->m_rtname) {
> @@ -779,7 +779,7 @@ xfs_open_devices(
>  			error = -EINVAL;
>  			goto out_close_rtdev;
>  		}
> -		dax_rtdev = fs_dax_get_by_bdev(rtdev);
> +		dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
>  	}
>  
>  	/*
> @@ -813,14 +813,14 @@ xfs_open_devices(
>  	xfs_free_buftarg(mp, mp->m_ddev_targp);
>   out_close_rtdev:
>  	xfs_blkdev_put(rtdev);
> -	fs_put_dax(dax_rtdev);
> +	fs_dax_release(dax_rtdev, mp);
>   out_close_logdev:
>  	if (logdev && logdev != ddev) {
>  		xfs_blkdev_put(logdev);
> -		fs_put_dax(dax_logdev);
> +		fs_dax_release(dax_logdev, mp);
>  	}
>   out:
> -	fs_put_dax(dax_ddev);
> +	fs_dax_release(dax_ddev, mp);
>  	return error;
>  }
>  
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 3045c0d9c804..9b4259aee016 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -51,12 +51,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
>  	return dax_get_by_host(host);
>  }
>  
> -static inline void fs_put_dax(struct dax_device *dax_dev)
> -{
> -	put_dax(dax_dev);
> -}
> -
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
> +struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
> +void fs_dax_release(struct dax_device *dax_dev, void *owner);
>  int dax_set_page_dirty(struct page *page);
>  void dax_invalidatepage(struct page *page, unsigned int offset,
>  		unsigned int length);
> @@ -71,13 +67,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
>  	return NULL;
>  }
>  
> -static inline void fs_put_dax(struct dax_device *dax_dev)
> +static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
> +		void *owner)
>  {
> +	return NULL;
>  }
>  
> -static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> +static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
>  {
> -	return NULL;
>  }
>  
>  #define dax_set_page_dirty NULL
> @@ -88,6 +85,8 @@ int dax_read_lock(void);
>  void dax_read_unlock(int id);
>  struct dax_device *alloc_dax(void *private, const char *host,
>  		const struct dax_operations *ops);
> +struct dax_device *alloc_dax_devmap(void *private, const char *host,
> +		const struct dax_operations *ops, struct dev_pagemap *pgmap);
>  bool dax_alive(struct dax_device *dax_dev);
>  void kill_dax(struct dax_device *dax_dev);
>  void *dax_get_private(struct dax_device *dax_dev);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 7b4899c06f49..02d6d042ee7f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -53,11 +53,19 @@ struct vmem_altmap {
>   * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>   * type. Any page of a process can be migrated to such memory. However no one
>   * should be allow to pin such memory so that it can always be evicted.
> + *
> + * MEMORY_DEVICE_FS_DAX:
> + * When MEMORY_DEVICE_HOST memory is represented by a device that can
> + * host a filesystem, for example /dev/pmem0, that filesystem can
> + * register for a callback when a page is idled. For the filesystem-dax
> + * case page idle callbacks are used to coordinate DMA vs
> + * hole-punch/truncate.
>   */
>  enum memory_type {
>  	MEMORY_DEVICE_HOST = 0,
>  	MEMORY_DEVICE_PRIVATE,
>  	MEMORY_DEVICE_PUBLIC,
> +	MEMORY_DEVICE_FS_DAX,
>  };
>  
>  /*
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 07/11] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  2018-03-10  6:55 ` [PATCH v5 07/11] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
@ 2018-03-12 14:17   ` Jerome Glisse
  2018-03-12 18:17     ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Jerome Glisse @ 2018-03-12 14:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, jack, linux-nvdimm, david, linux-kernel, linux-xfs,
	linux-fsdevel, hch

On Fri, Mar 09, 2018 at 10:55:26PM -0800, Dan Williams wrote:
> The HMM sub-system extended dev_pagemap to arrange a callback when a
> dev_pagemap managed page is freed. Since a dev_pagemap page is free /
> idle when its reference count is 1 it requires an additional branch to
> check the page-type at put_page() time. Given put_page() is a hot-path
> we do not want to incur that check if HMM is not in use, so a static
> branch is used to avoid that overhead when not necessary.
> 
> Now, the FS_DAX implementation wants to reuse this mechanism for
> receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
> static-key into a generic mechanism that either HMM or FS_DAX code paths
> can enable.

Why EXPORT_SYMBOL_GPL and not EXPORT_SYMBOL like it was prior to this
patch ? Not i care that much about that, just wondering. Maybe keep it
EXPORT_SYMBOL() ?

In any case
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>

> Cc: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/dax/super.c      |    2 ++
>  fs/Kconfig               |    1 +
>  include/linux/memremap.h |   20 ++-------------
>  include/linux/mm.h       |   61 ++++++++++++++++++++++++++++++++--------------
>  kernel/memremap.c        |   30 ++++++++++++++++++++---
>  mm/Kconfig               |    5 ++++
>  mm/hmm.c                 |   13 ++--------
>  mm/swap.c                |    3 ++
>  8 files changed, 84 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index ecefe9f7eb60..619b1ed6434c 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -191,6 +191,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
>  		return NULL;
>  	}
>  
> +	dev_pagemap_get_ops();
>  	pgmap->type = MEMORY_DEVICE_FS_DAX;
>  	pgmap->page_free = generic_dax_pagefree;
>  	pgmap->data = owner;
> @@ -215,6 +216,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
>  	pgmap->type = MEMORY_DEVICE_HOST;
>  	pgmap->page_free = NULL;
>  	pgmap->data = NULL;
> +	dev_pagemap_put_ops();
>  	mutex_unlock(&devmap_lock);
>  }
>  EXPORT_SYMBOL_GPL(fs_dax_release);
> diff --git a/fs/Kconfig b/fs/Kconfig
> index bc821a86d965..1f0832bbc32f 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -38,6 +38,7 @@ config FS_DAX
>  	bool "Direct Access (DAX) support"
>  	depends on MMU
>  	depends on !(ARM || MIPS || SPARC)
> +	select DEV_PAGEMAP_OPS
>  	select FS_IOMAP
>  	select DAX
>  	help
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 02d6d042ee7f..9faf25d6abef 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -1,7 +1,6 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #ifndef _LINUX_MEMREMAP_H_
>  #define _LINUX_MEMREMAP_H_
> -#include <linux/mm.h>
>  #include <linux/ioport.h>
>  #include <linux/percpu-refcount.h>
>  
> @@ -130,6 +129,9 @@ struct dev_pagemap {
>  	enum memory_type type;
>  };
>  
> +void dev_pagemap_get_ops(void);
> +void dev_pagemap_put_ops(void);
> +
>  #ifdef CONFIG_ZONE_DEVICE
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>  struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> @@ -137,8 +139,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  
>  unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
>  void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
> -
> -static inline bool is_zone_device_page(const struct page *page);
>  #else
>  static inline void *devm_memremap_pages(struct device *dev,
>  		struct dev_pagemap *pgmap)
> @@ -169,20 +169,6 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
>  }
>  #endif /* CONFIG_ZONE_DEVICE */
>  
> -#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
> -static inline bool is_device_private_page(const struct page *page)
> -{
> -	return is_zone_device_page(page) &&
> -		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
> -}
> -
> -static inline bool is_device_public_page(const struct page *page)
> -{
> -	return is_zone_device_page(page) &&
> -		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
> -}
> -#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> -
>  static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
>  {
>  	if (pgmap)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ad06d42adb1a..088c76bce360 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -812,27 +812,55 @@ static inline bool is_zone_device_page(const struct page *page)
>  }
>  #endif
>  
> -#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
> -void put_zone_device_private_or_public_page(struct page *page);
> -DECLARE_STATIC_KEY_FALSE(device_private_key);
> -#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
> -static inline bool is_device_private_page(const struct page *page);
> -static inline bool is_device_public_page(const struct page *page);
> -#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> -static inline void put_zone_device_private_or_public_page(struct page *page)
> +#ifdef CONFIG_DEV_PAGEMAP_OPS
> +void __put_devmap_managed_page(struct page *page);
> +DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> +static inline bool put_devmap_managed_page(struct page *page)
>  {
> +	if (!static_branch_unlikely(&devmap_managed_key))
> +		return false;
> +	if (!is_zone_device_page(page))
> +		return false;
> +	switch (page->pgmap->type) {
> +	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_PUBLIC:
> +	case MEMORY_DEVICE_FS_DAX:
> +		__put_devmap_managed_page(page);
> +		return true;
> +	default:
> +		break;
> +	}
> +	return false;
>  }
> -#define IS_HMM_ENABLED 0
> +
>  static inline bool is_device_private_page(const struct page *page)
>  {
> -	return false;
> +	return is_zone_device_page(page) &&
> +		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
>  }
> +
>  static inline bool is_device_public_page(const struct page *page)
>  {
> +	return is_zone_device_page(page) &&
> +		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
> +}
> +
> +#else /* CONFIG_DEV_PAGEMAP_OPS */
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
>  	return false;
>  }
> -#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
>  
> +static inline bool is_device_private_page(const struct page *page)
> +{
> +	return false;
> +}
> +
> +static inline bool is_device_public_page(const struct page *page)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_DEV_PAGEMAP_OPS */
>  
>  static inline void get_page(struct page *page)
>  {
> @@ -850,16 +878,13 @@ static inline void put_page(struct page *page)
>  	page = compound_head(page);
>  
>  	/*
> -	 * For private device pages we need to catch refcount transition from
> -	 * 2 to 1, when refcount reach one it means the private device page is
> -	 * free and we need to inform the device driver through callback. See
> +	 * For devmap managed pages we need to catch refcount transition from
> +	 * 2 to 1, when refcount reach one it means the page is free and we
> +	 * need to inform the device driver through callback. See
>  	 * include/linux/memremap.h and HMM for details.
>  	 */
> -	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
> -	    unlikely(is_device_public_page(page)))) {
> -		put_zone_device_private_or_public_page(page);
> +	if (put_devmap_managed_page(page))
>  		return;
> -	}
>  
>  	if (put_page_testzero(page))
>  		__put_page(page);
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 4dd4274cabe2..bd8300d02d7c 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -476,8 +476,30 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  }
>  #endif /* CONFIG_ZONE_DEVICE */
>  
> -#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
> -void put_zone_device_private_or_public_page(struct page *page)
> +#ifdef CONFIG_DEV_PAGEMAP_OPS
> +DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
> +EXPORT_SYMBOL(devmap_managed_key);
> +static atomic_t devmap_enable;
> +
> +/*
> + * Toggle the static key for ->page_free() callbacks when dev_pagemap
> + * pages go idle.
> + */
> +void dev_pagemap_get_ops(void)
> +{
> +	if (atomic_inc_return(&devmap_enable) == 1)
> +		static_branch_enable(&devmap_managed_key);
> +}
> +EXPORT_SYMBOL_GPL(dev_pagemap_get_ops);
> +
> +void dev_pagemap_put_ops(void)
> +{
> +	if (atomic_dec_and_test(&devmap_enable))
> +		static_branch_disable(&devmap_managed_key);
> +}
> +EXPORT_SYMBOL_GPL(dev_pagemap_put_ops);
> +
> +void __put_devmap_managed_page(struct page *page)
>  {
>  	int count = page_ref_dec_return(page);
>  
> @@ -497,5 +519,5 @@ void put_zone_device_private_or_public_page(struct page *page)
>  	} else if (!count)
>  		__put_page(page);
>  }
> -EXPORT_SYMBOL(put_zone_device_private_or_public_page);
> -#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> +EXPORT_SYMBOL_GPL(__put_devmap_managed_page);
> +#endif /* CONFIG_DEV_PAGEMAP_OPS */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c782e8fb7235..dc32828984a3 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -700,6 +700,9 @@ config ARCH_HAS_HMM
>  config MIGRATE_VMA_HELPER
>  	bool
>  
> +config DEV_PAGEMAP_OPS
> +	bool
> +
>  config HMM
>  	bool
>  	select MIGRATE_VMA_HELPER
> @@ -720,6 +723,7 @@ config DEVICE_PRIVATE
>  	bool "Unaddressable device memory (GPU memory, ...)"
>  	depends on ARCH_HAS_HMM
>  	select HMM
> +	select DEV_PAGEMAP_OPS
>  
>  	help
>  	  Allows creation of struct pages to represent unaddressable device
> @@ -730,6 +734,7 @@ config DEVICE_PUBLIC
>  	bool "Addressable device memory (like GPU memory)"
>  	depends on ARCH_HAS_HMM
>  	select HMM
> +	select DEV_PAGEMAP_OPS
>  
>  	help
>  	  Allows creation of struct pages to represent addressable device
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98ff5..4aa554e76d06 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -35,15 +35,6 @@
>  
>  #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
>  
> -#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
> -/*
> - * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
> - */
> -DEFINE_STATIC_KEY_FALSE(device_private_key);
> -EXPORT_SYMBOL(device_private_key);
> -#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> -
> -
>  #if IS_ENABLED(CONFIG_HMM_MIRROR)
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>  
> @@ -996,7 +987,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
>  	resource_size_t addr;
>  	int ret;
>  
> -	static_branch_enable(&device_private_key);
> +	dev_pagemap_get_ops();
>  
>  	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
>  				   GFP_KERNEL, dev_to_node(device));
> @@ -1090,7 +1081,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
>  	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
>  		return ERR_PTR(-EINVAL);
>  
> -	static_branch_enable(&device_private_key);
> +	dev_pagemap_get_ops();
>  
>  	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
>  				   GFP_KERNEL, dev_to_node(device));
> diff --git a/mm/swap.c b/mm/swap.c
> index 0f17330dd0e5..eed846cfc8b8 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -29,6 +29,7 @@
>  #include <linux/cpu.h>
>  #include <linux/notifier.h>
>  #include <linux/backing-dev.h>
> +#include <linux/memremap.h>
>  #include <linux/memcontrol.h>
>  #include <linux/gfp.h>
>  #include <linux/uio.h>
> @@ -744,7 +745,7 @@ void release_pages(struct page **pages, int nr)
>  						       flags);
>  				locked_pgdat = NULL;
>  			}
> -			put_zone_device_private_or_public_page(page);
> +			put_devmap_managed_page(page);
>  			continue;
>  		}
>  
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 07/11] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  2018-03-12 14:17   ` Jerome Glisse
@ 2018-03-12 18:17     ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-03-12 18:17 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, david,
	Linux Kernel Mailing List, linux-xfs, linux-fsdevel,
	Christoph Hellwig

On Mon, Mar 12, 2018 at 7:17 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Fri, Mar 09, 2018 at 10:55:26PM -0800, Dan Williams wrote:
>> The HMM sub-system extended dev_pagemap to arrange a callback when a
>> dev_pagemap managed page is freed. Since a dev_pagemap page is free /
>> idle when its reference count is 1 it requires an additional branch to
>> check the page-type at put_page() time. Given put_page() is a hot-path
>> we do not want to incur that check if HMM is not in use, so a static
>> branch is used to avoid that overhead when not necessary.
>>
>> Now, the FS_DAX implementation wants to reuse this mechanism for
>> receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
>> static-key into a generic mechanism that either HMM or FS_DAX code paths
>> can enable.
>
> Why EXPORT_SYMBOL_GPL and not EXPORT_SYMBOL like it was prior to this
> patch ? Not i care that much about that, just wondering. Maybe keep it
> EXPORT_SYMBOL() ?

HMM has significant integrations and useful hacks within the mm
sub-system, I think we should go further than just these symbols and
also include devm_memremap_pages() as EXPORT_SYMBOL_GPL. It was simply
an oversight that these were EXPORT_SYMBOL previously.

> In any case
> Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>

Thanks!
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 08/11] wait_bit: introduce {wait_on, wake_up}_atomic_one
  2018-03-11 17:15     ` Dan Williams
@ 2018-03-12 19:32       ` Dan Williams
  2018-03-13 10:20       ` [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var() Peter Zijlstra
  2018-03-15  9:58       ` David Howells
  2 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-03-12 19:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, linux-nvdimm, david, Linux Kernel Mailing List,
	linux-xfs, Ingo Molnar, linux-fsdevel, Christoph Hellwig

On Sun, Mar 11, 2018 at 10:15 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sun, Mar 11, 2018 at 4:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote:
>>> Add a generic facility for awaiting an atomic_t to reach a value of 1.
>>>
>>> Page reference counts typically need to reach 0 to be considered a
>>> free / inactive page. However, ZONE_DEVICE pages allocated via
>>> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
>>> done at init time to assign pages to the page allocator is skipped.
>>>
>>> These pages will have their reference count elevated > 1 by
>>> get_user_pages() when they are under DMA. In order to coordinate DMA to
>>> these pages vs filesytem operations like hole-punch and truncate the
>>> filesystem-dax implementation needs to capture the DMA-idle event i.e.
>>> the 2 to 1 count transition).
>>>
>>> For now, this implementation does not have functional behavior change,
>>> follow-on patches will add waiters for these page-idle events.
>>
>> Argh, no no no.. That whole wait_for_atomic_t thing is a giant
>> trainwreck already and now you're making it worse still.
>>
>> Please have a look here:
>>
>>   https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx2m7@hirez.programming.kicks-ass.net
>
> That thread seems to be worried about the object disappearing the
> moment it's reference count reaches a target. That isn't the case with
> the memmap / struct page objects for ZONE_DEVICE pages. I understand
> wait_for_atomic_one() is broken in the general case, but as far as I
> can see it works fine specifically for ZONE_DEVICE page busy tracking,
> just not generic object lifetime.

Ok, that thread is also concerned with cleaning up the
wait_for_atomic_* pattern to also do something more idiomatic with
wait_event(). I agree that would be better, but I'm running short of
time to go refactor this aou for 4.17 inclusion, especially as I
expect another couple rounds of review on this more urgent data
corruption fix series that depends on this new api. I think the
addition of wait_for_atomic_one() makes it clear that we need a way to
pass a conditional expression rather than create a variant api for
each different condition. Can you help me out with an attempt of your
own, or at least point in a direction that you would accept for
solving the "Except the current wait_event() doesn't do the whole key
part that makes the hash-table 'work'."  problem that you highlighted?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()
  2018-03-11 17:15     ` Dan Williams
  2018-03-12 19:32       ` Dan Williams
@ 2018-03-13 10:20       ` Peter Zijlstra
  2018-03-14  4:12         ` Dan Williams
  2018-03-15  5:46         ` Dan Williams
  2018-03-15  9:58       ` David Howells
  2 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-13 10:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: David Howells, Jan Kara, linux-nvdimm, david,
	Linux Kernel Mailing List, Ralf Baechle, linux-xfs, Ingo Molnar,
	linux-fsdevel, Christoph Hellwig

On Sun, Mar 11, 2018 at 10:15:55AM -0700, Dan Williams wrote:
> On Sun, Mar 11, 2018 at 4:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote:
> >> Add a generic facility for awaiting an atomic_t to reach a value of 1.
> >>
> >> Page reference counts typically need to reach 0 to be considered a
> >> free / inactive page. However, ZONE_DEVICE pages allocated via
> >> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
> >> done at init time to assign pages to the page allocator is skipped.
> >>
> >> These pages will have their reference count elevated > 1 by
> >> get_user_pages() when they are under DMA. In order to coordinate DMA to
> >> these pages vs filesytem operations like hole-punch and truncate the
> >> filesystem-dax implementation needs to capture the DMA-idle event i.e.
> >> the 2 to 1 count transition).
> >>
> >> For now, this implementation does not have functional behavior change,
> >> follow-on patches will add waiters for these page-idle events.
> >
> > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
> > trainwreck already and now you're making it worse still.
> >
> > Please have a look here:
> >
> >   https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx2m7@hirez.programming.kicks-ass.net
> 
> That thread seems to be worried about the object disappearing the
> moment it's reference count reaches a target. That isn't the case with
> the memmap / struct page objects for ZONE_DEVICE pages. I understand
> wait_for_atomic_one() is broken in the general case, but as far as I
> can see it works fine specifically for ZONE_DEVICE page busy tracking,
> just not generic object lifetime.

How's this, compile tested (x86_64-allmodconfig) only.

This allows you to write:

	wait_var_event(&your_atomic, atomic_read(&your_atomic) == 1);

Ralf, please have a look at the MIPS thing, current code seems to be
busted in that it can wait indefinitely due to a missing wakeup.

---
 arch/mips/kernel/process.c                         |   3 +
 arch/mips/kernel/traps.c                           |   6 +-
 drivers/gpu/drm/drm_dp_aux_dev.c                   |  13 +--
 drivers/gpu/drm/i915/selftests/intel_breadcrumbs.c |  14 +--
 drivers/media/platform/qcom/venus/hfi.c            |   8 +-
 fs/afs/cell.c                                      |   6 +-
 fs/afs/rxrpc.c                                     |   6 +-
 fs/afs/server.c                                    |   6 +-
 fs/btrfs/extent-tree.c                             |  14 ++-
 fs/btrfs/ioctl.c                                   |   2 +-
 fs/fscache/cookie.c                                |   7 +-
 fs/nfs/inode.c                                     |   5 -
 fs/nfs/pagelist.c                                  |   6 +-
 fs/nfs/pnfs_nfs.c                                  |   2 +-
 fs/nfs/write.c                                     |   6 +-
 fs/ocfs2/filecheck.c                               |   9 +-
 include/linux/fscache-cache.h                      |   2 +-
 include/linux/wait_bit.h                           |  95 +++++++++++++------
 kernel/sched/wait_bit.c                            | 103 +++++----------------
 19 files changed, 149 insertions(+), 164 deletions(-)

diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 57028d49c202..4369401361c3 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -781,6 +781,9 @@ int mips_set_process_fp_mode(struct task_struct *task, unsigned int value)
 	atomic_set(&task->mm->context.fp_mode_switching, 0);
 	preempt_enable();
 
+	/* XXX is this right ?! */
+	wake_up_var(&task->mm->context.fp_mode_switching);
+
 	return 0;
 }
 
diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index 0ae4a731cc12..fd536b2f22ee 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -1247,9 +1247,11 @@ static int enable_restore_fp_context(int msa)
 	/*
 	 * If an FP mode switch is currently underway, wait for it to
 	 * complete before proceeding.
+	 *
+	 * XXX where is the wakeup ?!?
 	 */
-	wait_on_atomic_t(&current->mm->context.fp_mode_switching,
-			 atomic_t_wait, TASK_KILLABLE);
+	wait_var_event(&current->mm->context.fp_mode_switching,
+		       !atomic_read(&current->mm->context.fp_mode_switching));
 
 	if (!used_math()) {
 		/* First time FP context user. */
diff --git a/drivers/gpu/drm/drm_dp_aux_dev.c b/drivers/gpu/drm/drm_dp_aux_dev.c
index 053044201e31..74832c0920b6 100644
--- a/drivers/gpu/drm/drm_dp_aux_dev.c
+++ b/drivers/gpu/drm/drm_dp_aux_dev.c
@@ -177,8 +177,9 @@ static ssize_t auxdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		res = pos - iocb->ki_pos;
 	iocb->ki_pos = pos;
 
-	atomic_dec(&aux_dev->usecount);
-	wake_up_atomic_t(&aux_dev->usecount);
+	if (atomic_dec_and_test(&aux_dev->usecount))
+		wake_up_var(&aux_dev->usecount);
+
 	return res;
 }
 
@@ -218,8 +219,9 @@ static ssize_t auxdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		res = pos - iocb->ki_pos;
 	iocb->ki_pos = pos;
 
-	atomic_dec(&aux_dev->usecount);
-	wake_up_atomic_t(&aux_dev->usecount);
+	if (atomic_dec_and_test(&aux_dev->usecount))
+		wake_up_var(&aux_dev->usecount);
+
 	return res;
 }
 
@@ -277,8 +279,7 @@ void drm_dp_aux_unregister_devnode(struct drm_dp_aux *aux)
 	mutex_unlock(&aux_idr_mutex);
 
 	atomic_dec(&aux_dev->usecount);
-	wait_on_atomic_t(&aux_dev->usecount, atomic_t_wait,
-			 TASK_UNINTERRUPTIBLE);
+	wait_var_event(&aux_dev->usecount, !atomic_read(&aux_dev->usecount));
 
 	minor = aux_dev->index;
 	if (aux_dev->dev)
diff --git a/drivers/gpu/drm/i915/selftests/intel_breadcrumbs.c b/drivers/gpu/drm/i915/selftests/intel_breadcrumbs.c
index 54fc571b1102..f594926b8e9f 100644
--- a/drivers/gpu/drm/i915/selftests/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/selftests/intel_breadcrumbs.c
@@ -271,18 +271,13 @@ struct igt_wakeup {
 	u32 seqno;
 };
 
-static int wait_atomic_timeout(atomic_t *p, unsigned int mode)
-{
-	return schedule_timeout(10 * HZ) ? 0 : -ETIMEDOUT;
-}
-
 static bool wait_for_ready(struct igt_wakeup *w)
 {
 	DEFINE_WAIT(ready);
 
 	set_bit(IDLE, &w->flags);
 	if (atomic_dec_and_test(w->done))
-		wake_up_atomic_t(w->done);
+		wake_up_var(w->done);
 
 	if (test_bit(STOP, &w->flags))
 		goto out;
@@ -299,7 +294,7 @@ static bool wait_for_ready(struct igt_wakeup *w)
 out:
 	clear_bit(IDLE, &w->flags);
 	if (atomic_dec_and_test(w->set))
-		wake_up_atomic_t(w->set);
+		wake_up_var(w->set);
 
 	return !test_bit(STOP, &w->flags);
 }
@@ -342,7 +337,7 @@ static void igt_wake_all_sync(atomic_t *ready,
 	atomic_set(ready, 0);
 	wake_up_all(wq);
 
-	wait_on_atomic_t(set, atomic_t_wait, TASK_UNINTERRUPTIBLE);
+	wait_var_event(set, !atomic_read(set));
 	atomic_set(ready, count);
 	atomic_set(done, count);
 }
@@ -350,7 +345,6 @@ static void igt_wake_all_sync(atomic_t *ready,
 static int igt_wakeup(void *arg)
 {
 	I915_RND_STATE(prng);
-	const int state = TASK_UNINTERRUPTIBLE;
 	struct intel_engine_cs *engine = arg;
 	struct igt_wakeup *waiters;
 	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
@@ -418,7 +412,7 @@ static int igt_wakeup(void *arg)
 		 * that they are ready for the next test. We wait until all
 		 * threads are complete and waiting for us (i.e. not a seqno).
 		 */
-		err = wait_on_atomic_t(&done, wait_atomic_timeout, state);
+		err = wait_var_event_timeout(&done, !atomic_read(&done), 10 * HZ);
 		if (err) {
 			pr_err("Timed out waiting for %d remaining waiters\n",
 			       atomic_read(&done));
diff --git a/drivers/media/platform/qcom/venus/hfi.c b/drivers/media/platform/qcom/venus/hfi.c
index 1baf78d3c02d..785627cf172c 100644
--- a/drivers/media/platform/qcom/venus/hfi.c
+++ b/drivers/media/platform/qcom/venus/hfi.c
@@ -106,8 +106,8 @@ int hfi_core_deinit(struct venus_core *core, bool blocking)
 
 	if (!empty) {
 		mutex_unlock(&core->lock);
-		wait_on_atomic_t(&core->insts_count, atomic_t_wait,
-				 TASK_UNINTERRUPTIBLE);
+		wait_var_event(&core->insts_count,
+			       !atomic_read(&core->insts_count));
 		mutex_lock(&core->lock);
 	}
 
@@ -229,8 +229,8 @@ void hfi_session_destroy(struct venus_inst *inst)
 
 	mutex_lock(&core->lock);
 	list_del_init(&inst->list);
-	atomic_dec(&core->insts_count);
-	wake_up_atomic_t(&core->insts_count);
+	if (atomic_dec_and_test(&core->insts_count))
+		wake_up_var(&core->insts_count);
 	mutex_unlock(&core->lock);
 }
 EXPORT_SYMBOL_GPL(hfi_session_destroy);
diff --git a/fs/afs/cell.c b/fs/afs/cell.c
index 9bb921d120d0..6393107e05cb 100644
--- a/fs/afs/cell.c
+++ b/fs/afs/cell.c
@@ -25,7 +25,7 @@ static void afs_manage_cell(struct work_struct *);
 static void afs_dec_cells_outstanding(struct afs_net *net)
 {
 	if (atomic_dec_and_test(&net->cells_outstanding))
-		wake_up_atomic_t(&net->cells_outstanding);
+		wake_up_var(&net->cells_outstanding);
 }
 
 /*
@@ -764,7 +764,7 @@ void afs_cell_purge(struct afs_net *net)
 	afs_queue_cell_manager(net);
 
 	_debug("wait");
-	wait_on_atomic_t(&net->cells_outstanding, atomic_t_wait,
-			 TASK_UNINTERRUPTIBLE);
+	wait_var_event(&net->cells_outstanding,
+		       !atomic_read(&net->cells_outstanding));
 	_leave("");
 }
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index e1126659f043..e613dd754383 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -103,8 +103,8 @@ void afs_close_socket(struct afs_net *net)
 	}
 
 	_debug("outstanding %u", atomic_read(&net->nr_outstanding_calls));
-	wait_on_atomic_t(&net->nr_outstanding_calls, atomic_t_wait,
-			 TASK_UNINTERRUPTIBLE);
+	wait_var_event(&net->nr_outstanding_calls,
+		       !atomic_read(&net->nr_outstanding_calls));
 	_debug("no outstanding calls");
 
 	kernel_sock_shutdown(net->socket, SHUT_RDWR);
@@ -175,7 +175,7 @@ void afs_put_call(struct afs_call *call)
 		trace_afs_call(call, afs_call_trace_free, 0, o,
 			       __builtin_return_address(0));
 		if (o == 0)
-			wake_up_atomic_t(&net->nr_outstanding_calls);
+			wake_up_var(&net->nr_outstanding_calls);
 	}
 }
 
diff --git a/fs/afs/server.c b/fs/afs/server.c
index 1880f1b6a9f1..a43ef77dabae 100644
--- a/fs/afs/server.c
+++ b/fs/afs/server.c
@@ -25,7 +25,7 @@ static void afs_inc_servers_outstanding(struct afs_net *net)
 static void afs_dec_servers_outstanding(struct afs_net *net)
 {
 	if (atomic_dec_and_test(&net->servers_outstanding))
-		wake_up_atomic_t(&net->servers_outstanding);
+		wake_up_var(&net->servers_outstanding);
 }
 
 /*
@@ -521,8 +521,8 @@ void afs_purge_servers(struct afs_net *net)
 	afs_queue_server_manager(net);
 
 	_debug("wait");
-	wait_on_atomic_t(&net->servers_outstanding, atomic_t_wait,
-			 TASK_UNINTERRUPTIBLE);
+	wait_var_event(&net->servers_outstanding,
+		       !atomic_read(&net->servers_outstanding));
 	_leave("");
 }
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c1618ab9fecf..e0460d7b5622 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3990,7 +3990,7 @@ void btrfs_dec_nocow_writers(struct btrfs_fs_info *fs_info, u64 bytenr)
 	bg = btrfs_lookup_block_group(fs_info, bytenr);
 	ASSERT(bg);
 	if (atomic_dec_and_test(&bg->nocow_writers))
-		wake_up_atomic_t(&bg->nocow_writers);
+		wake_up_var(&bg->nocow_writers);
 	/*
 	 * Once for our lookup and once for the lookup done by a previous call
 	 * to btrfs_inc_nocow_writers()
@@ -4001,8 +4001,7 @@ void btrfs_dec_nocow_writers(struct btrfs_fs_info *fs_info, u64 bytenr)
 
 void btrfs_wait_nocow_writers(struct btrfs_block_group_cache *bg)
 {
-	wait_on_atomic_t(&bg->nocow_writers, atomic_t_wait,
-			 TASK_UNINTERRUPTIBLE);
+	wait_var_event(&bg->nocow_writers, !atomic_read(&bg->nocow_writers));
 }
 
 static const char *alloc_name(u64 flags)
@@ -6526,7 +6525,7 @@ void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
 	bg = btrfs_lookup_block_group(fs_info, start);
 	ASSERT(bg);
 	if (atomic_dec_and_test(&bg->reservations))
-		wake_up_atomic_t(&bg->reservations);
+		wake_up_var(&bg->reservations);
 	btrfs_put_block_group(bg);
 }
 
@@ -6552,8 +6551,7 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
 	down_write(&space_info->groups_sem);
 	up_write(&space_info->groups_sem);
 
-	wait_on_atomic_t(&bg->reservations, atomic_t_wait,
-			 TASK_UNINTERRUPTIBLE);
+	wait_var_event(&bg->reservations, !atomic_read(&bg->reservations));
 }
 
 /**
@@ -11061,7 +11059,7 @@ void btrfs_wait_for_snapshot_creation(struct btrfs_root *root)
 		ret = btrfs_start_write_no_snapshotting(root);
 		if (ret)
 			break;
-		wait_on_atomic_t(&root->will_be_snapshotted, atomic_t_wait,
-				 TASK_UNINTERRUPTIBLE);
+		wait_var_event(&root->will_be_snapshotted,
+			       !atomic_read(&root->will_be_snapshotted));
 	}
 }
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 111ee282b777..3278ae592a2c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -723,7 +723,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
 	btrfs_subvolume_release_metadata(fs_info, &pending_snapshot->block_rsv);
 dec_and_free:
 	if (atomic_dec_and_test(&root->will_be_snapshotted))
-		wake_up_atomic_t(&root->will_be_snapshotted);
+		wake_up_var(&root->will_be_snapshotted);
 free_pending:
 	kfree(pending_snapshot->root_item);
 	btrfs_free_path(pending_snapshot->path);
diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index ff84258132bb..d705125665f0 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -557,9 +557,10 @@ void __fscache_disable_cookie(struct fscache_cookie *cookie, bool invalidate)
 	 * n_active reaches 0).  This makes sure outstanding reads and writes
 	 * have completed.
 	 */
-	if (!atomic_dec_and_test(&cookie->n_active))
-		wait_on_atomic_t(&cookie->n_active, atomic_t_wait,
-				 TASK_UNINTERRUPTIBLE);
+	if (!atomic_dec_and_test(&cookie->n_active)) {
+		wait_var_event(&cookie->n_active,
+			       !atomic_read(&cookie->n_active));
+	}
 
 	/* Make sure any pending writes are cancelled. */
 	if (cookie->def->type != FSCACHE_COOKIE_TYPE_INDEX)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 7d893543cf3b..d17a90c4fa37 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -85,11 +85,6 @@ int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
 }
 EXPORT_SYMBOL_GPL(nfs_wait_bit_killable);
 
-int nfs_wait_atomic_killable(atomic_t *p, unsigned int mode)
-{
-	return nfs_wait_killable(mode);
-}
-
 /**
  * nfs_compat_user_ino64 - returns the user-visible inode number
  * @fileid: 64-bit fileid
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 18a7626ac638..67d19cd92e44 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -98,8 +98,8 @@ nfs_page_free(struct nfs_page *p)
 int
 nfs_iocounter_wait(struct nfs_lock_context *l_ctx)
 {
-	return wait_on_atomic_t(&l_ctx->io_count, nfs_wait_atomic_killable,
-			TASK_KILLABLE);
+	return wait_var_event_killable(&l_ctx->io_count,
+				       !atomic_read(&l_ctx->io_count));
 }
 
 /**
@@ -395,7 +395,7 @@ static void nfs_clear_request(struct nfs_page *req)
 	}
 	if (l_ctx != NULL) {
 		if (atomic_dec_and_test(&l_ctx->io_count)) {
-			wake_up_atomic_t(&l_ctx->io_count);
+			wake_up_var(&l_ctx->io_count);
 			if (test_bit(NFS_CONTEXT_UNLOCK, &ctx->flags))
 				rpc_wake_up(&NFS_SERVER(d_inode(ctx->dentry))->uoc_rpcwaitq);
 		}
diff --git a/fs/nfs/pnfs_nfs.c b/fs/nfs/pnfs_nfs.c
index 03aaa60c7768..32ba2d471853 100644
--- a/fs/nfs/pnfs_nfs.c
+++ b/fs/nfs/pnfs_nfs.c
@@ -245,7 +245,7 @@ pnfs_generic_commit_cancel_empty_pagelist(struct list_head *pages,
 {
 	if (list_empty(pages)) {
 		if (atomic_dec_and_test(&cinfo->mds->rpcs_out))
-			wake_up_atomic_t(&cinfo->mds->rpcs_out);
+			wake_up_var(&cinfo->mds->rpcs_out);
 		/* don't call nfs_commitdata_release - it tries to put
 		 * the open_context which is not acquired until nfs_init_commit
 		 * which has not been called on @data */
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 7428a669d7a7..fd805771ea2f 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1620,8 +1620,8 @@ static void nfs_writeback_result(struct rpc_task *task,
 
 static int wait_on_commit(struct nfs_mds_commit_info *cinfo)
 {
-	return wait_on_atomic_t(&cinfo->rpcs_out,
-			nfs_wait_atomic_killable, TASK_KILLABLE);
+	return wait_var_event_killable(&cinfo->rpcs_out,
+				       !atomic_read(&cinfo->rpcs_out));
 }
 
 static void nfs_commit_begin(struct nfs_mds_commit_info *cinfo)
@@ -1632,7 +1632,7 @@ static void nfs_commit_begin(struct nfs_mds_commit_info *cinfo)
 static void nfs_commit_end(struct nfs_mds_commit_info *cinfo)
 {
 	if (atomic_dec_and_test(&cinfo->rpcs_out))
-		wake_up_atomic_t(&cinfo->rpcs_out);
+		wake_up_var(&cinfo->rpcs_out);
 }
 
 void nfs_commitdata_release(struct nfs_commit_data *data)
diff --git a/fs/ocfs2/filecheck.c b/fs/ocfs2/filecheck.c
index e87279e49ba3..6b92cb241138 100644
--- a/fs/ocfs2/filecheck.c
+++ b/fs/ocfs2/filecheck.c
@@ -134,9 +134,10 @@ ocfs2_filecheck_sysfs_free(struct ocfs2_filecheck_sysfs_entry *entry)
 {
 	struct ocfs2_filecheck_entry *p;
 
-	if (!atomic_dec_and_test(&entry->fs_count))
-		wait_on_atomic_t(&entry->fs_count, atomic_t_wait,
-				 TASK_UNINTERRUPTIBLE);
+	if (!atomic_dec_and_test(&entry->fs_count)) {
+		wait_var_event(&entry->fs_count,
+			       !atomic_read(&entry->fs_count));
+	}
 
 	spin_lock(&entry->fs_fcheck->fc_lock);
 	while (!list_empty(&entry->fs_fcheck->fc_head)) {
@@ -183,7 +184,7 @@ static void
 ocfs2_filecheck_sysfs_put(struct ocfs2_filecheck_sysfs_entry *entry)
 {
 	if (atomic_dec_and_test(&entry->fs_count))
-		wake_up_atomic_t(&entry->fs_count);
+		wake_up_var(&entry->fs_count);
 }
 
 static struct ocfs2_filecheck_sysfs_entry *
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 4c467ef50159..3b03e29e2f1a 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -496,7 +496,7 @@ static inline bool __fscache_unuse_cookie(struct fscache_cookie *cookie)
 
 static inline void __fscache_wake_unused_cookie(struct fscache_cookie *cookie)
 {
-	wake_up_atomic_t(&cookie->n_active);
+	wake_up_var(&cookie->n_active);
 }
 
 /**
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 61b39eaf7cad..9318b2166439 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -10,7 +10,6 @@
 struct wait_bit_key {
 	void			*flags;
 	int			bit_nr;
-#define WAIT_ATOMIC_T_BIT_NR	-1
 	unsigned long		timeout;
 };
 
@@ -22,21 +21,15 @@ struct wait_bit_queue_entry {
 #define __WAIT_BIT_KEY_INITIALIZER(word, bit)					\
 	{ .flags = word, .bit_nr = bit, }
 
-#define __WAIT_ATOMIC_T_KEY_INITIALIZER(p)					\
-	{ .flags = p, .bit_nr = WAIT_ATOMIC_T_BIT_NR, }
-
 typedef int wait_bit_action_f(struct wait_bit_key *key, int mode);
-typedef int wait_atomic_t_action_f(atomic_t *counter, unsigned int mode);
 
 void __wake_up_bit(struct wait_queue_head *wq_head, void *word, int bit);
 int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
-void wake_up_atomic_t(atomic_t *p);
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
-int out_of_line_wait_on_atomic_t(atomic_t *p, wait_atomic_t_action_f action, unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -57,7 +50,6 @@ extern int bit_wait(struct wait_bit_key *key, int mode);
 extern int bit_wait_io(struct wait_bit_key *key, int mode);
 extern int bit_wait_timeout(struct wait_bit_key *key, int mode);
 extern int bit_wait_io_timeout(struct wait_bit_key *key, int mode);
-extern int atomic_t_wait(atomic_t *counter, unsigned int mode);
 
 /**
  * wait_on_bit - wait for a bit to be cleared
@@ -243,23 +235,74 @@ wait_on_bit_lock_action(unsigned long *word, int bit, wait_bit_action_f *action,
 	return out_of_line_wait_on_bit_lock(word, bit, action, mode);
 }
 
-/**
- * wait_on_atomic_t - Wait for an atomic_t to become 0
- * @val: The atomic value being waited on, a kernel virtual address
- * @action: the function used to sleep, which may take special actions
- * @mode: the task state to sleep in
- *
- * Wait for an atomic_t to become 0.  We abuse the bit-wait waitqueue table for
- * the purpose of getting a waitqueue, but we set the key to a bit number
- * outside of the target 'word'.
- */
-static inline
-int wait_on_atomic_t(atomic_t *val, wait_atomic_t_action_f action, unsigned mode)
-{
-	might_sleep();
-	if (atomic_read(val) == 0)
-		return 0;
-	return out_of_line_wait_on_atomic_t(val, action, mode);
-}
+extern void init_wait_var_entry(struct wait_bit_queue_entry *wbq_entry, void *var, int flags);
+extern void wake_up_var(void *var);
+extern wait_queue_head_t *__var_waitqueue(void *p);
+
+#define ___wait_var_event(var, condition, state, exclusive, ret, cmd)	\
+({									\
+	__label__ __out;						\
+	struct wait_queue_head *__wq_head = __var_waitqueue(var);	\
+	struct wait_bit_queue_entry __wbq_entry;			\
+	long __ret = ret; /* explicit shadow */				\
+									\
+	init_wait_var_entry(&__wbq_entry, var,				\
+			    exclusive ? WQ_FLAG_EXCLUSIVE : 0);		\
+	for (;;) {							\
+		long __int = prepare_to_wait_event(__wq_head,		\
+						   &__wbq_entry.wq_entry, \
+						   state);		\
+		if (condition)						\
+			break;						\
+									\
+		if (___wait_is_interruptible(state) && __int) {		\
+			__ret = __int;					\
+			goto __out;					\
+		}							\
+									\
+		cmd;							\
+	}								\
+	finish_wait(__wq_head, &__wbq_entry.wq_entry);			\
+__out:	__ret;								\
+})
+
+#define __wait_var_event(var, condition)				\
+	___wait_var_event(var, condition, TASK_UNINTERRUPTIBLE, 0, 0,	\
+			  schedule())
+
+#define wait_var_event(var, condition)					\
+do {									\
+	might_sleep();							\
+	if (condition)							\
+		break;							\
+	__wait_var_event(var, condition);				\
+} while (0)
+
+#define __wait_var_event_killable(var, condition)			\
+	___wait_var_event(var, condition, TASK_KILLABLE, 0, 0,		\
+			  schedule())
+
+#define wait_var_event_killable(var, condition)				\
+({									\
+	int __ret = 0;							\
+	might_sleep();							\
+	if (!(condition))						\
+		__ret = __wait_var_event_killable(var, condition);	\
+	__ret;								\
+})
+
+#define __wait_var_event_timeout(var, condition, timeout)		\
+	___wait_var_event(var, ___wait_cond_timeout(condition),		\
+			  TASK_UNINTERRUPTIBLE, 0, timeout,		\
+			  __ret = schedule_timeout(__ret))
+
+#define wait_var_event_timeout(var, condition, timeout)			\
+({									\
+	long __ret = timeout;						\
+	might_sleep();							\
+	if (!___wait_cond_timeout(condition))				\
+		__ret = __wait_var_event_timeout(var, condition, timeout); \
+	__ret;								\
+})
 
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 4239c78f5cd3..60a84f5a6cb4 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -149,12 +149,7 @@ void wake_up_bit(void *word, int bit)
 }
 EXPORT_SYMBOL(wake_up_bit);
 
-/*
- * Manipulate the atomic_t address to produce a better bit waitqueue table hash
- * index (we're keying off bit -1, but that would produce a horrible hash
- * value).
- */
-static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
+wait_queue_head_t *__var_waitqueue(void *p)
 {
 	if (BITS_PER_LONG == 64) {
 		unsigned long q = (unsigned long)p;
@@ -163,92 +158,44 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
 	}
 	return bit_waitqueue(p, 0);
 }
+EXPORT_SYMBOL(__var_waitqueue);
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync,
-				  void *arg)
+static int
+var_wake_function(struct wait_queue_entry *wq_entry, unsigned int mode,
+		  int sync, void *arg)
 {
 	struct wait_bit_key *key = arg;
-	struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
-	atomic_t *val = key->flags;
+	struct wait_bit_queue_entry *wbq_entry =
+		container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
 
-	if (wait_bit->key.flags != key->flags ||
-	    wait_bit->key.bit_nr != key->bit_nr ||
-	    atomic_read(val) != 0)
+	if (wbq_entry->key.flags != key->flags ||
+	    wbq_entry->key.bit_nr != key->bit_nr)
 		return 0;
 
 	return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
-/*
- * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
- * the actions of __wait_on_atomic_t() are permitted return codes.  Nonzero
- * return codes halt waiting and return.
- */
-static __sched
-int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry,
-		       wait_atomic_t_action_f action, unsigned int mode)
+void init_wait_var_entry(struct wait_bit_queue_entry *wbq_entry, void *var, int flags)
 {
-	atomic_t *val;
-	int ret = 0;
-
-	do {
-		prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
-		val = wbq_entry->key.flags;
-		if (atomic_read(val) == 0)
-			break;
-		ret = (*action)(val, mode);
-	} while (!ret && atomic_read(val) != 0);
-	finish_wait(wq_head, &wbq_entry->wq_entry);
-
-	return ret;
+	*wbq_entry = (struct wait_bit_queue_entry){
+		.key = {
+			.flags	= (var),
+			.bit_nr = -1,
+		},
+		.wq_entry = {
+			.private = current,
+			.func	 = var_wake_function,
+			.entry	 = LIST_HEAD_INIT(wbq_entry->wq_entry.entry),
+		},
+	};
 }
+EXPORT_SYMBOL(init_wait_var_entry);
 
-#define DEFINE_WAIT_ATOMIC_T(name, p)					\
-	struct wait_bit_queue_entry name = {				\
-		.key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p),		\
-		.wq_entry = {						\
-			.private	= current,			\
-			.func		= wake_atomic_t_function,	\
-			.entry		=				\
-				LIST_HEAD_INIT((name).wq_entry.entry),	\
-		},							\
-	}
-
-__sched int out_of_line_wait_on_atomic_t(atomic_t *p,
-					 wait_atomic_t_action_f action,
-					 unsigned int mode)
-{
-	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
-	DEFINE_WAIT_ATOMIC_T(wq_entry, p);
-
-	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode);
-}
-EXPORT_SYMBOL(out_of_line_wait_on_atomic_t);
-
-__sched int atomic_t_wait(atomic_t *counter, unsigned int mode)
-{
-	schedule();
-	if (signal_pending_state(mode, current))
-		return -EINTR;
-
-	return 0;
-}
-EXPORT_SYMBOL(atomic_t_wait);
-
-/**
- * wake_up_atomic_t - Wake up a waiter on a atomic_t
- * @p: The atomic_t being waited on, a kernel virtual address
- *
- * Wake up anyone waiting for the atomic_t to go to zero.
- *
- * Abuse the bit-waker function and its waitqueue hash table set (the atomic_t
- * check is done by the waiter's wake function, not the by the waker itself).
- */
-void wake_up_atomic_t(atomic_t *p)
+void wake_up_var(void *var)
 {
-	__wake_up_bit(atomic_t_waitqueue(p), p, WAIT_ATOMIC_T_BIT_NR);
+	__wake_up_bit(__var_waitqueue(var), var, -1);
 }
-EXPORT_SYMBOL(wake_up_atomic_t);
+EXPORT_SYMBOL(wake_up_var);
 
 __sched int bit_wait(struct wait_bit_key *word, int mode)
 {
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()
  2018-03-13 10:20       ` [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var() Peter Zijlstra
@ 2018-03-14  4:12         ` Dan Williams
  2018-03-15  5:46         ` Dan Williams
  1 sibling, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-03-14  4:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Howells, Jan Kara, linux-nvdimm, david,
	Linux Kernel Mailing List, Ralf Baechle, linux-xfs, Ingo Molnar,
	linux-fsdevel, Christoph Hellwig

On Tue, Mar 13, 2018 at 3:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Mar 11, 2018 at 10:15:55AM -0700, Dan Williams wrote:
>> On Sun, Mar 11, 2018 at 4:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote:
>> >> Add a generic facility for awaiting an atomic_t to reach a value of 1.
>> >>
>> >> Page reference counts typically need to reach 0 to be considered a
>> >> free / inactive page. However, ZONE_DEVICE pages allocated via
>> >> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
>> >> done at init time to assign pages to the page allocator is skipped.
>> >>
>> >> These pages will have their reference count elevated > 1 by
>> >> get_user_pages() when they are under DMA. In order to coordinate DMA to
>> >> these pages vs filesytem operations like hole-punch and truncate the
>> >> filesystem-dax implementation needs to capture the DMA-idle event i.e.
>> >> the 2 to 1 count transition).
>> >>
>> >> For now, this implementation does not have functional behavior change,
>> >> follow-on patches will add waiters for these page-idle events.
>> >
>> > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
>> > trainwreck already and now you're making it worse still.
>> >
>> > Please have a look here:
>> >
>> >   https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx2m7@hirez.programming.kicks-ass.net
>>
>> That thread seems to be worried about the object disappearing the
>> moment it's reference count reaches a target. That isn't the case with
>> the memmap / struct page objects for ZONE_DEVICE pages. I understand
>> wait_for_atomic_one() is broken in the general case, but as far as I
>> can see it works fine specifically for ZONE_DEVICE page busy tracking,
>> just not generic object lifetime.
>
> How's this, compile tested (x86_64-allmodconfig) only.
>
> This allows you to write:
>
>         wait_var_event(&your_atomic, atomic_read(&your_atomic) == 1);

Nice!

I'll give this a shot. I will need to add
wait_var_event_interruptible(), but other than that it looks workable
to me.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()
  2018-03-13 10:20       ` [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var() Peter Zijlstra
  2018-03-14  4:12         ` Dan Williams
@ 2018-03-15  5:46         ` Dan Williams
  1 sibling, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-03-15  5:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Howells, Jan Kara, linux-nvdimm, david,
	Linux Kernel Mailing List, Ralf Baechle, linux-xfs, Ingo Molnar,
	linux-fsdevel, Christoph Hellwig

On Tue, Mar 13, 2018 at 3:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Mar 11, 2018 at 10:15:55AM -0700, Dan Williams wrote:
>> On Sun, Mar 11, 2018 at 4:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote:
>> >> Add a generic facility for awaiting an atomic_t to reach a value of 1.
>> >>
>> >> Page reference counts typically need to reach 0 to be considered a
>> >> free / inactive page. However, ZONE_DEVICE pages allocated via
>> >> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
>> >> done at init time to assign pages to the page allocator is skipped.
>> >>
>> >> These pages will have their reference count elevated > 1 by
>> >> get_user_pages() when they are under DMA. In order to coordinate DMA to
>> >> these pages vs filesytem operations like hole-punch and truncate the
>> >> filesystem-dax implementation needs to capture the DMA-idle event i.e.
>> >> the 2 to 1 count transition).
>> >>
>> >> For now, this implementation does not have functional behavior change,
>> >> follow-on patches will add waiters for these page-idle events.
>> >
>> > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
>> > trainwreck already and now you're making it worse still.
>> >
>> > Please have a look here:
>> >
>> >   https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx2m7@hirez.programming.kicks-ass.net
>>
>> That thread seems to be worried about the object disappearing the
>> moment it's reference count reaches a target. That isn't the case with
>> the memmap / struct page objects for ZONE_DEVICE pages. I understand
>> wait_for_atomic_one() is broken in the general case, but as far as I
>> can see it works fine specifically for ZONE_DEVICE page busy tracking,
>> just not generic object lifetime.
>
> How's this, compile tested (x86_64-allmodconfig) only.
>
> This allows you to write:
>
>         wait_var_event(&your_atomic, atomic_read(&your_atomic) == 1);

This works for me, you can add

Tested-by: Dan Williams <dan.j.williams@intel.com>

...to the upstream version.

Can we add this new api in an immutable commit tip/sched/core tree, so
I can base my fix on it? The wait_for_atomic_t removal can then come
in follow-on patches.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()
  2018-03-11 17:15     ` Dan Williams
  2018-03-12 19:32       ` Dan Williams
  2018-03-13 10:20       ` [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var() Peter Zijlstra
@ 2018-03-15  9:58       ` David Howells
  2018-03-15 11:19         ` Peter Zijlstra
                           ` (2 more replies)
  2 siblings, 3 replies; 32+ messages in thread
From: David Howells @ 2018-03-15  9:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-xfs, Jan Kara, linux-nvdimm, david,
	Linux Kernel Mailing List, Ralf Baechle, dhowells, Ingo Molnar,
	linux-fsdevel, Christoph Hellwig

Peter Zijlstra <peterz@infradead.org> wrote:

> > > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
> > > trainwreck already and now you're making it worse still.

Your patch description needs to say why this isn't a trainwreck when you
consider wait_for_atomic_t() to be one since it does things in a very similar
way.

David
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()
  2018-03-15  9:58       ` David Howells
@ 2018-03-15 11:19         ` Peter Zijlstra
  2018-03-15 11:51         ` Peter Zijlstra
  2018-03-15 14:45         ` David Howells
  2 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-15 11:19 UTC (permalink / raw)
  To: David Howells
  Cc: Jan Kara, linux-nvdimm, david, Linux Kernel Mailing List,
	Ralf Baechle, linux-xfs, Ingo Molnar, linux-fsdevel,
	Christoph Hellwig

On Thu, Mar 15, 2018 at 09:58:42AM +0000, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
> > > > trainwreck already and now you're making it worse still.
> 
> Your patch description needs to say why this isn't a trainwreck when you
> consider wait_for_atomic_t() to be one since it does things in a very similar
> way.

Yeah, still writing changelogs..
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()
  2018-03-15  9:58       ` David Howells
  2018-03-15 11:19         ` Peter Zijlstra
@ 2018-03-15 11:51         ` Peter Zijlstra
  2018-03-15 14:45         ` David Howells
  2 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-15 11:51 UTC (permalink / raw)
  To: David Howells
  Cc: Jan Kara, linux-nvdimm, david, Linux Kernel Mailing List,
	Ralf Baechle, linux-xfs, Ingo Molnar, linux-fsdevel,
	Christoph Hellwig

On Thu, Mar 15, 2018 at 09:58:42AM +0000, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
> > > > trainwreck already and now you're making it worse still.
> 
> Your patch description needs to say why this isn't a trainwreck when you
> consider wait_for_atomic_t() to be one since it does things in a very similar
> way.

Does the below address things sufficiently clear?

---

Subject: sched/wait: Introduce wait_var_event()
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu Mar 15 11:40:33 CET 2018

As a replacement for the wait_on_atomic_t() API provide the
wait_var_event() API.

The wait_var_event() API is based on the very same hashed-waitqueue
idea, but doesn't care about the type (atomic_t) or the specific
condition (atomic_read() == 0). IOW. it's much more widely
applicable/flexible.

It shares all the benefits/disadvantages of a hashed-waitqueue
approach with the existing wait_on_atomic_t/wait_on_bit() APIs.

The API is modeled after the existing wait_event() API, but instead of
taking a wait_queue_head, it takes an address. This addresses is
hashed to obtain a wait_queue_head from the bit_wait_table.

Similar to the wait_event() API, it takes a condition expression as
second argument and will wait until this expression becomes true.

The following are (mostly) identical replacements:

	wait_on_atomic_t(&my_atomic, atomic_t_wait, TASK_UNINTERRUPTIBLE);
	wake_up_atomic_t(&my_atomic);

	wait_var_event(&my_atomic, !atomic_read(&my_atomic));
	wake_up_var(&my_atomic);

The only difference is that wake_up_var() is an unconditional wakeup
and doesn't check the previously hard-coded (atomic_read() == 0)
condition here. This is of little concequence, since most callers are
already conditional on atomic_dec_and_test() and the ones that are
not, are trivial to make so.

Cc: David Howells <dhowells@redhat.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/wait_bit.h |   70 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/wait_bit.c  |   48 ++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+)

--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -262,4 +262,74 @@ int wait_on_atomic_t(atomic_t *val, wait
 	return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+extern void init_wait_var_entry(struct wait_bit_queue_entry *wbq_entry, void *var, int flags);
+extern void wake_up_var(void *var);
+extern wait_queue_head_t *__var_waitqueue(void *p);
+
+#define ___wait_var_event(var, condition, state, exclusive, ret, cmd)	\
+({									\
+	__label__ __out;						\
+	struct wait_queue_head *__wq_head = __var_waitqueue(var);	\
+	struct wait_bit_queue_entry __wbq_entry;			\
+	long __ret = ret; /* explicit shadow */				\
+									\
+	init_wait_var_entry(&__wbq_entry, var,				\
+			    exclusive ? WQ_FLAG_EXCLUSIVE : 0);		\
+	for (;;) {							\
+		long __int = prepare_to_wait_event(__wq_head,		\
+						   &__wbq_entry.wq_entry, \
+						   state);		\
+		if (condition)						\
+			break;						\
+									\
+		if (___wait_is_interruptible(state) && __int) {		\
+			__ret = __int;					\
+			goto __out;					\
+		}							\
+									\
+		cmd;							\
+	}								\
+	finish_wait(__wq_head, &__wbq_entry.wq_entry);			\
+__out:	__ret;								\
+})
+
+#define __wait_var_event(var, condition)				\
+	___wait_var_event(var, condition, TASK_UNINTERRUPTIBLE, 0, 0,	\
+			  schedule())
+
+#define wait_var_event(var, condition)					\
+do {									\
+	might_sleep();							\
+	if (condition)							\
+		break;							\
+	__wait_var_event(var, condition);				\
+} while (0)
+
+#define __wait_var_event_killable(var, condition)			\
+	___wait_var_event(var, condition, TASK_KILLABLE, 0, 0,		\
+			  schedule())
+
+#define wait_var_event_killable(var, condition)				\
+({									\
+	int __ret = 0;							\
+	might_sleep();							\
+	if (!(condition))						\
+		__ret = __wait_var_event_killable(var, condition);	\
+	__ret;								\
+})
+
+#define __wait_var_event_timeout(var, condition, timeout)		\
+	___wait_var_event(var, ___wait_cond_timeout(condition),		\
+			  TASK_UNINTERRUPTIBLE, 0, timeout,		\
+			  __ret = schedule_timeout(__ret))
+
+#define wait_var_event_timeout(var, condition, timeout)			\
+({									\
+	long __ret = timeout;						\
+	might_sleep();							\
+	if (!___wait_cond_timeout(condition))				\
+		__ret = __wait_var_event_timeout(var, condition, timeout); \
+	__ret;								\
+})
+
 #endif /* _LINUX_WAIT_BIT_H */
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -149,6 +149,54 @@ void wake_up_bit(void *word, int bit)
 }
 EXPORT_SYMBOL(wake_up_bit);
 
+wait_queue_head_t *__var_waitqueue(void *p)
+{
+	if (BITS_PER_LONG == 64) {
+		unsigned long q = (unsigned long)p;
+
+		return bit_waitqueue((void *)(q & ~1), q & 1);
+	}
+	return bit_waitqueue(p, 0);
+}
+EXPORT_SYMBOL(__var_waitqueue);
+
+static int
+var_wake_function(struct wait_queue_entry *wq_entry, unsigned int mode,
+		  int sync, void *arg)
+{
+	struct wait_bit_key *key = arg;
+	struct wait_bit_queue_entry *wbq_entry =
+		container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+
+	if (wbq_entry->key.flags != key->flags ||
+	    wbq_entry->key.bit_nr != key->bit_nr)
+		return 0;
+
+	return autoremove_wake_function(wq_entry, mode, sync, key);
+}
+
+void init_wait_var_entry(struct wait_bit_queue_entry *wbq_entry, void *var, int flags)
+{
+	*wbq_entry = (struct wait_bit_queue_entry){
+		.key = {
+			.flags	= (var),
+			.bit_nr = -1,
+		},
+		.wq_entry = {
+			.private = current,
+			.func	 = var_wake_function,
+			.entry	 = LIST_HEAD_INIT(wbq_entry->wq_entry.entry),
+		},
+	};
+}
+EXPORT_SYMBOL(init_wait_var_entry);
+
+void wake_up_var(void *var)
+{
+	__wake_up_bit(__var_waitqueue(var), var, -1);
+}
+EXPORT_SYMBOL(wake_up_var);
+
 /*
  * Manipulate the atomic_t address to produce a better bit waitqueue table hash
  * index (we're keying off bit -1, but that would produce a horrible hash
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()
  2018-03-15  9:58       ` David Howells
  2018-03-15 11:19         ` Peter Zijlstra
  2018-03-15 11:51         ` Peter Zijlstra
@ 2018-03-15 14:45         ` David Howells
  2018-03-15 14:53           ` Peter Zijlstra
  2 siblings, 1 reply; 32+ messages in thread
From: David Howells @ 2018-03-15 14:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-xfs, Jan Kara, linux-nvdimm, david,
	Linux Kernel Mailing List, Ralf Baechle, dhowells, Ingo Molnar,
	linux-fsdevel, Christoph Hellwig

Peter Zijlstra <peterz@infradead.org> wrote:

> Does the below address things sufficiently clear?

Yep.

> +wait_queue_head_t *__var_waitqueue(void *p)
> +{
> +	if (BITS_PER_LONG == 64) {
> +		unsigned long q = (unsigned long)p;
> +
> +		return bit_waitqueue((void *)(q & ~1), q & 1);
> +	}
> +	return bit_waitqueue(p, 0);
> +}

You might be better off not using bit_waitqueue() but rather do the
calculation directly since you don't actually have a bit number.

David
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()
  2018-03-15 14:45         ` David Howells
@ 2018-03-15 14:53           ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-15 14:53 UTC (permalink / raw)
  To: David Howells
  Cc: Jan Kara, linux-nvdimm, david, Linux Kernel Mailing List,
	Ralf Baechle, linux-xfs, Ingo Molnar, linux-fsdevel,
	Christoph Hellwig

On Thu, Mar 15, 2018 at 02:45:20PM +0000, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Does the below address things sufficiently clear?
> 
> Yep.

Thanks!

> > +wait_queue_head_t *__var_waitqueue(void *p)
> > +{
> > +	if (BITS_PER_LONG == 64) {
> > +		unsigned long q = (unsigned long)p;
> > +
> > +		return bit_waitqueue((void *)(q & ~1), q & 1);
> > +	}
> > +	return bit_waitqueue(p, 0);
> > +}
> 
> You might be better off not using bit_waitqueue() but rather do the
> calculation directly since you don't actually have a bit number.

Yes, I did that in patch 11. The initial version uses the exact same
stuff wait_on_atomic_t() uses to avoid spurious changes.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-03-15 14:47 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-10  6:54 [PATCH v5 00/11] dax: fix dma vs truncate/hole-punch Dan Williams
2018-03-10  6:54 ` [PATCH v5 01/11] dax: store pfns in the radix Dan Williams
2018-03-10  6:54 ` [PATCH v5 02/11] xfs, dax: introduce xfs_dax_aops Dan Williams
2018-03-10  9:46   ` Christoph Hellwig
2018-03-10 17:40     ` Dan Williams
2018-03-11 19:16       ` Dan Williams
2018-03-12  7:51         ` Christoph Hellwig
2018-03-10  6:55 ` [PATCH v5 03/11] ext4, dax: introduce ext4_dax_aops Dan Williams
2018-03-10  6:55 ` [PATCH v5 04/11] ext2, dax: introduce ext2_dax_aops Dan Williams
2018-03-10  6:55 ` [PATCH v5 05/11] fs, dax: use page->mapping to warn if truncate collides with a busy page Dan Williams
2018-03-10  6:55 ` [PATCH v5 06/11] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
2018-03-12 14:09   ` Jerome Glisse
2018-03-10  6:55 ` [PATCH v5 07/11] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
2018-03-12 14:17   ` Jerome Glisse
2018-03-12 18:17     ` Dan Williams
2018-03-10  6:55 ` [PATCH v5 08/11] wait_bit: introduce {wait_on,wake_up}_atomic_one Dan Williams
2018-03-11 11:27   ` [PATCH v5 08/11] wait_bit: introduce {wait_on, wake_up}_atomic_one Peter Zijlstra
2018-03-11 17:15     ` Dan Williams
2018-03-12 19:32       ` Dan Williams
2018-03-13 10:20       ` [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var() Peter Zijlstra
2018-03-14  4:12         ` Dan Williams
2018-03-15  5:46         ` Dan Williams
2018-03-15  9:58       ` David Howells
2018-03-15 11:19         ` Peter Zijlstra
2018-03-15 11:51         ` Peter Zijlstra
2018-03-15 14:45         ` David Howells
2018-03-15 14:53           ` Peter Zijlstra
2018-03-10  6:55 ` [PATCH v5 09/11] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
2018-03-10  6:55 ` [PATCH v5 10/11] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
2018-03-10  9:51   ` Christoph Hellwig
2018-03-10  6:55 ` [PATCH v5 11/11] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
2018-03-10  9:55   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).