All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/17] re-enable DAX PMD support
@ 2016-10-07 21:08 ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

Dave, can you please take this through the XFS tree as we discussed during
the v4 review?

Changes since v4:
 - Reworked the DAX flags handling to simplify things and get rid of
   RADIX_DAX_PTE. (Jan & Christoph)
 - Moved RADIX_DAX_* macros to be inline functions in include/linux/dax.h.
   (Christoph)
 - Got rid of unneeded macros RADIX_DAX_HZP_ENTRY() and
   RADIX_DAX_EMPTY_ENTRY(), and instead just pass arbitrary flags to
   radix_dax_entry().
 - Re-ordered the arguments to dax_wake_mapping_entry_waiter() to be more
   consistent with the rest of the code. (Jan)
 - Moved radix_dax_order() inside of the #ifdef CONFIG_FS_DAX_PMD block.
   This was causing a build error on various systems that don't define
   PMD_SHIFT.
 - Patch 5 fixes what I believe is a missing error return in
   ext2_iomap_end().
 - Fixed the page_start calculation for PMDs that was previously found in
   dax_entry_start().  (Jan)  This code is now included directly in
   dax_entry_waitqueue().  (Christoph)
 - dax_entry_waitqueue() now sets up the struct exceptional_entry_key() of
   the caller as a service to reduce code duplication. (Christoph)
 - In grab_mapping_entry() we now hold the radix tree entry lock for PMD
   downgrades while we release the tree_lock and do an
   unmap_mapping_range().  (Jan)
 - Removed our last BUG_ON() in dax.c, replacing it with a WARN_ON_ONCE()
   and an error return.
 - The dax_iomap_fault() and dax_iomap_pmd_fault() handlers both now call
   ops->iomap_end() to ensure that we properly balance the
   ops->iomap_begin() calls with respect to locking, allocations, etc.
   (Jan)
 - Removed __GFP_FS from the vmf.gfp_mask used in dax_iomap_pmd_fault().
   (Jan)

Thank you again to Jan, Christoph and Dave for their review feedback.

Here are some related things that are not included in this patch set, but
which I plan on doing in the near future:
 - Add tracepoint support for the PTE and PMD based DAX fault handlers.
   (Dave)
 - Move the DAX 4k zero page handling to use a single 4k zero page instead
   of allocating pages on demand.  This will mirror the way that things are
   done for the 2 MiB case, and will reduce the amount of memory we use
   when reading 4k holes in DAX.
 - Change the API to the PMD fault hanlder so it takes a vmf, and at a
   layer above DAX make sure that the vmf.gfp_mask given to DAX for both
   PMD and PTE faults doesn't include __GFP_FS. (Jan)

These work items will happen after review & integration with Jan's patch
set for DAX radix tree cleaning.

This series was built upon xfs/xfs-4.9-reflink with PMD performance fixes
from Toshi Kani and Dan Williams.  Dan's patch has already been merged for
v4.8, and Toshi's patches are currently queued in Andrew Morton's mm tree
for v4.9 inclusion.  These patches are not needed for correct operation,
only for good performance.

Here is a tree containing my changes:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_v5

This tree has passed xfstests for ext2, ext4 and XFS both with and without
DAX, and has passed targeted testing where I inserted, removed and flushed
DAX PTEs and PMDs in every combination I could think of.

Previously reported performance numbers:

In some simple mmap I/O testing with FIO the use of PMD faults more than
doubles I/O performance as compared with PTE faults.  Here is the FIO
script I used for my testing:

  [global]
  bs=4k
  size=2G
  directory=/mnt/pmem0
  ioengine=mmap
  [randrw]
  rw=randrw

Here are the performance results with XFS using only pte faults:
   READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
  WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec

Here are performance numbers for that same test using PMD faults:
   READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
  WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec

This was done on a random lab machine with a PMEM device made from memmap'd
RAM.  To get XFS to use PMD faults, I did the following:

  mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
  mount -o dax /dev/pmem0 /mnt/pmem0
  xfs_io -c "extsize 2m" /mnt/pmem0

Ross Zwisler (17):
  ext4: allow DAX writeback for hole punch
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  ext2: remove support for DAX PMD faults
  ext2: return -EIO on ext2_iomap_end() failure
  dax: make 'wait_table' global variable static
  dax: remove the last BUG_ON() from fs/dax.c
  dax: consistent variable naming for DAX entries
  dax: coordinate locking for offsets in PMD range
  dax: remove dax_pmd_fault()
  dax: correct dax iomap code namespace
  dax: add dax_iomap_sector() helper function
  dax: dax_iomap_fault() needs to call iomap_end()
  dax: move RADIX_DAX_* defines to dax.h
  dax: add struct iomap based DAX PMD support
  xfs: use struct iomap based DAX PMD fault path
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig          |   1 -
 fs/dax.c            | 718 ++++++++++++++++++++++++++++------------------------
 fs/ext2/file.c      |  35 +--
 fs/ext2/inode.c     |   4 +-
 fs/ext4/inode.c     |   7 +-
 fs/xfs/xfs_aops.c   |  26 +-
 fs/xfs/xfs_aops.h   |   3 -
 fs/xfs/xfs_file.c   |  10 +-
 include/linux/dax.h |  60 ++++-
 mm/filemap.c        |   6 +-
 10 files changed, 466 insertions(+), 404 deletions(-)

-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* [PATCH v5 00/17] re-enable DAX PMD support
@ 2016-10-07 21:08 ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

Dave, can you please take this through the XFS tree as we discussed during
the v4 review?

Changes since v4:
 - Reworked the DAX flags handling to simplify things and get rid of
   RADIX_DAX_PTE. (Jan & Christoph)
 - Moved RADIX_DAX_* macros to be inline functions in include/linux/dax.h.
   (Christoph)
 - Got rid of unneeded macros RADIX_DAX_HZP_ENTRY() and
   RADIX_DAX_EMPTY_ENTRY(), and instead just pass arbitrary flags to
   radix_dax_entry().
 - Re-ordered the arguments to dax_wake_mapping_entry_waiter() to be more
   consistent with the rest of the code. (Jan)
 - Moved radix_dax_order() inside of the #ifdef CONFIG_FS_DAX_PMD block.
   This was causing a build error on various systems that don't define
   PMD_SHIFT.
 - Patch 5 fixes what I believe is a missing error return in
   ext2_iomap_end().
 - Fixed the page_start calculation for PMDs that was previously found in
   dax_entry_start().  (Jan)  This code is now included directly in
   dax_entry_waitqueue().  (Christoph)
 - dax_entry_waitqueue() now sets up the struct exceptional_entry_key() of
   the caller as a service to reduce code duplication. (Christoph)
 - In grab_mapping_entry() we now hold the radix tree entry lock for PMD
   downgrades while we release the tree_lock and do an
   unmap_mapping_range().  (Jan)
 - Removed our last BUG_ON() in dax.c, replacing it with a WARN_ON_ONCE()
   and an error return.
 - The dax_iomap_fault() and dax_iomap_pmd_fault() handlers both now call
   ops->iomap_end() to ensure that we properly balance the
   ops->iomap_begin() calls with respect to locking, allocations, etc.
   (Jan)
 - Removed __GFP_FS from the vmf.gfp_mask used in dax_iomap_pmd_fault().
   (Jan)

Thank you again to Jan, Christoph and Dave for their review feedback.

Here are some related things that are not included in this patch set, but
which I plan on doing in the near future:
 - Add tracepoint support for the PTE and PMD based DAX fault handlers.
   (Dave)
 - Move the DAX 4k zero page handling to use a single 4k zero page instead
   of allocating pages on demand.  This will mirror the way that things are
   done for the 2 MiB case, and will reduce the amount of memory we use
   when reading 4k holes in DAX.
 - Change the API to the PMD fault hanlder so it takes a vmf, and at a
   layer above DAX make sure that the vmf.gfp_mask given to DAX for both
   PMD and PTE faults doesn't include __GFP_FS. (Jan)

These work items will happen after review & integration with Jan's patch
set for DAX radix tree cleaning.

This series was built upon xfs/xfs-4.9-reflink with PMD performance fixes
from Toshi Kani and Dan Williams.  Dan's patch has already been merged for
v4.8, and Toshi's patches are currently queued in Andrew Morton's mm tree
for v4.9 inclusion.  These patches are not needed for correct operation,
only for good performance.

Here is a tree containing my changes:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_v5

This tree has passed xfstests for ext2, ext4 and XFS both with and without
DAX, and has passed targeted testing where I inserted, removed and flushed
DAX PTEs and PMDs in every combination I could think of.

Previously reported performance numbers:

In some simple mmap I/O testing with FIO the use of PMD faults more than
doubles I/O performance as compared with PTE faults.  Here is the FIO
script I used for my testing:

  [global]
  bs=4k
  size=2G
  directory=/mnt/pmem0
  ioengine=mmap
  [randrw]
  rw=randrw

Here are the performance results with XFS using only pte faults:
   READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
  WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec

Here are performance numbers for that same test using PMD faults:
   READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
  WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec

This was done on a random lab machine with a PMEM device made from memmap'd
RAM.  To get XFS to use PMD faults, I did the following:

  mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
  mount -o dax /dev/pmem0 /mnt/pmem0
  xfs_io -c "extsize 2m" /mnt/pmem0

Ross Zwisler (17):
  ext4: allow DAX writeback for hole punch
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  ext2: remove support for DAX PMD faults
  ext2: return -EIO on ext2_iomap_end() failure
  dax: make 'wait_table' global variable static
  dax: remove the last BUG_ON() from fs/dax.c
  dax: consistent variable naming for DAX entries
  dax: coordinate locking for offsets in PMD range
  dax: remove dax_pmd_fault()
  dax: correct dax iomap code namespace
  dax: add dax_iomap_sector() helper function
  dax: dax_iomap_fault() needs to call iomap_end()
  dax: move RADIX_DAX_* defines to dax.h
  dax: add struct iomap based DAX PMD support
  xfs: use struct iomap based DAX PMD fault path
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig          |   1 -
 fs/dax.c            | 718 ++++++++++++++++++++++++++++------------------------
 fs/ext2/file.c      |  35 +--
 fs/ext2/inode.c     |   4 +-
 fs/ext4/inode.c     |   7 +-
 fs/xfs/xfs_aops.c   |  26 +-
 fs/xfs/xfs_aops.h   |   3 -
 fs/xfs/xfs_file.c   |  10 +-
 include/linux/dax.h |  60 ++++-
 mm/filemap.c        |   6 +-
 10 files changed, 466 insertions(+), 404 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 182+ messages in thread

* [PATCH v5 00/17] re-enable DAX PMD support
@ 2016-10-07 21:08 ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

Dave, can you please take this through the XFS tree as we discussed during
the v4 review?

Changes since v4:
 - Reworked the DAX flags handling to simplify things and get rid of
   RADIX_DAX_PTE. (Jan & Christoph)
 - Moved RADIX_DAX_* macros to be inline functions in include/linux/dax.h.
   (Christoph)
 - Got rid of unneeded macros RADIX_DAX_HZP_ENTRY() and
   RADIX_DAX_EMPTY_ENTRY(), and instead just pass arbitrary flags to
   radix_dax_entry().
 - Re-ordered the arguments to dax_wake_mapping_entry_waiter() to be more
   consistent with the rest of the code. (Jan)
 - Moved radix_dax_order() inside of the #ifdef CONFIG_FS_DAX_PMD block.
   This was causing a build error on various systems that don't define
   PMD_SHIFT.
 - Patch 5 fixes what I believe is a missing error return in
   ext2_iomap_end().
 - Fixed the page_start calculation for PMDs that was previously found in
   dax_entry_start().  (Jan)  This code is now included directly in
   dax_entry_waitqueue().  (Christoph)
 - dax_entry_waitqueue() now sets up the struct exceptional_entry_key() of
   the caller as a service to reduce code duplication. (Christoph)
 - In grab_mapping_entry() we now hold the radix tree entry lock for PMD
   downgrades while we release the tree_lock and do an
   unmap_mapping_range().  (Jan)
 - Removed our last BUG_ON() in dax.c, replacing it with a WARN_ON_ONCE()
   and an error return.
 - The dax_iomap_fault() and dax_iomap_pmd_fault() handlers both now call
   ops->iomap_end() to ensure that we properly balance the
   ops->iomap_begin() calls with respect to locking, allocations, etc.
   (Jan)
 - Removed __GFP_FS from the vmf.gfp_mask used in dax_iomap_pmd_fault().
   (Jan)

Thank you again to Jan, Christoph and Dave for their review feedback.

Here are some related things that are not included in this patch set, but
which I plan on doing in the near future:
 - Add tracepoint support for the PTE and PMD based DAX fault handlers.
   (Dave)
 - Move the DAX 4k zero page handling to use a single 4k zero page instead
   of allocating pages on demand.  This will mirror the way that things are
   done for the 2 MiB case, and will reduce the amount of memory we use
   when reading 4k holes in DAX.
 - Change the API to the PMD fault hanlder so it takes a vmf, and at a
   layer above DAX make sure that the vmf.gfp_mask given to DAX for both
   PMD and PTE faults doesn't include __GFP_FS. (Jan)

These work items will happen after review & integration with Jan's patch
set for DAX radix tree cleaning.

This series was built upon xfs/xfs-4.9-reflink with PMD performance fixes
from Toshi Kani and Dan Williams.  Dan's patch has already been merged for
v4.8, and Toshi's patches are currently queued in Andrew Morton's mm tree
for v4.9 inclusion.  These patches are not needed for correct operation,
only for good performance.

Here is a tree containing my changes:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_v5

This tree has passed xfstests for ext2, ext4 and XFS both with and without
DAX, and has passed targeted testing where I inserted, removed and flushed
DAX PTEs and PMDs in every combination I could think of.

Previously reported performance numbers:

In some simple mmap I/O testing with FIO the use of PMD faults more than
doubles I/O performance as compared with PTE faults.  Here is the FIO
script I used for my testing:

  [global]
  bs=4k
  size=2G
  directory=/mnt/pmem0
  ioengine=mmap
  [randrw]
  rw=randrw

Here are the performance results with XFS using only pte faults:
   READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
  WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec

Here are performance numbers for that same test using PMD faults:
   READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
  WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec

This was done on a random lab machine with a PMEM device made from memmap'd
RAM.  To get XFS to use PMD faults, I did the following:

  mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
  mount -o dax /dev/pmem0 /mnt/pmem0
  xfs_io -c "extsize 2m" /mnt/pmem0

Ross Zwisler (17):
  ext4: allow DAX writeback for hole punch
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  ext2: remove support for DAX PMD faults
  ext2: return -EIO on ext2_iomap_end() failure
  dax: make 'wait_table' global variable static
  dax: remove the last BUG_ON() from fs/dax.c
  dax: consistent variable naming for DAX entries
  dax: coordinate locking for offsets in PMD range
  dax: remove dax_pmd_fault()
  dax: correct dax iomap code namespace
  dax: add dax_iomap_sector() helper function
  dax: dax_iomap_fault() needs to call iomap_end()
  dax: move RADIX_DAX_* defines to dax.h
  dax: add struct iomap based DAX PMD support
  xfs: use struct iomap based DAX PMD fault path
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig          |   1 -
 fs/dax.c            | 718 ++++++++++++++++++++++++++++------------------------
 fs/ext2/file.c      |  35 +--
 fs/ext2/inode.c     |   4 +-
 fs/ext4/inode.c     |   7 +-
 fs/xfs/xfs_aops.c   |  26 +-
 fs/xfs/xfs_aops.h   |   3 -
 fs/xfs/xfs_file.c   |  10 +-
 include/linux/dax.h |  60 ++++-
 mm/filemap.c        |   6 +-
 10 files changed, 466 insertions(+), 404 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* [PATCH v5 00/17] re-enable DAX PMD support
@ 2016-10-07 21:08 ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

Dave, can you please take this through the XFS tree as we discussed during
the v4 review?

Changes since v4:
 - Reworked the DAX flags handling to simplify things and get rid of
   RADIX_DAX_PTE. (Jan & Christoph)
 - Moved RADIX_DAX_* macros to be inline functions in include/linux/dax.h.
   (Christoph)
 - Got rid of unneeded macros RADIX_DAX_HZP_ENTRY() and
   RADIX_DAX_EMPTY_ENTRY(), and instead just pass arbitrary flags to
   radix_dax_entry().
 - Re-ordered the arguments to dax_wake_mapping_entry_waiter() to be more
   consistent with the rest of the code. (Jan)
 - Moved radix_dax_order() inside of the #ifdef CONFIG_FS_DAX_PMD block.
   This was causing a build error on various systems that don't define
   PMD_SHIFT.
 - Patch 5 fixes what I believe is a missing error return in
   ext2_iomap_end().
 - Fixed the page_start calculation for PMDs that was previously found in
   dax_entry_start().  (Jan)  This code is now included directly in
   dax_entry_waitqueue().  (Christoph)
 - dax_entry_waitqueue() now sets up the struct exceptional_entry_key() of
   the caller as a service to reduce code duplication. (Christoph)
 - In grab_mapping_entry() we now hold the radix tree entry lock for PMD
   downgrades while we release the tree_lock and do an
   unmap_mapping_range().  (Jan)
 - Removed our last BUG_ON() in dax.c, replacing it with a WARN_ON_ONCE()
   and an error return.
 - The dax_iomap_fault() and dax_iomap_pmd_fault() handlers both now call
   ops->iomap_end() to ensure that we properly balance the
   ops->iomap_begin() calls with respect to locking, allocations, etc.
   (Jan)
 - Removed __GFP_FS from the vmf.gfp_mask used in dax_iomap_pmd_fault().
   (Jan)

Thank you again to Jan, Christoph and Dave for their review feedback.

Here are some related things that are not included in this patch set, but
which I plan on doing in the near future:
 - Add tracepoint support for the PTE and PMD based DAX fault handlers.
   (Dave)
 - Move the DAX 4k zero page handling to use a single 4k zero page instead
   of allocating pages on demand.  This will mirror the way that things are
   done for the 2 MiB case, and will reduce the amount of memory we use
   when reading 4k holes in DAX.
 - Change the API to the PMD fault hanlder so it takes a vmf, and at a
   layer above DAX make sure that the vmf.gfp_mask given to DAX for both
   PMD and PTE faults doesn't include __GFP_FS. (Jan)

These work items will happen after review & integration with Jan's patch
set for DAX radix tree cleaning.

This series was built upon xfs/xfs-4.9-reflink with PMD performance fixes
from Toshi Kani and Dan Williams.  Dan's patch has already been merged for
v4.8, and Toshi's patches are currently queued in Andrew Morton's mm tree
for v4.9 inclusion.  These patches are not needed for correct operation,
only for good performance.

Here is a tree containing my changes:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_v5

This tree has passed xfstests for ext2, ext4 and XFS both with and without
DAX, and has passed targeted testing where I inserted, removed and flushed
DAX PTEs and PMDs in every combination I could think of.

Previously reported performance numbers:

In some simple mmap I/O testing with FIO the use of PMD faults more than
doubles I/O performance as compared with PTE faults.  Here is the FIO
script I used for my testing:

  [global]
  bs=4k
  size=2G
  directory=/mnt/pmem0
  ioengine=mmap
  [randrw]
  rw=randrw

Here are the performance results with XFS using only pte faults:
   READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
  WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec

Here are performance numbers for that same test using PMD faults:
   READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
  WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec

This was done on a random lab machine with a PMEM device made from memmap'd
RAM.  To get XFS to use PMD faults, I did the following:

  mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
  mount -o dax /dev/pmem0 /mnt/pmem0
  xfs_io -c "extsize 2m" /mnt/pmem0

Ross Zwisler (17):
  ext4: allow DAX writeback for hole punch
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  ext2: remove support for DAX PMD faults
  ext2: return -EIO on ext2_iomap_end() failure
  dax: make 'wait_table' global variable static
  dax: remove the last BUG_ON() from fs/dax.c
  dax: consistent variable naming for DAX entries
  dax: coordinate locking for offsets in PMD range
  dax: remove dax_pmd_fault()
  dax: correct dax iomap code namespace
  dax: add dax_iomap_sector() helper function
  dax: dax_iomap_fault() needs to call iomap_end()
  dax: move RADIX_DAX_* defines to dax.h
  dax: add struct iomap based DAX PMD support
  xfs: use struct iomap based DAX PMD fault path
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig          |   1 -
 fs/dax.c            | 718 ++++++++++++++++++++++++++++------------------------
 fs/ext2/file.c      |  35 +--
 fs/ext2/inode.c     |   4 +-
 fs/ext4/inode.c     |   7 +-
 fs/xfs/xfs_aops.c   |  26 +-
 fs/xfs/xfs_aops.h   |   3 -
 fs/xfs/xfs_file.c   |  10 +-
 include/linux/dax.h |  60 ++++-
 mm/filemap.c        |   6 +-
 10 files changed, 466 insertions(+), 404 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 182+ messages in thread

* [PATCH v5 01/17] ext4: allow DAX writeback for hole punch
  2016-10-07 21:08 ` Ross Zwisler
                     ` (2 preceding siblings ...)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	stable, Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..0900cb4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3890,7 +3890,7 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 }
 
 /*
- * ext4_punch_hole: punches a hole in a file by releaseing the blocks
+ * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
  *
  * @inode:  File inode
@@ -3919,7 +3919,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	 * Write out all dirty pages to avoid race conditions
 	 * Then release them.
 	 */
-	if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 		ret = filemap_write_and_wait_range(mapping, offset,
 						   offset + length - 1);
 		if (ret)
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 01/17] ext4: allow DAX writeback for hole punch
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs, stable

Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..0900cb4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3890,7 +3890,7 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 }
 
 /*
- * ext4_punch_hole: punches a hole in a file by releaseing the blocks
+ * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
  *
  * @inode:  File inode
@@ -3919,7 +3919,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	 * Write out all dirty pages to avoid race conditions
 	 * Then release them.
 	 */
-	if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 		ret = filemap_write_and_wait_range(mapping, offset,
 						   offset + length - 1);
 		if (ret)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 01/17] ext4: allow DAX writeback for hole punch
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs, stable

Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..0900cb4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3890,7 +3890,7 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 }
 
 /*
- * ext4_punch_hole: punches a hole in a file by releaseing the blocks
+ * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
  *
  * @inode:  File inode
@@ -3919,7 +3919,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	 * Write out all dirty pages to avoid race conditions
 	 * Then release them.
 	 */
-	if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 		ret = filemap_write_and_wait_range(mapping, offset,
 						   offset + length - 1);
 		if (ret)
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 01/17] ext4: allow DAX writeback for hole punch
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	stable-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..0900cb4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3890,7 +3890,7 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 }
 
 /*
- * ext4_punch_hole: punches a hole in a file by releaseing the blocks
+ * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
  *
  * @inode:  File inode
@@ -3919,7 +3919,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	 * Write out all dirty pages to avoid race conditions
 	 * Then release them.
 	 */
-	if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 		ret = filemap_write_and_wait_range(mapping, offset,
 						   offset + length - 1);
 		if (ret)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 01/17] ext4: allow DAX writeback for hole punch
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs, stable

Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..0900cb4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3890,7 +3890,7 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 }
 
 /*
- * ext4_punch_hole: punches a hole in a file by releaseing the blocks
+ * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
  *
  * @inode:  File inode
@@ -3919,7 +3919,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	 * Write out all dirty pages to avoid race conditions
 	 * Then release them.
 	 */
-	if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 		ret = filemap_write_and_wait_range(mapping, offset,
 						   offset + length - 1);
 		if (ret)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 02/17] ext4: tell DAX the size of allocation holes
  2016-10-07 21:08 ` Ross Zwisler
                     ` (2 preceding siblings ...)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0900cb4..9075fac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_update_bh_state(bh, map.m_flags);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
+	} else if (ret == 0) {
+		/* hole case, need to fill in bh->b_size */
+		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 	}
 	return ret;
 }
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 02/17] ext4: tell DAX the size of allocation holes
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0900cb4..9075fac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_update_bh_state(bh, map.m_flags);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
+	} else if (ret == 0) {
+		/* hole case, need to fill in bh->b_size */
+		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 	}
 	return ret;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 02/17] ext4: tell DAX the size of allocation holes
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0900cb4..9075fac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_update_bh_state(bh, map.m_flags);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
+	} else if (ret == 0) {
+		/* hole case, need to fill in bh->b_size */
+		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 	}
 	return ret;
 }
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 02/17] ext4: tell DAX the size of allocation holes
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0900cb4..9075fac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_update_bh_state(bh, map.m_flags);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
+	} else if (ret == 0) {
+		/* hole case, need to fill in bh->b_size */
+		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 	}
 	return ret;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 02/17] ext4: tell DAX the size of allocation holes
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0900cb4..9075fac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_update_bh_state(bh, map.m_flags);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
+	} else if (ret == 0) {
+		/* hole case, need to fill in bh->b_size */
+		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 	}
 	return ret;
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 03/17] dax: remove buffer_size_valid()
  2016-10-07 21:08 ` Ross Zwisler
                     ` (2 preceding siblings ...)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/dax.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cc025f8..9b9be8a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -123,19 +123,6 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-	return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
 		const struct inode *inode)
 {
@@ -177,8 +164,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				rc = get_block(inode, block, bh, rw == WRITE);
 				if (rc)
 					break;
-				if (!buffer_size_valid(bh))
-					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
 				bdev = bh->b_bdev;
 				/*
@@ -1012,12 +997,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bdev = bh.b_bdev;
 
-	/*
-	 * If the filesystem isn't willing to tell us the length of a hole,
-	 * just fall back to PTEs.  Calling get_block 512 times in a loop
-	 * would be silly.
-	 */
-	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
 		return VM_FAULT_FALLBACK;
 	}
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 03/17] dax: remove buffer_size_valid()
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/dax.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cc025f8..9b9be8a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -123,19 +123,6 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-	return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
 		const struct inode *inode)
 {
@@ -177,8 +164,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				rc = get_block(inode, block, bh, rw == WRITE);
 				if (rc)
 					break;
-				if (!buffer_size_valid(bh))
-					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
 				bdev = bh->b_bdev;
 				/*
@@ -1012,12 +997,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bdev = bh.b_bdev;
 
-	/*
-	 * If the filesystem isn't willing to tell us the length of a hole,
-	 * just fall back to PTEs.  Calling get_block 512 times in a loop
-	 * would be silly.
-	 */
-	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
 		return VM_FAULT_FALLBACK;
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 03/17] dax: remove buffer_size_valid()
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/dax.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cc025f8..9b9be8a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -123,19 +123,6 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-	return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
 		const struct inode *inode)
 {
@@ -177,8 +164,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				rc = get_block(inode, block, bh, rw == WRITE);
 				if (rc)
 					break;
-				if (!buffer_size_valid(bh))
-					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
 				bdev = bh->b_bdev;
 				/*
@@ -1012,12 +997,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bdev = bh.b_bdev;
 
-	/*
-	 * If the filesystem isn't willing to tell us the length of a hole,
-	 * just fall back to PTEs.  Calling get_block 512 times in a loop
-	 * would be silly.
-	 */
-	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
 		return VM_FAULT_FALLBACK;
 	}
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 03/17] dax: remove buffer_size_valid()
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 fs/dax.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cc025f8..9b9be8a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -123,19 +123,6 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-	return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
 		const struct inode *inode)
 {
@@ -177,8 +164,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				rc = get_block(inode, block, bh, rw == WRITE);
 				if (rc)
 					break;
-				if (!buffer_size_valid(bh))
-					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
 				bdev = bh->b_bdev;
 				/*
@@ -1012,12 +997,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bdev = bh.b_bdev;
 
-	/*
-	 * If the filesystem isn't willing to tell us the length of a hole,
-	 * just fall back to PTEs.  Calling get_block 512 times in a loop
-	 * would be silly.
-	 */
-	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
 		return VM_FAULT_FALLBACK;
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 03/17] dax: remove buffer_size_valid()
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/dax.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cc025f8..9b9be8a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -123,19 +123,6 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-	return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
 		const struct inode *inode)
 {
@@ -177,8 +164,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				rc = get_block(inode, block, bh, rw == WRITE);
 				if (rc)
 					break;
-				if (!buffer_size_valid(bh))
-					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
 				bdev = bh->b_bdev;
 				/*
@@ -1012,12 +997,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bdev = bh.b_bdev;
 
-	/*
-	 * If the filesystem isn't willing to tell us the length of a hole,
-	 * just fall back to PTEs.  Calling get_block 512 times in a loop
-	 * would be silly.
-	 */
-	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
 		return VM_FAULT_FALLBACK;
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 04/17] ext2: remove support for DAX PMD faults
  2016-10-07 21:08 ` Ross Zwisler
  (?)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

DAX PMD support was added via the following commit:

commit e7b1ea2ad658 ("ext2: huge page fault support")

I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB.  In my testing I've been unable to
get ext2 to actually fault in a PMD.  It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.

I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.

Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation.  This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/file.c | 29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0ca363d..0f257f8 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -107,27 +107,6 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return ret;
 }
 
-static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-						pmd_t *pmd, unsigned int flags)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret;
-
-	if (flags & FAULT_FLAG_WRITE) {
-		sb_start_pagefault(inode->i_sb);
-		file_update_time(vma->vm_file);
-	}
-	down_read(&ei->dax_sem);
-
-	ret = dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
-
-	up_read(&ei->dax_sem);
-	if (flags & FAULT_FLAG_WRITE)
-		sb_end_pagefault(inode->i_sb);
-	return ret;
-}
-
 static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 		struct vm_fault *vmf)
 {
@@ -154,7 +133,11 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
-	.pmd_fault	= ext2_dax_pmd_fault,
+	/*
+	 * .pmd_fault is not supported for DAX because allocation in ext2
+	 * cannot be reliably aligned to huge page sizes and so pmd faults
+	 * will always fail and fail back to regular faults.
+	 */
 	.page_mkwrite	= ext2_dax_fault,
 	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
 };
@@ -166,7 +149,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 04/17] ext2: remove support for DAX PMD faults
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMD support was added via the following commit:

commit e7b1ea2ad658 ("ext2: huge page fault support")

I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB.  In my testing I've been unable to
get ext2 to actually fault in a PMD.  It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.

I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.

Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation.  This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/file.c | 29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0ca363d..0f257f8 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -107,27 +107,6 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return ret;
 }
 
-static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-						pmd_t *pmd, unsigned int flags)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret;
-
-	if (flags & FAULT_FLAG_WRITE) {
-		sb_start_pagefault(inode->i_sb);
-		file_update_time(vma->vm_file);
-	}
-	down_read(&ei->dax_sem);
-
-	ret = dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
-
-	up_read(&ei->dax_sem);
-	if (flags & FAULT_FLAG_WRITE)
-		sb_end_pagefault(inode->i_sb);
-	return ret;
-}
-
 static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 		struct vm_fault *vmf)
 {
@@ -154,7 +133,11 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
-	.pmd_fault	= ext2_dax_pmd_fault,
+	/*
+	 * .pmd_fault is not supported for DAX because allocation in ext2
+	 * cannot be reliably aligned to huge page sizes and so pmd faults
+	 * will always fail and fail back to regular faults.
+	 */
 	.page_mkwrite	= ext2_dax_fault,
 	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
 };
@@ -166,7 +149,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 04/17] ext2: remove support for DAX PMD faults
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMD support was added via the following commit:

commit e7b1ea2ad658 ("ext2: huge page fault support")

I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB.  In my testing I've been unable to
get ext2 to actually fault in a PMD.  It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.

I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.

Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation.  This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/file.c | 29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0ca363d..0f257f8 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -107,27 +107,6 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return ret;
 }
 
-static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-						pmd_t *pmd, unsigned int flags)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret;
-
-	if (flags & FAULT_FLAG_WRITE) {
-		sb_start_pagefault(inode->i_sb);
-		file_update_time(vma->vm_file);
-	}
-	down_read(&ei->dax_sem);
-
-	ret = dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
-
-	up_read(&ei->dax_sem);
-	if (flags & FAULT_FLAG_WRITE)
-		sb_end_pagefault(inode->i_sb);
-	return ret;
-}
-
 static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 		struct vm_fault *vmf)
 {
@@ -154,7 +133,11 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
-	.pmd_fault	= ext2_dax_pmd_fault,
+	/*
+	 * .pmd_fault is not supported for DAX because allocation in ext2
+	 * cannot be reliably aligned to huge page sizes and so pmd faults
+	 * will always fail and fail back to regular faults.
+	 */
 	.page_mkwrite	= ext2_dax_fault,
 	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
 };
@@ -166,7 +149,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 04/17] ext2: remove support for DAX PMD faults
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMD support was added via the following commit:

commit e7b1ea2ad658 ("ext2: huge page fault support")

I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB.  In my testing I've been unable to
get ext2 to actually fault in a PMD.  It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.

I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.

Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation.  This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/file.c | 29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0ca363d..0f257f8 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -107,27 +107,6 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return ret;
 }
 
-static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-						pmd_t *pmd, unsigned int flags)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret;
-
-	if (flags & FAULT_FLAG_WRITE) {
-		sb_start_pagefault(inode->i_sb);
-		file_update_time(vma->vm_file);
-	}
-	down_read(&ei->dax_sem);
-
-	ret = dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
-
-	up_read(&ei->dax_sem);
-	if (flags & FAULT_FLAG_WRITE)
-		sb_end_pagefault(inode->i_sb);
-	return ret;
-}
-
 static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 		struct vm_fault *vmf)
 {
@@ -154,7 +133,11 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
-	.pmd_fault	= ext2_dax_pmd_fault,
+	/*
+	 * .pmd_fault is not supported for DAX because allocation in ext2
+	 * cannot be reliably aligned to huge page sizes and so pmd faults
+	 * will always fail and fail back to regular faults.
+	 */
 	.page_mkwrite	= ext2_dax_fault,
 	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
 };
@@ -166,7 +149,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 05/17] ext2: return -EIO on ext2_iomap_end() failure
  2016-10-07 21:08 ` Ross Zwisler
                     ` (2 preceding siblings ...)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

Right now we just return 0 for success, but we really want to let callers
know about this failure.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c7dbb46..368913c 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -830,8 +830,10 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 {
 	if (iomap->type == IOMAP_MAPPED &&
 	    written < length &&
-	    (flags & IOMAP_WRITE))
+	    (flags & IOMAP_WRITE)) {
 		ext2_write_failed(inode->i_mapping, offset + length);
+		return -EIO;
+	}
 	return 0;
 }
 
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 05/17] ext2: return -EIO on ext2_iomap_end() failure
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Right now we just return 0 for success, but we really want to let callers
know about this failure.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c7dbb46..368913c 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -830,8 +830,10 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 {
 	if (iomap->type == IOMAP_MAPPED &&
 	    written < length &&
-	    (flags & IOMAP_WRITE))
+	    (flags & IOMAP_WRITE)) {
 		ext2_write_failed(inode->i_mapping, offset + length);
+		return -EIO;
+	}
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 05/17] ext2: return -EIO on ext2_iomap_end() failure
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Right now we just return 0 for success, but we really want to let callers
know about this failure.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c7dbb46..368913c 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -830,8 +830,10 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 {
 	if (iomap->type == IOMAP_MAPPED &&
 	    written < length &&
-	    (flags & IOMAP_WRITE))
+	    (flags & IOMAP_WRITE)) {
 		ext2_write_failed(inode->i_mapping, offset + length);
+		return -EIO;
+	}
 	return 0;
 }
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 05/17] ext2: return -EIO on ext2_iomap_end() failure
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Right now we just return 0 for success, but we really want to let callers
know about this failure.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/ext2/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c7dbb46..368913c 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -830,8 +830,10 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 {
 	if (iomap->type == IOMAP_MAPPED &&
 	    written < length &&
-	    (flags & IOMAP_WRITE))
+	    (flags & IOMAP_WRITE)) {
 		ext2_write_failed(inode->i_mapping, offset + length);
+		return -EIO;
+	}
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 05/17] ext2: return -EIO on ext2_iomap_end() failure
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Right now we just return 0 for success, but we really want to let callers
know about this failure.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c7dbb46..368913c 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -830,8 +830,10 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 {
 	if (iomap->type == IOMAP_MAPPED &&
 	    written < length &&
-	    (flags & IOMAP_WRITE))
+	    (flags & IOMAP_WRITE)) {
 		ext2_write_failed(inode->i_mapping, offset + length);
+		return -EIO;
+	}
 	return 0;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 06/17] dax: make 'wait_table' global variable static
  2016-10-07 21:08 ` Ross Zwisler
                     ` (2 preceding siblings ...)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:

fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?

Make it static so it has scope local to fs/dax.c, and to make sparse happy.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9b9be8a..ac28cdf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -52,7 +52,7 @@
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
 
-wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
 {
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 06/17] dax: make 'wait_table' global variable static
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:

fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?

Make it static so it has scope local to fs/dax.c, and to make sparse happy.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9b9be8a..ac28cdf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -52,7 +52,7 @@
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
 
-wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 06/17] dax: make 'wait_table' global variable static
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:

fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?

Make it static so it has scope local to fs/dax.c, and to make sparse happy.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9b9be8a..ac28cdf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -52,7 +52,7 @@
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
 
-wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
 {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 06/17] dax: make 'wait_table' global variable static
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:

fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?

Make it static so it has scope local to fs/dax.c, and to make sparse happy.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9b9be8a..ac28cdf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -52,7 +52,7 @@
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
 
-wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 06/17] dax: make 'wait_table' global variable static
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:

fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?

Make it static so it has scope local to fs/dax.c, and to make sparse happy.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9b9be8a..ac28cdf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -52,7 +52,7 @@
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
 
-wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
 {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
  2016-10-07 21:08 ` Ross Zwisler
                     ` (2 preceding siblings ...)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

Don't take down the kernel if we get an invalid 'from' and 'length'
argument pair.  Just warn once and return an error.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac28cdf..98189ac 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	/* Block boundary? Nothing to do */
 	if (!length)
 		return 0;
-	BUG_ON((offset + length) > PAGE_SIZE);
+	if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
+		return -EINVAL;
 
 	memset(&bh, 0, sizeof(bh));
 	bh.b_bdev = inode->i_sb->s_bdev;
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Don't take down the kernel if we get an invalid 'from' and 'length'
argument pair.  Just warn once and return an error.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac28cdf..98189ac 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	/* Block boundary? Nothing to do */
 	if (!length)
 		return 0;
-	BUG_ON((offset + length) > PAGE_SIZE);
+	if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
+		return -EINVAL;
 
 	memset(&bh, 0, sizeof(bh));
 	bh.b_bdev = inode->i_sb->s_bdev;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Don't take down the kernel if we get an invalid 'from' and 'length'
argument pair.  Just warn once and return an error.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac28cdf..98189ac 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	/* Block boundary? Nothing to do */
 	if (!length)
 		return 0;
-	BUG_ON((offset + length) > PAGE_SIZE);
+	if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
+		return -EINVAL;
 
 	memset(&bh, 0, sizeof(bh));
 	bh.b_bdev = inode->i_sb->s_bdev;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Don't take down the kernel if we get an invalid 'from' and 'length'
argument pair.  Just warn once and return an error.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/dax.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac28cdf..98189ac 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	/* Block boundary? Nothing to do */
 	if (!length)
 		return 0;
-	BUG_ON((offset + length) > PAGE_SIZE);
+	if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
+		return -EINVAL;
 
 	memset(&bh, 0, sizeof(bh));
 	bh.b_bdev = inode->i_sb->s_bdev;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Don't take down the kernel if we get an invalid 'from' and 'length'
argument pair.  Just warn once and return an error.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac28cdf..98189ac 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	/* Block boundary? Nothing to do */
 	if (!length)
 		return 0;
-	BUG_ON((offset + length) > PAGE_SIZE);
+	if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
+		return -EINVAL;
 
 	memset(&bh, 0, sizeof(bh));
 	bh.b_bdev = inode->i_sb->s_bdev;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 08/17] dax: consistent variable naming for DAX entries
  2016-10-07 21:08 ` Ross Zwisler
                     ` (2 preceding siblings ...)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

No functional change.

Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries.  This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 98189ac..152a6e1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -357,7 +357,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
 static void *get_unlocked_mapping_entry(struct address_space *mapping,
 					pgoff_t index, void ***slotp)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
 	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
 
@@ -367,13 +367,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 	ewait.key.index = index;
 
 	for (;;) {
-		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
 					  &slot);
-		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		if (!entry || !radix_tree_exceptional_entry(entry) ||
 		    !slot_locked(mapping, slot)) {
 			if (slotp)
 				*slotp = slot;
-			return ret;
+			return entry;
 		}
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
@@ -396,13 +396,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
-	ret = get_unlocked_mapping_entry(mapping, index, &slot);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!ret) {
+	if (!entry) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -410,10 +410,10 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 			       RADIX_DAX_ENTRY_LOCK);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		err = radix_tree_insert(&mapping->page_tree, index, entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
@@ -425,11 +425,11 @@ restart:
 		/* Good, we have inserted empty locked entry into the tree. */
 		mapping->nrexceptional++;
 		spin_unlock_irq(&mapping->tree_lock);
-		return ret;
+		return entry;
 	}
 	/* Normal page in radix tree? */
-	if (!radix_tree_exceptional_entry(ret)) {
-		struct page *page = ret;
+	if (!radix_tree_exceptional_entry(entry)) {
+		struct page *page = entry;
 
 		get_page(page);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -442,9 +442,9 @@ restart:
 		}
 		return page;
 	}
-	ret = lock_slot(mapping, slot);
+	entry = lock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	return ret;
+	return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
@@ -469,11 +469,11 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 	spin_lock_irq(&mapping->tree_lock);
-	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
-	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
+	entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
 			 !slot_locked(mapping, slot))) {
 		spin_unlock_irq(&mapping->tree_lock);
 		return;
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 08/17] dax: consistent variable naming for DAX entries
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

No functional change.

Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries.  This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 98189ac..152a6e1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -357,7 +357,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
 static void *get_unlocked_mapping_entry(struct address_space *mapping,
 					pgoff_t index, void ***slotp)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
 	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
 
@@ -367,13 +367,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 	ewait.key.index = index;
 
 	for (;;) {
-		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
 					  &slot);
-		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		if (!entry || !radix_tree_exceptional_entry(entry) ||
 		    !slot_locked(mapping, slot)) {
 			if (slotp)
 				*slotp = slot;
-			return ret;
+			return entry;
 		}
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
@@ -396,13 +396,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
-	ret = get_unlocked_mapping_entry(mapping, index, &slot);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!ret) {
+	if (!entry) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -410,10 +410,10 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 			       RADIX_DAX_ENTRY_LOCK);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		err = radix_tree_insert(&mapping->page_tree, index, entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
@@ -425,11 +425,11 @@ restart:
 		/* Good, we have inserted empty locked entry into the tree. */
 		mapping->nrexceptional++;
 		spin_unlock_irq(&mapping->tree_lock);
-		return ret;
+		return entry;
 	}
 	/* Normal page in radix tree? */
-	if (!radix_tree_exceptional_entry(ret)) {
-		struct page *page = ret;
+	if (!radix_tree_exceptional_entry(entry)) {
+		struct page *page = entry;
 
 		get_page(page);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -442,9 +442,9 @@ restart:
 		}
 		return page;
 	}
-	ret = lock_slot(mapping, slot);
+	entry = lock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	return ret;
+	return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
@@ -469,11 +469,11 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 	spin_lock_irq(&mapping->tree_lock);
-	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
-	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
+	entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
 			 !slot_locked(mapping, slot))) {
 		spin_unlock_irq(&mapping->tree_lock);
 		return;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 08/17] dax: consistent variable naming for DAX entries
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

No functional change.

Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries.  This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 98189ac..152a6e1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -357,7 +357,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
 static void *get_unlocked_mapping_entry(struct address_space *mapping,
 					pgoff_t index, void ***slotp)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
 	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
 
@@ -367,13 +367,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 	ewait.key.index = index;
 
 	for (;;) {
-		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
 					  &slot);
-		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		if (!entry || !radix_tree_exceptional_entry(entry) ||
 		    !slot_locked(mapping, slot)) {
 			if (slotp)
 				*slotp = slot;
-			return ret;
+			return entry;
 		}
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
@@ -396,13 +396,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
-	ret = get_unlocked_mapping_entry(mapping, index, &slot);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!ret) {
+	if (!entry) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -410,10 +410,10 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 			       RADIX_DAX_ENTRY_LOCK);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		err = radix_tree_insert(&mapping->page_tree, index, entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
@@ -425,11 +425,11 @@ restart:
 		/* Good, we have inserted empty locked entry into the tree. */
 		mapping->nrexceptional++;
 		spin_unlock_irq(&mapping->tree_lock);
-		return ret;
+		return entry;
 	}
 	/* Normal page in radix tree? */
-	if (!radix_tree_exceptional_entry(ret)) {
-		struct page *page = ret;
+	if (!radix_tree_exceptional_entry(entry)) {
+		struct page *page = entry;
 
 		get_page(page);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -442,9 +442,9 @@ restart:
 		}
 		return page;
 	}
-	ret = lock_slot(mapping, slot);
+	entry = lock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	return ret;
+	return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
@@ -469,11 +469,11 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 	spin_lock_irq(&mapping->tree_lock);
-	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
-	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
+	entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
 			 !slot_locked(mapping, slot))) {
 		spin_unlock_irq(&mapping->tree_lock);
 		return;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 08/17] dax: consistent variable naming for DAX entries
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

No functional change.

Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries.  This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/dax.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 98189ac..152a6e1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -357,7 +357,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
 static void *get_unlocked_mapping_entry(struct address_space *mapping,
 					pgoff_t index, void ***slotp)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
 	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
 
@@ -367,13 +367,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 	ewait.key.index = index;
 
 	for (;;) {
-		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
 					  &slot);
-		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		if (!entry || !radix_tree_exceptional_entry(entry) ||
 		    !slot_locked(mapping, slot)) {
 			if (slotp)
 				*slotp = slot;
-			return ret;
+			return entry;
 		}
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
@@ -396,13 +396,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
-	ret = get_unlocked_mapping_entry(mapping, index, &slot);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!ret) {
+	if (!entry) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -410,10 +410,10 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 			       RADIX_DAX_ENTRY_LOCK);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		err = radix_tree_insert(&mapping->page_tree, index, entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
@@ -425,11 +425,11 @@ restart:
 		/* Good, we have inserted empty locked entry into the tree. */
 		mapping->nrexceptional++;
 		spin_unlock_irq(&mapping->tree_lock);
-		return ret;
+		return entry;
 	}
 	/* Normal page in radix tree? */
-	if (!radix_tree_exceptional_entry(ret)) {
-		struct page *page = ret;
+	if (!radix_tree_exceptional_entry(entry)) {
+		struct page *page = entry;
 
 		get_page(page);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -442,9 +442,9 @@ restart:
 		}
 		return page;
 	}
-	ret = lock_slot(mapping, slot);
+	entry = lock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	return ret;
+	return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
@@ -469,11 +469,11 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 	spin_lock_irq(&mapping->tree_lock);
-	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
-	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
+	entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
 			 !slot_locked(mapping, slot))) {
 		spin_unlock_irq(&mapping->tree_lock);
 		return;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 08/17] dax: consistent variable naming for DAX entries
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

No functional change.

Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries.  This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 98189ac..152a6e1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -357,7 +357,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
 static void *get_unlocked_mapping_entry(struct address_space *mapping,
 					pgoff_t index, void ***slotp)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
 	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
 
@@ -367,13 +367,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 	ewait.key.index = index;
 
 	for (;;) {
-		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
 					  &slot);
-		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		if (!entry || !radix_tree_exceptional_entry(entry) ||
 		    !slot_locked(mapping, slot)) {
 			if (slotp)
 				*slotp = slot;
-			return ret;
+			return entry;
 		}
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
@@ -396,13 +396,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
-	ret = get_unlocked_mapping_entry(mapping, index, &slot);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!ret) {
+	if (!entry) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -410,10 +410,10 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 			       RADIX_DAX_ENTRY_LOCK);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		err = radix_tree_insert(&mapping->page_tree, index, entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
@@ -425,11 +425,11 @@ restart:
 		/* Good, we have inserted empty locked entry into the tree. */
 		mapping->nrexceptional++;
 		spin_unlock_irq(&mapping->tree_lock);
-		return ret;
+		return entry;
 	}
 	/* Normal page in radix tree? */
-	if (!radix_tree_exceptional_entry(ret)) {
-		struct page *page = ret;
+	if (!radix_tree_exceptional_entry(entry)) {
+		struct page *page = entry;
 
 		get_page(page);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -442,9 +442,9 @@ restart:
 		}
 		return page;
 	}
-	ret = lock_slot(mapping, slot);
+	entry = lock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	return ret;
+	return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
@@ -469,11 +469,11 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 	spin_lock_irq(&mapping->tree_lock);
-	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
-	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
+	entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
 			 !slot_locked(mapping, slot))) {
 		spin_unlock_irq(&mapping->tree_lock);
 		return;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
  2016-10-07 21:08 ` Ross Zwisler
  (?)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry.  The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 58 +++++++++++++++++++++++++++++++----------------------
 include/linux/dax.h |  2 +-
 mm/filemap.c        |  2 +-
 3 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 152a6e1..608cee9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -64,14 +64,6 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
-					      pgoff_t index)
-{
-	unsigned long hash = hash_long((unsigned long)mapping ^ index,
-				       DAX_WAIT_TABLE_BITS);
-	return wait_table + hash;
-}
-
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
@@ -285,7 +277,7 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  */
 struct exceptional_entry_key {
 	struct address_space *mapping;
-	unsigned long index;
+	pgoff_t entry_start;
 };
 
 struct wait_exceptional_entry_queue {
@@ -293,6 +285,26 @@ struct wait_exceptional_entry_queue {
 	struct exceptional_entry_key key;
 };
 
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+		pgoff_t index, void *entry, struct exceptional_entry_key *key)
+{
+	unsigned long hash;
+
+	/*
+	 * If 'entry' is a PMD, align the 'index' that we use for the wait
+	 * queue to the start of that PMD.  This ensures that all offsets in
+	 * the range covered by the PMD map to the same bit lock.
+	 */
+	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
+
+	key->mapping = mapping;
+	key->entry_start = index;
+
+	hash = hash_long((unsigned long)mapping ^ index, DAX_WAIT_TABLE_BITS);
+	return wait_table + hash;
+}
+
 static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 				       int sync, void *keyp)
 {
@@ -301,7 +313,7 @@ static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 		container_of(wait, struct wait_exceptional_entry_queue, wait);
 
 	if (key->mapping != ewait->key.mapping ||
-	    key->index != ewait->key.index)
+	    key->entry_start != ewait->key.entry_start)
 		return 0;
 	return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -359,12 +371,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 {
 	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq;
 
 	init_wait(&ewait.wait);
 	ewait.wait.func = wake_exceptional_entry_func;
-	ewait.key.mapping = mapping;
-	ewait.key.index = index;
 
 	for (;;) {
 		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -375,6 +385,8 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 				*slotp = slot;
 			return entry;
 		}
+
+		wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -448,9 +460,12 @@ restart:
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all)
+		pgoff_t index, void *entry, bool wake_all)
 {
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	struct exceptional_entry_key key;
+	wait_queue_head_t *wq;
+
+	wq = dax_entry_waitqueue(mapping, index, entry, &key);
 
 	/*
 	 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -458,13 +473,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 	 * So at this point all tasks that could have seen our entry locked
 	 * must be in the waitqueue and the following check will see them.
 	 */
-	if (waitqueue_active(wq)) {
-		struct exceptional_entry_key key;
-
-		key.mapping = mapping;
-		key.index = index;
+	if (waitqueue_active(wq))
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
-	}
 }
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
@@ -480,7 +490,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 	}
 	unlock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
 static void put_locked_mapping_entry(struct address_space *mapping,
@@ -505,7 +515,7 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 		return;
 
 	/* We have to wake up next waiter for the radix tree entry lock */
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
 /*
@@ -532,7 +542,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	radix_tree_delete(&mapping->page_tree, index);
 	mapping->nrexceptional--;
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, true);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
 
 	return 1;
 }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index add6c4b..a41a747 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -22,7 +22,7 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all);
+		pgoff_t index, void *entry, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/filemap.c b/mm/filemap.c
index 2f1175e..a596462 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,7 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 			if (node)
 				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
+			dax_wake_mapping_entry_waiter(mapping, page->index, p,
 						      false);
 		}
 	}
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry.  The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 58 +++++++++++++++++++++++++++++++----------------------
 include/linux/dax.h |  2 +-
 mm/filemap.c        |  2 +-
 3 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 152a6e1..608cee9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -64,14 +64,6 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
-					      pgoff_t index)
-{
-	unsigned long hash = hash_long((unsigned long)mapping ^ index,
-				       DAX_WAIT_TABLE_BITS);
-	return wait_table + hash;
-}
-
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
@@ -285,7 +277,7 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  */
 struct exceptional_entry_key {
 	struct address_space *mapping;
-	unsigned long index;
+	pgoff_t entry_start;
 };
 
 struct wait_exceptional_entry_queue {
@@ -293,6 +285,26 @@ struct wait_exceptional_entry_queue {
 	struct exceptional_entry_key key;
 };
 
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+		pgoff_t index, void *entry, struct exceptional_entry_key *key)
+{
+	unsigned long hash;
+
+	/*
+	 * If 'entry' is a PMD, align the 'index' that we use for the wait
+	 * queue to the start of that PMD.  This ensures that all offsets in
+	 * the range covered by the PMD map to the same bit lock.
+	 */
+	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
+
+	key->mapping = mapping;
+	key->entry_start = index;
+
+	hash = hash_long((unsigned long)mapping ^ index, DAX_WAIT_TABLE_BITS);
+	return wait_table + hash;
+}
+
 static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 				       int sync, void *keyp)
 {
@@ -301,7 +313,7 @@ static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 		container_of(wait, struct wait_exceptional_entry_queue, wait);
 
 	if (key->mapping != ewait->key.mapping ||
-	    key->index != ewait->key.index)
+	    key->entry_start != ewait->key.entry_start)
 		return 0;
 	return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -359,12 +371,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 {
 	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq;
 
 	init_wait(&ewait.wait);
 	ewait.wait.func = wake_exceptional_entry_func;
-	ewait.key.mapping = mapping;
-	ewait.key.index = index;
 
 	for (;;) {
 		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -375,6 +385,8 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 				*slotp = slot;
 			return entry;
 		}
+
+		wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -448,9 +460,12 @@ restart:
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all)
+		pgoff_t index, void *entry, bool wake_all)
 {
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	struct exceptional_entry_key key;
+	wait_queue_head_t *wq;
+
+	wq = dax_entry_waitqueue(mapping, index, entry, &key);
 
 	/*
 	 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -458,13 +473,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 	 * So at this point all tasks that could have seen our entry locked
 	 * must be in the waitqueue and the following check will see them.
 	 */
-	if (waitqueue_active(wq)) {
-		struct exceptional_entry_key key;
-
-		key.mapping = mapping;
-		key.index = index;
+	if (waitqueue_active(wq))
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
-	}
 }
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
@@ -480,7 +490,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 	}
 	unlock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
 static void put_locked_mapping_entry(struct address_space *mapping,
@@ -505,7 +515,7 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 		return;
 
 	/* We have to wake up next waiter for the radix tree entry lock */
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
 /*
@@ -532,7 +542,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	radix_tree_delete(&mapping->page_tree, index);
 	mapping->nrexceptional--;
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, true);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
 
 	return 1;
 }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index add6c4b..a41a747 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -22,7 +22,7 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all);
+		pgoff_t index, void *entry, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/filemap.c b/mm/filemap.c
index 2f1175e..a596462 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,7 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 			if (node)
 				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
+			dax_wake_mapping_entry_waiter(mapping, page->index, p,
 						      false);
 		}
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry.  The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 58 +++++++++++++++++++++++++++++++----------------------
 include/linux/dax.h |  2 +-
 mm/filemap.c        |  2 +-
 3 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 152a6e1..608cee9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -64,14 +64,6 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
-					      pgoff_t index)
-{
-	unsigned long hash = hash_long((unsigned long)mapping ^ index,
-				       DAX_WAIT_TABLE_BITS);
-	return wait_table + hash;
-}
-
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
@@ -285,7 +277,7 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  */
 struct exceptional_entry_key {
 	struct address_space *mapping;
-	unsigned long index;
+	pgoff_t entry_start;
 };
 
 struct wait_exceptional_entry_queue {
@@ -293,6 +285,26 @@ struct wait_exceptional_entry_queue {
 	struct exceptional_entry_key key;
 };
 
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+		pgoff_t index, void *entry, struct exceptional_entry_key *key)
+{
+	unsigned long hash;
+
+	/*
+	 * If 'entry' is a PMD, align the 'index' that we use for the wait
+	 * queue to the start of that PMD.  This ensures that all offsets in
+	 * the range covered by the PMD map to the same bit lock.
+	 */
+	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
+
+	key->mapping = mapping;
+	key->entry_start = index;
+
+	hash = hash_long((unsigned long)mapping ^ index, DAX_WAIT_TABLE_BITS);
+	return wait_table + hash;
+}
+
 static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 				       int sync, void *keyp)
 {
@@ -301,7 +313,7 @@ static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 		container_of(wait, struct wait_exceptional_entry_queue, wait);
 
 	if (key->mapping != ewait->key.mapping ||
-	    key->index != ewait->key.index)
+	    key->entry_start != ewait->key.entry_start)
 		return 0;
 	return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -359,12 +371,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 {
 	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq;
 
 	init_wait(&ewait.wait);
 	ewait.wait.func = wake_exceptional_entry_func;
-	ewait.key.mapping = mapping;
-	ewait.key.index = index;
 
 	for (;;) {
 		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -375,6 +385,8 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 				*slotp = slot;
 			return entry;
 		}
+
+		wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -448,9 +460,12 @@ restart:
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all)
+		pgoff_t index, void *entry, bool wake_all)
 {
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	struct exceptional_entry_key key;
+	wait_queue_head_t *wq;
+
+	wq = dax_entry_waitqueue(mapping, index, entry, &key);
 
 	/*
 	 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -458,13 +473,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 	 * So at this point all tasks that could have seen our entry locked
 	 * must be in the waitqueue and the following check will see them.
 	 */
-	if (waitqueue_active(wq)) {
-		struct exceptional_entry_key key;
-
-		key.mapping = mapping;
-		key.index = index;
+	if (waitqueue_active(wq))
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
-	}
 }
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
@@ -480,7 +490,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 	}
 	unlock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
 static void put_locked_mapping_entry(struct address_space *mapping,
@@ -505,7 +515,7 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 		return;
 
 	/* We have to wake up next waiter for the radix tree entry lock */
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
 /*
@@ -532,7 +542,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	radix_tree_delete(&mapping->page_tree, index);
 	mapping->nrexceptional--;
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, true);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
 
 	return 1;
 }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index add6c4b..a41a747 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -22,7 +22,7 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all);
+		pgoff_t index, void *entry, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/filemap.c b/mm/filemap.c
index 2f1175e..a596462 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,7 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 			if (node)
 				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
+			dax_wake_mapping_entry_waiter(mapping, page->index, p,
 						      false);
 		}
 	}
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry.  The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 58 +++++++++++++++++++++++++++++++----------------------
 include/linux/dax.h |  2 +-
 mm/filemap.c        |  2 +-
 3 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 152a6e1..608cee9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -64,14 +64,6 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
-					      pgoff_t index)
-{
-	unsigned long hash = hash_long((unsigned long)mapping ^ index,
-				       DAX_WAIT_TABLE_BITS);
-	return wait_table + hash;
-}
-
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
@@ -285,7 +277,7 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  */
 struct exceptional_entry_key {
 	struct address_space *mapping;
-	unsigned long index;
+	pgoff_t entry_start;
 };
 
 struct wait_exceptional_entry_queue {
@@ -293,6 +285,26 @@ struct wait_exceptional_entry_queue {
 	struct exceptional_entry_key key;
 };
 
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+		pgoff_t index, void *entry, struct exceptional_entry_key *key)
+{
+	unsigned long hash;
+
+	/*
+	 * If 'entry' is a PMD, align the 'index' that we use for the wait
+	 * queue to the start of that PMD.  This ensures that all offsets in
+	 * the range covered by the PMD map to the same bit lock.
+	 */
+	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
+
+	key->mapping = mapping;
+	key->entry_start = index;
+
+	hash = hash_long((unsigned long)mapping ^ index, DAX_WAIT_TABLE_BITS);
+	return wait_table + hash;
+}
+
 static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 				       int sync, void *keyp)
 {
@@ -301,7 +313,7 @@ static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 		container_of(wait, struct wait_exceptional_entry_queue, wait);
 
 	if (key->mapping != ewait->key.mapping ||
-	    key->index != ewait->key.index)
+	    key->entry_start != ewait->key.entry_start)
 		return 0;
 	return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -359,12 +371,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 {
 	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq;
 
 	init_wait(&ewait.wait);
 	ewait.wait.func = wake_exceptional_entry_func;
-	ewait.key.mapping = mapping;
-	ewait.key.index = index;
 
 	for (;;) {
 		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -375,6 +385,8 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 				*slotp = slot;
 			return entry;
 		}
+
+		wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -448,9 +460,12 @@ restart:
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all)
+		pgoff_t index, void *entry, bool wake_all)
 {
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	struct exceptional_entry_key key;
+	wait_queue_head_t *wq;
+
+	wq = dax_entry_waitqueue(mapping, index, entry, &key);
 
 	/*
 	 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -458,13 +473,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 	 * So at this point all tasks that could have seen our entry locked
 	 * must be in the waitqueue and the following check will see them.
 	 */
-	if (waitqueue_active(wq)) {
-		struct exceptional_entry_key key;
-
-		key.mapping = mapping;
-		key.index = index;
+	if (waitqueue_active(wq))
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
-	}
 }
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
@@ -480,7 +490,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 	}
 	unlock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
 static void put_locked_mapping_entry(struct address_space *mapping,
@@ -505,7 +515,7 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 		return;
 
 	/* We have to wake up next waiter for the radix tree entry lock */
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
 /*
@@ -532,7 +542,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	radix_tree_delete(&mapping->page_tree, index);
 	mapping->nrexceptional--;
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, true);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
 
 	return 1;
 }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index add6c4b..a41a747 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -22,7 +22,7 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all);
+		pgoff_t index, void *entry, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/filemap.c b/mm/filemap.c
index 2f1175e..a596462 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,7 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 			if (node)
 				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
+			dax_wake_mapping_entry_waiter(mapping, page->index, p,
 						      false);
 		}
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 10/17] dax: remove dax_pmd_fault()
  2016-10-07 21:08 ` Ross Zwisler
                     ` (2 preceding siblings ...)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler.  This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 213 ----------------------------------------------------
 include/linux/dax.h |   6 +-
 2 files changed, 1 insertion(+), 218 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 608cee9..1b9154a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -908,219 +908,6 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
- * more often than one might expect in the below function.
- */
-#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
-
-static void __dax_dbg(struct buffer_head *bh, unsigned long address,
-		const char *reason, const char *fn)
-{
-	if (bh) {
-		char bname[BDEVNAME_SIZE];
-		bdevname(bh->b_bdev, bname);
-		pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
-			"length %zd fallback: %s\n", fn, current->comm,
-			address, bname, bh->b_state, (u64)bh->b_blocknr,
-			bh->b_size, reason);
-	} else {
-		pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
-			current->comm, address, reason);
-	}
-}
-
-#define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
-
-/**
- * dax_pmd_fault - handle a PMD fault on a DAX file
- * @vma: The virtual memory area where the fault occurred
- * @vmf: The description of the fault
- * @get_block: The filesystem method used to translate file offsets to blocks
- *
- * When a page fault occurs, filesystems may call this helper in their
- * pmd_fault handler for DAX files.
- */
-int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	struct buffer_head bh;
-	unsigned blkbits = inode->i_blkbits;
-	unsigned long pmd_addr = address & PMD_MASK;
-	bool write = flags & FAULT_FLAG_WRITE;
-	struct block_device *bdev;
-	pgoff_t size, pgoff;
-	sector_t block;
-	int result = 0;
-	bool alloc = false;
-
-	/* dax pmd mappings require pfn_t_devmap() */
-	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
-		return VM_FAULT_FALLBACK;
-
-	/* Fall back to PTEs if we're going to COW */
-	if (write && !(vma->vm_flags & VM_SHARED)) {
-		split_huge_pmd(vma, pmd, address);
-		dax_pmd_dbg(NULL, address, "cow write");
-		return VM_FAULT_FALLBACK;
-	}
-	/* If the PMD would extend outside the VMA */
-	if (pmd_addr < vma->vm_start) {
-		dax_pmd_dbg(NULL, address, "vma start unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-	if ((pmd_addr + PMD_SIZE) > vma->vm_end) {
-		dax_pmd_dbg(NULL, address, "vma end unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-
-	pgoff = linear_page_index(vma, pmd_addr);
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size)
-		return VM_FAULT_SIGBUS;
-	/* If the PMD would cover blocks out of the file */
-	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(NULL, address,
-				"offset + huge page size > file size");
-		return VM_FAULT_FALLBACK;
-	}
-
-	memset(&bh, 0, sizeof(bh));
-	bh.b_bdev = inode->i_sb->s_bdev;
-	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
-
-	bh.b_size = PMD_SIZE;
-
-	if (get_block(inode, block, &bh, 0) != 0)
-		return VM_FAULT_SIGBUS;
-
-	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
-	}
-
-	bdev = bh.b_bdev;
-
-	if (bh.b_size < PMD_SIZE) {
-		dax_pmd_dbg(&bh, address, "allocated block too small");
-		return VM_FAULT_FALLBACK;
-	}
-
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
-
-	if (!write && !buffer_mapped(&bh)) {
-		spinlock_t *ptl;
-		pmd_t entry;
-		struct page *zero_page = get_huge_zero_page();
-
-		if (unlikely(!zero_page)) {
-			dax_pmd_dbg(&bh, address, "no zero page");
-			goto fallback;
-		}
-
-		ptl = pmd_lock(vma->vm_mm, pmd);
-		if (!pmd_none(*pmd)) {
-			spin_unlock(ptl);
-			dax_pmd_dbg(&bh, address, "pmd already present");
-			goto fallback;
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: <zero> sect: %llx\n",
-				__func__, current->comm, address,
-				(unsigned long long) to_sector(&bh, inode));
-
-		entry = mk_pmd(zero_page, vma->vm_page_prot);
-		entry = pmd_mkhuge(entry);
-		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
-		result = VM_FAULT_NOPAGE;
-		spin_unlock(ptl);
-	} else {
-		struct blk_dax_ctl dax = {
-			.sector = to_sector(&bh, inode),
-			.size = PMD_SIZE,
-		};
-		long length = dax_map_atomic(bdev, &dax);
-
-		if (length < 0) {
-			dax_pmd_dbg(&bh, address, "dax-error fallback");
-			goto fallback;
-		}
-		if (length < PMD_SIZE) {
-			dax_pmd_dbg(&bh, address, "dax-length too small");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-		if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
-			dax_pmd_dbg(&bh, address, "pfn unaligned");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-
-		if (!pfn_t_devmap(dax.pfn)) {
-			dax_unmap_atomic(bdev, &dax);
-			dax_pmd_dbg(&bh, address, "pfn not in memmap");
-			goto fallback;
-		}
-		dax_unmap_atomic(bdev, &dax);
-
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			/*
-			 * We should insert radix-tree entry and dirty it here.
-			 * For now this is broken...
-			 */
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
-				__func__, current->comm, address,
-				pfn_t_to_pfn(dax.pfn),
-				(unsigned long long) dax.sector);
-		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				dax.pfn, write);
-	}
-
- out:
-	return result;
-
- fallback:
-	count_vm_event(THP_FAULT_FALLBACK);
-	result = VM_FAULT_FALLBACK;
-	goto out;
-}
-EXPORT_SYMBOL_GPL(dax_pmd_fault);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 /**
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a41a747..0f74866 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -48,16 +48,12 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
 }
 #endif
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t);
-#else
 static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t *pmd, unsigned int flags, get_block_t gb)
 {
 	return VM_FAULT_FALLBACK;
 }
-#endif
+
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 10/17] dax: remove dax_pmd_fault()
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler.  This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 213 ----------------------------------------------------
 include/linux/dax.h |   6 +-
 2 files changed, 1 insertion(+), 218 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 608cee9..1b9154a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -908,219 +908,6 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
- * more often than one might expect in the below function.
- */
-#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
-
-static void __dax_dbg(struct buffer_head *bh, unsigned long address,
-		const char *reason, const char *fn)
-{
-	if (bh) {
-		char bname[BDEVNAME_SIZE];
-		bdevname(bh->b_bdev, bname);
-		pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
-			"length %zd fallback: %s\n", fn, current->comm,
-			address, bname, bh->b_state, (u64)bh->b_blocknr,
-			bh->b_size, reason);
-	} else {
-		pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
-			current->comm, address, reason);
-	}
-}
-
-#define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
-
-/**
- * dax_pmd_fault - handle a PMD fault on a DAX file
- * @vma: The virtual memory area where the fault occurred
- * @vmf: The description of the fault
- * @get_block: The filesystem method used to translate file offsets to blocks
- *
- * When a page fault occurs, filesystems may call this helper in their
- * pmd_fault handler for DAX files.
- */
-int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	struct buffer_head bh;
-	unsigned blkbits = inode->i_blkbits;
-	unsigned long pmd_addr = address & PMD_MASK;
-	bool write = flags & FAULT_FLAG_WRITE;
-	struct block_device *bdev;
-	pgoff_t size, pgoff;
-	sector_t block;
-	int result = 0;
-	bool alloc = false;
-
-	/* dax pmd mappings require pfn_t_devmap() */
-	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
-		return VM_FAULT_FALLBACK;
-
-	/* Fall back to PTEs if we're going to COW */
-	if (write && !(vma->vm_flags & VM_SHARED)) {
-		split_huge_pmd(vma, pmd, address);
-		dax_pmd_dbg(NULL, address, "cow write");
-		return VM_FAULT_FALLBACK;
-	}
-	/* If the PMD would extend outside the VMA */
-	if (pmd_addr < vma->vm_start) {
-		dax_pmd_dbg(NULL, address, "vma start unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-	if ((pmd_addr + PMD_SIZE) > vma->vm_end) {
-		dax_pmd_dbg(NULL, address, "vma end unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-
-	pgoff = linear_page_index(vma, pmd_addr);
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size)
-		return VM_FAULT_SIGBUS;
-	/* If the PMD would cover blocks out of the file */
-	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(NULL, address,
-				"offset + huge page size > file size");
-		return VM_FAULT_FALLBACK;
-	}
-
-	memset(&bh, 0, sizeof(bh));
-	bh.b_bdev = inode->i_sb->s_bdev;
-	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
-
-	bh.b_size = PMD_SIZE;
-
-	if (get_block(inode, block, &bh, 0) != 0)
-		return VM_FAULT_SIGBUS;
-
-	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
-	}
-
-	bdev = bh.b_bdev;
-
-	if (bh.b_size < PMD_SIZE) {
-		dax_pmd_dbg(&bh, address, "allocated block too small");
-		return VM_FAULT_FALLBACK;
-	}
-
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
-
-	if (!write && !buffer_mapped(&bh)) {
-		spinlock_t *ptl;
-		pmd_t entry;
-		struct page *zero_page = get_huge_zero_page();
-
-		if (unlikely(!zero_page)) {
-			dax_pmd_dbg(&bh, address, "no zero page");
-			goto fallback;
-		}
-
-		ptl = pmd_lock(vma->vm_mm, pmd);
-		if (!pmd_none(*pmd)) {
-			spin_unlock(ptl);
-			dax_pmd_dbg(&bh, address, "pmd already present");
-			goto fallback;
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: <zero> sect: %llx\n",
-				__func__, current->comm, address,
-				(unsigned long long) to_sector(&bh, inode));
-
-		entry = mk_pmd(zero_page, vma->vm_page_prot);
-		entry = pmd_mkhuge(entry);
-		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
-		result = VM_FAULT_NOPAGE;
-		spin_unlock(ptl);
-	} else {
-		struct blk_dax_ctl dax = {
-			.sector = to_sector(&bh, inode),
-			.size = PMD_SIZE,
-		};
-		long length = dax_map_atomic(bdev, &dax);
-
-		if (length < 0) {
-			dax_pmd_dbg(&bh, address, "dax-error fallback");
-			goto fallback;
-		}
-		if (length < PMD_SIZE) {
-			dax_pmd_dbg(&bh, address, "dax-length too small");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-		if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
-			dax_pmd_dbg(&bh, address, "pfn unaligned");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-
-		if (!pfn_t_devmap(dax.pfn)) {
-			dax_unmap_atomic(bdev, &dax);
-			dax_pmd_dbg(&bh, address, "pfn not in memmap");
-			goto fallback;
-		}
-		dax_unmap_atomic(bdev, &dax);
-
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			/*
-			 * We should insert radix-tree entry and dirty it here.
-			 * For now this is broken...
-			 */
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
-				__func__, current->comm, address,
-				pfn_t_to_pfn(dax.pfn),
-				(unsigned long long) dax.sector);
-		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				dax.pfn, write);
-	}
-
- out:
-	return result;
-
- fallback:
-	count_vm_event(THP_FAULT_FALLBACK);
-	result = VM_FAULT_FALLBACK;
-	goto out;
-}
-EXPORT_SYMBOL_GPL(dax_pmd_fault);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 /**
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a41a747..0f74866 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -48,16 +48,12 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
 }
 #endif
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t);
-#else
 static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t *pmd, unsigned int flags, get_block_t gb)
 {
 	return VM_FAULT_FALLBACK;
 }
-#endif
+
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 10/17] dax: remove dax_pmd_fault()
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler.  This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 213 ----------------------------------------------------
 include/linux/dax.h |   6 +-
 2 files changed, 1 insertion(+), 218 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 608cee9..1b9154a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -908,219 +908,6 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
- * more often than one might expect in the below function.
- */
-#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
-
-static void __dax_dbg(struct buffer_head *bh, unsigned long address,
-		const char *reason, const char *fn)
-{
-	if (bh) {
-		char bname[BDEVNAME_SIZE];
-		bdevname(bh->b_bdev, bname);
-		pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
-			"length %zd fallback: %s\n", fn, current->comm,
-			address, bname, bh->b_state, (u64)bh->b_blocknr,
-			bh->b_size, reason);
-	} else {
-		pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
-			current->comm, address, reason);
-	}
-}
-
-#define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
-
-/**
- * dax_pmd_fault - handle a PMD fault on a DAX file
- * @vma: The virtual memory area where the fault occurred
- * @vmf: The description of the fault
- * @get_block: The filesystem method used to translate file offsets to blocks
- *
- * When a page fault occurs, filesystems may call this helper in their
- * pmd_fault handler for DAX files.
- */
-int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	struct buffer_head bh;
-	unsigned blkbits = inode->i_blkbits;
-	unsigned long pmd_addr = address & PMD_MASK;
-	bool write = flags & FAULT_FLAG_WRITE;
-	struct block_device *bdev;
-	pgoff_t size, pgoff;
-	sector_t block;
-	int result = 0;
-	bool alloc = false;
-
-	/* dax pmd mappings require pfn_t_devmap() */
-	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
-		return VM_FAULT_FALLBACK;
-
-	/* Fall back to PTEs if we're going to COW */
-	if (write && !(vma->vm_flags & VM_SHARED)) {
-		split_huge_pmd(vma, pmd, address);
-		dax_pmd_dbg(NULL, address, "cow write");
-		return VM_FAULT_FALLBACK;
-	}
-	/* If the PMD would extend outside the VMA */
-	if (pmd_addr < vma->vm_start) {
-		dax_pmd_dbg(NULL, address, "vma start unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-	if ((pmd_addr + PMD_SIZE) > vma->vm_end) {
-		dax_pmd_dbg(NULL, address, "vma end unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-
-	pgoff = linear_page_index(vma, pmd_addr);
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size)
-		return VM_FAULT_SIGBUS;
-	/* If the PMD would cover blocks out of the file */
-	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(NULL, address,
-				"offset + huge page size > file size");
-		return VM_FAULT_FALLBACK;
-	}
-
-	memset(&bh, 0, sizeof(bh));
-	bh.b_bdev = inode->i_sb->s_bdev;
-	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
-
-	bh.b_size = PMD_SIZE;
-
-	if (get_block(inode, block, &bh, 0) != 0)
-		return VM_FAULT_SIGBUS;
-
-	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
-	}
-
-	bdev = bh.b_bdev;
-
-	if (bh.b_size < PMD_SIZE) {
-		dax_pmd_dbg(&bh, address, "allocated block too small");
-		return VM_FAULT_FALLBACK;
-	}
-
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
-
-	if (!write && !buffer_mapped(&bh)) {
-		spinlock_t *ptl;
-		pmd_t entry;
-		struct page *zero_page = get_huge_zero_page();
-
-		if (unlikely(!zero_page)) {
-			dax_pmd_dbg(&bh, address, "no zero page");
-			goto fallback;
-		}
-
-		ptl = pmd_lock(vma->vm_mm, pmd);
-		if (!pmd_none(*pmd)) {
-			spin_unlock(ptl);
-			dax_pmd_dbg(&bh, address, "pmd already present");
-			goto fallback;
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: <zero> sect: %llx\n",
-				__func__, current->comm, address,
-				(unsigned long long) to_sector(&bh, inode));
-
-		entry = mk_pmd(zero_page, vma->vm_page_prot);
-		entry = pmd_mkhuge(entry);
-		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
-		result = VM_FAULT_NOPAGE;
-		spin_unlock(ptl);
-	} else {
-		struct blk_dax_ctl dax = {
-			.sector = to_sector(&bh, inode),
-			.size = PMD_SIZE,
-		};
-		long length = dax_map_atomic(bdev, &dax);
-
-		if (length < 0) {
-			dax_pmd_dbg(&bh, address, "dax-error fallback");
-			goto fallback;
-		}
-		if (length < PMD_SIZE) {
-			dax_pmd_dbg(&bh, address, "dax-length too small");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-		if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
-			dax_pmd_dbg(&bh, address, "pfn unaligned");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-
-		if (!pfn_t_devmap(dax.pfn)) {
-			dax_unmap_atomic(bdev, &dax);
-			dax_pmd_dbg(&bh, address, "pfn not in memmap");
-			goto fallback;
-		}
-		dax_unmap_atomic(bdev, &dax);
-
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			/*
-			 * We should insert radix-tree entry and dirty it here.
-			 * For now this is broken...
-			 */
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
-				__func__, current->comm, address,
-				pfn_t_to_pfn(dax.pfn),
-				(unsigned long long) dax.sector);
-		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				dax.pfn, write);
-	}
-
- out:
-	return result;
-
- fallback:
-	count_vm_event(THP_FAULT_FALLBACK);
-	result = VM_FAULT_FALLBACK;
-	goto out;
-}
-EXPORT_SYMBOL_GPL(dax_pmd_fault);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 /**
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a41a747..0f74866 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -48,16 +48,12 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
 }
 #endif
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t);
-#else
 static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t *pmd, unsigned int flags, get_block_t gb)
 {
 	return VM_FAULT_FALLBACK;
 }
-#endif
+
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 10/17] dax: remove dax_pmd_fault()
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler.  This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/dax.c            | 213 ----------------------------------------------------
 include/linux/dax.h |   6 +-
 2 files changed, 1 insertion(+), 218 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 608cee9..1b9154a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -908,219 +908,6 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
- * more often than one might expect in the below function.
- */
-#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
-
-static void __dax_dbg(struct buffer_head *bh, unsigned long address,
-		const char *reason, const char *fn)
-{
-	if (bh) {
-		char bname[BDEVNAME_SIZE];
-		bdevname(bh->b_bdev, bname);
-		pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
-			"length %zd fallback: %s\n", fn, current->comm,
-			address, bname, bh->b_state, (u64)bh->b_blocknr,
-			bh->b_size, reason);
-	} else {
-		pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
-			current->comm, address, reason);
-	}
-}
-
-#define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
-
-/**
- * dax_pmd_fault - handle a PMD fault on a DAX file
- * @vma: The virtual memory area where the fault occurred
- * @vmf: The description of the fault
- * @get_block: The filesystem method used to translate file offsets to blocks
- *
- * When a page fault occurs, filesystems may call this helper in their
- * pmd_fault handler for DAX files.
- */
-int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	struct buffer_head bh;
-	unsigned blkbits = inode->i_blkbits;
-	unsigned long pmd_addr = address & PMD_MASK;
-	bool write = flags & FAULT_FLAG_WRITE;
-	struct block_device *bdev;
-	pgoff_t size, pgoff;
-	sector_t block;
-	int result = 0;
-	bool alloc = false;
-
-	/* dax pmd mappings require pfn_t_devmap() */
-	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
-		return VM_FAULT_FALLBACK;
-
-	/* Fall back to PTEs if we're going to COW */
-	if (write && !(vma->vm_flags & VM_SHARED)) {
-		split_huge_pmd(vma, pmd, address);
-		dax_pmd_dbg(NULL, address, "cow write");
-		return VM_FAULT_FALLBACK;
-	}
-	/* If the PMD would extend outside the VMA */
-	if (pmd_addr < vma->vm_start) {
-		dax_pmd_dbg(NULL, address, "vma start unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-	if ((pmd_addr + PMD_SIZE) > vma->vm_end) {
-		dax_pmd_dbg(NULL, address, "vma end unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-
-	pgoff = linear_page_index(vma, pmd_addr);
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size)
-		return VM_FAULT_SIGBUS;
-	/* If the PMD would cover blocks out of the file */
-	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(NULL, address,
-				"offset + huge page size > file size");
-		return VM_FAULT_FALLBACK;
-	}
-
-	memset(&bh, 0, sizeof(bh));
-	bh.b_bdev = inode->i_sb->s_bdev;
-	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
-
-	bh.b_size = PMD_SIZE;
-
-	if (get_block(inode, block, &bh, 0) != 0)
-		return VM_FAULT_SIGBUS;
-
-	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
-	}
-
-	bdev = bh.b_bdev;
-
-	if (bh.b_size < PMD_SIZE) {
-		dax_pmd_dbg(&bh, address, "allocated block too small");
-		return VM_FAULT_FALLBACK;
-	}
-
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
-
-	if (!write && !buffer_mapped(&bh)) {
-		spinlock_t *ptl;
-		pmd_t entry;
-		struct page *zero_page = get_huge_zero_page();
-
-		if (unlikely(!zero_page)) {
-			dax_pmd_dbg(&bh, address, "no zero page");
-			goto fallback;
-		}
-
-		ptl = pmd_lock(vma->vm_mm, pmd);
-		if (!pmd_none(*pmd)) {
-			spin_unlock(ptl);
-			dax_pmd_dbg(&bh, address, "pmd already present");
-			goto fallback;
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: <zero> sect: %llx\n",
-				__func__, current->comm, address,
-				(unsigned long long) to_sector(&bh, inode));
-
-		entry = mk_pmd(zero_page, vma->vm_page_prot);
-		entry = pmd_mkhuge(entry);
-		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
-		result = VM_FAULT_NOPAGE;
-		spin_unlock(ptl);
-	} else {
-		struct blk_dax_ctl dax = {
-			.sector = to_sector(&bh, inode),
-			.size = PMD_SIZE,
-		};
-		long length = dax_map_atomic(bdev, &dax);
-
-		if (length < 0) {
-			dax_pmd_dbg(&bh, address, "dax-error fallback");
-			goto fallback;
-		}
-		if (length < PMD_SIZE) {
-			dax_pmd_dbg(&bh, address, "dax-length too small");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-		if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
-			dax_pmd_dbg(&bh, address, "pfn unaligned");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-
-		if (!pfn_t_devmap(dax.pfn)) {
-			dax_unmap_atomic(bdev, &dax);
-			dax_pmd_dbg(&bh, address, "pfn not in memmap");
-			goto fallback;
-		}
-		dax_unmap_atomic(bdev, &dax);
-
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			/*
-			 * We should insert radix-tree entry and dirty it here.
-			 * For now this is broken...
-			 */
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
-				__func__, current->comm, address,
-				pfn_t_to_pfn(dax.pfn),
-				(unsigned long long) dax.sector);
-		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				dax.pfn, write);
-	}
-
- out:
-	return result;
-
- fallback:
-	count_vm_event(THP_FAULT_FALLBACK);
-	result = VM_FAULT_FALLBACK;
-	goto out;
-}
-EXPORT_SYMBOL_GPL(dax_pmd_fault);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 /**
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a41a747..0f74866 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -48,16 +48,12 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
 }
 #endif
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t);
-#else
 static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t *pmd, unsigned int flags, get_block_t gb)
 {
 	return VM_FAULT_FALLBACK;
 }
-#endif
+
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 10/17] dax: remove dax_pmd_fault()
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler.  This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 213 ----------------------------------------------------
 include/linux/dax.h |   6 +-
 2 files changed, 1 insertion(+), 218 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 608cee9..1b9154a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -908,219 +908,6 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
- * more often than one might expect in the below function.
- */
-#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
-
-static void __dax_dbg(struct buffer_head *bh, unsigned long address,
-		const char *reason, const char *fn)
-{
-	if (bh) {
-		char bname[BDEVNAME_SIZE];
-		bdevname(bh->b_bdev, bname);
-		pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
-			"length %zd fallback: %s\n", fn, current->comm,
-			address, bname, bh->b_state, (u64)bh->b_blocknr,
-			bh->b_size, reason);
-	} else {
-		pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
-			current->comm, address, reason);
-	}
-}
-
-#define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
-
-/**
- * dax_pmd_fault - handle a PMD fault on a DAX file
- * @vma: The virtual memory area where the fault occurred
- * @vmf: The description of the fault
- * @get_block: The filesystem method used to translate file offsets to blocks
- *
- * When a page fault occurs, filesystems may call this helper in their
- * pmd_fault handler for DAX files.
- */
-int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	struct buffer_head bh;
-	unsigned blkbits = inode->i_blkbits;
-	unsigned long pmd_addr = address & PMD_MASK;
-	bool write = flags & FAULT_FLAG_WRITE;
-	struct block_device *bdev;
-	pgoff_t size, pgoff;
-	sector_t block;
-	int result = 0;
-	bool alloc = false;
-
-	/* dax pmd mappings require pfn_t_devmap() */
-	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
-		return VM_FAULT_FALLBACK;
-
-	/* Fall back to PTEs if we're going to COW */
-	if (write && !(vma->vm_flags & VM_SHARED)) {
-		split_huge_pmd(vma, pmd, address);
-		dax_pmd_dbg(NULL, address, "cow write");
-		return VM_FAULT_FALLBACK;
-	}
-	/* If the PMD would extend outside the VMA */
-	if (pmd_addr < vma->vm_start) {
-		dax_pmd_dbg(NULL, address, "vma start unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-	if ((pmd_addr + PMD_SIZE) > vma->vm_end) {
-		dax_pmd_dbg(NULL, address, "vma end unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-
-	pgoff = linear_page_index(vma, pmd_addr);
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size)
-		return VM_FAULT_SIGBUS;
-	/* If the PMD would cover blocks out of the file */
-	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(NULL, address,
-				"offset + huge page size > file size");
-		return VM_FAULT_FALLBACK;
-	}
-
-	memset(&bh, 0, sizeof(bh));
-	bh.b_bdev = inode->i_sb->s_bdev;
-	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
-
-	bh.b_size = PMD_SIZE;
-
-	if (get_block(inode, block, &bh, 0) != 0)
-		return VM_FAULT_SIGBUS;
-
-	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
-	}
-
-	bdev = bh.b_bdev;
-
-	if (bh.b_size < PMD_SIZE) {
-		dax_pmd_dbg(&bh, address, "allocated block too small");
-		return VM_FAULT_FALLBACK;
-	}
-
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
-
-	if (!write && !buffer_mapped(&bh)) {
-		spinlock_t *ptl;
-		pmd_t entry;
-		struct page *zero_page = get_huge_zero_page();
-
-		if (unlikely(!zero_page)) {
-			dax_pmd_dbg(&bh, address, "no zero page");
-			goto fallback;
-		}
-
-		ptl = pmd_lock(vma->vm_mm, pmd);
-		if (!pmd_none(*pmd)) {
-			spin_unlock(ptl);
-			dax_pmd_dbg(&bh, address, "pmd already present");
-			goto fallback;
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: <zero> sect: %llx\n",
-				__func__, current->comm, address,
-				(unsigned long long) to_sector(&bh, inode));
-
-		entry = mk_pmd(zero_page, vma->vm_page_prot);
-		entry = pmd_mkhuge(entry);
-		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
-		result = VM_FAULT_NOPAGE;
-		spin_unlock(ptl);
-	} else {
-		struct blk_dax_ctl dax = {
-			.sector = to_sector(&bh, inode),
-			.size = PMD_SIZE,
-		};
-		long length = dax_map_atomic(bdev, &dax);
-
-		if (length < 0) {
-			dax_pmd_dbg(&bh, address, "dax-error fallback");
-			goto fallback;
-		}
-		if (length < PMD_SIZE) {
-			dax_pmd_dbg(&bh, address, "dax-length too small");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-		if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
-			dax_pmd_dbg(&bh, address, "pfn unaligned");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-
-		if (!pfn_t_devmap(dax.pfn)) {
-			dax_unmap_atomic(bdev, &dax);
-			dax_pmd_dbg(&bh, address, "pfn not in memmap");
-			goto fallback;
-		}
-		dax_unmap_atomic(bdev, &dax);
-
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			/*
-			 * We should insert radix-tree entry and dirty it here.
-			 * For now this is broken...
-			 */
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
-				__func__, current->comm, address,
-				pfn_t_to_pfn(dax.pfn),
-				(unsigned long long) dax.sector);
-		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				dax.pfn, write);
-	}
-
- out:
-	return result;
-
- fallback:
-	count_vm_event(THP_FAULT_FALLBACK);
-	result = VM_FAULT_FALLBACK;
-	goto out;
-}
-EXPORT_SYMBOL_GPL(dax_pmd_fault);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 /**
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a41a747..0f74866 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -48,16 +48,12 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
 }
 #endif
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t);
-#else
 static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t *pmd, unsigned int flags, get_block_t gb)
 {
 	return VM_FAULT_FALLBACK;
 }
-#endif
+
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 11/17] dax: correct dax iomap code namespace
  2016-10-07 21:08 ` Ross Zwisler
  (?)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 16 ++++++++--------
 fs/ext2/file.c      |  6 +++---
 fs/xfs/xfs_file.c   |  8 ++++----
 include/linux/dax.h |  4 ++--
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1b9154a..982ccbb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1024,7 +1024,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
 static loff_t
-iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
 {
 	struct iov_iter *iter = data;
@@ -1081,7 +1081,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 }
 
 /**
- * iomap_dax_rw - Perform I/O to a DAX file
+ * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:	The control block for this I/O
  * @iter:	The addresses to do I/O from or to
  * @ops:	iomap ops passed from the file system
@@ -1091,7 +1091,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
  * and evicting any page cache pages in the region under I/O.
  */
 ssize_t
-iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
@@ -1121,7 +1121,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-				iter, iomap_dax_actor);
+				iter, dax_iomap_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
@@ -1131,10 +1131,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->ki_pos += done;
 	return done ? done : ret;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_rw);
+EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
 /**
- * iomap_dax_fault - handle a page fault on a DAX file
+ * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @ops: iomap ops passed from the file system
@@ -1143,7 +1143,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
  * or mkwrite handler for DAX files. Assumes the caller has done all the
  * necessary locking for the page fault to proceed successfully.
  */
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1245,5 +1245,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_fault);
+EXPORT_SYMBOL_GPL(dax_iomap_fault);
 #endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0f257f8..32a4913 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return 0; /* skip atime */
 
 	inode_lock_shared(inode);
-	ret = iomap_dax_rw(iocb, to, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret)
 		goto out_unlock;
 
-	ret = iomap_dax_rw(iocb, from, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		mark_inode_dirty(inode);
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = iomap_dax_fault(vma, vmf, &ext2_iomap_ops);
+	ret = dax_iomap_fault(vma, vmf, &ext2_iomap_ops);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de7c53c..8f12152 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -344,7 +344,7 @@ xfs_file_dax_read(
 		return 0; /* skip atime */
 
 	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = iomap_dax_rw(iocb, to, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
@@ -730,7 +730,7 @@ xfs_file_dax_write(
 
 	trace_xfs_file_dax_write(ip, count, pos);
 
-	ret = iomap_dax_rw(iocb, from, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1679,7 +1679,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else {
 		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
@@ -1713,7 +1713,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0f74866..a3dfee4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -11,13 +11,13 @@ struct iomap_ops;
 /* We use lowest available exceptional entry bit for locking */
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
 
-ssize_t iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 11/17] dax: correct dax iomap code namespace
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 16 ++++++++--------
 fs/ext2/file.c      |  6 +++---
 fs/xfs/xfs_file.c   |  8 ++++----
 include/linux/dax.h |  4 ++--
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1b9154a..982ccbb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1024,7 +1024,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
 static loff_t
-iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
 {
 	struct iov_iter *iter = data;
@@ -1081,7 +1081,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 }
 
 /**
- * iomap_dax_rw - Perform I/O to a DAX file
+ * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:	The control block for this I/O
  * @iter:	The addresses to do I/O from or to
  * @ops:	iomap ops passed from the file system
@@ -1091,7 +1091,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
  * and evicting any page cache pages in the region under I/O.
  */
 ssize_t
-iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
@@ -1121,7 +1121,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-				iter, iomap_dax_actor);
+				iter, dax_iomap_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
@@ -1131,10 +1131,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->ki_pos += done;
 	return done ? done : ret;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_rw);
+EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
 /**
- * iomap_dax_fault - handle a page fault on a DAX file
+ * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @ops: iomap ops passed from the file system
@@ -1143,7 +1143,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
  * or mkwrite handler for DAX files. Assumes the caller has done all the
  * necessary locking for the page fault to proceed successfully.
  */
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1245,5 +1245,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_fault);
+EXPORT_SYMBOL_GPL(dax_iomap_fault);
 #endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0f257f8..32a4913 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return 0; /* skip atime */
 
 	inode_lock_shared(inode);
-	ret = iomap_dax_rw(iocb, to, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret)
 		goto out_unlock;
 
-	ret = iomap_dax_rw(iocb, from, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		mark_inode_dirty(inode);
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = iomap_dax_fault(vma, vmf, &ext2_iomap_ops);
+	ret = dax_iomap_fault(vma, vmf, &ext2_iomap_ops);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de7c53c..8f12152 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -344,7 +344,7 @@ xfs_file_dax_read(
 		return 0; /* skip atime */
 
 	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = iomap_dax_rw(iocb, to, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
@@ -730,7 +730,7 @@ xfs_file_dax_write(
 
 	trace_xfs_file_dax_write(ip, count, pos);
 
-	ret = iomap_dax_rw(iocb, from, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1679,7 +1679,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else {
 		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
@@ -1713,7 +1713,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0f74866..a3dfee4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -11,13 +11,13 @@ struct iomap_ops;
 /* We use lowest available exceptional entry bit for locking */
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
 
-ssize_t iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 11/17] dax: correct dax iomap code namespace
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Suggested-by: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/dax.c            | 16 ++++++++--------
 fs/ext2/file.c      |  6 +++---
 fs/xfs/xfs_file.c   |  8 ++++----
 include/linux/dax.h |  4 ++--
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1b9154a..982ccbb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1024,7 +1024,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
 static loff_t
-iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
 {
 	struct iov_iter *iter = data;
@@ -1081,7 +1081,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 }
 
 /**
- * iomap_dax_rw - Perform I/O to a DAX file
+ * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:	The control block for this I/O
  * @iter:	The addresses to do I/O from or to
  * @ops:	iomap ops passed from the file system
@@ -1091,7 +1091,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
  * and evicting any page cache pages in the region under I/O.
  */
 ssize_t
-iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
@@ -1121,7 +1121,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-				iter, iomap_dax_actor);
+				iter, dax_iomap_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
@@ -1131,10 +1131,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->ki_pos += done;
 	return done ? done : ret;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_rw);
+EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
 /**
- * iomap_dax_fault - handle a page fault on a DAX file
+ * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @ops: iomap ops passed from the file system
@@ -1143,7 +1143,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
  * or mkwrite handler for DAX files. Assumes the caller has done all the
  * necessary locking for the page fault to proceed successfully.
  */
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1245,5 +1245,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_fault);
+EXPORT_SYMBOL_GPL(dax_iomap_fault);
 #endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0f257f8..32a4913 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return 0; /* skip atime */
 
 	inode_lock_shared(inode);
-	ret = iomap_dax_rw(iocb, to, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret)
 		goto out_unlock;
 
-	ret = iomap_dax_rw(iocb, from, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		mark_inode_dirty(inode);
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = iomap_dax_fault(vma, vmf, &ext2_iomap_ops);
+	ret = dax_iomap_fault(vma, vmf, &ext2_iomap_ops);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de7c53c..8f12152 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -344,7 +344,7 @@ xfs_file_dax_read(
 		return 0; /* skip atime */
 
 	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = iomap_dax_rw(iocb, to, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
@@ -730,7 +730,7 @@ xfs_file_dax_write(
 
 	trace_xfs_file_dax_write(ip, count, pos);
 
-	ret = iomap_dax_rw(iocb, from, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1679,7 +1679,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else {
 		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
@@ -1713,7 +1713,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0f74866..a3dfee4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -11,13 +11,13 @@ struct iomap_ops;
 /* We use lowest available exceptional entry bit for locking */
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
 
-ssize_t iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 11/17] dax: correct dax iomap code namespace
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 16 ++++++++--------
 fs/ext2/file.c      |  6 +++---
 fs/xfs/xfs_file.c   |  8 ++++----
 include/linux/dax.h |  4 ++--
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1b9154a..982ccbb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1024,7 +1024,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
 static loff_t
-iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
 {
 	struct iov_iter *iter = data;
@@ -1081,7 +1081,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 }
 
 /**
- * iomap_dax_rw - Perform I/O to a DAX file
+ * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:	The control block for this I/O
  * @iter:	The addresses to do I/O from or to
  * @ops:	iomap ops passed from the file system
@@ -1091,7 +1091,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
  * and evicting any page cache pages in the region under I/O.
  */
 ssize_t
-iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
@@ -1121,7 +1121,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-				iter, iomap_dax_actor);
+				iter, dax_iomap_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
@@ -1131,10 +1131,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->ki_pos += done;
 	return done ? done : ret;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_rw);
+EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
 /**
- * iomap_dax_fault - handle a page fault on a DAX file
+ * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @ops: iomap ops passed from the file system
@@ -1143,7 +1143,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
  * or mkwrite handler for DAX files. Assumes the caller has done all the
  * necessary locking for the page fault to proceed successfully.
  */
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1245,5 +1245,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_fault);
+EXPORT_SYMBOL_GPL(dax_iomap_fault);
 #endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0f257f8..32a4913 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return 0; /* skip atime */
 
 	inode_lock_shared(inode);
-	ret = iomap_dax_rw(iocb, to, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret)
 		goto out_unlock;
 
-	ret = iomap_dax_rw(iocb, from, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		mark_inode_dirty(inode);
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = iomap_dax_fault(vma, vmf, &ext2_iomap_ops);
+	ret = dax_iomap_fault(vma, vmf, &ext2_iomap_ops);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de7c53c..8f12152 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -344,7 +344,7 @@ xfs_file_dax_read(
 		return 0; /* skip atime */
 
 	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = iomap_dax_rw(iocb, to, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
@@ -730,7 +730,7 @@ xfs_file_dax_write(
 
 	trace_xfs_file_dax_write(ip, count, pos);
 
-	ret = iomap_dax_rw(iocb, from, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1679,7 +1679,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else {
 		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
@@ -1713,7 +1713,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0f74866..a3dfee4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -11,13 +11,13 @@ struct iomap_ops;
 /* We use lowest available exceptional entry bit for locking */
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
 
-ssize_t iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
  2016-10-07 21:08 ` Ross Zwisler
                     ` (2 preceding siblings ...)
  (?)
@ 2016-10-07 21:08   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

To be able to correctly calculate the sector from a file position and a
struct iomap there is a complex little bit of logic that currently happens
in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
repeated yet again in the DAX PMD fault handler when it is added, so break
it out into a helper function.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 982ccbb..7689ab0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1023,6 +1023,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
+static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
+{
+	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
+}
+
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
@@ -1048,8 +1053,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct blk_dax_ctl dax = { 0 };
 		ssize_t map_len;
 
-		dax.sector = iomap->blkno +
-			(((pos & PAGE_MASK) - iomap->offset) >> 9);
+		dax.sector = dax_iomap_sector(iomap, pos);
 		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
 		map_len = dax_map_atomic(iomap->bdev, &dax);
 		if (map_len < 0) {
@@ -1186,7 +1190,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		goto unlock_entry;
 	}
 
-	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
+	sector = dax_iomap_sector(&iomap, pos);
 
 	if (vmf->cow_page) {
 		switch (iomap.type) {
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

To be able to correctly calculate the sector from a file position and a
struct iomap there is a complex little bit of logic that currently happens
in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
repeated yet again in the DAX PMD fault handler when it is added, so break
it out into a helper function.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 982ccbb..7689ab0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1023,6 +1023,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
+static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
+{
+	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
+}
+
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
@@ -1048,8 +1053,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct blk_dax_ctl dax = { 0 };
 		ssize_t map_len;
 
-		dax.sector = iomap->blkno +
-			(((pos & PAGE_MASK) - iomap->offset) >> 9);
+		dax.sector = dax_iomap_sector(iomap, pos);
 		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
 		map_len = dax_map_atomic(iomap->bdev, &dax);
 		if (map_len < 0) {
@@ -1186,7 +1190,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		goto unlock_entry;
 	}
 
-	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
+	sector = dax_iomap_sector(&iomap, pos);
 
 	if (vmf->cow_page) {
 		switch (iomap.type) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

To be able to correctly calculate the sector from a file position and a
struct iomap there is a complex little bit of logic that currently happens
in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
repeated yet again in the DAX PMD fault handler when it is added, so break
it out into a helper function.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 982ccbb..7689ab0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1023,6 +1023,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
+static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
+{
+	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
+}
+
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
@@ -1048,8 +1053,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct blk_dax_ctl dax = { 0 };
 		ssize_t map_len;
 
-		dax.sector = iomap->blkno +
-			(((pos & PAGE_MASK) - iomap->offset) >> 9);
+		dax.sector = dax_iomap_sector(iomap, pos);
 		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
 		map_len = dax_map_atomic(iomap->bdev, &dax);
 		if (map_len < 0) {
@@ -1186,7 +1190,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		goto unlock_entry;
 	}
 
-	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
+	sector = dax_iomap_sector(&iomap, pos);
 
 	if (vmf->cow_page) {
 		switch (iomap.type) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

To be able to correctly calculate the sector from a file position and a
struct iomap there is a complex little bit of logic that currently happens
in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
repeated yet again in the DAX PMD fault handler when it is added, so break
it out into a helper function.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/dax.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 982ccbb..7689ab0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1023,6 +1023,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
+static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
+{
+	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
+}
+
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
@@ -1048,8 +1053,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct blk_dax_ctl dax = { 0 };
 		ssize_t map_len;
 
-		dax.sector = iomap->blkno +
-			(((pos & PAGE_MASK) - iomap->offset) >> 9);
+		dax.sector = dax_iomap_sector(iomap, pos);
 		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
 		map_len = dax_map_atomic(iomap->bdev, &dax);
 		if (map_len < 0) {
@@ -1186,7 +1190,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		goto unlock_entry;
 	}
 
-	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
+	sector = dax_iomap_sector(&iomap, pos);
 
 	if (vmf->cow_page) {
 		switch (iomap.type) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
@ 2016-10-07 21:08   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

To be able to correctly calculate the sector from a file position and a
struct iomap there is a complex little bit of logic that currently happens
in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
repeated yet again in the DAX PMD fault handler when it is added, so break
it out into a helper function.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 982ccbb..7689ab0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1023,6 +1023,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
+static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
+{
+	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
+}
+
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
@@ -1048,8 +1053,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct blk_dax_ctl dax = { 0 };
 		ssize_t map_len;
 
-		dax.sector = iomap->blkno +
-			(((pos & PAGE_MASK) - iomap->offset) >> 9);
+		dax.sector = dax_iomap_sector(iomap, pos);
 		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
 		map_len = dax_map_atomic(iomap->bdev, &dax);
 		if (map_len < 0) {
@@ -1186,7 +1190,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		goto unlock_entry;
 	}
 
-	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
+	sector = dax_iomap_sector(&iomap, pos);
 
 	if (vmf->cow_page) {
 		switch (iomap.type) {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
  2016-10-07 21:08 ` Ross Zwisler
  (?)
  (?)
@ 2016-10-07 21:09   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

Currently iomap_end() doesn't do anything for DAX page faults for both ext2
and XFS.  ext2_iomap_end() just checks for a write underrun, and
xfs_file_iomap_end() checks to see if it needs to finish a delayed
allocation.  However, in the future iomap_end() calls might be needed to
make sure we have balanced allocations, locks, etc.  So, add calls to
iomap_end() with appropriate error handling to dax_iomap_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7689ab0..5e8febe 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1187,7 +1187,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		goto unlock_entry;
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
 		error = -EIO;		/* fs corruption? */
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1209,7 +1209,14 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto unlock_entry;
+			goto finish_iomap;
+
+		if (ops->iomap_end) {
+			error = ops->iomap_end(inode, pos, PAGE_SIZE,
+					PAGE_SIZE, flags, &iomap);
+			if (error)
+				goto unlock_entry;
+		}
 		if (!radix_tree_exceptional_entry(entry)) {
 			vmf->page = entry;
 			return VM_FAULT_LOCKED;
@@ -1230,8 +1237,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
-		if (!(vmf->flags & FAULT_FLAG_WRITE))
+		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
+			if (ops->iomap_end)  {
+				error = ops->iomap_end(inode, pos, PAGE_SIZE,
+						PAGE_SIZE, flags, &iomap);
+				if (error)
+					goto unlock_entry;
+			}
 			return dax_load_hole(mapping, entry, vmf);
+		}
 		/*FALLTHRU*/
 	default:
 		WARN_ON_ONCE(1);
@@ -1239,6 +1253,17 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
+ finish_iomap:
+	if (ops->iomap_end) {
+		if (error) {
+			/* keep previous error */
+			ops->iomap_end(inode, pos, PAGE_SIZE, PAGE_SIZE, flags,
+					&iomap);
+		} else {
+			error = ops->iomap_end(inode, pos, PAGE_SIZE,
+					PAGE_SIZE, flags, &iomap);
+		}
+	}
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Currently iomap_end() doesn't do anything for DAX page faults for both ext2
and XFS.  ext2_iomap_end() just checks for a write underrun, and
xfs_file_iomap_end() checks to see if it needs to finish a delayed
allocation.  However, in the future iomap_end() calls might be needed to
make sure we have balanced allocations, locks, etc.  So, add calls to
iomap_end() with appropriate error handling to dax_iomap_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7689ab0..5e8febe 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1187,7 +1187,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		goto unlock_entry;
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
 		error = -EIO;		/* fs corruption? */
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1209,7 +1209,14 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto unlock_entry;
+			goto finish_iomap;
+
+		if (ops->iomap_end) {
+			error = ops->iomap_end(inode, pos, PAGE_SIZE,
+					PAGE_SIZE, flags, &iomap);
+			if (error)
+				goto unlock_entry;
+		}
 		if (!radix_tree_exceptional_entry(entry)) {
 			vmf->page = entry;
 			return VM_FAULT_LOCKED;
@@ -1230,8 +1237,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
-		if (!(vmf->flags & FAULT_FLAG_WRITE))
+		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
+			if (ops->iomap_end)  {
+				error = ops->iomap_end(inode, pos, PAGE_SIZE,
+						PAGE_SIZE, flags, &iomap);
+				if (error)
+					goto unlock_entry;
+			}
 			return dax_load_hole(mapping, entry, vmf);
+		}
 		/*FALLTHRU*/
 	default:
 		WARN_ON_ONCE(1);
@@ -1239,6 +1253,17 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
+ finish_iomap:
+	if (ops->iomap_end) {
+		if (error) {
+			/* keep previous error */
+			ops->iomap_end(inode, pos, PAGE_SIZE, PAGE_SIZE, flags,
+					&iomap);
+		} else {
+			error = ops->iomap_end(inode, pos, PAGE_SIZE,
+					PAGE_SIZE, flags, &iomap);
+		}
+	}
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Currently iomap_end() doesn't do anything for DAX page faults for both ext2
and XFS.  ext2_iomap_end() just checks for a write underrun, and
xfs_file_iomap_end() checks to see if it needs to finish a delayed
allocation.  However, in the future iomap_end() calls might be needed to
make sure we have balanced allocations, locks, etc.  So, add calls to
iomap_end() with appropriate error handling to dax_iomap_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7689ab0..5e8febe 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1187,7 +1187,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		goto unlock_entry;
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
 		error = -EIO;		/* fs corruption? */
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1209,7 +1209,14 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto unlock_entry;
+			goto finish_iomap;
+
+		if (ops->iomap_end) {
+			error = ops->iomap_end(inode, pos, PAGE_SIZE,
+					PAGE_SIZE, flags, &iomap);
+			if (error)
+				goto unlock_entry;
+		}
 		if (!radix_tree_exceptional_entry(entry)) {
 			vmf->page = entry;
 			return VM_FAULT_LOCKED;
@@ -1230,8 +1237,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
-		if (!(vmf->flags & FAULT_FLAG_WRITE))
+		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
+			if (ops->iomap_end)  {
+				error = ops->iomap_end(inode, pos, PAGE_SIZE,
+						PAGE_SIZE, flags, &iomap);
+				if (error)
+					goto unlock_entry;
+			}
 			return dax_load_hole(mapping, entry, vmf);
+		}
 		/*FALLTHRU*/
 	default:
 		WARN_ON_ONCE(1);
@@ -1239,6 +1253,17 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
+ finish_iomap:
+	if (ops->iomap_end) {
+		if (error) {
+			/* keep previous error */
+			ops->iomap_end(inode, pos, PAGE_SIZE, PAGE_SIZE, flags,
+					&iomap);
+		} else {
+			error = ops->iomap_end(inode, pos, PAGE_SIZE,
+					PAGE_SIZE, flags, &iomap);
+		}
+	}
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Currently iomap_end() doesn't do anything for DAX page faults for both ext2
and XFS.  ext2_iomap_end() just checks for a write underrun, and
xfs_file_iomap_end() checks to see if it needs to finish a delayed
allocation.  However, in the future iomap_end() calls might be needed to
make sure we have balanced allocations, locks, etc.  So, add calls to
iomap_end() with appropriate error handling to dax_iomap_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7689ab0..5e8febe 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1187,7 +1187,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		goto unlock_entry;
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
 		error = -EIO;		/* fs corruption? */
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1209,7 +1209,14 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto unlock_entry;
+			goto finish_iomap;
+
+		if (ops->iomap_end) {
+			error = ops->iomap_end(inode, pos, PAGE_SIZE,
+					PAGE_SIZE, flags, &iomap);
+			if (error)
+				goto unlock_entry;
+		}
 		if (!radix_tree_exceptional_entry(entry)) {
 			vmf->page = entry;
 			return VM_FAULT_LOCKED;
@@ -1230,8 +1237,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
-		if (!(vmf->flags & FAULT_FLAG_WRITE))
+		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
+			if (ops->iomap_end)  {
+				error = ops->iomap_end(inode, pos, PAGE_SIZE,
+						PAGE_SIZE, flags, &iomap);
+				if (error)
+					goto unlock_entry;
+			}
 			return dax_load_hole(mapping, entry, vmf);
+		}
 		/*FALLTHRU*/
 	default:
 		WARN_ON_ONCE(1);
@@ -1239,6 +1253,17 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
+ finish_iomap:
+	if (ops->iomap_end) {
+		if (error) {
+			/* keep previous error */
+			ops->iomap_end(inode, pos, PAGE_SIZE, PAGE_SIZE, flags,
+					&iomap);
+		} else {
+			error = ops->iomap_end(inode, pos, PAGE_SIZE,
+					PAGE_SIZE, flags, &iomap);
+		}
+	}
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
  2016-10-07 21:08 ` Ross Zwisler
  (?)
  (?)
@ 2016-10-07 21:09   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
empty entry.

Instead of shifting the defines between dax.c and dax.h as they are
individually used in other code, just move them wholesale to dax.h so
they'll be available when we need them.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 14 --------------
 include/linux/dax.h | 15 ++++++++++++++-
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5e8febe..ac3cd05 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -34,20 +34,6 @@
 #include <linux/iomap.h>
 #include "internal.h"
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a3dfee4..e9ea78c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,8 +8,21 @@
 
 struct iomap_ops;
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, other two
+ * bits to determine entry type. In total 3 special bits.
+ */
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
+		RADIX_TREE_EXCEPTIONAL_ENTRY))
+
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
empty entry.

Instead of shifting the defines between dax.c and dax.h as they are
individually used in other code, just move them wholesale to dax.h so
they'll be available when we need them.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 14 --------------
 include/linux/dax.h | 15 ++++++++++++++-
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5e8febe..ac3cd05 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -34,20 +34,6 @@
 #include <linux/iomap.h>
 #include "internal.h"
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a3dfee4..e9ea78c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,8 +8,21 @@
 
 struct iomap_ops;
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, other two
+ * bits to determine entry type. In total 3 special bits.
+ */
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
+		RADIX_TREE_EXCEPTIONAL_ENTRY))
+
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
empty entry.

Instead of shifting the defines between dax.c and dax.h as they are
individually used in other code, just move them wholesale to dax.h so
they'll be available when we need them.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 14 --------------
 include/linux/dax.h | 15 ++++++++++++++-
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5e8febe..ac3cd05 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -34,20 +34,6 @@
 #include <linux/iomap.h>
 #include "internal.h"
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a3dfee4..e9ea78c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,8 +8,21 @@
 
 struct iomap_ops;
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, other two
+ * bits to determine entry type. In total 3 special bits.
+ */
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
+		RADIX_TREE_EXCEPTIONAL_ENTRY))
+
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
empty entry.

Instead of shifting the defines between dax.c and dax.h as they are
individually used in other code, just move them wholesale to dax.h so
they'll be available when we need them.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 14 --------------
 include/linux/dax.h | 15 ++++++++++++++-
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5e8febe..ac3cd05 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -34,20 +34,6 @@
 #include <linux/iomap.h>
 #include "internal.h"
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a3dfee4..e9ea78c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,8 +8,21 @@
 
 struct iomap_ops;
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, other two
+ * bits to determine entry type. In total 3 special bits.
+ */
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
+		RADIX_TREE_EXCEPTIONAL_ENTRY))
+
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
  2016-10-07 21:08 ` Ross Zwisler
                     ` (2 preceding siblings ...)
  (?)
@ 2016-10-07 21:09   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 325 ++++++++++++++++++++++++++++++++++++++++++++++------
 include/linux/dax.h |  57 +++++++--
 mm/filemap.c        |   4 +-
 3 files changed, 339 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac3cd05..e51d51f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
 	 * queue to the start of that PMD.  This ensures that all offsets in
 	 * the range covered by the PMD map to the same bit lock.
 	 */
-	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+	if ((unsigned long)entry & RADIX_DAX_PMD)
 		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
 
 	key->mapping = mapping;
@@ -388,35 +388,103 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  * radix tree entry locked. If the radix tree doesn't contain given index,
  * create empty exceptional entry for the index and return with it locked.
  *
+ * When requesting an entry with size RADIX_DAX_PMD, grab_mapping_entry() will
+ * either return that locked entry or will return an error.  This error will
+ * happen if there are any 4k entries (either zero pages or DAX entries)
+ * within the 2MiB range that we are requesting.
+ *
+ * We always favor 4k entries over 2MiB entries. There isn't a flow where we
+ * evict 4k entries in order to 'upgrade' them to a 2MiB entry.  A 2MiB
+ * insertion will fail if it finds any 4k entries already in the tree, and a
+ * 4k insertion will cause an existing 2MiB entry to be unmapped and
+ * downgraded to 4k entries.  This happens for both 2MiB huge zero pages as
+ * well as 2MiB empty entries.
+ *
+ * The exception to this downgrade path is for 2MiB DAX PMD entries that have
+ * real storage backing them.  We will leave these real 2MiB DAX entries in
+ * the tree, and PTE writes will simply dirty the entire 2MiB DAX entry.
+ *
  * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+		unsigned long size_flag)
 {
+	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
 	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
 	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+	if (entry) {
+		if (size_flag & RADIX_DAX_PMD) {
+			if (!radix_tree_exceptional_entry(entry) ||
+			    !((unsigned long)entry & RADIX_DAX_PMD)) {
+				entry = ERR_PTR(-EEXIST);
+				goto out_unlock;
+			}
+		} else { /* trying to grab a PTE entry */
+			if (radix_tree_exceptional_entry(entry) &&
+			    ((unsigned long)entry & RADIX_DAX_PMD) &&
+			    ((unsigned long)entry &
+			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
+				pmd_downgrade = true;
+			}
+		}
+	}
+
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!entry) {
+	if (!entry || pmd_downgrade) {
 		int err;
 
+		if (pmd_downgrade) {
+			/*
+			 * Make sure 'entry' remains valid while we drop
+			 * mapping->tree_lock.
+			 */
+			entry = lock_slot(mapping, slot);
+		}
+
 		spin_unlock_irq(&mapping->tree_lock);
 		err = radix_tree_preload(
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-			       RADIX_DAX_ENTRY_LOCK);
+
+		/*
+		 * Besides huge zero pages the only other thing that gets
+		 * downgraded are empty entries which don't need to be
+		 * unmapped.
+		 */
+		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+			unmap_mapping_range(mapping,
+				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+		if (pmd_downgrade) {
+			radix_tree_delete(&mapping->page_tree, index);
+			mapping->nrexceptional--;
+			dax_wake_mapping_entry_waiter(mapping, index, entry,
+					false);
+		}
+
+		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
+
+		err = __radix_tree_insert(&mapping->page_tree, index,
+				dax_radix_order(entry), entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
-			/* Someone already created the entry? */
-			if (err == -EEXIST)
+			/*
+			 * Someone already created the entry?  This is a
+			 * normal failure when inserting PMDs in a range
+			 * that already contains PTEs.  In that case we want
+			 * to return -EEXIST immediately.
+			 */
+			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
 				goto restart;
 			return ERR_PTR(err);
 		}
@@ -441,6 +509,7 @@ restart:
 		return page;
 	}
 	entry = lock_slot(mapping, slot);
+ out_unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 	return entry;
 }
@@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
 	return 0;
 }
 
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
+/*
+ * By this point grab_mapping_entry() has ensured that we have a locked entry
+ * of the appropriate size so we don't have to worry about downgrading PMDs to
+ * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
+ * already in the tree, we will skip the insertion and just dirty the PMD as
+ * appropriate.
+ */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector)
+				      void *entry, sector_t sector,
+				      unsigned long flags)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	int error = 0;
@@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
 		if (error)
 			return ERR_PTR(error);
+	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
+			!(flags & RADIX_DAX_HZP)) {
+		/* replacing huge zero page with PMD block mapping */
+		unmap_mapping_range(mapping,
+			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
-		       RADIX_DAX_ENTRY_LOCK);
+	new_entry = dax_radix_entry(sector, flags);
+
 	if (hole_fill) {
 		__delete_from_page_cache(entry, NULL);
 		/* Drop pagecache reference */
 		put_page(entry);
-		error = radix_tree_insert(page_tree, index, new_entry);
+		error = __radix_tree_insert(page_tree, index,
+				dax_radix_order(new_entry), new_entry);
 		if (error) {
 			new_entry = ERR_PTR(error);
 			goto unlock;
 		}
 		mapping->nrexceptional++;
-	} else {
+	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
 		void **slot;
 		void *ret;
 
@@ -653,7 +734,6 @@ static int dax_writeback_one(struct block_device *bdev,
 		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	int type = RADIX_DAX_TYPE(entry);
 	struct radix_tree_node *node;
 	struct blk_dax_ctl dax;
 	void **slot;
@@ -674,13 +754,20 @@ static int dax_writeback_one(struct block_device *bdev,
 	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
 		goto unlock;
 
-	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
 		ret = -EIO;
 		goto unlock;
 	}
 
-	dax.sector = RADIX_DAX_SECTOR(entry);
-	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+	/*
+	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we are given will be aligned to
+	 * the start index of the PMD, as will the sector we pull from
+	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
+	 * worry about partial PMD writebacks.
+	 */
+	dax.sector = dax_radix_sector(entry);
+	dax.size = PAGE_SIZE << dax_radix_order(entry);
 	spin_unlock_irq(&mapping->tree_lock);
 
 	/*
@@ -719,12 +806,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	pgoff_t start_index, end_index, pmd_index;
+	pgoff_t start_index, end_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	bool done = false;
 	int i, ret = 0;
-	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
@@ -734,15 +820,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 	start_index = wbc->range_start >> PAGE_SHIFT;
 	end_index = wbc->range_end >> PAGE_SHIFT;
-	pmd_index = DAX_PMD_INDEX(start_index);
-
-	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
-	rcu_read_unlock();
-
-	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start_index = pmd_index;
 
 	tag_pages_for_writeback(mapping, start_index, end_index);
 
@@ -787,7 +864,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return PTR_ERR(dax.addr);
 	dax_unmap_atomic(bdev, &dax);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector, 0);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 	*entryp = ret;
@@ -834,7 +911,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1154,7 +1231,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return VM_FAULT_NOPAGE | major;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+#ifdef CONFIG_FS_DAX_PMD
+/*
+ * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
+ * more often than one might expect in the below functions.
+ */
+#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
+
+static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, loff_t pos, bool write, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct block_device *bdev = iomap->bdev;
+	struct blk_dax_ctl dax = {
+		.sector = dax_iomap_sector(iomap, pos),
+		.size = PMD_SIZE,
+	};
+	long length = dax_map_atomic(bdev, &dax);
+	void *ret;
+
+	if (length < 0) /* dax_map_atomic() failed */
+		return VM_FAULT_FALLBACK;
+	if (length < PMD_SIZE)
+		goto unmap_fallback;
+	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
+		goto unmap_fallback;
+	if (!pfn_t_devmap(dax.pfn))
+		goto unmap_fallback;
+
+	dax_unmap_atomic(bdev, &dax);
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
+			RADIX_DAX_PMD);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
+
+ unmap_fallback:
+	dax_unmap_atomic(bdev, &dax);
+	return VM_FAULT_FALLBACK;
+}
+
+static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	struct page *zero_page;
+	spinlock_t *ptl;
+	pmd_t pmd_entry;
+	void *ret;
+
+	zero_page = get_huge_zero_page();
+
+	if (unlikely(!zero_page))
+		return VM_FAULT_FALLBACK;
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
+			RADIX_DAX_PMD | RADIX_DAX_HZP);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (!pmd_none(*pmd)) {
+		spin_unlock(ptl);
+		return VM_FAULT_FALLBACK;
+	}
+
+	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+	pmd_entry = pmd_mkhuge(pmd_entry);
+	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
+	spin_unlock(ptl);
+	return VM_FAULT_NOPAGE;
+}
+
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	bool write = flags & FAULT_FLAG_WRITE;
+	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
+	struct inode *inode = mapping->host;
+	int result = VM_FAULT_FALLBACK;
+	struct iomap iomap = { 0 };
+	pgoff_t size, pgoff;
+	struct vm_fault vmf;
+	void *entry;
+	loff_t pos;
+	int error;
+
+	/* Fall back to PTEs if we're going to COW */
+	if (write && !(vma->vm_flags & VM_SHARED)) {
+		split_huge_pmd(vma, pmd, address);
+		goto fallback;
+	}
+
+	/* If the PMD would extend outside the VMA */
+	if (pmd_addr < vma->vm_start)
+		goto fallback;
+	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
+		goto fallback;
+
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is
+	 * supposed to hold locks serializing us with truncate / punch hole so
+	 * this is a reliable test.
+	 */
+	pgoff = linear_page_index(vma, pmd_addr);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+	if (pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	/* If the PMD would extend beyond the file size */
+	if ((pgoff | PG_PMD_COLOUR) >= size)
+		goto fallback;
+
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		goto fallback;
+
+	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	if (error)
+		goto unlock_entry;
+	if (iomap.offset + iomap.length < pos + PMD_SIZE)
+		goto finish_iomap;
+
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_IO;
+
+	switch (iomap.type) {
+	case IOMAP_MAPPED:
+		result = dax_pmd_insert_mapping(vma, pmd, &vmf, address,
+				&iomap, pos, write, &entry);
+		break;
+	case IOMAP_UNWRITTEN:
+	case IOMAP_HOLE:
+		if (WARN_ON_ONCE(write))
+			goto finish_iomap;
+		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
+				&entry);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		break;
+	}
+
+ finish_iomap:
+	if (ops->iomap_end) {
+		error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
+				iomap_flags, &iomap);
+		if (error)
+			result = VM_FAULT_FALLBACK;
+	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
+ fallback:
+	if (result == VM_FAULT_FALLBACK)
+		count_vm_event(THP_FAULT_FALLBACK);
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
+#endif /* CONFIG_FS_DAX_PMD */
 #endif /* CONFIG_FS_IOMAP */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9ea78c..5d70367 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -9,19 +9,33 @@
 struct iomap_ops;
 
 /*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
+ * We use lowest available bit in exceptional entry for locking, one bit for
+ * the entry size (PMD) and two more to tell us if the entry is a huge zero
+ * page (HZP) or an empty entry that is just used for locking.  In total four
+ * special bits.
+ *
+ * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the HZP and
+ * EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
+ * block allocation.
  */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 4)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+
+static inline unsigned long dax_radix_sector(void *entry)
+{
+	return (unsigned long)entry >> RADIX_DAX_SHIFT;
+}
+
+/* entries begin locked */
+static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
+{
+	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
+			((unsigned long)sector << RADIX_DAX_SHIFT) |
+			RADIX_DAX_ENTRY_LOCK);
+}
 
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
@@ -67,6 +81,27 @@ static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	return VM_FAULT_FALLBACK;
 }
 
+#ifdef CONFIG_FS_DAX_PMD
+static inline unsigned int dax_radix_order(void *entry)
+{
+	if ((unsigned long)entry & RADIX_DAX_PMD)
+		return PMD_SHIFT - PAGE_SHIFT;
+	return 0;
+}
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops);
+#else
+static inline unsigned int dax_radix_order(void *entry)
+{
+	return 0;
+}
+static inline int dax_iomap_pmd_fault(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		struct iomap_ops *ops)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index a596462..4ed2df5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
+			WARN_ON_ONCE(p != dax_radix_entry(0, RADIX_DAX_EMPTY));
 			/* DAX accounts exceptional entries as normal pages */
 			if (node)
 				workingset_node_pages_dec(node);
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 325 ++++++++++++++++++++++++++++++++++++++++++++++------
 include/linux/dax.h |  57 +++++++--
 mm/filemap.c        |   4 +-
 3 files changed, 339 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac3cd05..e51d51f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
 	 * queue to the start of that PMD.  This ensures that all offsets in
 	 * the range covered by the PMD map to the same bit lock.
 	 */
-	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+	if ((unsigned long)entry & RADIX_DAX_PMD)
 		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
 
 	key->mapping = mapping;
@@ -388,35 +388,103 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  * radix tree entry locked. If the radix tree doesn't contain given index,
  * create empty exceptional entry for the index and return with it locked.
  *
+ * When requesting an entry with size RADIX_DAX_PMD, grab_mapping_entry() will
+ * either return that locked entry or will return an error.  This error will
+ * happen if there are any 4k entries (either zero pages or DAX entries)
+ * within the 2MiB range that we are requesting.
+ *
+ * We always favor 4k entries over 2MiB entries. There isn't a flow where we
+ * evict 4k entries in order to 'upgrade' them to a 2MiB entry.  A 2MiB
+ * insertion will fail if it finds any 4k entries already in the tree, and a
+ * 4k insertion will cause an existing 2MiB entry to be unmapped and
+ * downgraded to 4k entries.  This happens for both 2MiB huge zero pages as
+ * well as 2MiB empty entries.
+ *
+ * The exception to this downgrade path is for 2MiB DAX PMD entries that have
+ * real storage backing them.  We will leave these real 2MiB DAX entries in
+ * the tree, and PTE writes will simply dirty the entire 2MiB DAX entry.
+ *
  * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+		unsigned long size_flag)
 {
+	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
 	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
 	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+	if (entry) {
+		if (size_flag & RADIX_DAX_PMD) {
+			if (!radix_tree_exceptional_entry(entry) ||
+			    !((unsigned long)entry & RADIX_DAX_PMD)) {
+				entry = ERR_PTR(-EEXIST);
+				goto out_unlock;
+			}
+		} else { /* trying to grab a PTE entry */
+			if (radix_tree_exceptional_entry(entry) &&
+			    ((unsigned long)entry & RADIX_DAX_PMD) &&
+			    ((unsigned long)entry &
+			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
+				pmd_downgrade = true;
+			}
+		}
+	}
+
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!entry) {
+	if (!entry || pmd_downgrade) {
 		int err;
 
+		if (pmd_downgrade) {
+			/*
+			 * Make sure 'entry' remains valid while we drop
+			 * mapping->tree_lock.
+			 */
+			entry = lock_slot(mapping, slot);
+		}
+
 		spin_unlock_irq(&mapping->tree_lock);
 		err = radix_tree_preload(
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-			       RADIX_DAX_ENTRY_LOCK);
+
+		/*
+		 * Besides huge zero pages the only other thing that gets
+		 * downgraded are empty entries which don't need to be
+		 * unmapped.
+		 */
+		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+			unmap_mapping_range(mapping,
+				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+		if (pmd_downgrade) {
+			radix_tree_delete(&mapping->page_tree, index);
+			mapping->nrexceptional--;
+			dax_wake_mapping_entry_waiter(mapping, index, entry,
+					false);
+		}
+
+		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
+
+		err = __radix_tree_insert(&mapping->page_tree, index,
+				dax_radix_order(entry), entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
-			/* Someone already created the entry? */
-			if (err == -EEXIST)
+			/*
+			 * Someone already created the entry?  This is a
+			 * normal failure when inserting PMDs in a range
+			 * that already contains PTEs.  In that case we want
+			 * to return -EEXIST immediately.
+			 */
+			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
 				goto restart;
 			return ERR_PTR(err);
 		}
@@ -441,6 +509,7 @@ restart:
 		return page;
 	}
 	entry = lock_slot(mapping, slot);
+ out_unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 	return entry;
 }
@@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
 	return 0;
 }
 
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
+/*
+ * By this point grab_mapping_entry() has ensured that we have a locked entry
+ * of the appropriate size so we don't have to worry about downgrading PMDs to
+ * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
+ * already in the tree, we will skip the insertion and just dirty the PMD as
+ * appropriate.
+ */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector)
+				      void *entry, sector_t sector,
+				      unsigned long flags)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	int error = 0;
@@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
 		if (error)
 			return ERR_PTR(error);
+	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
+			!(flags & RADIX_DAX_HZP)) {
+		/* replacing huge zero page with PMD block mapping */
+		unmap_mapping_range(mapping,
+			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
-		       RADIX_DAX_ENTRY_LOCK);
+	new_entry = dax_radix_entry(sector, flags);
+
 	if (hole_fill) {
 		__delete_from_page_cache(entry, NULL);
 		/* Drop pagecache reference */
 		put_page(entry);
-		error = radix_tree_insert(page_tree, index, new_entry);
+		error = __radix_tree_insert(page_tree, index,
+				dax_radix_order(new_entry), new_entry);
 		if (error) {
 			new_entry = ERR_PTR(error);
 			goto unlock;
 		}
 		mapping->nrexceptional++;
-	} else {
+	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
 		void **slot;
 		void *ret;
 
@@ -653,7 +734,6 @@ static int dax_writeback_one(struct block_device *bdev,
 		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	int type = RADIX_DAX_TYPE(entry);
 	struct radix_tree_node *node;
 	struct blk_dax_ctl dax;
 	void **slot;
@@ -674,13 +754,20 @@ static int dax_writeback_one(struct block_device *bdev,
 	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
 		goto unlock;
 
-	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
 		ret = -EIO;
 		goto unlock;
 	}
 
-	dax.sector = RADIX_DAX_SECTOR(entry);
-	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+	/*
+	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we are given will be aligned to
+	 * the start index of the PMD, as will the sector we pull from
+	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
+	 * worry about partial PMD writebacks.
+	 */
+	dax.sector = dax_radix_sector(entry);
+	dax.size = PAGE_SIZE << dax_radix_order(entry);
 	spin_unlock_irq(&mapping->tree_lock);
 
 	/*
@@ -719,12 +806,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	pgoff_t start_index, end_index, pmd_index;
+	pgoff_t start_index, end_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	bool done = false;
 	int i, ret = 0;
-	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
@@ -734,15 +820,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 	start_index = wbc->range_start >> PAGE_SHIFT;
 	end_index = wbc->range_end >> PAGE_SHIFT;
-	pmd_index = DAX_PMD_INDEX(start_index);
-
-	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
-	rcu_read_unlock();
-
-	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start_index = pmd_index;
 
 	tag_pages_for_writeback(mapping, start_index, end_index);
 
@@ -787,7 +864,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return PTR_ERR(dax.addr);
 	dax_unmap_atomic(bdev, &dax);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector, 0);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 	*entryp = ret;
@@ -834,7 +911,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1154,7 +1231,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return VM_FAULT_NOPAGE | major;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+#ifdef CONFIG_FS_DAX_PMD
+/*
+ * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
+ * more often than one might expect in the below functions.
+ */
+#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
+
+static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, loff_t pos, bool write, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct block_device *bdev = iomap->bdev;
+	struct blk_dax_ctl dax = {
+		.sector = dax_iomap_sector(iomap, pos),
+		.size = PMD_SIZE,
+	};
+	long length = dax_map_atomic(bdev, &dax);
+	void *ret;
+
+	if (length < 0) /* dax_map_atomic() failed */
+		return VM_FAULT_FALLBACK;
+	if (length < PMD_SIZE)
+		goto unmap_fallback;
+	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
+		goto unmap_fallback;
+	if (!pfn_t_devmap(dax.pfn))
+		goto unmap_fallback;
+
+	dax_unmap_atomic(bdev, &dax);
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
+			RADIX_DAX_PMD);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
+
+ unmap_fallback:
+	dax_unmap_atomic(bdev, &dax);
+	return VM_FAULT_FALLBACK;
+}
+
+static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	struct page *zero_page;
+	spinlock_t *ptl;
+	pmd_t pmd_entry;
+	void *ret;
+
+	zero_page = get_huge_zero_page();
+
+	if (unlikely(!zero_page))
+		return VM_FAULT_FALLBACK;
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
+			RADIX_DAX_PMD | RADIX_DAX_HZP);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (!pmd_none(*pmd)) {
+		spin_unlock(ptl);
+		return VM_FAULT_FALLBACK;
+	}
+
+	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+	pmd_entry = pmd_mkhuge(pmd_entry);
+	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
+	spin_unlock(ptl);
+	return VM_FAULT_NOPAGE;
+}
+
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	bool write = flags & FAULT_FLAG_WRITE;
+	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
+	struct inode *inode = mapping->host;
+	int result = VM_FAULT_FALLBACK;
+	struct iomap iomap = { 0 };
+	pgoff_t size, pgoff;
+	struct vm_fault vmf;
+	void *entry;
+	loff_t pos;
+	int error;
+
+	/* Fall back to PTEs if we're going to COW */
+	if (write && !(vma->vm_flags & VM_SHARED)) {
+		split_huge_pmd(vma, pmd, address);
+		goto fallback;
+	}
+
+	/* If the PMD would extend outside the VMA */
+	if (pmd_addr < vma->vm_start)
+		goto fallback;
+	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
+		goto fallback;
+
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is
+	 * supposed to hold locks serializing us with truncate / punch hole so
+	 * this is a reliable test.
+	 */
+	pgoff = linear_page_index(vma, pmd_addr);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+	if (pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	/* If the PMD would extend beyond the file size */
+	if ((pgoff | PG_PMD_COLOUR) >= size)
+		goto fallback;
+
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		goto fallback;
+
+	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	if (error)
+		goto unlock_entry;
+	if (iomap.offset + iomap.length < pos + PMD_SIZE)
+		goto finish_iomap;
+
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_IO;
+
+	switch (iomap.type) {
+	case IOMAP_MAPPED:
+		result = dax_pmd_insert_mapping(vma, pmd, &vmf, address,
+				&iomap, pos, write, &entry);
+		break;
+	case IOMAP_UNWRITTEN:
+	case IOMAP_HOLE:
+		if (WARN_ON_ONCE(write))
+			goto finish_iomap;
+		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
+				&entry);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		break;
+	}
+
+ finish_iomap:
+	if (ops->iomap_end) {
+		error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
+				iomap_flags, &iomap);
+		if (error)
+			result = VM_FAULT_FALLBACK;
+	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
+ fallback:
+	if (result == VM_FAULT_FALLBACK)
+		count_vm_event(THP_FAULT_FALLBACK);
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
+#endif /* CONFIG_FS_DAX_PMD */
 #endif /* CONFIG_FS_IOMAP */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9ea78c..5d70367 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -9,19 +9,33 @@
 struct iomap_ops;
 
 /*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
+ * We use lowest available bit in exceptional entry for locking, one bit for
+ * the entry size (PMD) and two more to tell us if the entry is a huge zero
+ * page (HZP) or an empty entry that is just used for locking.  In total four
+ * special bits.
+ *
+ * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the HZP and
+ * EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
+ * block allocation.
  */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 4)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+
+static inline unsigned long dax_radix_sector(void *entry)
+{
+	return (unsigned long)entry >> RADIX_DAX_SHIFT;
+}
+
+/* entries begin locked */
+static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
+{
+	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
+			((unsigned long)sector << RADIX_DAX_SHIFT) |
+			RADIX_DAX_ENTRY_LOCK);
+}
 
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
@@ -67,6 +81,27 @@ static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	return VM_FAULT_FALLBACK;
 }
 
+#ifdef CONFIG_FS_DAX_PMD
+static inline unsigned int dax_radix_order(void *entry)
+{
+	if ((unsigned long)entry & RADIX_DAX_PMD)
+		return PMD_SHIFT - PAGE_SHIFT;
+	return 0;
+}
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops);
+#else
+static inline unsigned int dax_radix_order(void *entry)
+{
+	return 0;
+}
+static inline int dax_iomap_pmd_fault(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		struct iomap_ops *ops)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index a596462..4ed2df5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
+			WARN_ON_ONCE(p != dax_radix_entry(0, RADIX_DAX_EMPTY));
 			/* DAX accounts exceptional entries as normal pages */
 			if (node)
 				workingset_node_pages_dec(node);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 325 ++++++++++++++++++++++++++++++++++++++++++++++------
 include/linux/dax.h |  57 +++++++--
 mm/filemap.c        |   4 +-
 3 files changed, 339 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac3cd05..e51d51f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
 	 * queue to the start of that PMD.  This ensures that all offsets in
 	 * the range covered by the PMD map to the same bit lock.
 	 */
-	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+	if ((unsigned long)entry & RADIX_DAX_PMD)
 		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
 
 	key->mapping = mapping;
@@ -388,35 +388,103 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  * radix tree entry locked. If the radix tree doesn't contain given index,
  * create empty exceptional entry for the index and return with it locked.
  *
+ * When requesting an entry with size RADIX_DAX_PMD, grab_mapping_entry() will
+ * either return that locked entry or will return an error.  This error will
+ * happen if there are any 4k entries (either zero pages or DAX entries)
+ * within the 2MiB range that we are requesting.
+ *
+ * We always favor 4k entries over 2MiB entries. There isn't a flow where we
+ * evict 4k entries in order to 'upgrade' them to a 2MiB entry.  A 2MiB
+ * insertion will fail if it finds any 4k entries already in the tree, and a
+ * 4k insertion will cause an existing 2MiB entry to be unmapped and
+ * downgraded to 4k entries.  This happens for both 2MiB huge zero pages as
+ * well as 2MiB empty entries.
+ *
+ * The exception to this downgrade path is for 2MiB DAX PMD entries that have
+ * real storage backing them.  We will leave these real 2MiB DAX entries in
+ * the tree, and PTE writes will simply dirty the entire 2MiB DAX entry.
+ *
  * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+		unsigned long size_flag)
 {
+	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
 	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
 	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+	if (entry) {
+		if (size_flag & RADIX_DAX_PMD) {
+			if (!radix_tree_exceptional_entry(entry) ||
+			    !((unsigned long)entry & RADIX_DAX_PMD)) {
+				entry = ERR_PTR(-EEXIST);
+				goto out_unlock;
+			}
+		} else { /* trying to grab a PTE entry */
+			if (radix_tree_exceptional_entry(entry) &&
+			    ((unsigned long)entry & RADIX_DAX_PMD) &&
+			    ((unsigned long)entry &
+			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
+				pmd_downgrade = true;
+			}
+		}
+	}
+
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!entry) {
+	if (!entry || pmd_downgrade) {
 		int err;
 
+		if (pmd_downgrade) {
+			/*
+			 * Make sure 'entry' remains valid while we drop
+			 * mapping->tree_lock.
+			 */
+			entry = lock_slot(mapping, slot);
+		}
+
 		spin_unlock_irq(&mapping->tree_lock);
 		err = radix_tree_preload(
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-			       RADIX_DAX_ENTRY_LOCK);
+
+		/*
+		 * Besides huge zero pages the only other thing that gets
+		 * downgraded are empty entries which don't need to be
+		 * unmapped.
+		 */
+		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+			unmap_mapping_range(mapping,
+				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+		if (pmd_downgrade) {
+			radix_tree_delete(&mapping->page_tree, index);
+			mapping->nrexceptional--;
+			dax_wake_mapping_entry_waiter(mapping, index, entry,
+					false);
+		}
+
+		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
+
+		err = __radix_tree_insert(&mapping->page_tree, index,
+				dax_radix_order(entry), entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
-			/* Someone already created the entry? */
-			if (err == -EEXIST)
+			/*
+			 * Someone already created the entry?  This is a
+			 * normal failure when inserting PMDs in a range
+			 * that already contains PTEs.  In that case we want
+			 * to return -EEXIST immediately.
+			 */
+			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
 				goto restart;
 			return ERR_PTR(err);
 		}
@@ -441,6 +509,7 @@ restart:
 		return page;
 	}
 	entry = lock_slot(mapping, slot);
+ out_unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 	return entry;
 }
@@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
 	return 0;
 }
 
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
+/*
+ * By this point grab_mapping_entry() has ensured that we have a locked entry
+ * of the appropriate size so we don't have to worry about downgrading PMDs to
+ * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
+ * already in the tree, we will skip the insertion and just dirty the PMD as
+ * appropriate.
+ */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector)
+				      void *entry, sector_t sector,
+				      unsigned long flags)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	int error = 0;
@@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
 		if (error)
 			return ERR_PTR(error);
+	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
+			!(flags & RADIX_DAX_HZP)) {
+		/* replacing huge zero page with PMD block mapping */
+		unmap_mapping_range(mapping,
+			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
-		       RADIX_DAX_ENTRY_LOCK);
+	new_entry = dax_radix_entry(sector, flags);
+
 	if (hole_fill) {
 		__delete_from_page_cache(entry, NULL);
 		/* Drop pagecache reference */
 		put_page(entry);
-		error = radix_tree_insert(page_tree, index, new_entry);
+		error = __radix_tree_insert(page_tree, index,
+				dax_radix_order(new_entry), new_entry);
 		if (error) {
 			new_entry = ERR_PTR(error);
 			goto unlock;
 		}
 		mapping->nrexceptional++;
-	} else {
+	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
 		void **slot;
 		void *ret;
 
@@ -653,7 +734,6 @@ static int dax_writeback_one(struct block_device *bdev,
 		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	int type = RADIX_DAX_TYPE(entry);
 	struct radix_tree_node *node;
 	struct blk_dax_ctl dax;
 	void **slot;
@@ -674,13 +754,20 @@ static int dax_writeback_one(struct block_device *bdev,
 	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
 		goto unlock;
 
-	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
 		ret = -EIO;
 		goto unlock;
 	}
 
-	dax.sector = RADIX_DAX_SECTOR(entry);
-	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+	/*
+	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we are given will be aligned to
+	 * the start index of the PMD, as will the sector we pull from
+	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
+	 * worry about partial PMD writebacks.
+	 */
+	dax.sector = dax_radix_sector(entry);
+	dax.size = PAGE_SIZE << dax_radix_order(entry);
 	spin_unlock_irq(&mapping->tree_lock);
 
 	/*
@@ -719,12 +806,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	pgoff_t start_index, end_index, pmd_index;
+	pgoff_t start_index, end_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	bool done = false;
 	int i, ret = 0;
-	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
@@ -734,15 +820,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 	start_index = wbc->range_start >> PAGE_SHIFT;
 	end_index = wbc->range_end >> PAGE_SHIFT;
-	pmd_index = DAX_PMD_INDEX(start_index);
-
-	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
-	rcu_read_unlock();
-
-	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start_index = pmd_index;
 
 	tag_pages_for_writeback(mapping, start_index, end_index);
 
@@ -787,7 +864,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return PTR_ERR(dax.addr);
 	dax_unmap_atomic(bdev, &dax);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector, 0);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 	*entryp = ret;
@@ -834,7 +911,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1154,7 +1231,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return VM_FAULT_NOPAGE | major;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+#ifdef CONFIG_FS_DAX_PMD
+/*
+ * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
+ * more often than one might expect in the below functions.
+ */
+#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
+
+static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, loff_t pos, bool write, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct block_device *bdev = iomap->bdev;
+	struct blk_dax_ctl dax = {
+		.sector = dax_iomap_sector(iomap, pos),
+		.size = PMD_SIZE,
+	};
+	long length = dax_map_atomic(bdev, &dax);
+	void *ret;
+
+	if (length < 0) /* dax_map_atomic() failed */
+		return VM_FAULT_FALLBACK;
+	if (length < PMD_SIZE)
+		goto unmap_fallback;
+	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
+		goto unmap_fallback;
+	if (!pfn_t_devmap(dax.pfn))
+		goto unmap_fallback;
+
+	dax_unmap_atomic(bdev, &dax);
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
+			RADIX_DAX_PMD);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
+
+ unmap_fallback:
+	dax_unmap_atomic(bdev, &dax);
+	return VM_FAULT_FALLBACK;
+}
+
+static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	struct page *zero_page;
+	spinlock_t *ptl;
+	pmd_t pmd_entry;
+	void *ret;
+
+	zero_page = get_huge_zero_page();
+
+	if (unlikely(!zero_page))
+		return VM_FAULT_FALLBACK;
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
+			RADIX_DAX_PMD | RADIX_DAX_HZP);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (!pmd_none(*pmd)) {
+		spin_unlock(ptl);
+		return VM_FAULT_FALLBACK;
+	}
+
+	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+	pmd_entry = pmd_mkhuge(pmd_entry);
+	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
+	spin_unlock(ptl);
+	return VM_FAULT_NOPAGE;
+}
+
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	bool write = flags & FAULT_FLAG_WRITE;
+	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
+	struct inode *inode = mapping->host;
+	int result = VM_FAULT_FALLBACK;
+	struct iomap iomap = { 0 };
+	pgoff_t size, pgoff;
+	struct vm_fault vmf;
+	void *entry;
+	loff_t pos;
+	int error;
+
+	/* Fall back to PTEs if we're going to COW */
+	if (write && !(vma->vm_flags & VM_SHARED)) {
+		split_huge_pmd(vma, pmd, address);
+		goto fallback;
+	}
+
+	/* If the PMD would extend outside the VMA */
+	if (pmd_addr < vma->vm_start)
+		goto fallback;
+	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
+		goto fallback;
+
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is
+	 * supposed to hold locks serializing us with truncate / punch hole so
+	 * this is a reliable test.
+	 */
+	pgoff = linear_page_index(vma, pmd_addr);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+	if (pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	/* If the PMD would extend beyond the file size */
+	if ((pgoff | PG_PMD_COLOUR) >= size)
+		goto fallback;
+
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		goto fallback;
+
+	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	if (error)
+		goto unlock_entry;
+	if (iomap.offset + iomap.length < pos + PMD_SIZE)
+		goto finish_iomap;
+
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_IO;
+
+	switch (iomap.type) {
+	case IOMAP_MAPPED:
+		result = dax_pmd_insert_mapping(vma, pmd, &vmf, address,
+				&iomap, pos, write, &entry);
+		break;
+	case IOMAP_UNWRITTEN:
+	case IOMAP_HOLE:
+		if (WARN_ON_ONCE(write))
+			goto finish_iomap;
+		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
+				&entry);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		break;
+	}
+
+ finish_iomap:
+	if (ops->iomap_end) {
+		error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
+				iomap_flags, &iomap);
+		if (error)
+			result = VM_FAULT_FALLBACK;
+	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
+ fallback:
+	if (result == VM_FAULT_FALLBACK)
+		count_vm_event(THP_FAULT_FALLBACK);
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
+#endif /* CONFIG_FS_DAX_PMD */
 #endif /* CONFIG_FS_IOMAP */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9ea78c..5d70367 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -9,19 +9,33 @@
 struct iomap_ops;
 
 /*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
+ * We use lowest available bit in exceptional entry for locking, one bit for
+ * the entry size (PMD) and two more to tell us if the entry is a huge zero
+ * page (HZP) or an empty entry that is just used for locking.  In total four
+ * special bits.
+ *
+ * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the HZP and
+ * EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
+ * block allocation.
  */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 4)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+
+static inline unsigned long dax_radix_sector(void *entry)
+{
+	return (unsigned long)entry >> RADIX_DAX_SHIFT;
+}
+
+/* entries begin locked */
+static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
+{
+	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
+			((unsigned long)sector << RADIX_DAX_SHIFT) |
+			RADIX_DAX_ENTRY_LOCK);
+}
 
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
@@ -67,6 +81,27 @@ static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	return VM_FAULT_FALLBACK;
 }
 
+#ifdef CONFIG_FS_DAX_PMD
+static inline unsigned int dax_radix_order(void *entry)
+{
+	if ((unsigned long)entry & RADIX_DAX_PMD)
+		return PMD_SHIFT - PAGE_SHIFT;
+	return 0;
+}
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops);
+#else
+static inline unsigned int dax_radix_order(void *entry)
+{
+	return 0;
+}
+static inline int dax_iomap_pmd_fault(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		struct iomap_ops *ops)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index a596462..4ed2df5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
+			WARN_ON_ONCE(p != dax_radix_entry(0, RADIX_DAX_EMPTY));
 			/* DAX accounts exceptional entries as normal pages */
 			if (node)
 				workingset_node_pages_dec(node);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/dax.c            | 325 ++++++++++++++++++++++++++++++++++++++++++++++------
 include/linux/dax.h |  57 +++++++--
 mm/filemap.c        |   4 +-
 3 files changed, 339 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac3cd05..e51d51f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
 	 * queue to the start of that PMD.  This ensures that all offsets in
 	 * the range covered by the PMD map to the same bit lock.
 	 */
-	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+	if ((unsigned long)entry & RADIX_DAX_PMD)
 		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
 
 	key->mapping = mapping;
@@ -388,35 +388,103 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  * radix tree entry locked. If the radix tree doesn't contain given index,
  * create empty exceptional entry for the index and return with it locked.
  *
+ * When requesting an entry with size RADIX_DAX_PMD, grab_mapping_entry() will
+ * either return that locked entry or will return an error.  This error will
+ * happen if there are any 4k entries (either zero pages or DAX entries)
+ * within the 2MiB range that we are requesting.
+ *
+ * We always favor 4k entries over 2MiB entries. There isn't a flow where we
+ * evict 4k entries in order to 'upgrade' them to a 2MiB entry.  A 2MiB
+ * insertion will fail if it finds any 4k entries already in the tree, and a
+ * 4k insertion will cause an existing 2MiB entry to be unmapped and
+ * downgraded to 4k entries.  This happens for both 2MiB huge zero pages as
+ * well as 2MiB empty entries.
+ *
+ * The exception to this downgrade path is for 2MiB DAX PMD entries that have
+ * real storage backing them.  We will leave these real 2MiB DAX entries in
+ * the tree, and PTE writes will simply dirty the entire 2MiB DAX entry.
+ *
  * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+		unsigned long size_flag)
 {
+	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
 	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
 	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+	if (entry) {
+		if (size_flag & RADIX_DAX_PMD) {
+			if (!radix_tree_exceptional_entry(entry) ||
+			    !((unsigned long)entry & RADIX_DAX_PMD)) {
+				entry = ERR_PTR(-EEXIST);
+				goto out_unlock;
+			}
+		} else { /* trying to grab a PTE entry */
+			if (radix_tree_exceptional_entry(entry) &&
+			    ((unsigned long)entry & RADIX_DAX_PMD) &&
+			    ((unsigned long)entry &
+			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
+				pmd_downgrade = true;
+			}
+		}
+	}
+
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!entry) {
+	if (!entry || pmd_downgrade) {
 		int err;
 
+		if (pmd_downgrade) {
+			/*
+			 * Make sure 'entry' remains valid while we drop
+			 * mapping->tree_lock.
+			 */
+			entry = lock_slot(mapping, slot);
+		}
+
 		spin_unlock_irq(&mapping->tree_lock);
 		err = radix_tree_preload(
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-			       RADIX_DAX_ENTRY_LOCK);
+
+		/*
+		 * Besides huge zero pages the only other thing that gets
+		 * downgraded are empty entries which don't need to be
+		 * unmapped.
+		 */
+		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+			unmap_mapping_range(mapping,
+				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+		if (pmd_downgrade) {
+			radix_tree_delete(&mapping->page_tree, index);
+			mapping->nrexceptional--;
+			dax_wake_mapping_entry_waiter(mapping, index, entry,
+					false);
+		}
+
+		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
+
+		err = __radix_tree_insert(&mapping->page_tree, index,
+				dax_radix_order(entry), entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
-			/* Someone already created the entry? */
-			if (err == -EEXIST)
+			/*
+			 * Someone already created the entry?  This is a
+			 * normal failure when inserting PMDs in a range
+			 * that already contains PTEs.  In that case we want
+			 * to return -EEXIST immediately.
+			 */
+			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
 				goto restart;
 			return ERR_PTR(err);
 		}
@@ -441,6 +509,7 @@ restart:
 		return page;
 	}
 	entry = lock_slot(mapping, slot);
+ out_unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 	return entry;
 }
@@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
 	return 0;
 }
 
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
+/*
+ * By this point grab_mapping_entry() has ensured that we have a locked entry
+ * of the appropriate size so we don't have to worry about downgrading PMDs to
+ * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
+ * already in the tree, we will skip the insertion and just dirty the PMD as
+ * appropriate.
+ */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector)
+				      void *entry, sector_t sector,
+				      unsigned long flags)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	int error = 0;
@@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
 		if (error)
 			return ERR_PTR(error);
+	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
+			!(flags & RADIX_DAX_HZP)) {
+		/* replacing huge zero page with PMD block mapping */
+		unmap_mapping_range(mapping,
+			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
-		       RADIX_DAX_ENTRY_LOCK);
+	new_entry = dax_radix_entry(sector, flags);
+
 	if (hole_fill) {
 		__delete_from_page_cache(entry, NULL);
 		/* Drop pagecache reference */
 		put_page(entry);
-		error = radix_tree_insert(page_tree, index, new_entry);
+		error = __radix_tree_insert(page_tree, index,
+				dax_radix_order(new_entry), new_entry);
 		if (error) {
 			new_entry = ERR_PTR(error);
 			goto unlock;
 		}
 		mapping->nrexceptional++;
-	} else {
+	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
 		void **slot;
 		void *ret;
 
@@ -653,7 +734,6 @@ static int dax_writeback_one(struct block_device *bdev,
 		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	int type = RADIX_DAX_TYPE(entry);
 	struct radix_tree_node *node;
 	struct blk_dax_ctl dax;
 	void **slot;
@@ -674,13 +754,20 @@ static int dax_writeback_one(struct block_device *bdev,
 	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
 		goto unlock;
 
-	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
 		ret = -EIO;
 		goto unlock;
 	}
 
-	dax.sector = RADIX_DAX_SECTOR(entry);
-	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+	/*
+	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we are given will be aligned to
+	 * the start index of the PMD, as will the sector we pull from
+	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
+	 * worry about partial PMD writebacks.
+	 */
+	dax.sector = dax_radix_sector(entry);
+	dax.size = PAGE_SIZE << dax_radix_order(entry);
 	spin_unlock_irq(&mapping->tree_lock);
 
 	/*
@@ -719,12 +806,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	pgoff_t start_index, end_index, pmd_index;
+	pgoff_t start_index, end_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	bool done = false;
 	int i, ret = 0;
-	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
@@ -734,15 +820,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 	start_index = wbc->range_start >> PAGE_SHIFT;
 	end_index = wbc->range_end >> PAGE_SHIFT;
-	pmd_index = DAX_PMD_INDEX(start_index);
-
-	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
-	rcu_read_unlock();
-
-	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start_index = pmd_index;
 
 	tag_pages_for_writeback(mapping, start_index, end_index);
 
@@ -787,7 +864,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return PTR_ERR(dax.addr);
 	dax_unmap_atomic(bdev, &dax);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector, 0);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 	*entryp = ret;
@@ -834,7 +911,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1154,7 +1231,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return VM_FAULT_NOPAGE | major;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+#ifdef CONFIG_FS_DAX_PMD
+/*
+ * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
+ * more often than one might expect in the below functions.
+ */
+#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
+
+static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, loff_t pos, bool write, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct block_device *bdev = iomap->bdev;
+	struct blk_dax_ctl dax = {
+		.sector = dax_iomap_sector(iomap, pos),
+		.size = PMD_SIZE,
+	};
+	long length = dax_map_atomic(bdev, &dax);
+	void *ret;
+
+	if (length < 0) /* dax_map_atomic() failed */
+		return VM_FAULT_FALLBACK;
+	if (length < PMD_SIZE)
+		goto unmap_fallback;
+	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
+		goto unmap_fallback;
+	if (!pfn_t_devmap(dax.pfn))
+		goto unmap_fallback;
+
+	dax_unmap_atomic(bdev, &dax);
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
+			RADIX_DAX_PMD);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
+
+ unmap_fallback:
+	dax_unmap_atomic(bdev, &dax);
+	return VM_FAULT_FALLBACK;
+}
+
+static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	struct page *zero_page;
+	spinlock_t *ptl;
+	pmd_t pmd_entry;
+	void *ret;
+
+	zero_page = get_huge_zero_page();
+
+	if (unlikely(!zero_page))
+		return VM_FAULT_FALLBACK;
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
+			RADIX_DAX_PMD | RADIX_DAX_HZP);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (!pmd_none(*pmd)) {
+		spin_unlock(ptl);
+		return VM_FAULT_FALLBACK;
+	}
+
+	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+	pmd_entry = pmd_mkhuge(pmd_entry);
+	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
+	spin_unlock(ptl);
+	return VM_FAULT_NOPAGE;
+}
+
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	bool write = flags & FAULT_FLAG_WRITE;
+	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
+	struct inode *inode = mapping->host;
+	int result = VM_FAULT_FALLBACK;
+	struct iomap iomap = { 0 };
+	pgoff_t size, pgoff;
+	struct vm_fault vmf;
+	void *entry;
+	loff_t pos;
+	int error;
+
+	/* Fall back to PTEs if we're going to COW */
+	if (write && !(vma->vm_flags & VM_SHARED)) {
+		split_huge_pmd(vma, pmd, address);
+		goto fallback;
+	}
+
+	/* If the PMD would extend outside the VMA */
+	if (pmd_addr < vma->vm_start)
+		goto fallback;
+	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
+		goto fallback;
+
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is
+	 * supposed to hold locks serializing us with truncate / punch hole so
+	 * this is a reliable test.
+	 */
+	pgoff = linear_page_index(vma, pmd_addr);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+	if (pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	/* If the PMD would extend beyond the file size */
+	if ((pgoff | PG_PMD_COLOUR) >= size)
+		goto fallback;
+
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		goto fallback;
+
+	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	if (error)
+		goto unlock_entry;
+	if (iomap.offset + iomap.length < pos + PMD_SIZE)
+		goto finish_iomap;
+
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_IO;
+
+	switch (iomap.type) {
+	case IOMAP_MAPPED:
+		result = dax_pmd_insert_mapping(vma, pmd, &vmf, address,
+				&iomap, pos, write, &entry);
+		break;
+	case IOMAP_UNWRITTEN:
+	case IOMAP_HOLE:
+		if (WARN_ON_ONCE(write))
+			goto finish_iomap;
+		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
+				&entry);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		break;
+	}
+
+ finish_iomap:
+	if (ops->iomap_end) {
+		error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
+				iomap_flags, &iomap);
+		if (error)
+			result = VM_FAULT_FALLBACK;
+	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
+ fallback:
+	if (result == VM_FAULT_FALLBACK)
+		count_vm_event(THP_FAULT_FALLBACK);
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
+#endif /* CONFIG_FS_DAX_PMD */
 #endif /* CONFIG_FS_IOMAP */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9ea78c..5d70367 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -9,19 +9,33 @@
 struct iomap_ops;
 
 /*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
+ * We use lowest available bit in exceptional entry for locking, one bit for
+ * the entry size (PMD) and two more to tell us if the entry is a huge zero
+ * page (HZP) or an empty entry that is just used for locking.  In total four
+ * special bits.
+ *
+ * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the HZP and
+ * EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
+ * block allocation.
  */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 4)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+
+static inline unsigned long dax_radix_sector(void *entry)
+{
+	return (unsigned long)entry >> RADIX_DAX_SHIFT;
+}
+
+/* entries begin locked */
+static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
+{
+	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
+			((unsigned long)sector << RADIX_DAX_SHIFT) |
+			RADIX_DAX_ENTRY_LOCK);
+}
 
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
@@ -67,6 +81,27 @@ static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	return VM_FAULT_FALLBACK;
 }
 
+#ifdef CONFIG_FS_DAX_PMD
+static inline unsigned int dax_radix_order(void *entry)
+{
+	if ((unsigned long)entry & RADIX_DAX_PMD)
+		return PMD_SHIFT - PAGE_SHIFT;
+	return 0;
+}
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops);
+#else
+static inline unsigned int dax_radix_order(void *entry)
+{
+	return 0;
+}
+static inline int dax_iomap_pmd_fault(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		struct iomap_ops *ops)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index a596462..4ed2df5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
+			WARN_ON_ONCE(p != dax_radix_entry(0, RADIX_DAX_EMPTY));
 			/* DAX accounts exceptional entries as normal pages */
 			if (node)
 				workingset_node_pages_dec(node);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 325 ++++++++++++++++++++++++++++++++++++++++++++++------
 include/linux/dax.h |  57 +++++++--
 mm/filemap.c        |   4 +-
 3 files changed, 339 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac3cd05..e51d51f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
 	 * queue to the start of that PMD.  This ensures that all offsets in
 	 * the range covered by the PMD map to the same bit lock.
 	 */
-	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+	if ((unsigned long)entry & RADIX_DAX_PMD)
 		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
 
 	key->mapping = mapping;
@@ -388,35 +388,103 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  * radix tree entry locked. If the radix tree doesn't contain given index,
  * create empty exceptional entry for the index and return with it locked.
  *
+ * When requesting an entry with size RADIX_DAX_PMD, grab_mapping_entry() will
+ * either return that locked entry or will return an error.  This error will
+ * happen if there are any 4k entries (either zero pages or DAX entries)
+ * within the 2MiB range that we are requesting.
+ *
+ * We always favor 4k entries over 2MiB entries. There isn't a flow where we
+ * evict 4k entries in order to 'upgrade' them to a 2MiB entry.  A 2MiB
+ * insertion will fail if it finds any 4k entries already in the tree, and a
+ * 4k insertion will cause an existing 2MiB entry to be unmapped and
+ * downgraded to 4k entries.  This happens for both 2MiB huge zero pages as
+ * well as 2MiB empty entries.
+ *
+ * The exception to this downgrade path is for 2MiB DAX PMD entries that have
+ * real storage backing them.  We will leave these real 2MiB DAX entries in
+ * the tree, and PTE writes will simply dirty the entire 2MiB DAX entry.
+ *
  * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+		unsigned long size_flag)
 {
+	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
 	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
 	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+	if (entry) {
+		if (size_flag & RADIX_DAX_PMD) {
+			if (!radix_tree_exceptional_entry(entry) ||
+			    !((unsigned long)entry & RADIX_DAX_PMD)) {
+				entry = ERR_PTR(-EEXIST);
+				goto out_unlock;
+			}
+		} else { /* trying to grab a PTE entry */
+			if (radix_tree_exceptional_entry(entry) &&
+			    ((unsigned long)entry & RADIX_DAX_PMD) &&
+			    ((unsigned long)entry &
+			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
+				pmd_downgrade = true;
+			}
+		}
+	}
+
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!entry) {
+	if (!entry || pmd_downgrade) {
 		int err;
 
+		if (pmd_downgrade) {
+			/*
+			 * Make sure 'entry' remains valid while we drop
+			 * mapping->tree_lock.
+			 */
+			entry = lock_slot(mapping, slot);
+		}
+
 		spin_unlock_irq(&mapping->tree_lock);
 		err = radix_tree_preload(
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-			       RADIX_DAX_ENTRY_LOCK);
+
+		/*
+		 * Besides huge zero pages the only other thing that gets
+		 * downgraded are empty entries which don't need to be
+		 * unmapped.
+		 */
+		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+			unmap_mapping_range(mapping,
+				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+		if (pmd_downgrade) {
+			radix_tree_delete(&mapping->page_tree, index);
+			mapping->nrexceptional--;
+			dax_wake_mapping_entry_waiter(mapping, index, entry,
+					false);
+		}
+
+		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
+
+		err = __radix_tree_insert(&mapping->page_tree, index,
+				dax_radix_order(entry), entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
-			/* Someone already created the entry? */
-			if (err == -EEXIST)
+			/*
+			 * Someone already created the entry?  This is a
+			 * normal failure when inserting PMDs in a range
+			 * that already contains PTEs.  In that case we want
+			 * to return -EEXIST immediately.
+			 */
+			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
 				goto restart;
 			return ERR_PTR(err);
 		}
@@ -441,6 +509,7 @@ restart:
 		return page;
 	}
 	entry = lock_slot(mapping, slot);
+ out_unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 	return entry;
 }
@@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
 	return 0;
 }
 
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
+/*
+ * By this point grab_mapping_entry() has ensured that we have a locked entry
+ * of the appropriate size so we don't have to worry about downgrading PMDs to
+ * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
+ * already in the tree, we will skip the insertion and just dirty the PMD as
+ * appropriate.
+ */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector)
+				      void *entry, sector_t sector,
+				      unsigned long flags)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	int error = 0;
@@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
 		if (error)
 			return ERR_PTR(error);
+	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
+			!(flags & RADIX_DAX_HZP)) {
+		/* replacing huge zero page with PMD block mapping */
+		unmap_mapping_range(mapping,
+			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
-		       RADIX_DAX_ENTRY_LOCK);
+	new_entry = dax_radix_entry(sector, flags);
+
 	if (hole_fill) {
 		__delete_from_page_cache(entry, NULL);
 		/* Drop pagecache reference */
 		put_page(entry);
-		error = radix_tree_insert(page_tree, index, new_entry);
+		error = __radix_tree_insert(page_tree, index,
+				dax_radix_order(new_entry), new_entry);
 		if (error) {
 			new_entry = ERR_PTR(error);
 			goto unlock;
 		}
 		mapping->nrexceptional++;
-	} else {
+	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
 		void **slot;
 		void *ret;
 
@@ -653,7 +734,6 @@ static int dax_writeback_one(struct block_device *bdev,
 		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	int type = RADIX_DAX_TYPE(entry);
 	struct radix_tree_node *node;
 	struct blk_dax_ctl dax;
 	void **slot;
@@ -674,13 +754,20 @@ static int dax_writeback_one(struct block_device *bdev,
 	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
 		goto unlock;
 
-	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
 		ret = -EIO;
 		goto unlock;
 	}
 
-	dax.sector = RADIX_DAX_SECTOR(entry);
-	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+	/*
+	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we are given will be aligned to
+	 * the start index of the PMD, as will the sector we pull from
+	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
+	 * worry about partial PMD writebacks.
+	 */
+	dax.sector = dax_radix_sector(entry);
+	dax.size = PAGE_SIZE << dax_radix_order(entry);
 	spin_unlock_irq(&mapping->tree_lock);
 
 	/*
@@ -719,12 +806,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	pgoff_t start_index, end_index, pmd_index;
+	pgoff_t start_index, end_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	bool done = false;
 	int i, ret = 0;
-	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
@@ -734,15 +820,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 	start_index = wbc->range_start >> PAGE_SHIFT;
 	end_index = wbc->range_end >> PAGE_SHIFT;
-	pmd_index = DAX_PMD_INDEX(start_index);
-
-	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
-	rcu_read_unlock();
-
-	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start_index = pmd_index;
 
 	tag_pages_for_writeback(mapping, start_index, end_index);
 
@@ -787,7 +864,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return PTR_ERR(dax.addr);
 	dax_unmap_atomic(bdev, &dax);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector, 0);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 	*entryp = ret;
@@ -834,7 +911,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1154,7 +1231,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return VM_FAULT_NOPAGE | major;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+#ifdef CONFIG_FS_DAX_PMD
+/*
+ * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
+ * more often than one might expect in the below functions.
+ */
+#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
+
+static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, loff_t pos, bool write, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct block_device *bdev = iomap->bdev;
+	struct blk_dax_ctl dax = {
+		.sector = dax_iomap_sector(iomap, pos),
+		.size = PMD_SIZE,
+	};
+	long length = dax_map_atomic(bdev, &dax);
+	void *ret;
+
+	if (length < 0) /* dax_map_atomic() failed */
+		return VM_FAULT_FALLBACK;
+	if (length < PMD_SIZE)
+		goto unmap_fallback;
+	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
+		goto unmap_fallback;
+	if (!pfn_t_devmap(dax.pfn))
+		goto unmap_fallback;
+
+	dax_unmap_atomic(bdev, &dax);
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
+			RADIX_DAX_PMD);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
+
+ unmap_fallback:
+	dax_unmap_atomic(bdev, &dax);
+	return VM_FAULT_FALLBACK;
+}
+
+static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	struct page *zero_page;
+	spinlock_t *ptl;
+	pmd_t pmd_entry;
+	void *ret;
+
+	zero_page = get_huge_zero_page();
+
+	if (unlikely(!zero_page))
+		return VM_FAULT_FALLBACK;
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
+			RADIX_DAX_PMD | RADIX_DAX_HZP);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (!pmd_none(*pmd)) {
+		spin_unlock(ptl);
+		return VM_FAULT_FALLBACK;
+	}
+
+	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+	pmd_entry = pmd_mkhuge(pmd_entry);
+	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
+	spin_unlock(ptl);
+	return VM_FAULT_NOPAGE;
+}
+
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	bool write = flags & FAULT_FLAG_WRITE;
+	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
+	struct inode *inode = mapping->host;
+	int result = VM_FAULT_FALLBACK;
+	struct iomap iomap = { 0 };
+	pgoff_t size, pgoff;
+	struct vm_fault vmf;
+	void *entry;
+	loff_t pos;
+	int error;
+
+	/* Fall back to PTEs if we're going to COW */
+	if (write && !(vma->vm_flags & VM_SHARED)) {
+		split_huge_pmd(vma, pmd, address);
+		goto fallback;
+	}
+
+	/* If the PMD would extend outside the VMA */
+	if (pmd_addr < vma->vm_start)
+		goto fallback;
+	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
+		goto fallback;
+
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is
+	 * supposed to hold locks serializing us with truncate / punch hole so
+	 * this is a reliable test.
+	 */
+	pgoff = linear_page_index(vma, pmd_addr);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+	if (pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	/* If the PMD would extend beyond the file size */
+	if ((pgoff | PG_PMD_COLOUR) >= size)
+		goto fallback;
+
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		goto fallback;
+
+	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	if (error)
+		goto unlock_entry;
+	if (iomap.offset + iomap.length < pos + PMD_SIZE)
+		goto finish_iomap;
+
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_IO;
+
+	switch (iomap.type) {
+	case IOMAP_MAPPED:
+		result = dax_pmd_insert_mapping(vma, pmd, &vmf, address,
+				&iomap, pos, write, &entry);
+		break;
+	case IOMAP_UNWRITTEN:
+	case IOMAP_HOLE:
+		if (WARN_ON_ONCE(write))
+			goto finish_iomap;
+		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
+				&entry);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		break;
+	}
+
+ finish_iomap:
+	if (ops->iomap_end) {
+		error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
+				iomap_flags, &iomap);
+		if (error)
+			result = VM_FAULT_FALLBACK;
+	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
+ fallback:
+	if (result == VM_FAULT_FALLBACK)
+		count_vm_event(THP_FAULT_FALLBACK);
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
+#endif /* CONFIG_FS_DAX_PMD */
 #endif /* CONFIG_FS_IOMAP */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9ea78c..5d70367 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -9,19 +9,33 @@
 struct iomap_ops;
 
 /*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
+ * We use lowest available bit in exceptional entry for locking, one bit for
+ * the entry size (PMD) and two more to tell us if the entry is a huge zero
+ * page (HZP) or an empty entry that is just used for locking.  In total four
+ * special bits.
+ *
+ * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the HZP and
+ * EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
+ * block allocation.
  */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 4)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+
+static inline unsigned long dax_radix_sector(void *entry)
+{
+	return (unsigned long)entry >> RADIX_DAX_SHIFT;
+}
+
+/* entries begin locked */
+static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
+{
+	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
+			((unsigned long)sector << RADIX_DAX_SHIFT) |
+			RADIX_DAX_ENTRY_LOCK);
+}
 
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
@@ -67,6 +81,27 @@ static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	return VM_FAULT_FALLBACK;
 }
 
+#ifdef CONFIG_FS_DAX_PMD
+static inline unsigned int dax_radix_order(void *entry)
+{
+	if ((unsigned long)entry & RADIX_DAX_PMD)
+		return PMD_SHIFT - PAGE_SHIFT;
+	return 0;
+}
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops);
+#else
+static inline unsigned int dax_radix_order(void *entry)
+{
+	return 0;
+}
+static inline int dax_iomap_pmd_fault(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		struct iomap_ops *ops)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index a596462..4ed2df5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
+			WARN_ON_ONCE(p != dax_radix_entry(0, RADIX_DAX_EMPTY));
 			/* DAX accounts exceptional entries as normal pages */
 			if (node)
 				workingset_node_pages_dec(node);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 16/17] xfs: use struct iomap based DAX PMD fault path
  2016-10-07 21:08 ` Ross Zwisler
  (?)
  (?)
@ 2016-10-07 21:09   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault().  Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_aops.c | 26 +++++---------------------
 fs/xfs/xfs_aops.h |  3 ---
 fs/xfs/xfs_file.c |  2 +-
 3 files changed, 6 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 0e2a931..1c73d0a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1298,8 +1298,7 @@ __xfs_get_blocks(
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	bool			direct,
-	bool			dax_fault)
+	bool			direct)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1420,13 +1419,8 @@ __xfs_get_blocks(
 		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
 		/* direct IO needs special help */
-		if (create) {
-			if (dax_fault)
-				ASSERT(!ISUNWRITTEN(&imap));
-			else
-				xfs_map_direct(inode, bh_result, &imap, offset,
-						is_cow);
-		}
+		if (create)
+			xfs_map_direct(inode, bh_result, &imap, offset, is_cow);
 	}
 
 	/*
@@ -1466,7 +1460,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
 }
 
 int
@@ -1476,17 +1470,7 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
-}
-
-int
-xfs_get_blocks_dax_fault(
-	struct inode		*inode,
-	sector_t		iblock,
-	struct buffer_head	*bh_result,
-	int			create)
-{
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index b3c6634..34dc00d 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -59,9 +59,6 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
 		       struct buffer_head *map_bh, int create);
 int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
 			      struct buffer_head *map_bh, int create);
-int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
-			         struct buffer_head *map_bh, int create);
-
 int	xfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
 		ssize_t size, void *private);
 int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8f12152..7b13dda 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1750,7 +1750,7 @@ xfs_filemap_pmd_fault(
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
+	ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (flags & FAULT_FLAG_WRITE)
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 16/17] xfs: use struct iomap based DAX PMD fault path
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault().  Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_aops.c | 26 +++++---------------------
 fs/xfs/xfs_aops.h |  3 ---
 fs/xfs/xfs_file.c |  2 +-
 3 files changed, 6 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 0e2a931..1c73d0a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1298,8 +1298,7 @@ __xfs_get_blocks(
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	bool			direct,
-	bool			dax_fault)
+	bool			direct)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1420,13 +1419,8 @@ __xfs_get_blocks(
 		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
 		/* direct IO needs special help */
-		if (create) {
-			if (dax_fault)
-				ASSERT(!ISUNWRITTEN(&imap));
-			else
-				xfs_map_direct(inode, bh_result, &imap, offset,
-						is_cow);
-		}
+		if (create)
+			xfs_map_direct(inode, bh_result, &imap, offset, is_cow);
 	}
 
 	/*
@@ -1466,7 +1460,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
 }
 
 int
@@ -1476,17 +1470,7 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
-}
-
-int
-xfs_get_blocks_dax_fault(
-	struct inode		*inode,
-	sector_t		iblock,
-	struct buffer_head	*bh_result,
-	int			create)
-{
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index b3c6634..34dc00d 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -59,9 +59,6 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
 		       struct buffer_head *map_bh, int create);
 int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
 			      struct buffer_head *map_bh, int create);
-int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
-			         struct buffer_head *map_bh, int create);
-
 int	xfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
 		ssize_t size, void *private);
 int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8f12152..7b13dda 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1750,7 +1750,7 @@ xfs_filemap_pmd_fault(
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
+	ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (flags & FAULT_FLAG_WRITE)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 16/17] xfs: use struct iomap based DAX PMD fault path
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault().  Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/xfs/xfs_aops.c | 26 +++++---------------------
 fs/xfs/xfs_aops.h |  3 ---
 fs/xfs/xfs_file.c |  2 +-
 3 files changed, 6 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 0e2a931..1c73d0a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1298,8 +1298,7 @@ __xfs_get_blocks(
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	bool			direct,
-	bool			dax_fault)
+	bool			direct)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1420,13 +1419,8 @@ __xfs_get_blocks(
 		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
 		/* direct IO needs special help */
-		if (create) {
-			if (dax_fault)
-				ASSERT(!ISUNWRITTEN(&imap));
-			else
-				xfs_map_direct(inode, bh_result, &imap, offset,
-						is_cow);
-		}
+		if (create)
+			xfs_map_direct(inode, bh_result, &imap, offset, is_cow);
 	}
 
 	/*
@@ -1466,7 +1460,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
 }
 
 int
@@ -1476,17 +1470,7 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
-}
-
-int
-xfs_get_blocks_dax_fault(
-	struct inode		*inode,
-	sector_t		iblock,
-	struct buffer_head	*bh_result,
-	int			create)
-{
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index b3c6634..34dc00d 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -59,9 +59,6 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
 		       struct buffer_head *map_bh, int create);
 int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
 			      struct buffer_head *map_bh, int create);
-int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
-			         struct buffer_head *map_bh, int create);
-
 int	xfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
 		ssize_t size, void *private);
 int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8f12152..7b13dda 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1750,7 +1750,7 @@ xfs_filemap_pmd_fault(
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
+	ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (flags & FAULT_FLAG_WRITE)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 16/17] xfs: use struct iomap based DAX PMD fault path
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault().  Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_aops.c | 26 +++++---------------------
 fs/xfs/xfs_aops.h |  3 ---
 fs/xfs/xfs_file.c |  2 +-
 3 files changed, 6 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 0e2a931..1c73d0a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1298,8 +1298,7 @@ __xfs_get_blocks(
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	bool			direct,
-	bool			dax_fault)
+	bool			direct)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1420,13 +1419,8 @@ __xfs_get_blocks(
 		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
 		/* direct IO needs special help */
-		if (create) {
-			if (dax_fault)
-				ASSERT(!ISUNWRITTEN(&imap));
-			else
-				xfs_map_direct(inode, bh_result, &imap, offset,
-						is_cow);
-		}
+		if (create)
+			xfs_map_direct(inode, bh_result, &imap, offset, is_cow);
 	}
 
 	/*
@@ -1466,7 +1460,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
 }
 
 int
@@ -1476,17 +1470,7 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
-}
-
-int
-xfs_get_blocks_dax_fault(
-	struct inode		*inode,
-	sector_t		iblock,
-	struct buffer_head	*bh_result,
-	int			create)
-{
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index b3c6634..34dc00d 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -59,9 +59,6 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
 		       struct buffer_head *map_bh, int create);
 int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
 			      struct buffer_head *map_bh, int create);
-int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
-			         struct buffer_head *map_bh, int create);
-
 int	xfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
 		ssize_t size, void *private);
 int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8f12152..7b13dda 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1750,7 +1750,7 @@ xfs_filemap_pmd_fault(
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
+	ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (flags & FAULT_FLAG_WRITE)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 17/17] dax: remove "depends on BROKEN" from FS_DAX_PMD
  2016-10-07 21:08 ` Ross Zwisler
                     ` (2 preceding siblings ...)
  (?)
@ 2016-10-07 21:09   ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	Christoph Hellwig, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
 	depends on FS_DAX
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
-	depends on BROKEN
 
 endif # BLOCK
 
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 17/17] dax: remove "depends on BROKEN" from FS_DAX_PMD
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
 	depends on FS_DAX
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
-	depends on BROKEN
 
 endif # BLOCK
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 17/17] dax: remove "depends on BROKEN" from FS_DAX_PMD
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
 	depends on FS_DAX
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
-	depends on BROKEN
 
 endif # BLOCK
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 17/17] dax: remove "depends on BROKEN" from FS_DAX_PMD
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
 	depends on FS_DAX
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
-	depends on BROKEN
 
 endif # BLOCK
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH v5 17/17] dax: remove "depends on BROKEN" from FS_DAX_PMD
@ 2016-10-07 21:09   ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-07 21:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
 	depends on FS_DAX
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
-	depends on BROKEN
 
 endif # BLOCK
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 11/17] dax: correct dax iomap code namespace
  2016-10-07 21:08   ` Ross Zwisler
  (?)
  (?)
@ 2016-10-09 15:28     ` Christoph Hellwig
  -1 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-09 15:28 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Can you send this one to Dave for 4.9?  It would be silly to rename
something one merge window after it's just been introduced.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 11/17] dax: correct dax iomap code namespace
@ 2016-10-09 15:28     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-09 15:28 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Can you send this one to Dave for 4.9?  It would be silly to rename
something one merge window after it's just been introduced.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 11/17] dax: correct dax iomap code namespace
@ 2016-10-09 15:28     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-09 15:28 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

Can you send this one to Dave for 4.9?  It would be silly to rename
something one merge window after it's just been introduced.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 11/17] dax: correct dax iomap code namespace
@ 2016-10-09 15:28     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-09 15:28 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Can you send this one to Dave for 4.9?  It would be silly to rename
something one merge window after it's just been introduced.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
  2016-10-07 21:08   ` Ross Zwisler
                       ` (2 preceding siblings ...)
  (?)
@ 2016-10-10 15:45     ` Christoph Hellwig
  -1 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:45 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-10 15:45     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:45 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-10 15:45     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:45 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-10 15:45     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:45 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-10 15:45     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:45 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
  2016-10-07 21:08   ` Ross Zwisler
  (?)
@ 2016-10-10 15:46     ` Christoph Hellwig
  -1 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:46 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
@ 2016-10-10 15:46     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:46 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
@ 2016-10-10 15:46     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:46 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
  2016-10-07 21:08   ` Ross Zwisler
                       ` (2 preceding siblings ...)
  (?)
@ 2016-10-10 15:46     ` Christoph Hellwig
  -1 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:46 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
@ 2016-10-10 15:46     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:46 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
@ 2016-10-10 15:46     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:46 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
@ 2016-10-10 15:46     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:46 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
@ 2016-10-10 15:46     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:46 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
  2016-10-07 21:09   ` Ross Zwisler
  (?)
  (?)
@ 2016-10-10 15:50     ` Christoph Hellwig
  -1 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

On Fri, Oct 07, 2016 at 03:09:00PM -0600, Ross Zwisler wrote:
> Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> and XFS.  ext2_iomap_end() just checks for a write underrun, and
> xfs_file_iomap_end() checks to see if it needs to finish a delayed
> allocation.  However, in the future iomap_end() calls might be needed to
> make sure we have balanced allocations, locks, etc.  So, add calls to
> iomap_end() with appropriate error handling to dax_iomap_fault().

Is there a way to just have a single call to iomap_end at the end of
the function, after which we just return a previosuly setup return
value?

e.g.

out:
	if (ops->iomap_end) {
		error = ops->iomap_end(inode, pos, PAGE_SIZE,
				PAGE_SIZE, flags, &iomap);
	}

	if (error == -ENOMEM)
		return VM_FAULT_OOM | major;
	if (error < 0 && error != -EBUSY)
		return VM_FAULT_SIGBUS | major;
	return ret;
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-10 15:50     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Oct 07, 2016 at 03:09:00PM -0600, Ross Zwisler wrote:
> Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> and XFS.  ext2_iomap_end() just checks for a write underrun, and
> xfs_file_iomap_end() checks to see if it needs to finish a delayed
> allocation.  However, in the future iomap_end() calls might be needed to
> make sure we have balanced allocations, locks, etc.  So, add calls to
> iomap_end() with appropriate error handling to dax_iomap_fault().

Is there a way to just have a single call to iomap_end at the end of
the function, after which we just return a previosuly setup return
value?

e.g.

out:
	if (ops->iomap_end) {
		error = ops->iomap_end(inode, pos, PAGE_SIZE,
				PAGE_SIZE, flags, &iomap);
	}

	if (error == -ENOMEM)
		return VM_FAULT_OOM | major;
	if (error < 0 && error != -EBUSY)
		return VM_FAULT_SIGBUS | major;
	return ret;

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-10 15:50     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Oct 07, 2016 at 03:09:00PM -0600, Ross Zwisler wrote:
> Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> and XFS.  ext2_iomap_end() just checks for a write underrun, and
> xfs_file_iomap_end() checks to see if it needs to finish a delayed
> allocation.  However, in the future iomap_end() calls might be needed to
> make sure we have balanced allocations, locks, etc.  So, add calls to
> iomap_end() with appropriate error handling to dax_iomap_fault().

Is there a way to just have a single call to iomap_end at the end of
the function, after which we just return a previosuly setup return
value?

e.g.

out:
	if (ops->iomap_end) {
		error = ops->iomap_end(inode, pos, PAGE_SIZE,
				PAGE_SIZE, flags, &iomap);
	}

	if (error == -ENOMEM)
		return VM_FAULT_OOM | major;
	if (error < 0 && error != -EBUSY)
		return VM_FAULT_SIGBUS | major;
	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-10 15:50     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Oct 07, 2016 at 03:09:00PM -0600, Ross Zwisler wrote:
> Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> and XFS.  ext2_iomap_end() just checks for a write underrun, and
> xfs_file_iomap_end() checks to see if it needs to finish a delayed
> allocation.  However, in the future iomap_end() calls might be needed to
> make sure we have balanced allocations, locks, etc.  So, add calls to
> iomap_end() with appropriate error handling to dax_iomap_fault().

Is there a way to just have a single call to iomap_end at the end of
the function, after which we just return a previosuly setup return
value?

e.g.

out:
	if (ops->iomap_end) {
		error = ops->iomap_end(inode, pos, PAGE_SIZE,
				PAGE_SIZE, flags, &iomap);
	}

	if (error == -ENOMEM)
		return VM_FAULT_OOM | major;
	if (error < 0 && error != -EBUSY)
		return VM_FAULT_SIGBUS | major;
	return ret;

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
  2016-10-07 21:09   ` Ross Zwisler
                       ` (2 preceding siblings ...)
  (?)
@ 2016-10-10 15:50     ` Christoph Hellwig
  -1 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

On Fri, Oct 07, 2016 at 03:09:01PM -0600, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
> 
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.

Looks fine, assuming that the macros get cleaned up in the next patches..

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
@ 2016-10-10 15:50     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Oct 07, 2016 at 03:09:01PM -0600, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
> 
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.

Looks fine, assuming that the macros get cleaned up in the next patches..

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
@ 2016-10-10 15:50     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Oct 07, 2016 at 03:09:01PM -0600, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
> 
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.

Looks fine, assuming that the macros get cleaned up in the next patches..

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
@ 2016-10-10 15:50     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

On Fri, Oct 07, 2016 at 03:09:01PM -0600, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
> 
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.

Looks fine, assuming that the macros get cleaned up in the next patches..

Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
@ 2016-10-10 15:50     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Oct 07, 2016 at 03:09:01PM -0600, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
> 
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.

Looks fine, assuming that the macros get cleaned up in the next patches..

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
  2016-10-07 21:09   ` Ross Zwisler
  (?)
  (?)
@ 2016-10-10 15:59     ` Christoph Hellwig
  -1 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:59 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +	if ((unsigned long)entry & RADIX_DAX_PMD)

Please introduce a proper inline helper that mask all the possible type
bits out of the radix tree entry, and use them wherever you do the
open cast.

>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
>  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +
> +	if (entry) {
> +		if (size_flag & RADIX_DAX_PMD) {
> +			if (!radix_tree_exceptional_entry(entry) ||
> +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> +				entry = ERR_PTR(-EEXIST);
> +				goto out_unlock;
> +			}
> +		} else { /* trying to grab a PTE entry */
> +			if (radix_tree_exceptional_entry(entry) &&
> +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> +			    ((unsigned long)entry &
> +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {

And when we do these cases N times next to each other we should
have a local variable the valid flag bits of entry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-10 15:59     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:59 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +	if ((unsigned long)entry & RADIX_DAX_PMD)

Please introduce a proper inline helper that mask all the possible type
bits out of the radix tree entry, and use them wherever you do the
open cast.

>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
>  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +
> +	if (entry) {
> +		if (size_flag & RADIX_DAX_PMD) {
> +			if (!radix_tree_exceptional_entry(entry) ||
> +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> +				entry = ERR_PTR(-EEXIST);
> +				goto out_unlock;
> +			}
> +		} else { /* trying to grab a PTE entry */
> +			if (radix_tree_exceptional_entry(entry) &&
> +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> +			    ((unsigned long)entry &
> +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {

And when we do these cases N times next to each other we should
have a local variable the valid flag bits of entry.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-10 15:59     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:59 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +	if ((unsigned long)entry & RADIX_DAX_PMD)

Please introduce a proper inline helper that mask all the possible type
bits out of the radix tree entry, and use them wherever you do the
open cast.

>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
>  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +
> +	if (entry) {
> +		if (size_flag & RADIX_DAX_PMD) {
> +			if (!radix_tree_exceptional_entry(entry) ||
> +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> +				entry = ERR_PTR(-EEXIST);
> +				goto out_unlock;
> +			}
> +		} else { /* trying to grab a PTE entry */
> +			if (radix_tree_exceptional_entry(entry) &&
> +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> +			    ((unsigned long)entry &
> +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {

And when we do these cases N times next to each other we should
have a local variable the valid flag bits of entry.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-10 15:59     ` Christoph Hellwig
  0 siblings, 0 replies; 182+ messages in thread
From: Christoph Hellwig @ 2016-10-10 15:59 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +	if ((unsigned long)entry & RADIX_DAX_PMD)

Please introduce a proper inline helper that mask all the possible type
bits out of the radix tree entry, and use them wherever you do the
open cast.

>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
>  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +
> +	if (entry) {
> +		if (size_flag & RADIX_DAX_PMD) {
> +			if (!radix_tree_exceptional_entry(entry) ||
> +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> +				entry = ERR_PTR(-EEXIST);
> +				goto out_unlock;
> +			}
> +		} else { /* trying to grab a PTE entry */
> +			if (radix_tree_exceptional_entry(entry) &&
> +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> +			    ((unsigned long)entry &
> +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {

And when we do these cases N times next to each other we should
have a local variable the valid flag bits of entry.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* [PATCH] dax: correct dax iomap code namespace
  2016-10-09 15:28     ` Christoph Hellwig
  (?)
  (?)
@ 2016-10-10 19:04       ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 19:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Matthew Wilcox, linux-nvdimm, Dave Chinner,
	Christoph Hellwig, linux-xfs, Alexander Viro, Jan Kara,
	linux-fsdevel, linux-ext4

The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---

Dave, per Christoph's suggestion can you please send this to Linus for
v4.9?  This patch applies cleanly to both the current linux/master and the
current xfs/for-next.

---
 fs/dax.c            | 16 ++++++++--------
 fs/ext2/file.c      |  6 +++---
 fs/xfs/xfs_file.c   |  8 ++++----
 include/linux/dax.h |  4 ++--
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1b9154a..982ccbb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1024,7 +1024,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
 static loff_t
-iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
 {
 	struct iov_iter *iter = data;
@@ -1081,7 +1081,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 }
 
 /**
- * iomap_dax_rw - Perform I/O to a DAX file
+ * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:	The control block for this I/O
  * @iter:	The addresses to do I/O from or to
  * @ops:	iomap ops passed from the file system
@@ -1091,7 +1091,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
  * and evicting any page cache pages in the region under I/O.
  */
 ssize_t
-iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
@@ -1121,7 +1121,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-				iter, iomap_dax_actor);
+				iter, dax_iomap_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
@@ -1131,10 +1131,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->ki_pos += done;
 	return done ? done : ret;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_rw);
+EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
 /**
- * iomap_dax_fault - handle a page fault on a DAX file
+ * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @ops: iomap ops passed from the file system
@@ -1143,7 +1143,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
  * or mkwrite handler for DAX files. Assumes the caller has done all the
  * necessary locking for the page fault to proceed successfully.
  */
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1245,5 +1245,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_fault);
+EXPORT_SYMBOL_GPL(dax_iomap_fault);
 #endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0f257f8..32a4913 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return 0; /* skip atime */
 
 	inode_lock_shared(inode);
-	ret = iomap_dax_rw(iocb, to, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret)
 		goto out_unlock;
 
-	ret = iomap_dax_rw(iocb, from, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		mark_inode_dirty(inode);
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = iomap_dax_fault(vma, vmf, &ext2_iomap_ops);
+	ret = dax_iomap_fault(vma, vmf, &ext2_iomap_ops);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de7c53c..8f12152 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -344,7 +344,7 @@ xfs_file_dax_read(
 		return 0; /* skip atime */
 
 	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = iomap_dax_rw(iocb, to, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
@@ -730,7 +730,7 @@ xfs_file_dax_write(
 
 	trace_xfs_file_dax_write(ip, count, pos);
 
-	ret = iomap_dax_rw(iocb, from, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1679,7 +1679,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else {
 		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
@@ -1713,7 +1713,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0f74866..a3dfee4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -11,13 +11,13 @@ struct iomap_ops;
 /* We use lowest available exceptional entry bit for locking */
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
 
-ssize_t iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] dax: correct dax iomap code namespace
@ 2016-10-10 19:04       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 19:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-nvdimm, linux-xfs

The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---

Dave, per Christoph's suggestion can you please send this to Linus for
v4.9?  This patch applies cleanly to both the current linux/master and the
current xfs/for-next.

---
 fs/dax.c            | 16 ++++++++--------
 fs/ext2/file.c      |  6 +++---
 fs/xfs/xfs_file.c   |  8 ++++----
 include/linux/dax.h |  4 ++--
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1b9154a..982ccbb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1024,7 +1024,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
 static loff_t
-iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
 {
 	struct iov_iter *iter = data;
@@ -1081,7 +1081,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 }
 
 /**
- * iomap_dax_rw - Perform I/O to a DAX file
+ * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:	The control block for this I/O
  * @iter:	The addresses to do I/O from or to
  * @ops:	iomap ops passed from the file system
@@ -1091,7 +1091,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
  * and evicting any page cache pages in the region under I/O.
  */
 ssize_t
-iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
@@ -1121,7 +1121,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-				iter, iomap_dax_actor);
+				iter, dax_iomap_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
@@ -1131,10 +1131,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->ki_pos += done;
 	return done ? done : ret;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_rw);
+EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
 /**
- * iomap_dax_fault - handle a page fault on a DAX file
+ * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @ops: iomap ops passed from the file system
@@ -1143,7 +1143,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
  * or mkwrite handler for DAX files. Assumes the caller has done all the
  * necessary locking for the page fault to proceed successfully.
  */
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1245,5 +1245,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_fault);
+EXPORT_SYMBOL_GPL(dax_iomap_fault);
 #endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0f257f8..32a4913 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return 0; /* skip atime */
 
 	inode_lock_shared(inode);
-	ret = iomap_dax_rw(iocb, to, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret)
 		goto out_unlock;
 
-	ret = iomap_dax_rw(iocb, from, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		mark_inode_dirty(inode);
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = iomap_dax_fault(vma, vmf, &ext2_iomap_ops);
+	ret = dax_iomap_fault(vma, vmf, &ext2_iomap_ops);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de7c53c..8f12152 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -344,7 +344,7 @@ xfs_file_dax_read(
 		return 0; /* skip atime */
 
 	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = iomap_dax_rw(iocb, to, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
@@ -730,7 +730,7 @@ xfs_file_dax_write(
 
 	trace_xfs_file_dax_write(ip, count, pos);
 
-	ret = iomap_dax_rw(iocb, from, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1679,7 +1679,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else {
 		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
@@ -1713,7 +1713,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0f74866..a3dfee4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -11,13 +11,13 @@ struct iomap_ops;
 /* We use lowest available exceptional entry bit for locking */
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
 
-ssize_t iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] dax: correct dax iomap code namespace
@ 2016-10-10 19:04       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 19:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Andrew Morton, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-nvdimm, linux-xfs

The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---

Dave, per Christoph's suggestion can you please send this to Linus for
v4.9?  This patch applies cleanly to both the current linux/master and the
current xfs/for-next.

---
 fs/dax.c            | 16 ++++++++--------
 fs/ext2/file.c      |  6 +++---
 fs/xfs/xfs_file.c   |  8 ++++----
 include/linux/dax.h |  4 ++--
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1b9154a..982ccbb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1024,7 +1024,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
 static loff_t
-iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
 {
 	struct iov_iter *iter = data;
@@ -1081,7 +1081,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 }
 
 /**
- * iomap_dax_rw - Perform I/O to a DAX file
+ * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:	The control block for this I/O
  * @iter:	The addresses to do I/O from or to
  * @ops:	iomap ops passed from the file system
@@ -1091,7 +1091,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
  * and evicting any page cache pages in the region under I/O.
  */
 ssize_t
-iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
@@ -1121,7 +1121,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-				iter, iomap_dax_actor);
+				iter, dax_iomap_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
@@ -1131,10 +1131,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->ki_pos += done;
 	return done ? done : ret;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_rw);
+EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
 /**
- * iomap_dax_fault - handle a page fault on a DAX file
+ * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @ops: iomap ops passed from the file system
@@ -1143,7 +1143,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
  * or mkwrite handler for DAX files. Assumes the caller has done all the
  * necessary locking for the page fault to proceed successfully.
  */
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1245,5 +1245,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_fault);
+EXPORT_SYMBOL_GPL(dax_iomap_fault);
 #endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0f257f8..32a4913 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return 0; /* skip atime */
 
 	inode_lock_shared(inode);
-	ret = iomap_dax_rw(iocb, to, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret)
 		goto out_unlock;
 
-	ret = iomap_dax_rw(iocb, from, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		mark_inode_dirty(inode);
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = iomap_dax_fault(vma, vmf, &ext2_iomap_ops);
+	ret = dax_iomap_fault(vma, vmf, &ext2_iomap_ops);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de7c53c..8f12152 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -344,7 +344,7 @@ xfs_file_dax_read(
 		return 0; /* skip atime */
 
 	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = iomap_dax_rw(iocb, to, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
@@ -730,7 +730,7 @@ xfs_file_dax_write(
 
 	trace_xfs_file_dax_write(ip, count, pos);
 
-	ret = iomap_dax_rw(iocb, from, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1679,7 +1679,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else {
 		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
@@ -1713,7 +1713,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0f74866..a3dfee4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -11,13 +11,13 @@ struct iomap_ops;
 /* We use lowest available exceptional entry bit for locking */
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
 
-ssize_t iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] dax: correct dax iomap code namespace
@ 2016-10-10 19:04       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 19:04 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrew Morton, Matthew Wilcox,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	Christoph Hellwig, linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	Alexander Viro, Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Suggested-by: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---

Dave, per Christoph's suggestion can you please send this to Linus for
v4.9?  This patch applies cleanly to both the current linux/master and the
current xfs/for-next.

---
 fs/dax.c            | 16 ++++++++--------
 fs/ext2/file.c      |  6 +++---
 fs/xfs/xfs_file.c   |  8 ++++----
 include/linux/dax.h |  4 ++--
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1b9154a..982ccbb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1024,7 +1024,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
 static loff_t
-iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
 {
 	struct iov_iter *iter = data;
@@ -1081,7 +1081,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 }
 
 /**
- * iomap_dax_rw - Perform I/O to a DAX file
+ * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:	The control block for this I/O
  * @iter:	The addresses to do I/O from or to
  * @ops:	iomap ops passed from the file system
@@ -1091,7 +1091,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
  * and evicting any page cache pages in the region under I/O.
  */
 ssize_t
-iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
@@ -1121,7 +1121,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-				iter, iomap_dax_actor);
+				iter, dax_iomap_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
@@ -1131,10 +1131,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->ki_pos += done;
 	return done ? done : ret;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_rw);
+EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
 /**
- * iomap_dax_fault - handle a page fault on a DAX file
+ * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @ops: iomap ops passed from the file system
@@ -1143,7 +1143,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
  * or mkwrite handler for DAX files. Assumes the caller has done all the
  * necessary locking for the page fault to proceed successfully.
  */
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1245,5 +1245,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_fault);
+EXPORT_SYMBOL_GPL(dax_iomap_fault);
 #endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0f257f8..32a4913 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return 0; /* skip atime */
 
 	inode_lock_shared(inode);
-	ret = iomap_dax_rw(iocb, to, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret)
 		goto out_unlock;
 
-	ret = iomap_dax_rw(iocb, from, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		mark_inode_dirty(inode);
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = iomap_dax_fault(vma, vmf, &ext2_iomap_ops);
+	ret = dax_iomap_fault(vma, vmf, &ext2_iomap_ops);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de7c53c..8f12152 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -344,7 +344,7 @@ xfs_file_dax_read(
 		return 0; /* skip atime */
 
 	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = iomap_dax_rw(iocb, to, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
@@ -730,7 +730,7 @@ xfs_file_dax_write(
 
 	trace_xfs_file_dax_write(ip, count, pos);
 
-	ret = iomap_dax_rw(iocb, from, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1679,7 +1679,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else {
 		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
@@ -1713,7 +1713,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0f74866..a3dfee4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -11,13 +11,13 @@ struct iomap_ops;
 /* We use lowest available exceptional entry bit for locking */
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
 
-ssize_t iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* Re: [PATCH] dax: correct dax iomap code namespace
  2016-10-10 19:04       ` Ross Zwisler
  (?)
  (?)
@ 2016-10-10 20:19         ` Dave Chinner
  -1 siblings, 0 replies; 182+ messages in thread
From: Dave Chinner @ 2016-10-10 20:19 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, linux-nvdimm, linux-kernel, linux-xfs,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

On Mon, Oct 10, 2016 at 01:04:04PM -0600, Ross Zwisler wrote:
> The recently added DAX functions that use the new struct iomap data
> structure were named iomap_dax_rw(), iomap_dax_fault() and
> iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
> should be part of the "dax" namespace and not the "iomap" namespace.
> Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
> respectively.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
> 
> Dave, per Christoph's suggestion can you please send this to Linus for
> v4.9?  This patch applies cleanly to both the current linux/master and the
> current xfs/for-next.

I'll plan to send this for 4.9-rc2 or 4.9-rc3 in the first batch of
fixes for the -rc1 merge.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH] dax: correct dax iomap code namespace
@ 2016-10-10 20:19         ` Dave Chinner
  0 siblings, 0 replies; 182+ messages in thread
From: Dave Chinner @ 2016-10-10 20:19 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Alexander Viro, Andrew Morton, Christoph Hellwig,
	Dan Williams, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-nvdimm, linux-xfs

On Mon, Oct 10, 2016 at 01:04:04PM -0600, Ross Zwisler wrote:
> The recently added DAX functions that use the new struct iomap data
> structure were named iomap_dax_rw(), iomap_dax_fault() and
> iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
> should be part of the "dax" namespace and not the "iomap" namespace.
> Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
> respectively.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
> 
> Dave, per Christoph's suggestion can you please send this to Linus for
> v4.9?  This patch applies cleanly to both the current linux/master and the
> current xfs/for-next.

I'll plan to send this for 4.9-rc2 or 4.9-rc3 in the first batch of
fixes for the -rc1 merge.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH] dax: correct dax iomap code namespace
@ 2016-10-10 20:19         ` Dave Chinner
  0 siblings, 0 replies; 182+ messages in thread
From: Dave Chinner @ 2016-10-10 20:19 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Alexander Viro, Andrew Morton, Christoph Hellwig,
	Dan Williams, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-nvdimm, linux-xfs

On Mon, Oct 10, 2016 at 01:04:04PM -0600, Ross Zwisler wrote:
> The recently added DAX functions that use the new struct iomap data
> structure were named iomap_dax_rw(), iomap_dax_fault() and
> iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
> should be part of the "dax" namespace and not the "iomap" namespace.
> Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
> respectively.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
> 
> Dave, per Christoph's suggestion can you please send this to Linus for
> v4.9?  This patch applies cleanly to both the current linux/master and the
> current xfs/for-next.

I'll plan to send this for 4.9-rc2 or 4.9-rc3 in the first batch of
fixes for the -rc1 merge.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH] dax: correct dax iomap code namespace
@ 2016-10-10 20:19         ` Dave Chinner
  0 siblings, 0 replies; 182+ messages in thread
From: Dave Chinner @ 2016-10-10 20:19 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro, Jan Kara,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

On Mon, Oct 10, 2016 at 01:04:04PM -0600, Ross Zwisler wrote:
> The recently added DAX functions that use the new struct iomap data
> structure were named iomap_dax_rw(), iomap_dax_fault() and
> iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
> should be part of the "dax" namespace and not the "iomap" namespace.
> Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
> respectively.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Suggested-by: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
> Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> ---
> 
> Dave, per Christoph's suggestion can you please send this to Linus for
> v4.9?  This patch applies cleanly to both the current linux/master and the
> current xfs/for-next.

I'll plan to send this for 4.9-rc2 or 4.9-rc3 in the first batch of
fixes for the -rc1 merge.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
  2016-10-10 15:50     ` Christoph Hellwig
  (?)
  (?)
@ 2016-10-10 22:05       ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 22:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Mon, Oct 10, 2016 at 05:50:04PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:00PM -0600, Ross Zwisler wrote:
> > Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> > and XFS.  ext2_iomap_end() just checks for a write underrun, and
> > xfs_file_iomap_end() checks to see if it needs to finish a delayed
> > allocation.  However, in the future iomap_end() calls might be needed to
> > make sure we have balanced allocations, locks, etc.  So, add calls to
> > iomap_end() with appropriate error handling to dax_iomap_fault().
> 
> Is there a way to just have a single call to iomap_end at the end of
> the function, after which we just return a previosuly setup return
> value?
> 
> e.g.
> 
> out:
> 	if (ops->iomap_end) {
> 		error = ops->iomap_end(inode, pos, PAGE_SIZE,
> 				PAGE_SIZE, flags, &iomap);
> 	}
> 
> 	if (error == -ENOMEM)
> 		return VM_FAULT_OOM | major;
> 	if (error < 0 && error != -EBUSY)
> 		return VM_FAULT_SIGBUS | major;
> 	return ret;

Sure, will do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-10 22:05       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 22:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Mon, Oct 10, 2016 at 05:50:04PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:00PM -0600, Ross Zwisler wrote:
> > Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> > and XFS.  ext2_iomap_end() just checks for a write underrun, and
> > xfs_file_iomap_end() checks to see if it needs to finish a delayed
> > allocation.  However, in the future iomap_end() calls might be needed to
> > make sure we have balanced allocations, locks, etc.  So, add calls to
> > iomap_end() with appropriate error handling to dax_iomap_fault().
> 
> Is there a way to just have a single call to iomap_end at the end of
> the function, after which we just return a previosuly setup return
> value?
> 
> e.g.
> 
> out:
> 	if (ops->iomap_end) {
> 		error = ops->iomap_end(inode, pos, PAGE_SIZE,
> 				PAGE_SIZE, flags, &iomap);
> 	}
> 
> 	if (error == -ENOMEM)
> 		return VM_FAULT_OOM | major;
> 	if (error < 0 && error != -EBUSY)
> 		return VM_FAULT_SIGBUS | major;
> 	return ret;

Sure, will do.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-10 22:05       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 22:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Mon, Oct 10, 2016 at 05:50:04PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:00PM -0600, Ross Zwisler wrote:
> > Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> > and XFS.  ext2_iomap_end() just checks for a write underrun, and
> > xfs_file_iomap_end() checks to see if it needs to finish a delayed
> > allocation.  However, in the future iomap_end() calls might be needed to
> > make sure we have balanced allocations, locks, etc.  So, add calls to
> > iomap_end() with appropriate error handling to dax_iomap_fault().
> 
> Is there a way to just have a single call to iomap_end at the end of
> the function, after which we just return a previosuly setup return
> value?
> 
> e.g.
> 
> out:
> 	if (ops->iomap_end) {
> 		error = ops->iomap_end(inode, pos, PAGE_SIZE,
> 				PAGE_SIZE, flags, &iomap);
> 	}
> 
> 	if (error == -ENOMEM)
> 		return VM_FAULT_OOM | major;
> 	if (error < 0 && error != -EBUSY)
> 		return VM_FAULT_SIGBUS | major;
> 	return ret;

Sure, will do.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-10 22:05       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 22:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Mon, Oct 10, 2016 at 05:50:04PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:00PM -0600, Ross Zwisler wrote:
> > Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> > and XFS.  ext2_iomap_end() just checks for a write underrun, and
> > xfs_file_iomap_end() checks to see if it needs to finish a delayed
> > allocation.  However, in the future iomap_end() calls might be needed to
> > make sure we have balanced allocations, locks, etc.  So, add calls to
> > iomap_end() with appropriate error handling to dax_iomap_fault().
> 
> Is there a way to just have a single call to iomap_end at the end of
> the function, after which we just return a previosuly setup return
> value?
> 
> e.g.
> 
> out:
> 	if (ops->iomap_end) {
> 		error = ops->iomap_end(inode, pos, PAGE_SIZE,
> 				PAGE_SIZE, flags, &iomap);
> 	}
> 
> 	if (error == -ENOMEM)
> 		return VM_FAULT_OOM | major;
> 	if (error < 0 && error != -EBUSY)
> 		return VM_FAULT_SIGBUS | major;
> 	return ret;

Sure, will do.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
  2016-10-10 15:59     ` Christoph Hellwig
                         ` (2 preceding siblings ...)
  (?)
@ 2016-10-10 22:06       ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 22:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

On Mon, Oct 10, 2016 at 05:59:17PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> 
> Please introduce a proper inline helper that mask all the possible type
> bits out of the radix tree entry, and use them wherever you do the
> open cast.

Sure, will do.

> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> 
> And when we do these cases N times next to each other we should
> have a local variable the valid flag bits of entry.

Okay, will fix.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-10 22:06       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 22:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Mon, Oct 10, 2016 at 05:59:17PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> 
> Please introduce a proper inline helper that mask all the possible type
> bits out of the radix tree entry, and use them wherever you do the
> open cast.

Sure, will do.

> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> 
> And when we do these cases N times next to each other we should
> have a local variable the valid flag bits of entry.

Okay, will fix.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-10 22:06       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 22:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Mon, Oct 10, 2016 at 05:59:17PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> 
> Please introduce a proper inline helper that mask all the possible type
> bits out of the radix tree entry, and use them wherever you do the
> open cast.

Sure, will do.

> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> 
> And when we do these cases N times next to each other we should
> have a local variable the valid flag bits of entry.

Okay, will fix.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-10 22:06       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 22:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Mon, Oct 10, 2016 at 05:59:17PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> 
> Please introduce a proper inline helper that mask all the possible type
> bits out of the radix tree entry, and use them wherever you do the
> open cast.

Sure, will do.

> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> 
> And when we do these cases N times next to each other we should
> have a local variable the valid flag bits of entry.

Okay, will fix.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-10 22:06       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-10 22:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Mon, Oct 10, 2016 at 05:59:17PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> 
> Please introduce a proper inline helper that mask all the possible type
> bits out of the radix tree entry, and use them wherever you do the
> open cast.

Sure, will do.

> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> 
> And when we do these cases N times next to each other we should
> have a local variable the valid flag bits of entry.

Okay, will fix.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 05/17] ext2: return -EIO on ext2_iomap_end() failure
  2016-10-07 21:08   ` Ross Zwisler
  (?)
  (?)
@ 2016-10-11  6:48     ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  6:48 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

On Fri 07-10-16 15:08:52, Ross Zwisler wrote:
> Right now we just return 0 for success, but we really want to let callers
> know about this failure.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/ext2/inode.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c7dbb46..368913c 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -830,8 +830,10 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>  {
>  	if (iomap->type == IOMAP_MAPPED &&
>  	    written < length &&
> -	    (flags & IOMAP_WRITE))
> +	    (flags & IOMAP_WRITE)) {
>  		ext2_write_failed(inode->i_mapping, offset + length);
> +		return -EIO;
> +	}

So this is wrong. This (written < length) happens when we fail to copy data
to / from userspace buffer into pagecache pages / DAX blocks. It may be
because the passed buffer pointer is just wrong, or just because the page
got swapped out and we have to swap it back in. It is a role of upper
layers to decide what went wrong and proceed accordingly but from filesystem
point of view we just have to cancel the operation we have prepared and
return to upper layers. So returning 0 in this case is correct.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 05/17] ext2: return -EIO on ext2_iomap_end() failure
@ 2016-10-11  6:48     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  6:48 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:52, Ross Zwisler wrote:
> Right now we just return 0 for success, but we really want to let callers
> know about this failure.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/ext2/inode.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c7dbb46..368913c 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -830,8 +830,10 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>  {
>  	if (iomap->type == IOMAP_MAPPED &&
>  	    written < length &&
> -	    (flags & IOMAP_WRITE))
> +	    (flags & IOMAP_WRITE)) {
>  		ext2_write_failed(inode->i_mapping, offset + length);
> +		return -EIO;
> +	}

So this is wrong. This (written < length) happens when we fail to copy data
to / from userspace buffer into pagecache pages / DAX blocks. It may be
because the passed buffer pointer is just wrong, or just because the page
got swapped out and we have to swap it back in. It is a role of upper
layers to decide what went wrong and proceed accordingly but from filesystem
point of view we just have to cancel the operation we have prepared and
return to upper layers. So returning 0 in this case is correct.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 05/17] ext2: return -EIO on ext2_iomap_end() failure
@ 2016-10-11  6:48     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  6:48 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:52, Ross Zwisler wrote:
> Right now we just return 0 for success, but we really want to let callers
> know about this failure.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/ext2/inode.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c7dbb46..368913c 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -830,8 +830,10 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>  {
>  	if (iomap->type == IOMAP_MAPPED &&
>  	    written < length &&
> -	    (flags & IOMAP_WRITE))
> +	    (flags & IOMAP_WRITE)) {
>  		ext2_write_failed(inode->i_mapping, offset + length);
> +		return -EIO;
> +	}

So this is wrong. This (written < length) happens when we fail to copy data
to / from userspace buffer into pagecache pages / DAX blocks. It may be
because the passed buffer pointer is just wrong, or just because the page
got swapped out and we have to swap it back in. It is a role of upper
layers to decide what went wrong and proceed accordingly but from filesystem
point of view we just have to cancel the operation we have prepared and
return to upper layers. So returning 0 in this case is correct.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 05/17] ext2: return -EIO on ext2_iomap_end() failure
@ 2016-10-11  6:48     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  6:48 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:52, Ross Zwisler wrote:
> Right now we just return 0 for success, but we really want to let callers
> know about this failure.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/ext2/inode.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c7dbb46..368913c 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -830,8 +830,10 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>  {
>  	if (iomap->type == IOMAP_MAPPED &&
>  	    written < length &&
> -	    (flags & IOMAP_WRITE))
> +	    (flags & IOMAP_WRITE)) {
>  		ext2_write_failed(inode->i_mapping, offset + length);
> +		return -EIO;
> +	}

So this is wrong. This (written < length) happens when we fail to copy data
to / from userspace buffer into pagecache pages / DAX blocks. It may be
because the passed buffer pointer is just wrong, or just because the page
got swapped out and we have to swap it back in. It is a role of upper
layers to decide what went wrong and proceed accordingly but from filesystem
point of view we just have to cancel the operation we have prepared and
return to upper layers. So returning 0 in this case is correct.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
  2016-10-07 21:08   ` Ross Zwisler
                       ` (2 preceding siblings ...)
  (?)
@ 2016-10-11  6:50     ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  6:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

On Fri 07-10-16 15:08:54, Ross Zwisler wrote:
> Don't take down the kernel if we get an invalid 'from' and 'length'
> argument pair.  Just warn once and return an error.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/dax.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ac28cdf..98189ac 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
>  	/* Block boundary? Nothing to do */
>  	if (!length)
>  		return 0;
> -	BUG_ON((offset + length) > PAGE_SIZE);
> +	if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
> +		return -EINVAL;
>  
>  	memset(&bh, 0, sizeof(bh));
>  	bh.b_bdev = inode->i_sb->s_bdev;
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-11  6:50     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  6:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:54, Ross Zwisler wrote:
> Don't take down the kernel if we get an invalid 'from' and 'length'
> argument pair.  Just warn once and return an error.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/dax.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ac28cdf..98189ac 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
>  	/* Block boundary? Nothing to do */
>  	if (!length)
>  		return 0;
> -	BUG_ON((offset + length) > PAGE_SIZE);
> +	if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
> +		return -EINVAL;
>  
>  	memset(&bh, 0, sizeof(bh));
>  	bh.b_bdev = inode->i_sb->s_bdev;
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-11  6:50     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  6:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:54, Ross Zwisler wrote:
> Don't take down the kernel if we get an invalid 'from' and 'length'
> argument pair.  Just warn once and return an error.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/dax.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ac28cdf..98189ac 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
>  	/* Block boundary? Nothing to do */
>  	if (!length)
>  		return 0;
> -	BUG_ON((offset + length) > PAGE_SIZE);
> +	if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
> +		return -EINVAL;
>  
>  	memset(&bh, 0, sizeof(bh));
>  	bh.b_bdev = inode->i_sb->s_bdev;
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-11  6:50     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  6:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

On Fri 07-10-16 15:08:54, Ross Zwisler wrote:
> Don't take down the kernel if we get an invalid 'from' and 'length'
> argument pair.  Just warn once and return an error.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Looks good. You can add:

Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>

								Honza
> ---
>  fs/dax.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ac28cdf..98189ac 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
>  	/* Block boundary? Nothing to do */
>  	if (!length)
>  		return 0;
> -	BUG_ON((offset + length) > PAGE_SIZE);
> +	if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
> +		return -EINVAL;
>  
>  	memset(&bh, 0, sizeof(bh));
>  	bh.b_bdev = inode->i_sb->s_bdev;
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c
@ 2016-10-11  6:50     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  6:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:54, Ross Zwisler wrote:
> Don't take down the kernel if we get an invalid 'from' and 'length'
> argument pair.  Just warn once and return an error.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/dax.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ac28cdf..98189ac 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
>  	/* Block boundary? Nothing to do */
>  	if (!length)
>  		return 0;
> -	BUG_ON((offset + length) > PAGE_SIZE);
> +	if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
> +		return -EINVAL;
>  
>  	memset(&bh, 0, sizeof(bh));
>  	bh.b_bdev = inode->i_sb->s_bdev;
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
  2016-10-07 21:08   ` Ross Zwisler
  (?)
  (?)
@ 2016-10-11  7:04     ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:04 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

On Fri 07-10-16 15:08:56, Ross Zwisler wrote:
> DAX radix tree locking currently locks entries based on the unique
> combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> This works for PTEs, but as we move to PMDs we will need to have all the
> offsets within the range covered by the PMD to map to the same bit lock.
> To accomplish this, for ranges covered by a PMD entry we will instead lock
> based on the page offset of the beginning of the PMD entry.  The 'mapping'
> pointer is still used in the same way.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

Just one thing which IMO deserves a comment below:

> @@ -448,9 +460,12 @@ restart:
>  }
>  
>  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -				   pgoff_t index, bool wake_all)
> +		pgoff_t index, void *entry, bool wake_all)
>  {
> -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +	struct exceptional_entry_key key;
> +	wait_queue_head_t *wq;
> +
> +	wq = dax_entry_waitqueue(mapping, index, entry, &key);

So I believe we should comment above this function that the 'entry' it gets
may be invalid by the time it gets it (we call it without tree_lock held so
the passed entry may be changed in the radix tree as we work) but we use it
only to find appropriate waitqueue where tasks sleep waiting for that old
entry to unlock so we indeed wake up all tasks we need.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
@ 2016-10-11  7:04     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:04 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:56, Ross Zwisler wrote:
> DAX radix tree locking currently locks entries based on the unique
> combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> This works for PTEs, but as we move to PMDs we will need to have all the
> offsets within the range covered by the PMD to map to the same bit lock.
> To accomplish this, for ranges covered by a PMD entry we will instead lock
> based on the page offset of the beginning of the PMD entry.  The 'mapping'
> pointer is still used in the same way.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

Just one thing which IMO deserves a comment below:

> @@ -448,9 +460,12 @@ restart:
>  }
>  
>  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -				   pgoff_t index, bool wake_all)
> +		pgoff_t index, void *entry, bool wake_all)
>  {
> -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +	struct exceptional_entry_key key;
> +	wait_queue_head_t *wq;
> +
> +	wq = dax_entry_waitqueue(mapping, index, entry, &key);

So I believe we should comment above this function that the 'entry' it gets
may be invalid by the time it gets it (we call it without tree_lock held so
the passed entry may be changed in the radix tree as we work) but we use it
only to find appropriate waitqueue where tasks sleep waiting for that old
entry to unlock so we indeed wake up all tasks we need.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
@ 2016-10-11  7:04     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:04 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:56, Ross Zwisler wrote:
> DAX radix tree locking currently locks entries based on the unique
> combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> This works for PTEs, but as we move to PMDs we will need to have all the
> offsets within the range covered by the PMD to map to the same bit lock.
> To accomplish this, for ranges covered by a PMD entry we will instead lock
> based on the page offset of the beginning of the PMD entry.  The 'mapping'
> pointer is still used in the same way.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

Just one thing which IMO deserves a comment below:

> @@ -448,9 +460,12 @@ restart:
>  }
>  
>  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -				   pgoff_t index, bool wake_all)
> +		pgoff_t index, void *entry, bool wake_all)
>  {
> -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +	struct exceptional_entry_key key;
> +	wait_queue_head_t *wq;
> +
> +	wq = dax_entry_waitqueue(mapping, index, entry, &key);

So I believe we should comment above this function that the 'entry' it gets
may be invalid by the time it gets it (we call it without tree_lock held so
the passed entry may be changed in the radix tree as we work) but we use it
only to find appropriate waitqueue where tasks sleep waiting for that old
entry to unlock so we indeed wake up all tasks we need.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
@ 2016-10-11  7:04     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:04 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:56, Ross Zwisler wrote:
> DAX radix tree locking currently locks entries based on the unique
> combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> This works for PTEs, but as we move to PMDs we will need to have all the
> offsets within the range covered by the PMD to map to the same bit lock.
> To accomplish this, for ranges covered by a PMD entry we will instead lock
> based on the page offset of the beginning of the PMD entry.  The 'mapping'
> pointer is still used in the same way.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

Just one thing which IMO deserves a comment below:

> @@ -448,9 +460,12 @@ restart:
>  }
>  
>  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -				   pgoff_t index, bool wake_all)
> +		pgoff_t index, void *entry, bool wake_all)
>  {
> -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +	struct exceptional_entry_key key;
> +	wait_queue_head_t *wq;
> +
> +	wq = dax_entry_waitqueue(mapping, index, entry, &key);

So I believe we should comment above this function that the 'entry' it gets
may be invalid by the time it gets it (we call it without tree_lock held so
the passed entry may be changed in the radix tree as we work) but we use it
only to find appropriate waitqueue where tasks sleep waiting for that old
entry to unlock so we indeed wake up all tasks we need.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
  2016-10-07 21:08   ` Ross Zwisler
  (?)
  (?)
@ 2016-10-11  7:06     ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:06 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

On Fri 07-10-16 15:08:59, Ross Zwisler wrote:
> To be able to correctly calculate the sector from a file position and a
> struct iomap there is a complex little bit of logic that currently happens
> in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
> repeated yet again in the DAX PMD fault handler when it is added, so break
> it out into a helper function.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 982ccbb..7689ab0 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1023,6 +1023,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
>  
>  #ifdef CONFIG_FS_IOMAP
> +static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> +{
> +	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
> +}
> +
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct iomap *iomap)
> @@ -1048,8 +1053,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct blk_dax_ctl dax = { 0 };
>  		ssize_t map_len;
>  
> -		dax.sector = iomap->blkno +
> -			(((pos & PAGE_MASK) - iomap->offset) >> 9);
> +		dax.sector = dax_iomap_sector(iomap, pos);
>  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
>  		map_len = dax_map_atomic(iomap->bdev, &dax);
>  		if (map_len < 0) {
> @@ -1186,7 +1190,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		goto unlock_entry;
>  	}
>  
> -	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
> +	sector = dax_iomap_sector(&iomap, pos);
>  
>  	if (vmf->cow_page) {
>  		switch (iomap.type) {
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
@ 2016-10-11  7:06     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:06 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:59, Ross Zwisler wrote:
> To be able to correctly calculate the sector from a file position and a
> struct iomap there is a complex little bit of logic that currently happens
> in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
> repeated yet again in the DAX PMD fault handler when it is added, so break
> it out into a helper function.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 982ccbb..7689ab0 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1023,6 +1023,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
>  
>  #ifdef CONFIG_FS_IOMAP
> +static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> +{
> +	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
> +}
> +
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct iomap *iomap)
> @@ -1048,8 +1053,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct blk_dax_ctl dax = { 0 };
>  		ssize_t map_len;
>  
> -		dax.sector = iomap->blkno +
> -			(((pos & PAGE_MASK) - iomap->offset) >> 9);
> +		dax.sector = dax_iomap_sector(iomap, pos);
>  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
>  		map_len = dax_map_atomic(iomap->bdev, &dax);
>  		if (map_len < 0) {
> @@ -1186,7 +1190,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		goto unlock_entry;
>  	}
>  
> -	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
> +	sector = dax_iomap_sector(&iomap, pos);
>  
>  	if (vmf->cow_page) {
>  		switch (iomap.type) {
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
@ 2016-10-11  7:06     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:06 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:59, Ross Zwisler wrote:
> To be able to correctly calculate the sector from a file position and a
> struct iomap there is a complex little bit of logic that currently happens
> in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
> repeated yet again in the DAX PMD fault handler when it is added, so break
> it out into a helper function.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 982ccbb..7689ab0 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1023,6 +1023,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
>  
>  #ifdef CONFIG_FS_IOMAP
> +static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> +{
> +	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
> +}
> +
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct iomap *iomap)
> @@ -1048,8 +1053,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct blk_dax_ctl dax = { 0 };
>  		ssize_t map_len;
>  
> -		dax.sector = iomap->blkno +
> -			(((pos & PAGE_MASK) - iomap->offset) >> 9);
> +		dax.sector = dax_iomap_sector(iomap, pos);
>  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
>  		map_len = dax_map_atomic(iomap->bdev, &dax);
>  		if (map_len < 0) {
> @@ -1186,7 +1190,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		goto unlock_entry;
>  	}
>  
> -	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
> +	sector = dax_iomap_sector(&iomap, pos);
>  
>  	if (vmf->cow_page) {
>  		switch (iomap.type) {
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 12/17] dax: add dax_iomap_sector() helper function
@ 2016-10-11  7:06     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:06 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:08:59, Ross Zwisler wrote:
> To be able to correctly calculate the sector from a file position and a
> struct iomap there is a complex little bit of logic that currently happens
> in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
> repeated yet again in the DAX PMD fault handler when it is added, so break
> it out into a helper function.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 982ccbb..7689ab0 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1023,6 +1023,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
>  
>  #ifdef CONFIG_FS_IOMAP
> +static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> +{
> +	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
> +}
> +
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct iomap *iomap)
> @@ -1048,8 +1053,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct blk_dax_ctl dax = { 0 };
>  		ssize_t map_len;
>  
> -		dax.sector = iomap->blkno +
> -			(((pos & PAGE_MASK) - iomap->offset) >> 9);
> +		dax.sector = dax_iomap_sector(iomap, pos);
>  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
>  		map_len = dax_map_atomic(iomap->bdev, &dax);
>  		if (map_len < 0) {
> @@ -1186,7 +1190,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		goto unlock_entry;
>  	}
>  
> -	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
> +	sector = dax_iomap_sector(&iomap, pos);
>  
>  	if (vmf->cow_page) {
>  		switch (iomap.type) {
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
  2016-10-07 21:09   ` Ross Zwisler
                       ` (2 preceding siblings ...)
  (?)
@ 2016-10-11  7:21     ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

On Fri 07-10-16 15:09:00, Ross Zwisler wrote:
> Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> and XFS.  ext2_iomap_end() just checks for a write underrun, and
> xfs_file_iomap_end() checks to see if it needs to finish a delayed
> allocation.  However, in the future iomap_end() calls might be needed to
> make sure we have balanced allocations, locks, etc.  So, add calls to
> iomap_end() with appropriate error handling to dax_iomap_fault().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Jan Kara <jack@suse.cz>
...
> @@ -1239,6 +1253,17 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		break;
>  	}
>  
> + finish_iomap:
> +	if (ops->iomap_end) {
> +		if (error) {
> +			/* keep previous error */
> +			ops->iomap_end(inode, pos, PAGE_SIZE, PAGE_SIZE, flags,
> +					&iomap);

I think for the error case we should set number of 'written' bytes to 0 to
tell fs to cancel what it has prepared. This is mostly cosmetic since the
only case where I can imagine this would matter is shared write fault and
in that case we have currently no error path but still it could bite us in
the future.

Other than that the patch looks good so you can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-11  7:21     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:00, Ross Zwisler wrote:
> Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> and XFS.  ext2_iomap_end() just checks for a write underrun, and
> xfs_file_iomap_end() checks to see if it needs to finish a delayed
> allocation.  However, in the future iomap_end() calls might be needed to
> make sure we have balanced allocations, locks, etc.  So, add calls to
> iomap_end() with appropriate error handling to dax_iomap_fault().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Jan Kara <jack@suse.cz>
...
> @@ -1239,6 +1253,17 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		break;
>  	}
>  
> + finish_iomap:
> +	if (ops->iomap_end) {
> +		if (error) {
> +			/* keep previous error */
> +			ops->iomap_end(inode, pos, PAGE_SIZE, PAGE_SIZE, flags,
> +					&iomap);

I think for the error case we should set number of 'written' bytes to 0 to
tell fs to cancel what it has prepared. This is mostly cosmetic since the
only case where I can imagine this would matter is shared write fault and
in that case we have currently no error path but still it could bite us in
the future.

Other than that the patch looks good so you can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-11  7:21     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:00, Ross Zwisler wrote:
> Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> and XFS.  ext2_iomap_end() just checks for a write underrun, and
> xfs_file_iomap_end() checks to see if it needs to finish a delayed
> allocation.  However, in the future iomap_end() calls might be needed to
> make sure we have balanced allocations, locks, etc.  So, add calls to
> iomap_end() with appropriate error handling to dax_iomap_fault().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Jan Kara <jack@suse.cz>
...
> @@ -1239,6 +1253,17 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		break;
>  	}
>  
> + finish_iomap:
> +	if (ops->iomap_end) {
> +		if (error) {
> +			/* keep previous error */
> +			ops->iomap_end(inode, pos, PAGE_SIZE, PAGE_SIZE, flags,
> +					&iomap);

I think for the error case we should set number of 'written' bytes to 0 to
tell fs to cancel what it has prepared. This is mostly cosmetic since the
only case where I can imagine this would matter is shared write fault and
in that case we have currently no error path but still it could bite us in
the future.

Other than that the patch looks good so you can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-11  7:21     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

On Fri 07-10-16 15:09:00, Ross Zwisler wrote:
> Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> and XFS.  ext2_iomap_end() just checks for a write underrun, and
> xfs_file_iomap_end() checks to see if it needs to finish a delayed
> allocation.  However, in the future iomap_end() calls might be needed to
> make sure we have balanced allocations, locks, etc.  So, add calls to
> iomap_end() with appropriate error handling to dax_iomap_fault().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Suggested-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
...
> @@ -1239,6 +1253,17 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		break;
>  	}
>  
> + finish_iomap:
> +	if (ops->iomap_end) {
> +		if (error) {
> +			/* keep previous error */
> +			ops->iomap_end(inode, pos, PAGE_SIZE, PAGE_SIZE, flags,
> +					&iomap);

I think for the error case we should set number of 'written' bytes to 0 to
tell fs to cancel what it has prepared. This is mostly cosmetic since the
only case where I can imagine this would matter is shared write fault and
in that case we have currently no error path but still it could bite us in
the future.

Other than that the patch looks good so you can add:

Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()
@ 2016-10-11  7:21     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:00, Ross Zwisler wrote:
> Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> and XFS.  ext2_iomap_end() just checks for a write underrun, and
> xfs_file_iomap_end() checks to see if it needs to finish a delayed
> allocation.  However, in the future iomap_end() calls might be needed to
> make sure we have balanced allocations, locks, etc.  So, add calls to
> iomap_end() with appropriate error handling to dax_iomap_fault().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Jan Kara <jack@suse.cz>
...
> @@ -1239,6 +1253,17 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		break;
>  	}
>  
> + finish_iomap:
> +	if (ops->iomap_end) {
> +		if (error) {
> +			/* keep previous error */
> +			ops->iomap_end(inode, pos, PAGE_SIZE, PAGE_SIZE, flags,
> +					&iomap);

I think for the error case we should set number of 'written' bytes to 0 to
tell fs to cancel what it has prepared. This is mostly cosmetic since the
only case where I can imagine this would matter is shared write fault and
in that case we have currently no error path but still it could bite us in
the future.

Other than that the patch looks good so you can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
  2016-10-07 21:09   ` Ross Zwisler
                       ` (2 preceding siblings ...)
  (?)
@ 2016-10-11  7:23     ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:23 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

On Fri 07-10-16 15:09:01, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
> 
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c            | 14 --------------
>  include/linux/dax.h | 15 ++++++++++++++-
>  2 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5e8febe..ac3cd05 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -34,20 +34,6 @@
>  #include <linux/iomap.h>
>  #include "internal.h"
>  
> -/*
> - * We use lowest available bit in exceptional entry for locking, other two
> - * bits to determine entry type. In total 3 special bits.
> - */
> -#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
> -#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> -#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> -#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> -		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> -		RADIX_TREE_EXCEPTIONAL_ENTRY))
> -
>  /* We choose 4096 entries - same as per-zone page wait tables */
>  #define DAX_WAIT_TABLE_BITS 12
>  #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index a3dfee4..e9ea78c 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -8,8 +8,21 @@
>  
>  struct iomap_ops;
>  
> -/* We use lowest available exceptional entry bit for locking */
> +/*
> + * We use lowest available bit in exceptional entry for locking, other two
> + * bits to determine entry type. In total 3 special bits.
> + */
> +#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
>  #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
> +#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> +#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> +#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> +#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> +		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> +		RADIX_TREE_EXCEPTIONAL_ENTRY))
> +
>  
>  ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		struct iomap_ops *ops);
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
@ 2016-10-11  7:23     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:23 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:01, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
> 
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c            | 14 --------------
>  include/linux/dax.h | 15 ++++++++++++++-
>  2 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5e8febe..ac3cd05 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -34,20 +34,6 @@
>  #include <linux/iomap.h>
>  #include "internal.h"
>  
> -/*
> - * We use lowest available bit in exceptional entry for locking, other two
> - * bits to determine entry type. In total 3 special bits.
> - */
> -#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
> -#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> -#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> -#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> -		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> -		RADIX_TREE_EXCEPTIONAL_ENTRY))
> -
>  /* We choose 4096 entries - same as per-zone page wait tables */
>  #define DAX_WAIT_TABLE_BITS 12
>  #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index a3dfee4..e9ea78c 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -8,8 +8,21 @@
>  
>  struct iomap_ops;
>  
> -/* We use lowest available exceptional entry bit for locking */
> +/*
> + * We use lowest available bit in exceptional entry for locking, other two
> + * bits to determine entry type. In total 3 special bits.
> + */
> +#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
>  #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
> +#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> +#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> +#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> +#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> +		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> +		RADIX_TREE_EXCEPTIONAL_ENTRY))
> +
>  
>  ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		struct iomap_ops *ops);
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
@ 2016-10-11  7:23     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:23 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:01, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
> 
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c            | 14 --------------
>  include/linux/dax.h | 15 ++++++++++++++-
>  2 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5e8febe..ac3cd05 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -34,20 +34,6 @@
>  #include <linux/iomap.h>
>  #include "internal.h"
>  
> -/*
> - * We use lowest available bit in exceptional entry for locking, other two
> - * bits to determine entry type. In total 3 special bits.
> - */
> -#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
> -#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> -#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> -#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> -		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> -		RADIX_TREE_EXCEPTIONAL_ENTRY))
> -
>  /* We choose 4096 entries - same as per-zone page wait tables */
>  #define DAX_WAIT_TABLE_BITS 12
>  #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index a3dfee4..e9ea78c 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -8,8 +8,21 @@
>  
>  struct iomap_ops;
>  
> -/* We use lowest available exceptional entry bit for locking */
> +/*
> + * We use lowest available bit in exceptional entry for locking, other two
> + * bits to determine entry type. In total 3 special bits.
> + */
> +#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
>  #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
> +#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> +#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> +#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> +#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> +		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> +		RADIX_TREE_EXCEPTIONAL_ENTRY))
> +
>  
>  ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		struct iomap_ops *ops);
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
@ 2016-10-11  7:23     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:23 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

On Fri 07-10-16 15:09:01, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
> 
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>

								Honza

> ---
>  fs/dax.c            | 14 --------------
>  include/linux/dax.h | 15 ++++++++++++++-
>  2 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5e8febe..ac3cd05 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -34,20 +34,6 @@
>  #include <linux/iomap.h>
>  #include "internal.h"
>  
> -/*
> - * We use lowest available bit in exceptional entry for locking, other two
> - * bits to determine entry type. In total 3 special bits.
> - */
> -#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
> -#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> -#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> -#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> -		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> -		RADIX_TREE_EXCEPTIONAL_ENTRY))
> -
>  /* We choose 4096 entries - same as per-zone page wait tables */
>  #define DAX_WAIT_TABLE_BITS 12
>  #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index a3dfee4..e9ea78c 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -8,8 +8,21 @@
>  
>  struct iomap_ops;
>  
> -/* We use lowest available exceptional entry bit for locking */
> +/*
> + * We use lowest available bit in exceptional entry for locking, other two
> + * bits to determine entry type. In total 3 special bits.
> + */
> +#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
>  #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
> +#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> +#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> +#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> +#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> +		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> +		RADIX_TREE_EXCEPTIONAL_ENTRY))
> +
>  
>  ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		struct iomap_ops *ops);
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h
@ 2016-10-11  7:23     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  7:23 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:01, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
> 
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c            | 14 --------------
>  include/linux/dax.h | 15 ++++++++++++++-
>  2 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5e8febe..ac3cd05 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -34,20 +34,6 @@
>  #include <linux/iomap.h>
>  #include "internal.h"
>  
> -/*
> - * We use lowest available bit in exceptional entry for locking, other two
> - * bits to determine entry type. In total 3 special bits.
> - */
> -#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
> -#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> -#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> -#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> -		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> -		RADIX_TREE_EXCEPTIONAL_ENTRY))
> -
>  /* We choose 4096 entries - same as per-zone page wait tables */
>  #define DAX_WAIT_TABLE_BITS 12
>  #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index a3dfee4..e9ea78c 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -8,8 +8,21 @@
>  
>  struct iomap_ops;
>  
> -/* We use lowest available exceptional entry bit for locking */
> +/*
> + * We use lowest available bit in exceptional entry for locking, other two
> + * bits to determine entry type. In total 3 special bits.
> + */
> +#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
>  #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
> +#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> +#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> +#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> +#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> +		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> +		RADIX_TREE_EXCEPTIONAL_ENTRY))
> +
>  
>  ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		struct iomap_ops *ops);
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
  2016-10-07 21:09   ` Ross Zwisler
                       ` (2 preceding siblings ...)
  (?)
@ 2016-10-11  8:31     ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  8:31 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> diff --git a/fs/dax.c b/fs/dax.c
> index ac3cd05..e51d51f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
>  	 * queue to the start of that PMD.  This ensures that all offsets in
>  	 * the range covered by the PMD map to the same bit lock.
>  	 */
> -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +	if ((unsigned long)entry & RADIX_DAX_PMD)
>  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);

I agree with Christoph - helper for masking type bits would make this
nicer.

...
> -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> +		unsigned long size_flag)
>  {
> +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
>  	void *entry, **slot;
>  
>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
>  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +
> +	if (entry) {
> +		if (size_flag & RADIX_DAX_PMD) {
> +			if (!radix_tree_exceptional_entry(entry) ||
> +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> +				entry = ERR_PTR(-EEXIST);
> +				goto out_unlock;

You need to call put_unlocked_mapping_entry() if you use
get_unlocked_mapping_entry() and then decide not to lock it. The reason is
that the waitqueues we use are exclusive (we wake up only a single waiter
waiting for the lock) and so there can be some waiters for the entry lock
although we have not locked the entry ourselves.

> +			}
> +		} else { /* trying to grab a PTE entry */
> +			if (radix_tree_exceptional_entry(entry) &&
> +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> +			    ((unsigned long)entry &
> +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> +				pmd_downgrade = true;
> +			}
> +		}
> +	}
> +
>  	/* No entry for given index? Make sure radix tree is big enough. */
> -	if (!entry) {
> +	if (!entry || pmd_downgrade) {
>  		int err;
>  
> +		if (pmd_downgrade) {
> +			/*
> +			 * Make sure 'entry' remains valid while we drop
> +			 * mapping->tree_lock.
> +			 */
> +			entry = lock_slot(mapping, slot);
> +		}
> +
>  		spin_unlock_irq(&mapping->tree_lock);
>  		err = radix_tree_preload(
>  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
>  		if (err)
>  			return ERR_PTR(err);

You need to unlock the locked entry before you return here...

> -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> -			       RADIX_DAX_ENTRY_LOCK);
> +
> +		/*
> +		 * Besides huge zero pages the only other thing that gets
> +		 * downgraded are empty entries which don't need to be
> +		 * unmapped.
> +		 */
> +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> +			unmap_mapping_range(mapping,
> +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> +
>  		spin_lock_irq(&mapping->tree_lock);
> -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> +
> +		if (pmd_downgrade) {
> +			radix_tree_delete(&mapping->page_tree, index);
> +			mapping->nrexceptional--;
> +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> +					false);

You need to set 'wake_all' argument here to true. Otherwise there could be
waiters waiting for non-existent entry forever...

> +		}
> +
> +		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
> +
> +		err = __radix_tree_insert(&mapping->page_tree, index,
> +				dax_radix_order(entry), entry);
>  		radix_tree_preload_end();
>  		if (err) {
>  			spin_unlock_irq(&mapping->tree_lock);
> -			/* Someone already created the entry? */
> -			if (err == -EEXIST)
> +			/*
> +			 * Someone already created the entry?  This is a
> +			 * normal failure when inserting PMDs in a range
> +			 * that already contains PTEs.  In that case we want
> +			 * to return -EEXIST immediately.
> +			 */
> +			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
>  				goto restart;

Add a comment here that we can get here only if there was no radix tree
entry at 'index' and thus there can be no waiters to wake.

>  			return ERR_PTR(err);
>  		}
> @@ -441,6 +509,7 @@ restart:
>  		return page;
>  	}
>  	entry = lock_slot(mapping, slot);
> + out_unlock:
>  	spin_unlock_irq(&mapping->tree_lock);
>  	return entry;
>  }
> @@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
>  	return 0;
>  }
>  
> -#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
> -
> +/*
> + * By this point grab_mapping_entry() has ensured that we have a locked entry
> + * of the appropriate size so we don't have to worry about downgrading PMDs to
> + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> + * already in the tree, we will skip the insertion and just dirty the PMD as
> + * appropriate.
> + */
>  static void *dax_insert_mapping_entry(struct address_space *mapping,
>  				      struct vm_fault *vmf,
> -				      void *entry, sector_t sector)
> +				      void *entry, sector_t sector,
> +				      unsigned long flags)
>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  	int error = 0;
> @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
>  		if (error)
>  			return ERR_PTR(error);
> +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> +			!(flags & RADIX_DAX_HZP)) {
> +		/* replacing huge zero page with PMD block mapping */
> +		unmap_mapping_range(mapping,
> +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
>  	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> -		       RADIX_DAX_ENTRY_LOCK);
> +	new_entry = dax_radix_entry(sector, flags);
> +

You've lost the RADIX_DAX_ENTRY_LOCK flag here?

>  	if (hole_fill) {
>  		__delete_from_page_cache(entry, NULL);
>  		/* Drop pagecache reference */
>  		put_page(entry);
> -		error = radix_tree_insert(page_tree, index, new_entry);
> +		error = __radix_tree_insert(page_tree, index,
> +				dax_radix_order(new_entry), new_entry);
>  		if (error) {
>  			new_entry = ERR_PTR(error);
>  			goto unlock;
>  		}
>  		mapping->nrexceptional++;
> -	} else {
> +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
>  		void **slot;
>  		void *ret;

Uh, why this condition need to change? Is it some protection so that we
don't replace a mapped PMD entry with PTE one?

<snip>

> @@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	return VM_FAULT_NOPAGE | major;
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> +
> +#ifdef CONFIG_FS_DAX_PMD
> +/*
> + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> + * more often than one might expect in the below functions.
> + */
> +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)

Just out of curiosity: Why the british spelling of 'colour'?

> +
> +static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> +		struct vm_fault *vmf, unsigned long address,
> +		struct iomap *iomap, loff_t pos, bool write, void **entryp)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	struct block_device *bdev = iomap->bdev;
> +	struct blk_dax_ctl dax = {
> +		.sector = dax_iomap_sector(iomap, pos),
> +		.size = PMD_SIZE,
> +	};
> +	long length = dax_map_atomic(bdev, &dax);
> +	void *ret;
> +
> +	if (length < 0) /* dax_map_atomic() failed */
> +		return VM_FAULT_FALLBACK;
> +	if (length < PMD_SIZE)
> +		goto unmap_fallback;
> +	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
> +		goto unmap_fallback;
> +	if (!pfn_t_devmap(dax.pfn))
> +		goto unmap_fallback;
> +
> +	dax_unmap_atomic(bdev, &dax);
> +
> +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
> +			RADIX_DAX_PMD);
> +	if (IS_ERR(ret))
> +		return VM_FAULT_FALLBACK;
> +	*entryp = ret;
> +
> +	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
> +
> + unmap_fallback:
> +	dax_unmap_atomic(bdev, &dax);
> +	return VM_FAULT_FALLBACK;
> +}
> +
> +static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
> +		struct vm_fault *vmf, unsigned long address,
> +		struct iomap *iomap, void **entryp)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	struct page *zero_page;
> +	spinlock_t *ptl;
> +	pmd_t pmd_entry;
> +	void *ret;
> +
> +	zero_page = get_huge_zero_page();
> +
> +	if (unlikely(!zero_page))
> +		return VM_FAULT_FALLBACK;
> +
> +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
> +			RADIX_DAX_PMD | RADIX_DAX_HZP);
> +	if (IS_ERR(ret))
> +		return VM_FAULT_FALLBACK;
> +	*entryp = ret;
> +
> +	ptl = pmd_lock(vma->vm_mm, pmd);
> +	if (!pmd_none(*pmd)) {
> +		spin_unlock(ptl);
> +		return VM_FAULT_FALLBACK;
> +	}
> +
> +	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
> +	pmd_entry = pmd_mkhuge(pmd_entry);
> +	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
> +	spin_unlock(ptl);
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	bool write = flags & FAULT_FLAG_WRITE;
> +	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
> +	struct inode *inode = mapping->host;
> +	int result = VM_FAULT_FALLBACK;
> +	struct iomap iomap = { 0 };

Why the 0 here? Just empty braces are enough to initialize the structure to
zeros.

> +	pgoff_t size, pgoff;
> +	struct vm_fault vmf;
> +	void *entry;
> +	loff_t pos;
> +	int error;
> +
> +	/* Fall back to PTEs if we're going to COW */
> +	if (write && !(vma->vm_flags & VM_SHARED)) {
> +		split_huge_pmd(vma, pmd, address);
> +		goto fallback;
> +	}
> +
> +	/* If the PMD would extend outside the VMA */
> +	if (pmd_addr < vma->vm_start)
> +		goto fallback;
> +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> +		goto fallback;
> +
> +	/*
> +	 * Check whether offset isn't beyond end of file now. Caller is
> +	 * supposed to hold locks serializing us with truncate / punch hole so
> +	 * this is a reliable test.
> +	 */
> +	pgoff = linear_page_index(vma, pmd_addr);
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;

Nitpick - 'size' does not express that this is in pages and rounded up.
Maybe we could have:

	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;

and then use strict inequalities below?


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11  8:31     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  8:31 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> diff --git a/fs/dax.c b/fs/dax.c
> index ac3cd05..e51d51f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
>  	 * queue to the start of that PMD.  This ensures that all offsets in
>  	 * the range covered by the PMD map to the same bit lock.
>  	 */
> -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +	if ((unsigned long)entry & RADIX_DAX_PMD)
>  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);

I agree with Christoph - helper for masking type bits would make this
nicer.

...
> -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> +		unsigned long size_flag)
>  {
> +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
>  	void *entry, **slot;
>  
>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
>  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +
> +	if (entry) {
> +		if (size_flag & RADIX_DAX_PMD) {
> +			if (!radix_tree_exceptional_entry(entry) ||
> +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> +				entry = ERR_PTR(-EEXIST);
> +				goto out_unlock;

You need to call put_unlocked_mapping_entry() if you use
get_unlocked_mapping_entry() and then decide not to lock it. The reason is
that the waitqueues we use are exclusive (we wake up only a single waiter
waiting for the lock) and so there can be some waiters for the entry lock
although we have not locked the entry ourselves.

> +			}
> +		} else { /* trying to grab a PTE entry */
> +			if (radix_tree_exceptional_entry(entry) &&
> +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> +			    ((unsigned long)entry &
> +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> +				pmd_downgrade = true;
> +			}
> +		}
> +	}
> +
>  	/* No entry for given index? Make sure radix tree is big enough. */
> -	if (!entry) {
> +	if (!entry || pmd_downgrade) {
>  		int err;
>  
> +		if (pmd_downgrade) {
> +			/*
> +			 * Make sure 'entry' remains valid while we drop
> +			 * mapping->tree_lock.
> +			 */
> +			entry = lock_slot(mapping, slot);
> +		}
> +
>  		spin_unlock_irq(&mapping->tree_lock);
>  		err = radix_tree_preload(
>  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
>  		if (err)
>  			return ERR_PTR(err);

You need to unlock the locked entry before you return here...

> -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> -			       RADIX_DAX_ENTRY_LOCK);
> +
> +		/*
> +		 * Besides huge zero pages the only other thing that gets
> +		 * downgraded are empty entries which don't need to be
> +		 * unmapped.
> +		 */
> +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> +			unmap_mapping_range(mapping,
> +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> +
>  		spin_lock_irq(&mapping->tree_lock);
> -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> +
> +		if (pmd_downgrade) {
> +			radix_tree_delete(&mapping->page_tree, index);
> +			mapping->nrexceptional--;
> +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> +					false);

You need to set 'wake_all' argument here to true. Otherwise there could be
waiters waiting for non-existent entry forever...

> +		}
> +
> +		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
> +
> +		err = __radix_tree_insert(&mapping->page_tree, index,
> +				dax_radix_order(entry), entry);
>  		radix_tree_preload_end();
>  		if (err) {
>  			spin_unlock_irq(&mapping->tree_lock);
> -			/* Someone already created the entry? */
> -			if (err == -EEXIST)
> +			/*
> +			 * Someone already created the entry?  This is a
> +			 * normal failure when inserting PMDs in a range
> +			 * that already contains PTEs.  In that case we want
> +			 * to return -EEXIST immediately.
> +			 */
> +			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
>  				goto restart;

Add a comment here that we can get here only if there was no radix tree
entry at 'index' and thus there can be no waiters to wake.

>  			return ERR_PTR(err);
>  		}
> @@ -441,6 +509,7 @@ restart:
>  		return page;
>  	}
>  	entry = lock_slot(mapping, slot);
> + out_unlock:
>  	spin_unlock_irq(&mapping->tree_lock);
>  	return entry;
>  }
> @@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
>  	return 0;
>  }
>  
> -#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
> -
> +/*
> + * By this point grab_mapping_entry() has ensured that we have a locked entry
> + * of the appropriate size so we don't have to worry about downgrading PMDs to
> + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> + * already in the tree, we will skip the insertion and just dirty the PMD as
> + * appropriate.
> + */
>  static void *dax_insert_mapping_entry(struct address_space *mapping,
>  				      struct vm_fault *vmf,
> -				      void *entry, sector_t sector)
> +				      void *entry, sector_t sector,
> +				      unsigned long flags)
>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  	int error = 0;
> @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
>  		if (error)
>  			return ERR_PTR(error);
> +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> +			!(flags & RADIX_DAX_HZP)) {
> +		/* replacing huge zero page with PMD block mapping */
> +		unmap_mapping_range(mapping,
> +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
>  	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> -		       RADIX_DAX_ENTRY_LOCK);
> +	new_entry = dax_radix_entry(sector, flags);
> +

You've lost the RADIX_DAX_ENTRY_LOCK flag here?

>  	if (hole_fill) {
>  		__delete_from_page_cache(entry, NULL);
>  		/* Drop pagecache reference */
>  		put_page(entry);
> -		error = radix_tree_insert(page_tree, index, new_entry);
> +		error = __radix_tree_insert(page_tree, index,
> +				dax_radix_order(new_entry), new_entry);
>  		if (error) {
>  			new_entry = ERR_PTR(error);
>  			goto unlock;
>  		}
>  		mapping->nrexceptional++;
> -	} else {
> +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
>  		void **slot;
>  		void *ret;

Uh, why this condition need to change? Is it some protection so that we
don't replace a mapped PMD entry with PTE one?

<snip>

> @@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	return VM_FAULT_NOPAGE | major;
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> +
> +#ifdef CONFIG_FS_DAX_PMD
> +/*
> + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> + * more often than one might expect in the below functions.
> + */
> +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)

Just out of curiosity: Why the british spelling of 'colour'?

> +
> +static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> +		struct vm_fault *vmf, unsigned long address,
> +		struct iomap *iomap, loff_t pos, bool write, void **entryp)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	struct block_device *bdev = iomap->bdev;
> +	struct blk_dax_ctl dax = {
> +		.sector = dax_iomap_sector(iomap, pos),
> +		.size = PMD_SIZE,
> +	};
> +	long length = dax_map_atomic(bdev, &dax);
> +	void *ret;
> +
> +	if (length < 0) /* dax_map_atomic() failed */
> +		return VM_FAULT_FALLBACK;
> +	if (length < PMD_SIZE)
> +		goto unmap_fallback;
> +	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
> +		goto unmap_fallback;
> +	if (!pfn_t_devmap(dax.pfn))
> +		goto unmap_fallback;
> +
> +	dax_unmap_atomic(bdev, &dax);
> +
> +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
> +			RADIX_DAX_PMD);
> +	if (IS_ERR(ret))
> +		return VM_FAULT_FALLBACK;
> +	*entryp = ret;
> +
> +	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
> +
> + unmap_fallback:
> +	dax_unmap_atomic(bdev, &dax);
> +	return VM_FAULT_FALLBACK;
> +}
> +
> +static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
> +		struct vm_fault *vmf, unsigned long address,
> +		struct iomap *iomap, void **entryp)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	struct page *zero_page;
> +	spinlock_t *ptl;
> +	pmd_t pmd_entry;
> +	void *ret;
> +
> +	zero_page = get_huge_zero_page();
> +
> +	if (unlikely(!zero_page))
> +		return VM_FAULT_FALLBACK;
> +
> +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
> +			RADIX_DAX_PMD | RADIX_DAX_HZP);
> +	if (IS_ERR(ret))
> +		return VM_FAULT_FALLBACK;
> +	*entryp = ret;
> +
> +	ptl = pmd_lock(vma->vm_mm, pmd);
> +	if (!pmd_none(*pmd)) {
> +		spin_unlock(ptl);
> +		return VM_FAULT_FALLBACK;
> +	}
> +
> +	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
> +	pmd_entry = pmd_mkhuge(pmd_entry);
> +	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
> +	spin_unlock(ptl);
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	bool write = flags & FAULT_FLAG_WRITE;
> +	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
> +	struct inode *inode = mapping->host;
> +	int result = VM_FAULT_FALLBACK;
> +	struct iomap iomap = { 0 };

Why the 0 here? Just empty braces are enough to initialize the structure to
zeros.

> +	pgoff_t size, pgoff;
> +	struct vm_fault vmf;
> +	void *entry;
> +	loff_t pos;
> +	int error;
> +
> +	/* Fall back to PTEs if we're going to COW */
> +	if (write && !(vma->vm_flags & VM_SHARED)) {
> +		split_huge_pmd(vma, pmd, address);
> +		goto fallback;
> +	}
> +
> +	/* If the PMD would extend outside the VMA */
> +	if (pmd_addr < vma->vm_start)
> +		goto fallback;
> +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> +		goto fallback;
> +
> +	/*
> +	 * Check whether offset isn't beyond end of file now. Caller is
> +	 * supposed to hold locks serializing us with truncate / punch hole so
> +	 * this is a reliable test.
> +	 */
> +	pgoff = linear_page_index(vma, pmd_addr);
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;

Nitpick - 'size' does not express that this is in pages and rounded up.
Maybe we could have:

	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;

and then use strict inequalities below?


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11  8:31     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  8:31 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> diff --git a/fs/dax.c b/fs/dax.c
> index ac3cd05..e51d51f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
>  	 * queue to the start of that PMD.  This ensures that all offsets in
>  	 * the range covered by the PMD map to the same bit lock.
>  	 */
> -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +	if ((unsigned long)entry & RADIX_DAX_PMD)
>  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);

I agree with Christoph - helper for masking type bits would make this
nicer.

...
> -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> +		unsigned long size_flag)
>  {
> +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
>  	void *entry, **slot;
>  
>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
>  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +
> +	if (entry) {
> +		if (size_flag & RADIX_DAX_PMD) {
> +			if (!radix_tree_exceptional_entry(entry) ||
> +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> +				entry = ERR_PTR(-EEXIST);
> +				goto out_unlock;

You need to call put_unlocked_mapping_entry() if you use
get_unlocked_mapping_entry() and then decide not to lock it. The reason is
that the waitqueues we use are exclusive (we wake up only a single waiter
waiting for the lock) and so there can be some waiters for the entry lock
although we have not locked the entry ourselves.

> +			}
> +		} else { /* trying to grab a PTE entry */
> +			if (radix_tree_exceptional_entry(entry) &&
> +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> +			    ((unsigned long)entry &
> +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> +				pmd_downgrade = true;
> +			}
> +		}
> +	}
> +
>  	/* No entry for given index? Make sure radix tree is big enough. */
> -	if (!entry) {
> +	if (!entry || pmd_downgrade) {
>  		int err;
>  
> +		if (pmd_downgrade) {
> +			/*
> +			 * Make sure 'entry' remains valid while we drop
> +			 * mapping->tree_lock.
> +			 */
> +			entry = lock_slot(mapping, slot);
> +		}
> +
>  		spin_unlock_irq(&mapping->tree_lock);
>  		err = radix_tree_preload(
>  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
>  		if (err)
>  			return ERR_PTR(err);

You need to unlock the locked entry before you return here...

> -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> -			       RADIX_DAX_ENTRY_LOCK);
> +
> +		/*
> +		 * Besides huge zero pages the only other thing that gets
> +		 * downgraded are empty entries which don't need to be
> +		 * unmapped.
> +		 */
> +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> +			unmap_mapping_range(mapping,
> +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> +
>  		spin_lock_irq(&mapping->tree_lock);
> -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> +
> +		if (pmd_downgrade) {
> +			radix_tree_delete(&mapping->page_tree, index);
> +			mapping->nrexceptional--;
> +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> +					false);

You need to set 'wake_all' argument here to true. Otherwise there could be
waiters waiting for non-existent entry forever...

> +		}
> +
> +		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
> +
> +		err = __radix_tree_insert(&mapping->page_tree, index,
> +				dax_radix_order(entry), entry);
>  		radix_tree_preload_end();
>  		if (err) {
>  			spin_unlock_irq(&mapping->tree_lock);
> -			/* Someone already created the entry? */
> -			if (err == -EEXIST)
> +			/*
> +			 * Someone already created the entry?  This is a
> +			 * normal failure when inserting PMDs in a range
> +			 * that already contains PTEs.  In that case we want
> +			 * to return -EEXIST immediately.
> +			 */
> +			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
>  				goto restart;

Add a comment here that we can get here only if there was no radix tree
entry at 'index' and thus there can be no waiters to wake.

>  			return ERR_PTR(err);
>  		}
> @@ -441,6 +509,7 @@ restart:
>  		return page;
>  	}
>  	entry = lock_slot(mapping, slot);
> + out_unlock:
>  	spin_unlock_irq(&mapping->tree_lock);
>  	return entry;
>  }
> @@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
>  	return 0;
>  }
>  
> -#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
> -
> +/*
> + * By this point grab_mapping_entry() has ensured that we have a locked entry
> + * of the appropriate size so we don't have to worry about downgrading PMDs to
> + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> + * already in the tree, we will skip the insertion and just dirty the PMD as
> + * appropriate.
> + */
>  static void *dax_insert_mapping_entry(struct address_space *mapping,
>  				      struct vm_fault *vmf,
> -				      void *entry, sector_t sector)
> +				      void *entry, sector_t sector,
> +				      unsigned long flags)
>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  	int error = 0;
> @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
>  		if (error)
>  			return ERR_PTR(error);
> +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> +			!(flags & RADIX_DAX_HZP)) {
> +		/* replacing huge zero page with PMD block mapping */
> +		unmap_mapping_range(mapping,
> +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
>  	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> -		       RADIX_DAX_ENTRY_LOCK);
> +	new_entry = dax_radix_entry(sector, flags);
> +

You've lost the RADIX_DAX_ENTRY_LOCK flag here?

>  	if (hole_fill) {
>  		__delete_from_page_cache(entry, NULL);
>  		/* Drop pagecache reference */
>  		put_page(entry);
> -		error = radix_tree_insert(page_tree, index, new_entry);
> +		error = __radix_tree_insert(page_tree, index,
> +				dax_radix_order(new_entry), new_entry);
>  		if (error) {
>  			new_entry = ERR_PTR(error);
>  			goto unlock;
>  		}
>  		mapping->nrexceptional++;
> -	} else {
> +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
>  		void **slot;
>  		void *ret;

Uh, why this condition need to change? Is it some protection so that we
don't replace a mapped PMD entry with PTE one?

<snip>

> @@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	return VM_FAULT_NOPAGE | major;
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> +
> +#ifdef CONFIG_FS_DAX_PMD
> +/*
> + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> + * more often than one might expect in the below functions.
> + */
> +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)

Just out of curiosity: Why the british spelling of 'colour'?

> +
> +static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> +		struct vm_fault *vmf, unsigned long address,
> +		struct iomap *iomap, loff_t pos, bool write, void **entryp)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	struct block_device *bdev = iomap->bdev;
> +	struct blk_dax_ctl dax = {
> +		.sector = dax_iomap_sector(iomap, pos),
> +		.size = PMD_SIZE,
> +	};
> +	long length = dax_map_atomic(bdev, &dax);
> +	void *ret;
> +
> +	if (length < 0) /* dax_map_atomic() failed */
> +		return VM_FAULT_FALLBACK;
> +	if (length < PMD_SIZE)
> +		goto unmap_fallback;
> +	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
> +		goto unmap_fallback;
> +	if (!pfn_t_devmap(dax.pfn))
> +		goto unmap_fallback;
> +
> +	dax_unmap_atomic(bdev, &dax);
> +
> +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
> +			RADIX_DAX_PMD);
> +	if (IS_ERR(ret))
> +		return VM_FAULT_FALLBACK;
> +	*entryp = ret;
> +
> +	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
> +
> + unmap_fallback:
> +	dax_unmap_atomic(bdev, &dax);
> +	return VM_FAULT_FALLBACK;
> +}
> +
> +static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
> +		struct vm_fault *vmf, unsigned long address,
> +		struct iomap *iomap, void **entryp)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	struct page *zero_page;
> +	spinlock_t *ptl;
> +	pmd_t pmd_entry;
> +	void *ret;
> +
> +	zero_page = get_huge_zero_page();
> +
> +	if (unlikely(!zero_page))
> +		return VM_FAULT_FALLBACK;
> +
> +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
> +			RADIX_DAX_PMD | RADIX_DAX_HZP);
> +	if (IS_ERR(ret))
> +		return VM_FAULT_FALLBACK;
> +	*entryp = ret;
> +
> +	ptl = pmd_lock(vma->vm_mm, pmd);
> +	if (!pmd_none(*pmd)) {
> +		spin_unlock(ptl);
> +		return VM_FAULT_FALLBACK;
> +	}
> +
> +	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
> +	pmd_entry = pmd_mkhuge(pmd_entry);
> +	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
> +	spin_unlock(ptl);
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	bool write = flags & FAULT_FLAG_WRITE;
> +	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
> +	struct inode *inode = mapping->host;
> +	int result = VM_FAULT_FALLBACK;
> +	struct iomap iomap = { 0 };

Why the 0 here? Just empty braces are enough to initialize the structure to
zeros.

> +	pgoff_t size, pgoff;
> +	struct vm_fault vmf;
> +	void *entry;
> +	loff_t pos;
> +	int error;
> +
> +	/* Fall back to PTEs if we're going to COW */
> +	if (write && !(vma->vm_flags & VM_SHARED)) {
> +		split_huge_pmd(vma, pmd, address);
> +		goto fallback;
> +	}
> +
> +	/* If the PMD would extend outside the VMA */
> +	if (pmd_addr < vma->vm_start)
> +		goto fallback;
> +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> +		goto fallback;
> +
> +	/*
> +	 * Check whether offset isn't beyond end of file now. Caller is
> +	 * supposed to hold locks serializing us with truncate / punch hole so
> +	 * this is a reliable test.
> +	 */
> +	pgoff = linear_page_index(vma, pmd_addr);
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;

Nitpick - 'size' does not express that this is in pages and rounded up.
Maybe we could have:

	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;

and then use strict inequalities below?


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11  8:31     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  8:31 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> diff --git a/fs/dax.c b/fs/dax.c
> index ac3cd05..e51d51f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
>  	 * queue to the start of that PMD.  This ensures that all offsets in
>  	 * the range covered by the PMD map to the same bit lock.
>  	 */
> -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +	if ((unsigned long)entry & RADIX_DAX_PMD)
>  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);

I agree with Christoph - helper for masking type bits would make this
nicer.

...
> -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> +		unsigned long size_flag)
>  {
> +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
>  	void *entry, **slot;
>  
>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
>  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +
> +	if (entry) {
> +		if (size_flag & RADIX_DAX_PMD) {
> +			if (!radix_tree_exceptional_entry(entry) ||
> +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> +				entry = ERR_PTR(-EEXIST);
> +				goto out_unlock;

You need to call put_unlocked_mapping_entry() if you use
get_unlocked_mapping_entry() and then decide not to lock it. The reason is
that the waitqueues we use are exclusive (we wake up only a single waiter
waiting for the lock) and so there can be some waiters for the entry lock
although we have not locked the entry ourselves.

> +			}
> +		} else { /* trying to grab a PTE entry */
> +			if (radix_tree_exceptional_entry(entry) &&
> +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> +			    ((unsigned long)entry &
> +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> +				pmd_downgrade = true;
> +			}
> +		}
> +	}
> +
>  	/* No entry for given index? Make sure radix tree is big enough. */
> -	if (!entry) {
> +	if (!entry || pmd_downgrade) {
>  		int err;
>  
> +		if (pmd_downgrade) {
> +			/*
> +			 * Make sure 'entry' remains valid while we drop
> +			 * mapping->tree_lock.
> +			 */
> +			entry = lock_slot(mapping, slot);
> +		}
> +
>  		spin_unlock_irq(&mapping->tree_lock);
>  		err = radix_tree_preload(
>  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
>  		if (err)
>  			return ERR_PTR(err);

You need to unlock the locked entry before you return here...

> -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> -			       RADIX_DAX_ENTRY_LOCK);
> +
> +		/*
> +		 * Besides huge zero pages the only other thing that gets
> +		 * downgraded are empty entries which don't need to be
> +		 * unmapped.
> +		 */
> +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> +			unmap_mapping_range(mapping,
> +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> +
>  		spin_lock_irq(&mapping->tree_lock);
> -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> +
> +		if (pmd_downgrade) {
> +			radix_tree_delete(&mapping->page_tree, index);
> +			mapping->nrexceptional--;
> +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> +					false);

You need to set 'wake_all' argument here to true. Otherwise there could be
waiters waiting for non-existent entry forever...

> +		}
> +
> +		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
> +
> +		err = __radix_tree_insert(&mapping->page_tree, index,
> +				dax_radix_order(entry), entry);
>  		radix_tree_preload_end();
>  		if (err) {
>  			spin_unlock_irq(&mapping->tree_lock);
> -			/* Someone already created the entry? */
> -			if (err == -EEXIST)
> +			/*
> +			 * Someone already created the entry?  This is a
> +			 * normal failure when inserting PMDs in a range
> +			 * that already contains PTEs.  In that case we want
> +			 * to return -EEXIST immediately.
> +			 */
> +			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
>  				goto restart;

Add a comment here that we can get here only if there was no radix tree
entry at 'index' and thus there can be no waiters to wake.

>  			return ERR_PTR(err);
>  		}
> @@ -441,6 +509,7 @@ restart:
>  		return page;
>  	}
>  	entry = lock_slot(mapping, slot);
> + out_unlock:
>  	spin_unlock_irq(&mapping->tree_lock);
>  	return entry;
>  }
> @@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
>  	return 0;
>  }
>  
> -#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
> -
> +/*
> + * By this point grab_mapping_entry() has ensured that we have a locked entry
> + * of the appropriate size so we don't have to worry about downgrading PMDs to
> + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> + * already in the tree, we will skip the insertion and just dirty the PMD as
> + * appropriate.
> + */
>  static void *dax_insert_mapping_entry(struct address_space *mapping,
>  				      struct vm_fault *vmf,
> -				      void *entry, sector_t sector)
> +				      void *entry, sector_t sector,
> +				      unsigned long flags)
>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  	int error = 0;
> @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
>  		if (error)
>  			return ERR_PTR(error);
> +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> +			!(flags & RADIX_DAX_HZP)) {
> +		/* replacing huge zero page with PMD block mapping */
> +		unmap_mapping_range(mapping,
> +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
>  	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> -		       RADIX_DAX_ENTRY_LOCK);
> +	new_entry = dax_radix_entry(sector, flags);
> +

You've lost the RADIX_DAX_ENTRY_LOCK flag here?

>  	if (hole_fill) {
>  		__delete_from_page_cache(entry, NULL);
>  		/* Drop pagecache reference */
>  		put_page(entry);
> -		error = radix_tree_insert(page_tree, index, new_entry);
> +		error = __radix_tree_insert(page_tree, index,
> +				dax_radix_order(new_entry), new_entry);
>  		if (error) {
>  			new_entry = ERR_PTR(error);
>  			goto unlock;
>  		}
>  		mapping->nrexceptional++;
> -	} else {
> +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
>  		void **slot;
>  		void *ret;

Uh, why this condition need to change? Is it some protection so that we
don't replace a mapped PMD entry with PTE one?

<snip>

> @@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	return VM_FAULT_NOPAGE | major;
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> +
> +#ifdef CONFIG_FS_DAX_PMD
> +/*
> + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> + * more often than one might expect in the below functions.
> + */
> +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)

Just out of curiosity: Why the british spelling of 'colour'?

> +
> +static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> +		struct vm_fault *vmf, unsigned long address,
> +		struct iomap *iomap, loff_t pos, bool write, void **entryp)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	struct block_device *bdev = iomap->bdev;
> +	struct blk_dax_ctl dax = {
> +		.sector = dax_iomap_sector(iomap, pos),
> +		.size = PMD_SIZE,
> +	};
> +	long length = dax_map_atomic(bdev, &dax);
> +	void *ret;
> +
> +	if (length < 0) /* dax_map_atomic() failed */
> +		return VM_FAULT_FALLBACK;
> +	if (length < PMD_SIZE)
> +		goto unmap_fallback;
> +	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
> +		goto unmap_fallback;
> +	if (!pfn_t_devmap(dax.pfn))
> +		goto unmap_fallback;
> +
> +	dax_unmap_atomic(bdev, &dax);
> +
> +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
> +			RADIX_DAX_PMD);
> +	if (IS_ERR(ret))
> +		return VM_FAULT_FALLBACK;
> +	*entryp = ret;
> +
> +	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
> +
> + unmap_fallback:
> +	dax_unmap_atomic(bdev, &dax);
> +	return VM_FAULT_FALLBACK;
> +}
> +
> +static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
> +		struct vm_fault *vmf, unsigned long address,
> +		struct iomap *iomap, void **entryp)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	struct page *zero_page;
> +	spinlock_t *ptl;
> +	pmd_t pmd_entry;
> +	void *ret;
> +
> +	zero_page = get_huge_zero_page();
> +
> +	if (unlikely(!zero_page))
> +		return VM_FAULT_FALLBACK;
> +
> +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
> +			RADIX_DAX_PMD | RADIX_DAX_HZP);
> +	if (IS_ERR(ret))
> +		return VM_FAULT_FALLBACK;
> +	*entryp = ret;
> +
> +	ptl = pmd_lock(vma->vm_mm, pmd);
> +	if (!pmd_none(*pmd)) {
> +		spin_unlock(ptl);
> +		return VM_FAULT_FALLBACK;
> +	}
> +
> +	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
> +	pmd_entry = pmd_mkhuge(pmd_entry);
> +	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
> +	spin_unlock(ptl);
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	bool write = flags & FAULT_FLAG_WRITE;
> +	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
> +	struct inode *inode = mapping->host;
> +	int result = VM_FAULT_FALLBACK;
> +	struct iomap iomap = { 0 };

Why the 0 here? Just empty braces are enough to initialize the structure to
zeros.

> +	pgoff_t size, pgoff;
> +	struct vm_fault vmf;
> +	void *entry;
> +	loff_t pos;
> +	int error;
> +
> +	/* Fall back to PTEs if we're going to COW */
> +	if (write && !(vma->vm_flags & VM_SHARED)) {
> +		split_huge_pmd(vma, pmd, address);
> +		goto fallback;
> +	}
> +
> +	/* If the PMD would extend outside the VMA */
> +	if (pmd_addr < vma->vm_start)
> +		goto fallback;
> +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> +		goto fallback;
> +
> +	/*
> +	 * Check whether offset isn't beyond end of file now. Caller is
> +	 * supposed to hold locks serializing us with truncate / punch hole so
> +	 * this is a reliable test.
> +	 */
> +	pgoff = linear_page_index(vma, pmd_addr);
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;

Nitpick - 'size' does not express that this is in pages and rounded up.
Maybe we could have:

	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;

and then use strict inequalities below?


								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11  8:31     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  8:31 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> diff --git a/fs/dax.c b/fs/dax.c
> index ac3cd05..e51d51f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
>  	 * queue to the start of that PMD.  This ensures that all offsets in
>  	 * the range covered by the PMD map to the same bit lock.
>  	 */
> -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +	if ((unsigned long)entry & RADIX_DAX_PMD)
>  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);

I agree with Christoph - helper for masking type bits would make this
nicer.

...
> -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> +		unsigned long size_flag)
>  {
> +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
>  	void *entry, **slot;
>  
>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
>  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +
> +	if (entry) {
> +		if (size_flag & RADIX_DAX_PMD) {
> +			if (!radix_tree_exceptional_entry(entry) ||
> +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> +				entry = ERR_PTR(-EEXIST);
> +				goto out_unlock;

You need to call put_unlocked_mapping_entry() if you use
get_unlocked_mapping_entry() and then decide not to lock it. The reason is
that the waitqueues we use are exclusive (we wake up only a single waiter
waiting for the lock) and so there can be some waiters for the entry lock
although we have not locked the entry ourselves.

> +			}
> +		} else { /* trying to grab a PTE entry */
> +			if (radix_tree_exceptional_entry(entry) &&
> +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> +			    ((unsigned long)entry &
> +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> +				pmd_downgrade = true;
> +			}
> +		}
> +	}
> +
>  	/* No entry for given index? Make sure radix tree is big enough. */
> -	if (!entry) {
> +	if (!entry || pmd_downgrade) {
>  		int err;
>  
> +		if (pmd_downgrade) {
> +			/*
> +			 * Make sure 'entry' remains valid while we drop
> +			 * mapping->tree_lock.
> +			 */
> +			entry = lock_slot(mapping, slot);
> +		}
> +
>  		spin_unlock_irq(&mapping->tree_lock);
>  		err = radix_tree_preload(
>  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
>  		if (err)
>  			return ERR_PTR(err);

You need to unlock the locked entry before you return here...

> -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> -			       RADIX_DAX_ENTRY_LOCK);
> +
> +		/*
> +		 * Besides huge zero pages the only other thing that gets
> +		 * downgraded are empty entries which don't need to be
> +		 * unmapped.
> +		 */
> +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> +			unmap_mapping_range(mapping,
> +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> +
>  		spin_lock_irq(&mapping->tree_lock);
> -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> +
> +		if (pmd_downgrade) {
> +			radix_tree_delete(&mapping->page_tree, index);
> +			mapping->nrexceptional--;
> +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> +					false);

You need to set 'wake_all' argument here to true. Otherwise there could be
waiters waiting for non-existent entry forever...

> +		}
> +
> +		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
> +
> +		err = __radix_tree_insert(&mapping->page_tree, index,
> +				dax_radix_order(entry), entry);
>  		radix_tree_preload_end();
>  		if (err) {
>  			spin_unlock_irq(&mapping->tree_lock);
> -			/* Someone already created the entry? */
> -			if (err == -EEXIST)
> +			/*
> +			 * Someone already created the entry?  This is a
> +			 * normal failure when inserting PMDs in a range
> +			 * that already contains PTEs.  In that case we want
> +			 * to return -EEXIST immediately.
> +			 */
> +			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
>  				goto restart;

Add a comment here that we can get here only if there was no radix tree
entry at 'index' and thus there can be no waiters to wake.

>  			return ERR_PTR(err);
>  		}
> @@ -441,6 +509,7 @@ restart:
>  		return page;
>  	}
>  	entry = lock_slot(mapping, slot);
> + out_unlock:
>  	spin_unlock_irq(&mapping->tree_lock);
>  	return entry;
>  }
> @@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
>  	return 0;
>  }
>  
> -#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
> -
> +/*
> + * By this point grab_mapping_entry() has ensured that we have a locked entry
> + * of the appropriate size so we don't have to worry about downgrading PMDs to
> + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> + * already in the tree, we will skip the insertion and just dirty the PMD as
> + * appropriate.
> + */
>  static void *dax_insert_mapping_entry(struct address_space *mapping,
>  				      struct vm_fault *vmf,
> -				      void *entry, sector_t sector)
> +				      void *entry, sector_t sector,
> +				      unsigned long flags)
>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  	int error = 0;
> @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
>  		if (error)
>  			return ERR_PTR(error);
> +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> +			!(flags & RADIX_DAX_HZP)) {
> +		/* replacing huge zero page with PMD block mapping */
> +		unmap_mapping_range(mapping,
> +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
>  	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> -		       RADIX_DAX_ENTRY_LOCK);
> +	new_entry = dax_radix_entry(sector, flags);
> +

You've lost the RADIX_DAX_ENTRY_LOCK flag here?

>  	if (hole_fill) {
>  		__delete_from_page_cache(entry, NULL);
>  		/* Drop pagecache reference */
>  		put_page(entry);
> -		error = radix_tree_insert(page_tree, index, new_entry);
> +		error = __radix_tree_insert(page_tree, index,
> +				dax_radix_order(new_entry), new_entry);
>  		if (error) {
>  			new_entry = ERR_PTR(error);
>  			goto unlock;
>  		}
>  		mapping->nrexceptional++;
> -	} else {
> +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
>  		void **slot;
>  		void *ret;

Uh, why this condition need to change? Is it some protection so that we
don't replace a mapped PMD entry with PTE one?

<snip>

> @@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	return VM_FAULT_NOPAGE | major;
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> +
> +#ifdef CONFIG_FS_DAX_PMD
> +/*
> + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> + * more often than one might expect in the below functions.
> + */
> +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)

Just out of curiosity: Why the british spelling of 'colour'?

> +
> +static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> +		struct vm_fault *vmf, unsigned long address,
> +		struct iomap *iomap, loff_t pos, bool write, void **entryp)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	struct block_device *bdev = iomap->bdev;
> +	struct blk_dax_ctl dax = {
> +		.sector = dax_iomap_sector(iomap, pos),
> +		.size = PMD_SIZE,
> +	};
> +	long length = dax_map_atomic(bdev, &dax);
> +	void *ret;
> +
> +	if (length < 0) /* dax_map_atomic() failed */
> +		return VM_FAULT_FALLBACK;
> +	if (length < PMD_SIZE)
> +		goto unmap_fallback;
> +	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
> +		goto unmap_fallback;
> +	if (!pfn_t_devmap(dax.pfn))
> +		goto unmap_fallback;
> +
> +	dax_unmap_atomic(bdev, &dax);
> +
> +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
> +			RADIX_DAX_PMD);
> +	if (IS_ERR(ret))
> +		return VM_FAULT_FALLBACK;
> +	*entryp = ret;
> +
> +	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
> +
> + unmap_fallback:
> +	dax_unmap_atomic(bdev, &dax);
> +	return VM_FAULT_FALLBACK;
> +}
> +
> +static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
> +		struct vm_fault *vmf, unsigned long address,
> +		struct iomap *iomap, void **entryp)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	struct page *zero_page;
> +	spinlock_t *ptl;
> +	pmd_t pmd_entry;
> +	void *ret;
> +
> +	zero_page = get_huge_zero_page();
> +
> +	if (unlikely(!zero_page))
> +		return VM_FAULT_FALLBACK;
> +
> +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
> +			RADIX_DAX_PMD | RADIX_DAX_HZP);
> +	if (IS_ERR(ret))
> +		return VM_FAULT_FALLBACK;
> +	*entryp = ret;
> +
> +	ptl = pmd_lock(vma->vm_mm, pmd);
> +	if (!pmd_none(*pmd)) {
> +		spin_unlock(ptl);
> +		return VM_FAULT_FALLBACK;
> +	}
> +
> +	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
> +	pmd_entry = pmd_mkhuge(pmd_entry);
> +	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
> +	spin_unlock(ptl);
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	bool write = flags & FAULT_FLAG_WRITE;
> +	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
> +	struct inode *inode = mapping->host;
> +	int result = VM_FAULT_FALLBACK;
> +	struct iomap iomap = { 0 };

Why the 0 here? Just empty braces are enough to initialize the structure to
zeros.

> +	pgoff_t size, pgoff;
> +	struct vm_fault vmf;
> +	void *entry;
> +	loff_t pos;
> +	int error;
> +
> +	/* Fall back to PTEs if we're going to COW */
> +	if (write && !(vma->vm_flags & VM_SHARED)) {
> +		split_huge_pmd(vma, pmd, address);
> +		goto fallback;
> +	}
> +
> +	/* If the PMD would extend outside the VMA */
> +	if (pmd_addr < vma->vm_start)
> +		goto fallback;
> +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> +		goto fallback;
> +
> +	/*
> +	 * Check whether offset isn't beyond end of file now. Caller is
> +	 * supposed to hold locks serializing us with truncate / punch hole so
> +	 * this is a reliable test.
> +	 */
> +	pgoff = linear_page_index(vma, pmd_addr);
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;

Nitpick - 'size' does not express that this is in pages and rounded up.
Maybe we could have:

	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;

and then use strict inequalities below?


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 16/17] xfs: use struct iomap based DAX PMD fault path
  2016-10-07 21:09   ` Ross Zwisler
  (?)
  (?)
@ 2016-10-11  8:34     ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  8:34 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, Andrew Morton,
	linux-ext4, Christoph Hellwig

On Fri 07-10-16 15:09:03, Ross Zwisler wrote:
> Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
> improved dax_iomap_pmd_fault().  Also, now that it has no more users,
> remove xfs_get_blocks_dax_fault().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/xfs/xfs_aops.c | 26 +++++---------------------
>  fs/xfs/xfs_aops.h |  3 ---
>  fs/xfs/xfs_file.c |  2 +-
>  3 files changed, 6 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0e2a931..1c73d0a 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1298,8 +1298,7 @@ __xfs_get_blocks(
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	bool			direct,
> -	bool			dax_fault)
> +	bool			direct)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1420,13 +1419,8 @@ __xfs_get_blocks(
>  		if (ISUNWRITTEN(&imap))
>  			set_buffer_unwritten(bh_result);
>  		/* direct IO needs special help */
> -		if (create) {
> -			if (dax_fault)
> -				ASSERT(!ISUNWRITTEN(&imap));
> -			else
> -				xfs_map_direct(inode, bh_result, &imap, offset,
> -						is_cow);
> -		}
> +		if (create)
> +			xfs_map_direct(inode, bh_result, &imap, offset, is_cow);
>  	}
>  
>  	/*
> @@ -1466,7 +1460,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
>  }
>  
>  int
> @@ -1476,17 +1470,7 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> -}
> -
> -int
> -xfs_get_blocks_dax_fault(
> -	struct inode		*inode,
> -	sector_t		iblock,
> -	struct buffer_head	*bh_result,
> -	int			create)
> -{
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index b3c6634..34dc00d 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -59,9 +59,6 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
>  		       struct buffer_head *map_bh, int create);
>  int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
>  			      struct buffer_head *map_bh, int create);
> -int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
> -			         struct buffer_head *map_bh, int create);
> -
>  int	xfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
>  		ssize_t size, void *private);
>  int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 8f12152..7b13dda 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1750,7 +1750,7 @@ xfs_filemap_pmd_fault(
>  	}
>  
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -	ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
> +	ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (flags & FAULT_FLAG_WRITE)
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 16/17] xfs: use struct iomap based DAX PMD fault path
@ 2016-10-11  8:34     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  8:34 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:03, Ross Zwisler wrote:
> Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
> improved dax_iomap_pmd_fault().  Also, now that it has no more users,
> remove xfs_get_blocks_dax_fault().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/xfs/xfs_aops.c | 26 +++++---------------------
>  fs/xfs/xfs_aops.h |  3 ---
>  fs/xfs/xfs_file.c |  2 +-
>  3 files changed, 6 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0e2a931..1c73d0a 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1298,8 +1298,7 @@ __xfs_get_blocks(
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	bool			direct,
> -	bool			dax_fault)
> +	bool			direct)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1420,13 +1419,8 @@ __xfs_get_blocks(
>  		if (ISUNWRITTEN(&imap))
>  			set_buffer_unwritten(bh_result);
>  		/* direct IO needs special help */
> -		if (create) {
> -			if (dax_fault)
> -				ASSERT(!ISUNWRITTEN(&imap));
> -			else
> -				xfs_map_direct(inode, bh_result, &imap, offset,
> -						is_cow);
> -		}
> +		if (create)
> +			xfs_map_direct(inode, bh_result, &imap, offset, is_cow);
>  	}
>  
>  	/*
> @@ -1466,7 +1460,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
>  }
>  
>  int
> @@ -1476,17 +1470,7 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> -}
> -
> -int
> -xfs_get_blocks_dax_fault(
> -	struct inode		*inode,
> -	sector_t		iblock,
> -	struct buffer_head	*bh_result,
> -	int			create)
> -{
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index b3c6634..34dc00d 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -59,9 +59,6 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
>  		       struct buffer_head *map_bh, int create);
>  int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
>  			      struct buffer_head *map_bh, int create);
> -int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
> -			         struct buffer_head *map_bh, int create);
> -
>  int	xfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
>  		ssize_t size, void *private);
>  int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 8f12152..7b13dda 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1750,7 +1750,7 @@ xfs_filemap_pmd_fault(
>  	}
>  
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -	ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
> +	ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (flags & FAULT_FLAG_WRITE)
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 16/17] xfs: use struct iomap based DAX PMD fault path
@ 2016-10-11  8:34     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  8:34 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:03, Ross Zwisler wrote:
> Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
> improved dax_iomap_pmd_fault().  Also, now that it has no more users,
> remove xfs_get_blocks_dax_fault().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/xfs/xfs_aops.c | 26 +++++---------------------
>  fs/xfs/xfs_aops.h |  3 ---
>  fs/xfs/xfs_file.c |  2 +-
>  3 files changed, 6 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0e2a931..1c73d0a 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1298,8 +1298,7 @@ __xfs_get_blocks(
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	bool			direct,
> -	bool			dax_fault)
> +	bool			direct)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1420,13 +1419,8 @@ __xfs_get_blocks(
>  		if (ISUNWRITTEN(&imap))
>  			set_buffer_unwritten(bh_result);
>  		/* direct IO needs special help */
> -		if (create) {
> -			if (dax_fault)
> -				ASSERT(!ISUNWRITTEN(&imap));
> -			else
> -				xfs_map_direct(inode, bh_result, &imap, offset,
> -						is_cow);
> -		}
> +		if (create)
> +			xfs_map_direct(inode, bh_result, &imap, offset, is_cow);
>  	}
>  
>  	/*
> @@ -1466,7 +1460,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
>  }
>  
>  int
> @@ -1476,17 +1470,7 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> -}
> -
> -int
> -xfs_get_blocks_dax_fault(
> -	struct inode		*inode,
> -	sector_t		iblock,
> -	struct buffer_head	*bh_result,
> -	int			create)
> -{
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index b3c6634..34dc00d 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -59,9 +59,6 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
>  		       struct buffer_head *map_bh, int create);
>  int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
>  			      struct buffer_head *map_bh, int create);
> -int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
> -			         struct buffer_head *map_bh, int create);
> -
>  int	xfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
>  		ssize_t size, void *private);
>  int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 8f12152..7b13dda 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1750,7 +1750,7 @@ xfs_filemap_pmd_fault(
>  	}
>  
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -	ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
> +	ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (flags & FAULT_FLAG_WRITE)
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 16/17] xfs: use struct iomap based DAX PMD fault path
@ 2016-10-11  8:34     ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-11  8:34 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri 07-10-16 15:09:03, Ross Zwisler wrote:
> Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
> improved dax_iomap_pmd_fault().  Also, now that it has no more users,
> remove xfs_get_blocks_dax_fault().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/xfs/xfs_aops.c | 26 +++++---------------------
>  fs/xfs/xfs_aops.h |  3 ---
>  fs/xfs/xfs_file.c |  2 +-
>  3 files changed, 6 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0e2a931..1c73d0a 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1298,8 +1298,7 @@ __xfs_get_blocks(
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	bool			direct,
> -	bool			dax_fault)
> +	bool			direct)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1420,13 +1419,8 @@ __xfs_get_blocks(
>  		if (ISUNWRITTEN(&imap))
>  			set_buffer_unwritten(bh_result);
>  		/* direct IO needs special help */
> -		if (create) {
> -			if (dax_fault)
> -				ASSERT(!ISUNWRITTEN(&imap));
> -			else
> -				xfs_map_direct(inode, bh_result, &imap, offset,
> -						is_cow);
> -		}
> +		if (create)
> +			xfs_map_direct(inode, bh_result, &imap, offset, is_cow);
>  	}
>  
>  	/*
> @@ -1466,7 +1460,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
>  }
>  
>  int
> @@ -1476,17 +1470,7 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> -}
> -
> -int
> -xfs_get_blocks_dax_fault(
> -	struct inode		*inode,
> -	sector_t		iblock,
> -	struct buffer_head	*bh_result,
> -	int			create)
> -{
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index b3c6634..34dc00d 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -59,9 +59,6 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
>  		       struct buffer_head *map_bh, int create);
>  int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
>  			      struct buffer_head *map_bh, int create);
> -int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
> -			         struct buffer_head *map_bh, int create);
> -
>  int	xfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
>  		ssize_t size, void *private);
>  int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 8f12152..7b13dda 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1750,7 +1750,7 @@ xfs_filemap_pmd_fault(
>  	}
>  
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -	ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
> +	ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (flags & FAULT_FLAG_WRITE)
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
  2016-10-11  7:04     ` Jan Kara
  (?)
@ 2016-10-11 21:18       ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 21:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue, Oct 11, 2016 at 09:04:09AM +0200, Jan Kara wrote:
> On Fri 07-10-16 15:08:56, Ross Zwisler wrote:
> > DAX radix tree locking currently locks entries based on the unique
> > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > This works for PTEs, but as we move to PMDs we will need to have all the
> > offsets within the range covered by the PMD to map to the same bit lock.
> > To accomplish this, for ranges covered by a PMD entry we will instead lock
> > based on the page offset of the beginning of the PMD entry.  The 'mapping'
> > pointer is still used in the same way.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> The patch looks good to me. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Just one thing which IMO deserves a comment below:
> 
> > @@ -448,9 +460,12 @@ restart:
> >  }
> >  
> >  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> > -				   pgoff_t index, bool wake_all)
> > +		pgoff_t index, void *entry, bool wake_all)
> >  {
> > -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> > +	struct exceptional_entry_key key;
> > +	wait_queue_head_t *wq;
> > +
> > +	wq = dax_entry_waitqueue(mapping, index, entry, &key);
> 
> So I believe we should comment above this function that the 'entry' it gets
> may be invalid by the time it gets it (we call it without tree_lock held so
> the passed entry may be changed in the radix tree as we work) but we use it
> only to find appropriate waitqueue where tasks sleep waiting for that old
> entry to unlock so we indeed wake up all tasks we need.

Added, thanks for the suggestion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
@ 2016-10-11 21:18       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 21:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue, Oct 11, 2016 at 09:04:09AM +0200, Jan Kara wrote:
> On Fri 07-10-16 15:08:56, Ross Zwisler wrote:
> > DAX radix tree locking currently locks entries based on the unique
> > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > This works for PTEs, but as we move to PMDs we will need to have all the
> > offsets within the range covered by the PMD to map to the same bit lock.
> > To accomplish this, for ranges covered by a PMD entry we will instead lock
> > based on the page offset of the beginning of the PMD entry.  The 'mapping'
> > pointer is still used in the same way.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> The patch looks good to me. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Just one thing which IMO deserves a comment below:
> 
> > @@ -448,9 +460,12 @@ restart:
> >  }
> >  
> >  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> > -				   pgoff_t index, bool wake_all)
> > +		pgoff_t index, void *entry, bool wake_all)
> >  {
> > -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> > +	struct exceptional_entry_key key;
> > +	wait_queue_head_t *wq;
> > +
> > +	wq = dax_entry_waitqueue(mapping, index, entry, &key);
> 
> So I believe we should comment above this function that the 'entry' it gets
> may be invalid by the time it gets it (we call it without tree_lock held so
> the passed entry may be changed in the radix tree as we work) but we use it
> only to find appropriate waitqueue where tasks sleep waiting for that old
> entry to unlock so we indeed wake up all tasks we need.

Added, thanks for the suggestion.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range
@ 2016-10-11 21:18       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 21:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue, Oct 11, 2016 at 09:04:09AM +0200, Jan Kara wrote:
> On Fri 07-10-16 15:08:56, Ross Zwisler wrote:
> > DAX radix tree locking currently locks entries based on the unique
> > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > This works for PTEs, but as we move to PMDs we will need to have all the
> > offsets within the range covered by the PMD to map to the same bit lock.
> > To accomplish this, for ranges covered by a PMD entry we will instead lock
> > based on the page offset of the beginning of the PMD entry.  The 'mapping'
> > pointer is still used in the same way.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> The patch looks good to me. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Just one thing which IMO deserves a comment below:
> 
> > @@ -448,9 +460,12 @@ restart:
> >  }
> >  
> >  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> > -				   pgoff_t index, bool wake_all)
> > +		pgoff_t index, void *entry, bool wake_all)
> >  {
> > -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> > +	struct exceptional_entry_key key;
> > +	wait_queue_head_t *wq;
> > +
> > +	wq = dax_entry_waitqueue(mapping, index, entry, &key);
> 
> So I believe we should comment above this function that the 'entry' it gets
> may be invalid by the time it gets it (we call it without tree_lock held so
> the passed entry may be changed in the radix tree as we work) but we use it
> only to find appropriate waitqueue where tasks sleep waiting for that old
> entry to unlock so we indeed wake up all tasks we need.

Added, thanks for the suggestion.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
  2016-10-10 15:59     ` Christoph Hellwig
                         ` (2 preceding siblings ...)
  (?)
@ 2016-10-11 21:48       ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 21:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, linux-xfs, linux-mm, Andreas Dilger,
	Alexander Viro, Jan Kara, linux-fsdevel, linux-ext4,
	Andrew Morton

On Mon, Oct 10, 2016 at 05:59:17PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> 
> Please introduce a proper inline helper that mask all the possible type
> bits out of the radix tree entry, and use them wherever you do the
> open cast.

Yea, this is messy.  I tried having temporary flags, but that's basically
where we came from with the old 'type' thing which used to be
RADIX_DAX_PTE|RADIX_DAX_PMD.  

After playing with it a bit it seems like the cleanest way is to have a little
flag test helper, like this:

static int dax_flag_test(void *entry, int flags)
{
	return (unsigned long)entry & flags;
}

...

		/*
		 * Besides huge zero pages the only other thing that gets
		 * downgraded are empty entries which don't need to be
		 * unmapped.
		 */
		if (pmd_downgrade && dax_flag_test(entry, RADIX_DAX_HZP))
			unmap_mapping_range(mapping,
				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);

etc.   Please let me know if this is undesirable for some reason.  Vs keeping
the flags in a local variable, this is good because a) it doesn't require
callers to cast, and b) it makes operator precedence easy because the flags
are a param, so no "flags & (flag1|flag2)" nesting, and c) we don't keep a
local variable that could get out of sync with 'entry'.

> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> 
> And when we do these cases N times next to each other we should
> have a local variable the valid flag bits of entry.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11 21:48       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 21:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Mon, Oct 10, 2016 at 05:59:17PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> 
> Please introduce a proper inline helper that mask all the possible type
> bits out of the radix tree entry, and use them wherever you do the
> open cast.

Yea, this is messy.  I tried having temporary flags, but that's basically
where we came from with the old 'type' thing which used to be
RADIX_DAX_PTE|RADIX_DAX_PMD.  

After playing with it a bit it seems like the cleanest way is to have a little
flag test helper, like this:

static int dax_flag_test(void *entry, int flags)
{
	return (unsigned long)entry & flags;
}

...

		/*
		 * Besides huge zero pages the only other thing that gets
		 * downgraded are empty entries which don't need to be
		 * unmapped.
		 */
		if (pmd_downgrade && dax_flag_test(entry, RADIX_DAX_HZP))
			unmap_mapping_range(mapping,
				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);

etc.   Please let me know if this is undesirable for some reason.  Vs keeping
the flags in a local variable, this is good because a) it doesn't require
callers to cast, and b) it makes operator precedence easy because the flags
are a param, so no "flags & (flag1|flag2)" nesting, and c) we don't keep a
local variable that could get out of sync with 'entry'.

> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> 
> And when we do these cases N times next to each other we should
> have a local variable the valid flag bits of entry.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11 21:48       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 21:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Mon, Oct 10, 2016 at 05:59:17PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> 
> Please introduce a proper inline helper that mask all the possible type
> bits out of the radix tree entry, and use them wherever you do the
> open cast.

Yea, this is messy.  I tried having temporary flags, but that's basically
where we came from with the old 'type' thing which used to be
RADIX_DAX_PTE|RADIX_DAX_PMD.  

After playing with it a bit it seems like the cleanest way is to have a little
flag test helper, like this:

static int dax_flag_test(void *entry, int flags)
{
	return (unsigned long)entry & flags;
}

...

		/*
		 * Besides huge zero pages the only other thing that gets
		 * downgraded are empty entries which don't need to be
		 * unmapped.
		 */
		if (pmd_downgrade && dax_flag_test(entry, RADIX_DAX_HZP))
			unmap_mapping_range(mapping,
				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);

etc.   Please let me know if this is undesirable for some reason.  Vs keeping
the flags in a local variable, this is good because a) it doesn't require
callers to cast, and b) it makes operator precedence easy because the flags
are a param, so no "flags & (flag1|flag2)" nesting, and c) we don't keep a
local variable that could get out of sync with 'entry'.

> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> 
> And when we do these cases N times next to each other we should
> have a local variable the valid flag bits of entry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11 21:48       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 21:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Mon, Oct 10, 2016 at 05:59:17PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> 
> Please introduce a proper inline helper that mask all the possible type
> bits out of the radix tree entry, and use them wherever you do the
> open cast.

Yea, this is messy.  I tried having temporary flags, but that's basically
where we came from with the old 'type' thing which used to be
RADIX_DAX_PTE|RADIX_DAX_PMD.  

After playing with it a bit it seems like the cleanest way is to have a little
flag test helper, like this:

static int dax_flag_test(void *entry, int flags)
{
	return (unsigned long)entry & flags;
}

...

		/*
		 * Besides huge zero pages the only other thing that gets
		 * downgraded are empty entries which don't need to be
		 * unmapped.
		 */
		if (pmd_downgrade && dax_flag_test(entry, RADIX_DAX_HZP))
			unmap_mapping_range(mapping,
				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);

etc.   Please let me know if this is undesirable for some reason.  Vs keeping
the flags in a local variable, this is good because a) it doesn't require
callers to cast, and b) it makes operator precedence easy because the flags
are a param, so no "flags & (flag1|flag2)" nesting, and c) we don't keep a
local variable that could get out of sync with 'entry'.

> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> 
> And when we do these cases N times next to each other we should
> have a local variable the valid flag bits of entry.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11 21:48       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 21:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Mon, Oct 10, 2016 at 05:59:17PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:02PM -0600, Ross Zwisler wrote:
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> 
> Please introduce a proper inline helper that mask all the possible type
> bits out of the radix tree entry, and use them wherever you do the
> open cast.

Yea, this is messy.  I tried having temporary flags, but that's basically
where we came from with the old 'type' thing which used to be
RADIX_DAX_PTE|RADIX_DAX_PMD.  

After playing with it a bit it seems like the cleanest way is to have a little
flag test helper, like this:

static int dax_flag_test(void *entry, int flags)
{
	return (unsigned long)entry & flags;
}

...

		/*
		 * Besides huge zero pages the only other thing that gets
		 * downgraded are empty entries which don't need to be
		 * unmapped.
		 */
		if (pmd_downgrade && dax_flag_test(entry, RADIX_DAX_HZP))
			unmap_mapping_range(mapping,
				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);

etc.   Please let me know if this is undesirable for some reason.  Vs keeping
the flags in a local variable, this is good because a) it doesn't require
callers to cast, and b) it makes operator precedence easy because the flags
are a param, so no "flags & (flag1|flag2)" nesting, and c) we don't keep a
local variable that could get out of sync with 'entry'.

> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> 
> And when we do these cases N times next to each other we should
> have a local variable the valid flag bits of entry.

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
  2016-10-11  8:31     ` Jan Kara
                         ` (2 preceding siblings ...)
  (?)
@ 2016-10-11 22:51       ` Ross Zwisler
  -1 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 22:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner, linux-nvdimm,
	linux-kernel, Christoph Hellwig, linux-xfs, linux-mm,
	Andreas Dilger, Alexander Viro, Jan Kara, linux-fsdevel,
	linux-ext4, Andrew Morton

On Tue, Oct 11, 2016 at 10:31:52AM +0200, Jan Kara wrote:
> On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> > diff --git a/fs/dax.c b/fs/dax.c
> > index ac3cd05..e51d51f 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> >  	 * queue to the start of that PMD.  This ensures that all offsets in
> >  	 * the range covered by the PMD map to the same bit lock.
> >  	 */
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> >  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> 
> I agree with Christoph - helper for masking type bits would make this
> nicer.

Fixed via a dax_flag_test() helper as I outlined in the mail to Christoph.  It
seems clean to me, but if you or Christoph feel strongly that it would be
cleaner as a local 'flags' variable, I'll make the change.

> ...
> > -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> > +		unsigned long size_flag)
> >  {
> > +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
> >  	void *entry, **slot;
> >  
> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> 
> You need to call put_unlocked_mapping_entry() if you use
> get_unlocked_mapping_entry() and then decide not to lock it. The reason is
> that the waitqueues we use are exclusive (we wake up only a single waiter
> waiting for the lock) and so there can be some waiters for the entry lock
> although we have not locked the entry ourselves.

Oh, makes sense, thanks for the catch.

> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> > +				pmd_downgrade = true;
> > +			}
> > +		}
> > +	}
> > +
> >  	/* No entry for given index? Make sure radix tree is big enough. */
> > -	if (!entry) {
> > +	if (!entry || pmd_downgrade) {
> >  		int err;
> >  
> > +		if (pmd_downgrade) {
> > +			/*
> > +			 * Make sure 'entry' remains valid while we drop
> > +			 * mapping->tree_lock.
> > +			 */
> > +			entry = lock_slot(mapping, slot);
> > +		}
> > +
> >  		spin_unlock_irq(&mapping->tree_lock);
> >  		err = radix_tree_preload(
> >  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> >  		if (err)
> >  			return ERR_PTR(err);
> 
> You need to unlock the locked entry before you return here...

Yep, thanks, fixed.

> > -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > -			       RADIX_DAX_ENTRY_LOCK);
> > +
> > +		/*
> > +		 * Besides huge zero pages the only other thing that gets
> > +		 * downgraded are empty entries which don't need to be
> > +		 * unmapped.
> > +		 */
> > +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > +			unmap_mapping_range(mapping,
> > +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > +
> >  		spin_lock_irq(&mapping->tree_lock);
> > -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> > +
> > +		if (pmd_downgrade) {
> > +			radix_tree_delete(&mapping->page_tree, index);
> > +			mapping->nrexceptional--;
> > +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> > +					false);
> 
> You need to set 'wake_all' argument here to true. Otherwise there could be
> waiters waiting for non-existent entry forever...

Interesting.   Fixed, but let me make sure I understand.  So is the issue that
you could have say 2 tasks waiting on a PMD index that has been rounded down
to the PMD index via dax_entry_waitqueue()?

The person holding the lock on the entry would remove the PMD, insert a PTE
and wake just one of the PMD aligned waiters.  That waiter would wake up, do
something PTE based (since the PMD space is now polluted with PTEs), and then
wake any waiters on it's PTE index.  Meanwhile, the second waiter could sleep
forever on the PMD aligned index.  Is this correct?

So, perhaps more succinctly:

Thread 1		Thread 2		Thread 3
--------		--------		--------
index 0x202, hold PMD lock 0x200
			index 0x203, sleep on 0x200
						index 0x204, sleep on 0x200
downgrade, removing 0x200
wake one waiter on 0x200
insert PTE @ 0x202
			wake up, grab index 0x203
			...
			wake one waiter on index 0x203

						... sleeps forever
Right?

> > +		}
> > +
> > +		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
> > +
> > +		err = __radix_tree_insert(&mapping->page_tree, index,
> > +				dax_radix_order(entry), entry);
> >  		radix_tree_preload_end();
> >  		if (err) {
> >  			spin_unlock_irq(&mapping->tree_lock);
> > -			/* Someone already created the entry? */
> > -			if (err == -EEXIST)
> > +			/*
> > +			 * Someone already created the entry?  This is a
> > +			 * normal failure when inserting PMDs in a range
> > +			 * that already contains PTEs.  In that case we want
> > +			 * to return -EEXIST immediately.
> > +			 */
> > +			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
> >  				goto restart;
> 
> Add a comment here that we can get here only if there was no radix tree
> entry at 'index' and thus there can be no waiters to wake.

Done.

> >  			return ERR_PTR(err);
> >  		}
> > @@ -441,6 +509,7 @@ restart:
> >  		return page;
> >  	}
> >  	entry = lock_slot(mapping, slot);
> > + out_unlock:
> >  	spin_unlock_irq(&mapping->tree_lock);
> >  	return entry;
> >  }
> > @@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
> >  	return 0;
> >  }
> >  
> > -#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
> > -
> > +/*
> > + * By this point grab_mapping_entry() has ensured that we have a locked entry
> > + * of the appropriate size so we don't have to worry about downgrading PMDs to
> > + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> > + * already in the tree, we will skip the insertion and just dirty the PMD as
> > + * appropriate.
> > + */
> >  static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  				      struct vm_fault *vmf,
> > -				      void *entry, sector_t sector)
> > +				      void *entry, sector_t sector,
> > +				      unsigned long flags)
> >  {
> >  	struct radix_tree_root *page_tree = &mapping->page_tree;
> >  	int error = 0;
> > @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> >  		if (error)
> >  			return ERR_PTR(error);
> > +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> > +			!(flags & RADIX_DAX_HZP)) {
> > +		/* replacing huge zero page with PMD block mapping */
> > +		unmap_mapping_range(mapping,
> > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> >  	}
> >  
> >  	spin_lock_irq(&mapping->tree_lock);
> > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > -		       RADIX_DAX_ENTRY_LOCK);
> > +	new_entry = dax_radix_entry(sector, flags);
> > +
> 
> You've lost the RADIX_DAX_ENTRY_LOCK flag here?

Oh, nope, that's embedded in the dax_radix_entry() helper:

/* entries begin locked */
static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
{
	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
			((unsigned long)sector << RADIX_DAX_SHIFT) |
			RADIX_DAX_ENTRY_LOCK);
}

I'll s/dax_radix_entry/dax_radix_locked_entry/ or something to make this
clearer to the reader.

> >  	if (hole_fill) {
> >  		__delete_from_page_cache(entry, NULL);
> >  		/* Drop pagecache reference */
> >  		put_page(entry);
> > -		error = radix_tree_insert(page_tree, index, new_entry);
> > +		error = __radix_tree_insert(page_tree, index,
> > +				dax_radix_order(new_entry), new_entry);
> >  		if (error) {
> >  			new_entry = ERR_PTR(error);
> >  			goto unlock;
> >  		}
> >  		mapping->nrexceptional++;
> > -	} else {
> > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> >  		void **slot;
> >  		void *ret;
> 
> Uh, why this condition need to change? Is it some protection so that we
> don't replace a mapped PMD entry with PTE one?

Yea, the logic was that if we have a radix tree that has PMDs in it, and some
new process comes along doing PTE faults, we can just leave the DAX mapped PMD
entries in the tree.  Locking, dirtying and flushing will happen on PMD basis
for all processes.

If you think this is dangerous or not worth the effort we can make it like the
hole case where any PTE sized faults will result in an unmap and a downgrade
to PTE sized entries in the radix tree.

> > @@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  	return VM_FAULT_NOPAGE | major;
> >  }
> >  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> > +
> > +#ifdef CONFIG_FS_DAX_PMD
> > +/*
> > + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> > + * more often than one might expect in the below functions.
> > + */
> > +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
> 
> Just out of curiosity: Why the british spelling of 'colour'?

This was copied from the old dax_pmd_fault() from Willy. :)

> > +
> > +static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> > +		struct vm_fault *vmf, unsigned long address,
> > +		struct iomap *iomap, loff_t pos, bool write, void **entryp)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	struct block_device *bdev = iomap->bdev;
> > +	struct blk_dax_ctl dax = {
> > +		.sector = dax_iomap_sector(iomap, pos),
> > +		.size = PMD_SIZE,
> > +	};
> > +	long length = dax_map_atomic(bdev, &dax);
> > +	void *ret;
> > +
> > +	if (length < 0) /* dax_map_atomic() failed */
> > +		return VM_FAULT_FALLBACK;
> > +	if (length < PMD_SIZE)
> > +		goto unmap_fallback;
> > +	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
> > +		goto unmap_fallback;
> > +	if (!pfn_t_devmap(dax.pfn))
> > +		goto unmap_fallback;
> > +
> > +	dax_unmap_atomic(bdev, &dax);
> > +
> > +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
> > +			RADIX_DAX_PMD);
> > +	if (IS_ERR(ret))
> > +		return VM_FAULT_FALLBACK;
> > +	*entryp = ret;
> > +
> > +	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
> > +
> > + unmap_fallback:
> > +	dax_unmap_atomic(bdev, &dax);
> > +	return VM_FAULT_FALLBACK;
> > +}
> > +
> > +static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
> > +		struct vm_fault *vmf, unsigned long address,
> > +		struct iomap *iomap, void **entryp)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	struct page *zero_page;
> > +	spinlock_t *ptl;
> > +	pmd_t pmd_entry;
> > +	void *ret;
> > +
> > +	zero_page = get_huge_zero_page();
> > +
> > +	if (unlikely(!zero_page))
> > +		return VM_FAULT_FALLBACK;
> > +
> > +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
> > +			RADIX_DAX_PMD | RADIX_DAX_HZP);
> > +	if (IS_ERR(ret))
> > +		return VM_FAULT_FALLBACK;
> > +	*entryp = ret;
> > +
> > +	ptl = pmd_lock(vma->vm_mm, pmd);
> > +	if (!pmd_none(*pmd)) {
> > +		spin_unlock(ptl);
> > +		return VM_FAULT_FALLBACK;
> > +	}
> > +
> > +	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
> > +	pmd_entry = pmd_mkhuge(pmd_entry);
> > +	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
> > +	spin_unlock(ptl);
> > +	return VM_FAULT_NOPAGE;
> > +}
> > +
> > +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	bool write = flags & FAULT_FLAG_WRITE;
> > +	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
> > +	struct inode *inode = mapping->host;
> > +	int result = VM_FAULT_FALLBACK;
> > +	struct iomap iomap = { 0 };
> 
> Why the 0 here? Just empty braces are enough to initialize the structure to
> zeros.

Ah, I just copied this from Christoph's dax_iomap_fault() code.

> > +	pgoff_t size, pgoff;
> > +	struct vm_fault vmf;
> > +	void *entry;
> > +	loff_t pos;
> > +	int error;
> > +
> > +	/* Fall back to PTEs if we're going to COW */
> > +	if (write && !(vma->vm_flags & VM_SHARED)) {
> > +		split_huge_pmd(vma, pmd, address);
> > +		goto fallback;
> > +	}
> > +
> > +	/* If the PMD would extend outside the VMA */
> > +	if (pmd_addr < vma->vm_start)
> > +		goto fallback;
> > +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> > +		goto fallback;
> > +
> > +	/*
> > +	 * Check whether offset isn't beyond end of file now. Caller is
> > +	 * supposed to hold locks serializing us with truncate / punch hole so
> > +	 * this is a reliable test.
> > +	 */
> > +	pgoff = linear_page_index(vma, pmd_addr);
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> 
> Nitpick - 'size' does not express that this is in pages and rounded up.
> Maybe we could have:
> 
> 	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
> 
> and then use strict inequalities below?

Yep, that's cleaner.  Fixed.

Jan, thank you for the review!
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11 22:51       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 22:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue, Oct 11, 2016 at 10:31:52AM +0200, Jan Kara wrote:
> On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> > diff --git a/fs/dax.c b/fs/dax.c
> > index ac3cd05..e51d51f 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> >  	 * queue to the start of that PMD.  This ensures that all offsets in
> >  	 * the range covered by the PMD map to the same bit lock.
> >  	 */
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> >  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> 
> I agree with Christoph - helper for masking type bits would make this
> nicer.

Fixed via a dax_flag_test() helper as I outlined in the mail to Christoph.  It
seems clean to me, but if you or Christoph feel strongly that it would be
cleaner as a local 'flags' variable, I'll make the change.

> ...
> > -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> > +		unsigned long size_flag)
> >  {
> > +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
> >  	void *entry, **slot;
> >  
> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> 
> You need to call put_unlocked_mapping_entry() if you use
> get_unlocked_mapping_entry() and then decide not to lock it. The reason is
> that the waitqueues we use are exclusive (we wake up only a single waiter
> waiting for the lock) and so there can be some waiters for the entry lock
> although we have not locked the entry ourselves.

Oh, makes sense, thanks for the catch.

> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> > +				pmd_downgrade = true;
> > +			}
> > +		}
> > +	}
> > +
> >  	/* No entry for given index? Make sure radix tree is big enough. */
> > -	if (!entry) {
> > +	if (!entry || pmd_downgrade) {
> >  		int err;
> >  
> > +		if (pmd_downgrade) {
> > +			/*
> > +			 * Make sure 'entry' remains valid while we drop
> > +			 * mapping->tree_lock.
> > +			 */
> > +			entry = lock_slot(mapping, slot);
> > +		}
> > +
> >  		spin_unlock_irq(&mapping->tree_lock);
> >  		err = radix_tree_preload(
> >  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> >  		if (err)
> >  			return ERR_PTR(err);
> 
> You need to unlock the locked entry before you return here...

Yep, thanks, fixed.

> > -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > -			       RADIX_DAX_ENTRY_LOCK);
> > +
> > +		/*
> > +		 * Besides huge zero pages the only other thing that gets
> > +		 * downgraded are empty entries which don't need to be
> > +		 * unmapped.
> > +		 */
> > +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > +			unmap_mapping_range(mapping,
> > +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > +
> >  		spin_lock_irq(&mapping->tree_lock);
> > -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> > +
> > +		if (pmd_downgrade) {
> > +			radix_tree_delete(&mapping->page_tree, index);
> > +			mapping->nrexceptional--;
> > +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> > +					false);
> 
> You need to set 'wake_all' argument here to true. Otherwise there could be
> waiters waiting for non-existent entry forever...

Interesting.   Fixed, but let me make sure I understand.  So is the issue that
you could have say 2 tasks waiting on a PMD index that has been rounded down
to the PMD index via dax_entry_waitqueue()?

The person holding the lock on the entry would remove the PMD, insert a PTE
and wake just one of the PMD aligned waiters.  That waiter would wake up, do
something PTE based (since the PMD space is now polluted with PTEs), and then
wake any waiters on it's PTE index.  Meanwhile, the second waiter could sleep
forever on the PMD aligned index.  Is this correct?

So, perhaps more succinctly:

Thread 1		Thread 2		Thread 3
--------		--------		--------
index 0x202, hold PMD lock 0x200
			index 0x203, sleep on 0x200
						index 0x204, sleep on 0x200
downgrade, removing 0x200
wake one waiter on 0x200
insert PTE @ 0x202
			wake up, grab index 0x203
			...
			wake one waiter on index 0x203

						... sleeps forever
Right?

> > +		}
> > +
> > +		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
> > +
> > +		err = __radix_tree_insert(&mapping->page_tree, index,
> > +				dax_radix_order(entry), entry);
> >  		radix_tree_preload_end();
> >  		if (err) {
> >  			spin_unlock_irq(&mapping->tree_lock);
> > -			/* Someone already created the entry? */
> > -			if (err == -EEXIST)
> > +			/*
> > +			 * Someone already created the entry?  This is a
> > +			 * normal failure when inserting PMDs in a range
> > +			 * that already contains PTEs.  In that case we want
> > +			 * to return -EEXIST immediately.
> > +			 */
> > +			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
> >  				goto restart;
> 
> Add a comment here that we can get here only if there was no radix tree
> entry at 'index' and thus there can be no waiters to wake.

Done.

> >  			return ERR_PTR(err);
> >  		}
> > @@ -441,6 +509,7 @@ restart:
> >  		return page;
> >  	}
> >  	entry = lock_slot(mapping, slot);
> > + out_unlock:
> >  	spin_unlock_irq(&mapping->tree_lock);
> >  	return entry;
> >  }
> > @@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
> >  	return 0;
> >  }
> >  
> > -#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
> > -
> > +/*
> > + * By this point grab_mapping_entry() has ensured that we have a locked entry
> > + * of the appropriate size so we don't have to worry about downgrading PMDs to
> > + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> > + * already in the tree, we will skip the insertion and just dirty the PMD as
> > + * appropriate.
> > + */
> >  static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  				      struct vm_fault *vmf,
> > -				      void *entry, sector_t sector)
> > +				      void *entry, sector_t sector,
> > +				      unsigned long flags)
> >  {
> >  	struct radix_tree_root *page_tree = &mapping->page_tree;
> >  	int error = 0;
> > @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> >  		if (error)
> >  			return ERR_PTR(error);
> > +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> > +			!(flags & RADIX_DAX_HZP)) {
> > +		/* replacing huge zero page with PMD block mapping */
> > +		unmap_mapping_range(mapping,
> > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> >  	}
> >  
> >  	spin_lock_irq(&mapping->tree_lock);
> > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > -		       RADIX_DAX_ENTRY_LOCK);
> > +	new_entry = dax_radix_entry(sector, flags);
> > +
> 
> You've lost the RADIX_DAX_ENTRY_LOCK flag here?

Oh, nope, that's embedded in the dax_radix_entry() helper:

/* entries begin locked */
static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
{
	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
			((unsigned long)sector << RADIX_DAX_SHIFT) |
			RADIX_DAX_ENTRY_LOCK);
}

I'll s/dax_radix_entry/dax_radix_locked_entry/ or something to make this
clearer to the reader.

> >  	if (hole_fill) {
> >  		__delete_from_page_cache(entry, NULL);
> >  		/* Drop pagecache reference */
> >  		put_page(entry);
> > -		error = radix_tree_insert(page_tree, index, new_entry);
> > +		error = __radix_tree_insert(page_tree, index,
> > +				dax_radix_order(new_entry), new_entry);
> >  		if (error) {
> >  			new_entry = ERR_PTR(error);
> >  			goto unlock;
> >  		}
> >  		mapping->nrexceptional++;
> > -	} else {
> > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> >  		void **slot;
> >  		void *ret;
> 
> Uh, why this condition need to change? Is it some protection so that we
> don't replace a mapped PMD entry with PTE one?

Yea, the logic was that if we have a radix tree that has PMDs in it, and some
new process comes along doing PTE faults, we can just leave the DAX mapped PMD
entries in the tree.  Locking, dirtying and flushing will happen on PMD basis
for all processes.

If you think this is dangerous or not worth the effort we can make it like the
hole case where any PTE sized faults will result in an unmap and a downgrade
to PTE sized entries in the radix tree.

> > @@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  	return VM_FAULT_NOPAGE | major;
> >  }
> >  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> > +
> > +#ifdef CONFIG_FS_DAX_PMD
> > +/*
> > + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> > + * more often than one might expect in the below functions.
> > + */
> > +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
> 
> Just out of curiosity: Why the british spelling of 'colour'?

This was copied from the old dax_pmd_fault() from Willy. :)

> > +
> > +static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> > +		struct vm_fault *vmf, unsigned long address,
> > +		struct iomap *iomap, loff_t pos, bool write, void **entryp)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	struct block_device *bdev = iomap->bdev;
> > +	struct blk_dax_ctl dax = {
> > +		.sector = dax_iomap_sector(iomap, pos),
> > +		.size = PMD_SIZE,
> > +	};
> > +	long length = dax_map_atomic(bdev, &dax);
> > +	void *ret;
> > +
> > +	if (length < 0) /* dax_map_atomic() failed */
> > +		return VM_FAULT_FALLBACK;
> > +	if (length < PMD_SIZE)
> > +		goto unmap_fallback;
> > +	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
> > +		goto unmap_fallback;
> > +	if (!pfn_t_devmap(dax.pfn))
> > +		goto unmap_fallback;
> > +
> > +	dax_unmap_atomic(bdev, &dax);
> > +
> > +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
> > +			RADIX_DAX_PMD);
> > +	if (IS_ERR(ret))
> > +		return VM_FAULT_FALLBACK;
> > +	*entryp = ret;
> > +
> > +	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
> > +
> > + unmap_fallback:
> > +	dax_unmap_atomic(bdev, &dax);
> > +	return VM_FAULT_FALLBACK;
> > +}
> > +
> > +static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
> > +		struct vm_fault *vmf, unsigned long address,
> > +		struct iomap *iomap, void **entryp)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	struct page *zero_page;
> > +	spinlock_t *ptl;
> > +	pmd_t pmd_entry;
> > +	void *ret;
> > +
> > +	zero_page = get_huge_zero_page();
> > +
> > +	if (unlikely(!zero_page))
> > +		return VM_FAULT_FALLBACK;
> > +
> > +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
> > +			RADIX_DAX_PMD | RADIX_DAX_HZP);
> > +	if (IS_ERR(ret))
> > +		return VM_FAULT_FALLBACK;
> > +	*entryp = ret;
> > +
> > +	ptl = pmd_lock(vma->vm_mm, pmd);
> > +	if (!pmd_none(*pmd)) {
> > +		spin_unlock(ptl);
> > +		return VM_FAULT_FALLBACK;
> > +	}
> > +
> > +	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
> > +	pmd_entry = pmd_mkhuge(pmd_entry);
> > +	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
> > +	spin_unlock(ptl);
> > +	return VM_FAULT_NOPAGE;
> > +}
> > +
> > +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	bool write = flags & FAULT_FLAG_WRITE;
> > +	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
> > +	struct inode *inode = mapping->host;
> > +	int result = VM_FAULT_FALLBACK;
> > +	struct iomap iomap = { 0 };
> 
> Why the 0 here? Just empty braces are enough to initialize the structure to
> zeros.

Ah, I just copied this from Christoph's dax_iomap_fault() code.

> > +	pgoff_t size, pgoff;
> > +	struct vm_fault vmf;
> > +	void *entry;
> > +	loff_t pos;
> > +	int error;
> > +
> > +	/* Fall back to PTEs if we're going to COW */
> > +	if (write && !(vma->vm_flags & VM_SHARED)) {
> > +		split_huge_pmd(vma, pmd, address);
> > +		goto fallback;
> > +	}
> > +
> > +	/* If the PMD would extend outside the VMA */
> > +	if (pmd_addr < vma->vm_start)
> > +		goto fallback;
> > +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> > +		goto fallback;
> > +
> > +	/*
> > +	 * Check whether offset isn't beyond end of file now. Caller is
> > +	 * supposed to hold locks serializing us with truncate / punch hole so
> > +	 * this is a reliable test.
> > +	 */
> > +	pgoff = linear_page_index(vma, pmd_addr);
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> 
> Nitpick - 'size' does not express that this is in pages and rounded up.
> Maybe we could have:
> 
> 	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
> 
> and then use strict inequalities below?

Yep, that's cleaner.  Fixed.

Jan, thank you for the review!

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11 22:51       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 22:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue, Oct 11, 2016 at 10:31:52AM +0200, Jan Kara wrote:
> On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> > diff --git a/fs/dax.c b/fs/dax.c
> > index ac3cd05..e51d51f 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> >  	 * queue to the start of that PMD.  This ensures that all offsets in
> >  	 * the range covered by the PMD map to the same bit lock.
> >  	 */
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> >  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> 
> I agree with Christoph - helper for masking type bits would make this
> nicer.

Fixed via a dax_flag_test() helper as I outlined in the mail to Christoph.  It
seems clean to me, but if you or Christoph feel strongly that it would be
cleaner as a local 'flags' variable, I'll make the change.

> ...
> > -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> > +		unsigned long size_flag)
> >  {
> > +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
> >  	void *entry, **slot;
> >  
> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> 
> You need to call put_unlocked_mapping_entry() if you use
> get_unlocked_mapping_entry() and then decide not to lock it. The reason is
> that the waitqueues we use are exclusive (we wake up only a single waiter
> waiting for the lock) and so there can be some waiters for the entry lock
> although we have not locked the entry ourselves.

Oh, makes sense, thanks for the catch.

> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> > +				pmd_downgrade = true;
> > +			}
> > +		}
> > +	}
> > +
> >  	/* No entry for given index? Make sure radix tree is big enough. */
> > -	if (!entry) {
> > +	if (!entry || pmd_downgrade) {
> >  		int err;
> >  
> > +		if (pmd_downgrade) {
> > +			/*
> > +			 * Make sure 'entry' remains valid while we drop
> > +			 * mapping->tree_lock.
> > +			 */
> > +			entry = lock_slot(mapping, slot);
> > +		}
> > +
> >  		spin_unlock_irq(&mapping->tree_lock);
> >  		err = radix_tree_preload(
> >  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> >  		if (err)
> >  			return ERR_PTR(err);
> 
> You need to unlock the locked entry before you return here...

Yep, thanks, fixed.

> > -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > -			       RADIX_DAX_ENTRY_LOCK);
> > +
> > +		/*
> > +		 * Besides huge zero pages the only other thing that gets
> > +		 * downgraded are empty entries which don't need to be
> > +		 * unmapped.
> > +		 */
> > +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > +			unmap_mapping_range(mapping,
> > +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > +
> >  		spin_lock_irq(&mapping->tree_lock);
> > -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> > +
> > +		if (pmd_downgrade) {
> > +			radix_tree_delete(&mapping->page_tree, index);
> > +			mapping->nrexceptional--;
> > +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> > +					false);
> 
> You need to set 'wake_all' argument here to true. Otherwise there could be
> waiters waiting for non-existent entry forever...

Interesting.   Fixed, but let me make sure I understand.  So is the issue that
you could have say 2 tasks waiting on a PMD index that has been rounded down
to the PMD index via dax_entry_waitqueue()?

The person holding the lock on the entry would remove the PMD, insert a PTE
and wake just one of the PMD aligned waiters.  That waiter would wake up, do
something PTE based (since the PMD space is now polluted with PTEs), and then
wake any waiters on it's PTE index.  Meanwhile, the second waiter could sleep
forever on the PMD aligned index.  Is this correct?

So, perhaps more succinctly:

Thread 1		Thread 2		Thread 3
--------		--------		--------
index 0x202, hold PMD lock 0x200
			index 0x203, sleep on 0x200
						index 0x204, sleep on 0x200
downgrade, removing 0x200
wake one waiter on 0x200
insert PTE @ 0x202
			wake up, grab index 0x203
			...
			wake one waiter on index 0x203

						... sleeps forever
Right?

> > +		}
> > +
> > +		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
> > +
> > +		err = __radix_tree_insert(&mapping->page_tree, index,
> > +				dax_radix_order(entry), entry);
> >  		radix_tree_preload_end();
> >  		if (err) {
> >  			spin_unlock_irq(&mapping->tree_lock);
> > -			/* Someone already created the entry? */
> > -			if (err == -EEXIST)
> > +			/*
> > +			 * Someone already created the entry?  This is a
> > +			 * normal failure when inserting PMDs in a range
> > +			 * that already contains PTEs.  In that case we want
> > +			 * to return -EEXIST immediately.
> > +			 */
> > +			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
> >  				goto restart;
> 
> Add a comment here that we can get here only if there was no radix tree
> entry at 'index' and thus there can be no waiters to wake.

Done.

> >  			return ERR_PTR(err);
> >  		}
> > @@ -441,6 +509,7 @@ restart:
> >  		return page;
> >  	}
> >  	entry = lock_slot(mapping, slot);
> > + out_unlock:
> >  	spin_unlock_irq(&mapping->tree_lock);
> >  	return entry;
> >  }
> > @@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
> >  	return 0;
> >  }
> >  
> > -#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
> > -
> > +/*
> > + * By this point grab_mapping_entry() has ensured that we have a locked entry
> > + * of the appropriate size so we don't have to worry about downgrading PMDs to
> > + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> > + * already in the tree, we will skip the insertion and just dirty the PMD as
> > + * appropriate.
> > + */
> >  static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  				      struct vm_fault *vmf,
> > -				      void *entry, sector_t sector)
> > +				      void *entry, sector_t sector,
> > +				      unsigned long flags)
> >  {
> >  	struct radix_tree_root *page_tree = &mapping->page_tree;
> >  	int error = 0;
> > @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> >  		if (error)
> >  			return ERR_PTR(error);
> > +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> > +			!(flags & RADIX_DAX_HZP)) {
> > +		/* replacing huge zero page with PMD block mapping */
> > +		unmap_mapping_range(mapping,
> > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> >  	}
> >  
> >  	spin_lock_irq(&mapping->tree_lock);
> > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > -		       RADIX_DAX_ENTRY_LOCK);
> > +	new_entry = dax_radix_entry(sector, flags);
> > +
> 
> You've lost the RADIX_DAX_ENTRY_LOCK flag here?

Oh, nope, that's embedded in the dax_radix_entry() helper:

/* entries begin locked */
static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
{
	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
			((unsigned long)sector << RADIX_DAX_SHIFT) |
			RADIX_DAX_ENTRY_LOCK);
}

I'll s/dax_radix_entry/dax_radix_locked_entry/ or something to make this
clearer to the reader.

> >  	if (hole_fill) {
> >  		__delete_from_page_cache(entry, NULL);
> >  		/* Drop pagecache reference */
> >  		put_page(entry);
> > -		error = radix_tree_insert(page_tree, index, new_entry);
> > +		error = __radix_tree_insert(page_tree, index,
> > +				dax_radix_order(new_entry), new_entry);
> >  		if (error) {
> >  			new_entry = ERR_PTR(error);
> >  			goto unlock;
> >  		}
> >  		mapping->nrexceptional++;
> > -	} else {
> > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> >  		void **slot;
> >  		void *ret;
> 
> Uh, why this condition need to change? Is it some protection so that we
> don't replace a mapped PMD entry with PTE one?

Yea, the logic was that if we have a radix tree that has PMDs in it, and some
new process comes along doing PTE faults, we can just leave the DAX mapped PMD
entries in the tree.  Locking, dirtying and flushing will happen on PMD basis
for all processes.

If you think this is dangerous or not worth the effort we can make it like the
hole case where any PTE sized faults will result in an unmap and a downgrade
to PTE sized entries in the radix tree.

> > @@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  	return VM_FAULT_NOPAGE | major;
> >  }
> >  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> > +
> > +#ifdef CONFIG_FS_DAX_PMD
> > +/*
> > + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> > + * more often than one might expect in the below functions.
> > + */
> > +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
> 
> Just out of curiosity: Why the british spelling of 'colour'?

This was copied from the old dax_pmd_fault() from Willy. :)

> > +
> > +static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> > +		struct vm_fault *vmf, unsigned long address,
> > +		struct iomap *iomap, loff_t pos, bool write, void **entryp)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	struct block_device *bdev = iomap->bdev;
> > +	struct blk_dax_ctl dax = {
> > +		.sector = dax_iomap_sector(iomap, pos),
> > +		.size = PMD_SIZE,
> > +	};
> > +	long length = dax_map_atomic(bdev, &dax);
> > +	void *ret;
> > +
> > +	if (length < 0) /* dax_map_atomic() failed */
> > +		return VM_FAULT_FALLBACK;
> > +	if (length < PMD_SIZE)
> > +		goto unmap_fallback;
> > +	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
> > +		goto unmap_fallback;
> > +	if (!pfn_t_devmap(dax.pfn))
> > +		goto unmap_fallback;
> > +
> > +	dax_unmap_atomic(bdev, &dax);
> > +
> > +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
> > +			RADIX_DAX_PMD);
> > +	if (IS_ERR(ret))
> > +		return VM_FAULT_FALLBACK;
> > +	*entryp = ret;
> > +
> > +	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
> > +
> > + unmap_fallback:
> > +	dax_unmap_atomic(bdev, &dax);
> > +	return VM_FAULT_FALLBACK;
> > +}
> > +
> > +static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
> > +		struct vm_fault *vmf, unsigned long address,
> > +		struct iomap *iomap, void **entryp)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	struct page *zero_page;
> > +	spinlock_t *ptl;
> > +	pmd_t pmd_entry;
> > +	void *ret;
> > +
> > +	zero_page = get_huge_zero_page();
> > +
> > +	if (unlikely(!zero_page))
> > +		return VM_FAULT_FALLBACK;
> > +
> > +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
> > +			RADIX_DAX_PMD | RADIX_DAX_HZP);
> > +	if (IS_ERR(ret))
> > +		return VM_FAULT_FALLBACK;
> > +	*entryp = ret;
> > +
> > +	ptl = pmd_lock(vma->vm_mm, pmd);
> > +	if (!pmd_none(*pmd)) {
> > +		spin_unlock(ptl);
> > +		return VM_FAULT_FALLBACK;
> > +	}
> > +
> > +	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
> > +	pmd_entry = pmd_mkhuge(pmd_entry);
> > +	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
> > +	spin_unlock(ptl);
> > +	return VM_FAULT_NOPAGE;
> > +}
> > +
> > +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	bool write = flags & FAULT_FLAG_WRITE;
> > +	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
> > +	struct inode *inode = mapping->host;
> > +	int result = VM_FAULT_FALLBACK;
> > +	struct iomap iomap = { 0 };
> 
> Why the 0 here? Just empty braces are enough to initialize the structure to
> zeros.

Ah, I just copied this from Christoph's dax_iomap_fault() code.

> > +	pgoff_t size, pgoff;
> > +	struct vm_fault vmf;
> > +	void *entry;
> > +	loff_t pos;
> > +	int error;
> > +
> > +	/* Fall back to PTEs if we're going to COW */
> > +	if (write && !(vma->vm_flags & VM_SHARED)) {
> > +		split_huge_pmd(vma, pmd, address);
> > +		goto fallback;
> > +	}
> > +
> > +	/* If the PMD would extend outside the VMA */
> > +	if (pmd_addr < vma->vm_start)
> > +		goto fallback;
> > +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> > +		goto fallback;
> > +
> > +	/*
> > +	 * Check whether offset isn't beyond end of file now. Caller is
> > +	 * supposed to hold locks serializing us with truncate / punch hole so
> > +	 * this is a reliable test.
> > +	 */
> > +	pgoff = linear_page_index(vma, pmd_addr);
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> 
> Nitpick - 'size' does not express that this is in pages and rounded up.
> Maybe we could have:
> 
> 	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
> 
> and then use strict inequalities below?

Yep, that's cleaner.  Fixed.

Jan, thank you for the review!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11 22:51       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 22:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Matthew Wilcox, Dave Chinner,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Tue, Oct 11, 2016 at 10:31:52AM +0200, Jan Kara wrote:
> On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> > diff --git a/fs/dax.c b/fs/dax.c
> > index ac3cd05..e51d51f 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> >  	 * queue to the start of that PMD.  This ensures that all offsets in
> >  	 * the range covered by the PMD map to the same bit lock.
> >  	 */
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> >  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> 
> I agree with Christoph - helper for masking type bits would make this
> nicer.

Fixed via a dax_flag_test() helper as I outlined in the mail to Christoph.  It
seems clean to me, but if you or Christoph feel strongly that it would be
cleaner as a local 'flags' variable, I'll make the change.

> ...
> > -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> > +		unsigned long size_flag)
> >  {
> > +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
> >  	void *entry, **slot;
> >  
> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> 
> You need to call put_unlocked_mapping_entry() if you use
> get_unlocked_mapping_entry() and then decide not to lock it. The reason is
> that the waitqueues we use are exclusive (we wake up only a single waiter
> waiting for the lock) and so there can be some waiters for the entry lock
> although we have not locked the entry ourselves.

Oh, makes sense, thanks for the catch.

> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> > +				pmd_downgrade = true;
> > +			}
> > +		}
> > +	}
> > +
> >  	/* No entry for given index? Make sure radix tree is big enough. */
> > -	if (!entry) {
> > +	if (!entry || pmd_downgrade) {
> >  		int err;
> >  
> > +		if (pmd_downgrade) {
> > +			/*
> > +			 * Make sure 'entry' remains valid while we drop
> > +			 * mapping->tree_lock.
> > +			 */
> > +			entry = lock_slot(mapping, slot);
> > +		}
> > +
> >  		spin_unlock_irq(&mapping->tree_lock);
> >  		err = radix_tree_preload(
> >  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> >  		if (err)
> >  			return ERR_PTR(err);
> 
> You need to unlock the locked entry before you return here...

Yep, thanks, fixed.

> > -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > -			       RADIX_DAX_ENTRY_LOCK);
> > +
> > +		/*
> > +		 * Besides huge zero pages the only other thing that gets
> > +		 * downgraded are empty entries which don't need to be
> > +		 * unmapped.
> > +		 */
> > +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > +			unmap_mapping_range(mapping,
> > +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > +
> >  		spin_lock_irq(&mapping->tree_lock);
> > -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> > +
> > +		if (pmd_downgrade) {
> > +			radix_tree_delete(&mapping->page_tree, index);
> > +			mapping->nrexceptional--;
> > +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> > +					false);
> 
> You need to set 'wake_all' argument here to true. Otherwise there could be
> waiters waiting for non-existent entry forever...

Interesting.   Fixed, but let me make sure I understand.  So is the issue that
you could have say 2 tasks waiting on a PMD index that has been rounded down
to the PMD index via dax_entry_waitqueue()?

The person holding the lock on the entry would remove the PMD, insert a PTE
and wake just one of the PMD aligned waiters.  That waiter would wake up, do
something PTE based (since the PMD space is now polluted with PTEs), and then
wake any waiters on it's PTE index.  Meanwhile, the second waiter could sleep
forever on the PMD aligned index.  Is this correct?

So, perhaps more succinctly:

Thread 1		Thread 2		Thread 3
--------		--------		--------
index 0x202, hold PMD lock 0x200
			index 0x203, sleep on 0x200
						index 0x204, sleep on 0x200
downgrade, removing 0x200
wake one waiter on 0x200
insert PTE @ 0x202
			wake up, grab index 0x203
			...
			wake one waiter on index 0x203

						... sleeps forever
Right?

> > +		}
> > +
> > +		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
> > +
> > +		err = __radix_tree_insert(&mapping->page_tree, index,
> > +				dax_radix_order(entry), entry);
> >  		radix_tree_preload_end();
> >  		if (err) {
> >  			spin_unlock_irq(&mapping->tree_lock);
> > -			/* Someone already created the entry? */
> > -			if (err == -EEXIST)
> > +			/*
> > +			 * Someone already created the entry?  This is a
> > +			 * normal failure when inserting PMDs in a range
> > +			 * that already contains PTEs.  In that case we want
> > +			 * to return -EEXIST immediately.
> > +			 */
> > +			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
> >  				goto restart;
> 
> Add a comment here that we can get here only if there was no radix tree
> entry at 'index' and thus there can be no waiters to wake.

Done.

> >  			return ERR_PTR(err);
> >  		}
> > @@ -441,6 +509,7 @@ restart:
> >  		return page;
> >  	}
> >  	entry = lock_slot(mapping, slot);
> > + out_unlock:
> >  	spin_unlock_irq(&mapping->tree_lock);
> >  	return entry;
> >  }
> > @@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
> >  	return 0;
> >  }
> >  
> > -#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
> > -
> > +/*
> > + * By this point grab_mapping_entry() has ensured that we have a locked entry
> > + * of the appropriate size so we don't have to worry about downgrading PMDs to
> > + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> > + * already in the tree, we will skip the insertion and just dirty the PMD as
> > + * appropriate.
> > + */
> >  static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  				      struct vm_fault *vmf,
> > -				      void *entry, sector_t sector)
> > +				      void *entry, sector_t sector,
> > +				      unsigned long flags)
> >  {
> >  	struct radix_tree_root *page_tree = &mapping->page_tree;
> >  	int error = 0;
> > @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> >  		if (error)
> >  			return ERR_PTR(error);
> > +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> > +			!(flags & RADIX_DAX_HZP)) {
> > +		/* replacing huge zero page with PMD block mapping */
> > +		unmap_mapping_range(mapping,
> > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> >  	}
> >  
> >  	spin_lock_irq(&mapping->tree_lock);
> > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > -		       RADIX_DAX_ENTRY_LOCK);
> > +	new_entry = dax_radix_entry(sector, flags);
> > +
> 
> You've lost the RADIX_DAX_ENTRY_LOCK flag here?

Oh, nope, that's embedded in the dax_radix_entry() helper:

/* entries begin locked */
static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
{
	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
			((unsigned long)sector << RADIX_DAX_SHIFT) |
			RADIX_DAX_ENTRY_LOCK);
}

I'll s/dax_radix_entry/dax_radix_locked_entry/ or something to make this
clearer to the reader.

> >  	if (hole_fill) {
> >  		__delete_from_page_cache(entry, NULL);
> >  		/* Drop pagecache reference */
> >  		put_page(entry);
> > -		error = radix_tree_insert(page_tree, index, new_entry);
> > +		error = __radix_tree_insert(page_tree, index,
> > +				dax_radix_order(new_entry), new_entry);
> >  		if (error) {
> >  			new_entry = ERR_PTR(error);
> >  			goto unlock;
> >  		}
> >  		mapping->nrexceptional++;
> > -	} else {
> > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> >  		void **slot;
> >  		void *ret;
> 
> Uh, why this condition need to change? Is it some protection so that we
> don't replace a mapped PMD entry with PTE one?

Yea, the logic was that if we have a radix tree that has PMDs in it, and some
new process comes along doing PTE faults, we can just leave the DAX mapped PMD
entries in the tree.  Locking, dirtying and flushing will happen on PMD basis
for all processes.

If you think this is dangerous or not worth the effort we can make it like the
hole case where any PTE sized faults will result in an unmap and a downgrade
to PTE sized entries in the radix tree.

> > @@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  	return VM_FAULT_NOPAGE | major;
> >  }
> >  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> > +
> > +#ifdef CONFIG_FS_DAX_PMD
> > +/*
> > + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> > + * more often than one might expect in the below functions.
> > + */
> > +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
> 
> Just out of curiosity: Why the british spelling of 'colour'?

This was copied from the old dax_pmd_fault() from Willy. :)

> > +
> > +static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> > +		struct vm_fault *vmf, unsigned long address,
> > +		struct iomap *iomap, loff_t pos, bool write, void **entryp)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	struct block_device *bdev = iomap->bdev;
> > +	struct blk_dax_ctl dax = {
> > +		.sector = dax_iomap_sector(iomap, pos),
> > +		.size = PMD_SIZE,
> > +	};
> > +	long length = dax_map_atomic(bdev, &dax);
> > +	void *ret;
> > +
> > +	if (length < 0) /* dax_map_atomic() failed */
> > +		return VM_FAULT_FALLBACK;
> > +	if (length < PMD_SIZE)
> > +		goto unmap_fallback;
> > +	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
> > +		goto unmap_fallback;
> > +	if (!pfn_t_devmap(dax.pfn))
> > +		goto unmap_fallback;
> > +
> > +	dax_unmap_atomic(bdev, &dax);
> > +
> > +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
> > +			RADIX_DAX_PMD);
> > +	if (IS_ERR(ret))
> > +		return VM_FAULT_FALLBACK;
> > +	*entryp = ret;
> > +
> > +	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
> > +
> > + unmap_fallback:
> > +	dax_unmap_atomic(bdev, &dax);
> > +	return VM_FAULT_FALLBACK;
> > +}
> > +
> > +static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
> > +		struct vm_fault *vmf, unsigned long address,
> > +		struct iomap *iomap, void **entryp)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	struct page *zero_page;
> > +	spinlock_t *ptl;
> > +	pmd_t pmd_entry;
> > +	void *ret;
> > +
> > +	zero_page = get_huge_zero_page();
> > +
> > +	if (unlikely(!zero_page))
> > +		return VM_FAULT_FALLBACK;
> > +
> > +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
> > +			RADIX_DAX_PMD | RADIX_DAX_HZP);
> > +	if (IS_ERR(ret))
> > +		return VM_FAULT_FALLBACK;
> > +	*entryp = ret;
> > +
> > +	ptl = pmd_lock(vma->vm_mm, pmd);
> > +	if (!pmd_none(*pmd)) {
> > +		spin_unlock(ptl);
> > +		return VM_FAULT_FALLBACK;
> > +	}
> > +
> > +	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
> > +	pmd_entry = pmd_mkhuge(pmd_entry);
> > +	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
> > +	spin_unlock(ptl);
> > +	return VM_FAULT_NOPAGE;
> > +}
> > +
> > +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	bool write = flags & FAULT_FLAG_WRITE;
> > +	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
> > +	struct inode *inode = mapping->host;
> > +	int result = VM_FAULT_FALLBACK;
> > +	struct iomap iomap = { 0 };
> 
> Why the 0 here? Just empty braces are enough to initialize the structure to
> zeros.

Ah, I just copied this from Christoph's dax_iomap_fault() code.

> > +	pgoff_t size, pgoff;
> > +	struct vm_fault vmf;
> > +	void *entry;
> > +	loff_t pos;
> > +	int error;
> > +
> > +	/* Fall back to PTEs if we're going to COW */
> > +	if (write && !(vma->vm_flags & VM_SHARED)) {
> > +		split_huge_pmd(vma, pmd, address);
> > +		goto fallback;
> > +	}
> > +
> > +	/* If the PMD would extend outside the VMA */
> > +	if (pmd_addr < vma->vm_start)
> > +		goto fallback;
> > +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> > +		goto fallback;
> > +
> > +	/*
> > +	 * Check whether offset isn't beyond end of file now. Caller is
> > +	 * supposed to hold locks serializing us with truncate / punch hole so
> > +	 * this is a reliable test.
> > +	 */
> > +	pgoff = linear_page_index(vma, pmd_addr);
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> 
> Nitpick - 'size' does not express that this is in pages and rounded up.
> Maybe we could have:
> 
> 	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
> 
> and then use strict inequalities below?

Yep, that's cleaner.  Fixed.

Jan, thank you for the review!

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-11 22:51       ` Ross Zwisler
  0 siblings, 0 replies; 182+ messages in thread
From: Ross Zwisler @ 2016-10-11 22:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue, Oct 11, 2016 at 10:31:52AM +0200, Jan Kara wrote:
> On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> > diff --git a/fs/dax.c b/fs/dax.c
> > index ac3cd05..e51d51f 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> >  	 * queue to the start of that PMD.  This ensures that all offsets in
> >  	 * the range covered by the PMD map to the same bit lock.
> >  	 */
> > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> >  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> 
> I agree with Christoph - helper for masking type bits would make this
> nicer.

Fixed via a dax_flag_test() helper as I outlined in the mail to Christoph.  It
seems clean to me, but if you or Christoph feel strongly that it would be
cleaner as a local 'flags' variable, I'll make the change.

> ...
> > -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> > +		unsigned long size_flag)
> >  {
> > +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
> >  	void *entry, **slot;
> >  
> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry) {
> > +		if (size_flag & RADIX_DAX_PMD) {
> > +			if (!radix_tree_exceptional_entry(entry) ||
> > +			    !((unsigned long)entry & RADIX_DAX_PMD)) {
> > +				entry = ERR_PTR(-EEXIST);
> > +				goto out_unlock;
> 
> You need to call put_unlocked_mapping_entry() if you use
> get_unlocked_mapping_entry() and then decide not to lock it. The reason is
> that the waitqueues we use are exclusive (we wake up only a single waiter
> waiting for the lock) and so there can be some waiters for the entry lock
> although we have not locked the entry ourselves.

Oh, makes sense, thanks for the catch.

> > +			}
> > +		} else { /* trying to grab a PTE entry */
> > +			if (radix_tree_exceptional_entry(entry) &&
> > +			    ((unsigned long)entry & RADIX_DAX_PMD) &&
> > +			    ((unsigned long)entry &
> > +			     (RADIX_DAX_HZP|RADIX_DAX_EMPTY))) {
> > +				pmd_downgrade = true;
> > +			}
> > +		}
> > +	}
> > +
> >  	/* No entry for given index? Make sure radix tree is big enough. */
> > -	if (!entry) {
> > +	if (!entry || pmd_downgrade) {
> >  		int err;
> >  
> > +		if (pmd_downgrade) {
> > +			/*
> > +			 * Make sure 'entry' remains valid while we drop
> > +			 * mapping->tree_lock.
> > +			 */
> > +			entry = lock_slot(mapping, slot);
> > +		}
> > +
> >  		spin_unlock_irq(&mapping->tree_lock);
> >  		err = radix_tree_preload(
> >  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> >  		if (err)
> >  			return ERR_PTR(err);
> 
> You need to unlock the locked entry before you return here...

Yep, thanks, fixed.

> > -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > -			       RADIX_DAX_ENTRY_LOCK);
> > +
> > +		/*
> > +		 * Besides huge zero pages the only other thing that gets
> > +		 * downgraded are empty entries which don't need to be
> > +		 * unmapped.
> > +		 */
> > +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > +			unmap_mapping_range(mapping,
> > +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > +
> >  		spin_lock_irq(&mapping->tree_lock);
> > -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> > +
> > +		if (pmd_downgrade) {
> > +			radix_tree_delete(&mapping->page_tree, index);
> > +			mapping->nrexceptional--;
> > +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> > +					false);
> 
> You need to set 'wake_all' argument here to true. Otherwise there could be
> waiters waiting for non-existent entry forever...

Interesting.   Fixed, but let me make sure I understand.  So is the issue that
you could have say 2 tasks waiting on a PMD index that has been rounded down
to the PMD index via dax_entry_waitqueue()?

The person holding the lock on the entry would remove the PMD, insert a PTE
and wake just one of the PMD aligned waiters.  That waiter would wake up, do
something PTE based (since the PMD space is now polluted with PTEs), and then
wake any waiters on it's PTE index.  Meanwhile, the second waiter could sleep
forever on the PMD aligned index.  Is this correct?

So, perhaps more succinctly:

Thread 1		Thread 2		Thread 3
--------		--------		--------
index 0x202, hold PMD lock 0x200
			index 0x203, sleep on 0x200
						index 0x204, sleep on 0x200
downgrade, removing 0x200
wake one waiter on 0x200
insert PTE @ 0x202
			wake up, grab index 0x203
			...
			wake one waiter on index 0x203

						... sleeps forever
Right?

> > +		}
> > +
> > +		entry = dax_radix_entry(0, size_flag | RADIX_DAX_EMPTY);
> > +
> > +		err = __radix_tree_insert(&mapping->page_tree, index,
> > +				dax_radix_order(entry), entry);
> >  		radix_tree_preload_end();
> >  		if (err) {
> >  			spin_unlock_irq(&mapping->tree_lock);
> > -			/* Someone already created the entry? */
> > -			if (err == -EEXIST)
> > +			/*
> > +			 * Someone already created the entry?  This is a
> > +			 * normal failure when inserting PMDs in a range
> > +			 * that already contains PTEs.  In that case we want
> > +			 * to return -EEXIST immediately.
> > +			 */
> > +			if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
> >  				goto restart;
> 
> Add a comment here that we can get here only if there was no radix tree
> entry at 'index' and thus there can be no waiters to wake.

Done.

> >  			return ERR_PTR(err);
> >  		}
> > @@ -441,6 +509,7 @@ restart:
> >  		return page;
> >  	}
> >  	entry = lock_slot(mapping, slot);
> > + out_unlock:
> >  	spin_unlock_irq(&mapping->tree_lock);
> >  	return entry;
> >  }
> > @@ -581,11 +650,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
> >  	return 0;
> >  }
> >  
> > -#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
> > -
> > +/*
> > + * By this point grab_mapping_entry() has ensured that we have a locked entry
> > + * of the appropriate size so we don't have to worry about downgrading PMDs to
> > + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> > + * already in the tree, we will skip the insertion and just dirty the PMD as
> > + * appropriate.
> > + */
> >  static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  				      struct vm_fault *vmf,
> > -				      void *entry, sector_t sector)
> > +				      void *entry, sector_t sector,
> > +				      unsigned long flags)
> >  {
> >  	struct radix_tree_root *page_tree = &mapping->page_tree;
> >  	int error = 0;
> > @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> >  		if (error)
> >  			return ERR_PTR(error);
> > +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> > +			!(flags & RADIX_DAX_HZP)) {
> > +		/* replacing huge zero page with PMD block mapping */
> > +		unmap_mapping_range(mapping,
> > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> >  	}
> >  
> >  	spin_lock_irq(&mapping->tree_lock);
> > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > -		       RADIX_DAX_ENTRY_LOCK);
> > +	new_entry = dax_radix_entry(sector, flags);
> > +
> 
> You've lost the RADIX_DAX_ENTRY_LOCK flag here?

Oh, nope, that's embedded in the dax_radix_entry() helper:

/* entries begin locked */
static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
{
	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
			((unsigned long)sector << RADIX_DAX_SHIFT) |
			RADIX_DAX_ENTRY_LOCK);
}

I'll s/dax_radix_entry/dax_radix_locked_entry/ or something to make this
clearer to the reader.

> >  	if (hole_fill) {
> >  		__delete_from_page_cache(entry, NULL);
> >  		/* Drop pagecache reference */
> >  		put_page(entry);
> > -		error = radix_tree_insert(page_tree, index, new_entry);
> > +		error = __radix_tree_insert(page_tree, index,
> > +				dax_radix_order(new_entry), new_entry);
> >  		if (error) {
> >  			new_entry = ERR_PTR(error);
> >  			goto unlock;
> >  		}
> >  		mapping->nrexceptional++;
> > -	} else {
> > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> >  		void **slot;
> >  		void *ret;
> 
> Uh, why this condition need to change? Is it some protection so that we
> don't replace a mapped PMD entry with PTE one?

Yea, the logic was that if we have a radix tree that has PMDs in it, and some
new process comes along doing PTE faults, we can just leave the DAX mapped PMD
entries in the tree.  Locking, dirtying and flushing will happen on PMD basis
for all processes.

If you think this is dangerous or not worth the effort we can make it like the
hole case where any PTE sized faults will result in an unmap and a downgrade
to PTE sized entries in the radix tree.

> > @@ -1261,4 +1338,186 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  	return VM_FAULT_NOPAGE | major;
> >  }
> >  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> > +
> > +#ifdef CONFIG_FS_DAX_PMD
> > +/*
> > + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> > + * more often than one might expect in the below functions.
> > + */
> > +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
> 
> Just out of curiosity: Why the british spelling of 'colour'?

This was copied from the old dax_pmd_fault() from Willy. :)

> > +
> > +static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> > +		struct vm_fault *vmf, unsigned long address,
> > +		struct iomap *iomap, loff_t pos, bool write, void **entryp)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	struct block_device *bdev = iomap->bdev;
> > +	struct blk_dax_ctl dax = {
> > +		.sector = dax_iomap_sector(iomap, pos),
> > +		.size = PMD_SIZE,
> > +	};
> > +	long length = dax_map_atomic(bdev, &dax);
> > +	void *ret;
> > +
> > +	if (length < 0) /* dax_map_atomic() failed */
> > +		return VM_FAULT_FALLBACK;
> > +	if (length < PMD_SIZE)
> > +		goto unmap_fallback;
> > +	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
> > +		goto unmap_fallback;
> > +	if (!pfn_t_devmap(dax.pfn))
> > +		goto unmap_fallback;
> > +
> > +	dax_unmap_atomic(bdev, &dax);
> > +
> > +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
> > +			RADIX_DAX_PMD);
> > +	if (IS_ERR(ret))
> > +		return VM_FAULT_FALLBACK;
> > +	*entryp = ret;
> > +
> > +	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
> > +
> > + unmap_fallback:
> > +	dax_unmap_atomic(bdev, &dax);
> > +	return VM_FAULT_FALLBACK;
> > +}
> > +
> > +static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
> > +		struct vm_fault *vmf, unsigned long address,
> > +		struct iomap *iomap, void **entryp)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	struct page *zero_page;
> > +	spinlock_t *ptl;
> > +	pmd_t pmd_entry;
> > +	void *ret;
> > +
> > +	zero_page = get_huge_zero_page();
> > +
> > +	if (unlikely(!zero_page))
> > +		return VM_FAULT_FALLBACK;
> > +
> > +	ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
> > +			RADIX_DAX_PMD | RADIX_DAX_HZP);
> > +	if (IS_ERR(ret))
> > +		return VM_FAULT_FALLBACK;
> > +	*entryp = ret;
> > +
> > +	ptl = pmd_lock(vma->vm_mm, pmd);
> > +	if (!pmd_none(*pmd)) {
> > +		spin_unlock(ptl);
> > +		return VM_FAULT_FALLBACK;
> > +	}
> > +
> > +	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
> > +	pmd_entry = pmd_mkhuge(pmd_entry);
> > +	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
> > +	spin_unlock(ptl);
> > +	return VM_FAULT_NOPAGE;
> > +}
> > +
> > +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	bool write = flags & FAULT_FLAG_WRITE;
> > +	unsigned int iomap_flags = write ? IOMAP_WRITE : 0;
> > +	struct inode *inode = mapping->host;
> > +	int result = VM_FAULT_FALLBACK;
> > +	struct iomap iomap = { 0 };
> 
> Why the 0 here? Just empty braces are enough to initialize the structure to
> zeros.

Ah, I just copied this from Christoph's dax_iomap_fault() code.

> > +	pgoff_t size, pgoff;
> > +	struct vm_fault vmf;
> > +	void *entry;
> > +	loff_t pos;
> > +	int error;
> > +
> > +	/* Fall back to PTEs if we're going to COW */
> > +	if (write && !(vma->vm_flags & VM_SHARED)) {
> > +		split_huge_pmd(vma, pmd, address);
> > +		goto fallback;
> > +	}
> > +
> > +	/* If the PMD would extend outside the VMA */
> > +	if (pmd_addr < vma->vm_start)
> > +		goto fallback;
> > +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> > +		goto fallback;
> > +
> > +	/*
> > +	 * Check whether offset isn't beyond end of file now. Caller is
> > +	 * supposed to hold locks serializing us with truncate / punch hole so
> > +	 * this is a reliable test.
> > +	 */
> > +	pgoff = linear_page_index(vma, pmd_addr);
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> 
> Nitpick - 'size' does not express that this is in pages and rounded up.
> Maybe we could have:
> 
> 	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
> 
> and then use strict inequalities below?

Yep, that's cleaner.  Fixed.

Jan, thank you for the review!

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
  2016-10-11 22:51       ` Ross Zwisler
  (?)
@ 2016-10-12  7:45         ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-12  7:45 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue 11-10-16 16:51:30, Ross Zwisler wrote:
> On Tue, Oct 11, 2016 at 10:31:52AM +0200, Jan Kara wrote:
> > On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index ac3cd05..e51d51f 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> > >  	 * queue to the start of that PMD.  This ensures that all offsets in
> > >  	 * the range covered by the PMD map to the same bit lock.
> > >  	 */
> > > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> > >  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> > 
> > I agree with Christoph - helper for masking type bits would make this
> > nicer.
> 
> Fixed via a dax_flag_test() helper as I outlined in the mail to Christoph.  It
> seems clean to me, but if you or Christoph feel strongly that it would be
> cleaner as a local 'flags' variable, I'll make the change.

One idea I had is that you could have helpers like:

dax_is_pmd_entry()
dax_is_pte_entry()
dax_is_empty_entry()
dax_is_hole_entry()

And then you would use these helpers - all the flags would be hidden in the
helpers so even if we decide to change the flagging scheme to compress
things or so, it should be pretty local change.

> > > -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > > -			       RADIX_DAX_ENTRY_LOCK);
> > > +
> > > +		/*
> > > +		 * Besides huge zero pages the only other thing that gets
> > > +		 * downgraded are empty entries which don't need to be
> > > +		 * unmapped.
> > > +		 */
> > > +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > > +			unmap_mapping_range(mapping,
> > > +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > > +
> > >  		spin_lock_irq(&mapping->tree_lock);
> > > -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> > > +
> > > +		if (pmd_downgrade) {
> > > +			radix_tree_delete(&mapping->page_tree, index);
> > > +			mapping->nrexceptional--;
> > > +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> > > +					false);
> > 
> > You need to set 'wake_all' argument here to true. Otherwise there could be
> > waiters waiting for non-existent entry forever...
> 
> Interesting.   Fixed, but let me make sure I understand.  So is the issue that
> you could have say 2 tasks waiting on a PMD index that has been rounded down
> to the PMD index via dax_entry_waitqueue()?
> 
> The person holding the lock on the entry would remove the PMD, insert a PTE
> and wake just one of the PMD aligned waiters.  That waiter would wake up, do
> something PTE based (since the PMD space is now polluted with PTEs), and then
> wake any waiters on it's PTE index.  Meanwhile, the second waiter could sleep
> forever on the PMD aligned index.  Is this correct?

Yes.

> So, perhaps more succinctly:
> 
> Thread 1		Thread 2		Thread 3
> --------		--------		--------
> index 0x202, hold PMD lock 0x200
> 			index 0x203, sleep on 0x200
> 						index 0x204, sleep on 0x200
> downgrade, removing 0x200
> wake one waiter on 0x200
> insert PTE @ 0x202
> 			wake up, grab index 0x203
> 			...
> 			wake one waiter on index 0x203
> 
> 						... sleeps forever
> Right?
 
Exactly.

> > > @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> > >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> > >  		if (error)
> > >  			return ERR_PTR(error);
> > > +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> > > +			!(flags & RADIX_DAX_HZP)) {
> > > +		/* replacing huge zero page with PMD block mapping */
> > > +		unmap_mapping_range(mapping,
> > > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > >  	}
> > >  
> > >  	spin_lock_irq(&mapping->tree_lock);
> > > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > > -		       RADIX_DAX_ENTRY_LOCK);
> > > +	new_entry = dax_radix_entry(sector, flags);
> > > +
> > 
> > You've lost the RADIX_DAX_ENTRY_LOCK flag here?
> 
> Oh, nope, that's embedded in the dax_radix_entry() helper:
> 
> /* entries begin locked */
> static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
> {
> 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
> 			((unsigned long)sector << RADIX_DAX_SHIFT) |
> 			RADIX_DAX_ENTRY_LOCK);
> }
> 
> I'll s/dax_radix_entry/dax_radix_locked_entry/ or something to make this
> clearer to the reader.

Yep, that would be better. Thanks!

> > >  	if (hole_fill) {
> > >  		__delete_from_page_cache(entry, NULL);
> > >  		/* Drop pagecache reference */
> > >  		put_page(entry);
> > > -		error = radix_tree_insert(page_tree, index, new_entry);
> > > +		error = __radix_tree_insert(page_tree, index,
> > > +				dax_radix_order(new_entry), new_entry);
> > >  		if (error) {
> > >  			new_entry = ERR_PTR(error);
> > >  			goto unlock;
> > >  		}
> > >  		mapping->nrexceptional++;
> > > -	} else {
> > > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> > >  		void **slot;
> > >  		void *ret;
> > 
> > Uh, why this condition need to change? Is it some protection so that we
> > don't replace a mapped PMD entry with PTE one?
> 
> Yea, the logic was that if we have a radix tree that has PMDs in it, and some
> new process comes along doing PTE faults, we can just leave the DAX mapped PMD
> entries in the tree.  Locking, dirtying and flushing will happen on PMD basis
> for all processes.

Yup, I got this and I agree with that.

> If you think this is dangerous or not worth the effort we can make it like the
> hole case where any PTE sized faults will result in an unmap and a downgrade
> to PTE sized entries in the radix tree.

Ok, so probably I'm somewhat surprised that the logic that mapped PMD entry
is not replaced by a PTE entry is handled down in
dax_insert_mapping_entry(). But now looking at this it is just how we work
in other cases as well so please just add a comment there.

Longer term we may just shortcut the fault if grab_mapping_entry() will
return us entry of type we can use. That can save us quite some work if
several processes are mapping the same file. But again that's an
optimization to do once the PMD, iomap, and page protection works settle.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-12  7:45         ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-12  7:45 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue 11-10-16 16:51:30, Ross Zwisler wrote:
> On Tue, Oct 11, 2016 at 10:31:52AM +0200, Jan Kara wrote:
> > On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index ac3cd05..e51d51f 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> > >  	 * queue to the start of that PMD.  This ensures that all offsets in
> > >  	 * the range covered by the PMD map to the same bit lock.
> > >  	 */
> > > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> > >  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> > 
> > I agree with Christoph - helper for masking type bits would make this
> > nicer.
> 
> Fixed via a dax_flag_test() helper as I outlined in the mail to Christoph.  It
> seems clean to me, but if you or Christoph feel strongly that it would be
> cleaner as a local 'flags' variable, I'll make the change.

One idea I had is that you could have helpers like:

dax_is_pmd_entry()
dax_is_pte_entry()
dax_is_empty_entry()
dax_is_hole_entry()

And then you would use these helpers - all the flags would be hidden in the
helpers so even if we decide to change the flagging scheme to compress
things or so, it should be pretty local change.

> > > -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > > -			       RADIX_DAX_ENTRY_LOCK);
> > > +
> > > +		/*
> > > +		 * Besides huge zero pages the only other thing that gets
> > > +		 * downgraded are empty entries which don't need to be
> > > +		 * unmapped.
> > > +		 */
> > > +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > > +			unmap_mapping_range(mapping,
> > > +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > > +
> > >  		spin_lock_irq(&mapping->tree_lock);
> > > -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> > > +
> > > +		if (pmd_downgrade) {
> > > +			radix_tree_delete(&mapping->page_tree, index);
> > > +			mapping->nrexceptional--;
> > > +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> > > +					false);
> > 
> > You need to set 'wake_all' argument here to true. Otherwise there could be
> > waiters waiting for non-existent entry forever...
> 
> Interesting.   Fixed, but let me make sure I understand.  So is the issue that
> you could have say 2 tasks waiting on a PMD index that has been rounded down
> to the PMD index via dax_entry_waitqueue()?
> 
> The person holding the lock on the entry would remove the PMD, insert a PTE
> and wake just one of the PMD aligned waiters.  That waiter would wake up, do
> something PTE based (since the PMD space is now polluted with PTEs), and then
> wake any waiters on it's PTE index.  Meanwhile, the second waiter could sleep
> forever on the PMD aligned index.  Is this correct?

Yes.

> So, perhaps more succinctly:
> 
> Thread 1		Thread 2		Thread 3
> --------		--------		--------
> index 0x202, hold PMD lock 0x200
> 			index 0x203, sleep on 0x200
> 						index 0x204, sleep on 0x200
> downgrade, removing 0x200
> wake one waiter on 0x200
> insert PTE @ 0x202
> 			wake up, grab index 0x203
> 			...
> 			wake one waiter on index 0x203
> 
> 						... sleeps forever
> Right?
 
Exactly.

> > > @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> > >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> > >  		if (error)
> > >  			return ERR_PTR(error);
> > > +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> > > +			!(flags & RADIX_DAX_HZP)) {
> > > +		/* replacing huge zero page with PMD block mapping */
> > > +		unmap_mapping_range(mapping,
> > > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > >  	}
> > >  
> > >  	spin_lock_irq(&mapping->tree_lock);
> > > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > > -		       RADIX_DAX_ENTRY_LOCK);
> > > +	new_entry = dax_radix_entry(sector, flags);
> > > +
> > 
> > You've lost the RADIX_DAX_ENTRY_LOCK flag here?
> 
> Oh, nope, that's embedded in the dax_radix_entry() helper:
> 
> /* entries begin locked */
> static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
> {
> 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
> 			((unsigned long)sector << RADIX_DAX_SHIFT) |
> 			RADIX_DAX_ENTRY_LOCK);
> }
> 
> I'll s/dax_radix_entry/dax_radix_locked_entry/ or something to make this
> clearer to the reader.

Yep, that would be better. Thanks!

> > >  	if (hole_fill) {
> > >  		__delete_from_page_cache(entry, NULL);
> > >  		/* Drop pagecache reference */
> > >  		put_page(entry);
> > > -		error = radix_tree_insert(page_tree, index, new_entry);
> > > +		error = __radix_tree_insert(page_tree, index,
> > > +				dax_radix_order(new_entry), new_entry);
> > >  		if (error) {
> > >  			new_entry = ERR_PTR(error);
> > >  			goto unlock;
> > >  		}
> > >  		mapping->nrexceptional++;
> > > -	} else {
> > > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> > >  		void **slot;
> > >  		void *ret;
> > 
> > Uh, why this condition need to change? Is it some protection so that we
> > don't replace a mapped PMD entry with PTE one?
> 
> Yea, the logic was that if we have a radix tree that has PMDs in it, and some
> new process comes along doing PTE faults, we can just leave the DAX mapped PMD
> entries in the tree.  Locking, dirtying and flushing will happen on PMD basis
> for all processes.

Yup, I got this and I agree with that.

> If you think this is dangerous or not worth the effort we can make it like the
> hole case where any PTE sized faults will result in an unmap and a downgrade
> to PTE sized entries in the radix tree.

Ok, so probably I'm somewhat surprised that the logic that mapped PMD entry
is not replaced by a PTE entry is handled down in
dax_insert_mapping_entry(). But now looking at this it is just how we work
in other cases as well so please just add a comment there.

Longer term we may just shortcut the fault if grab_mapping_entry() will
return us entry of type we can use. That can save us quite some work if
several processes are mapping the same file. But again that's an
optimization to do once the PMD, iomap, and page protection works settle.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
@ 2016-10-12  7:45         ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2016-10-12  7:45 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue 11-10-16 16:51:30, Ross Zwisler wrote:
> On Tue, Oct 11, 2016 at 10:31:52AM +0200, Jan Kara wrote:
> > On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index ac3cd05..e51d51f 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> > >  	 * queue to the start of that PMD.  This ensures that all offsets in
> > >  	 * the range covered by the PMD map to the same bit lock.
> > >  	 */
> > > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> > >  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> > 
> > I agree with Christoph - helper for masking type bits would make this
> > nicer.
> 
> Fixed via a dax_flag_test() helper as I outlined in the mail to Christoph.  It
> seems clean to me, but if you or Christoph feel strongly that it would be
> cleaner as a local 'flags' variable, I'll make the change.

One idea I had is that you could have helpers like:

dax_is_pmd_entry()
dax_is_pte_entry()
dax_is_empty_entry()
dax_is_hole_entry()

And then you would use these helpers - all the flags would be hidden in the
helpers so even if we decide to change the flagging scheme to compress
things or so, it should be pretty local change.

> > > -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > > -			       RADIX_DAX_ENTRY_LOCK);
> > > +
> > > +		/*
> > > +		 * Besides huge zero pages the only other thing that gets
> > > +		 * downgraded are empty entries which don't need to be
> > > +		 * unmapped.
> > > +		 */
> > > +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > > +			unmap_mapping_range(mapping,
> > > +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > > +
> > >  		spin_lock_irq(&mapping->tree_lock);
> > > -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> > > +
> > > +		if (pmd_downgrade) {
> > > +			radix_tree_delete(&mapping->page_tree, index);
> > > +			mapping->nrexceptional--;
> > > +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> > > +					false);
> > 
> > You need to set 'wake_all' argument here to true. Otherwise there could be
> > waiters waiting for non-existent entry forever...
> 
> Interesting.   Fixed, but let me make sure I understand.  So is the issue that
> you could have say 2 tasks waiting on a PMD index that has been rounded down
> to the PMD index via dax_entry_waitqueue()?
> 
> The person holding the lock on the entry would remove the PMD, insert a PTE
> and wake just one of the PMD aligned waiters.  That waiter would wake up, do
> something PTE based (since the PMD space is now polluted with PTEs), and then
> wake any waiters on it's PTE index.  Meanwhile, the second waiter could sleep
> forever on the PMD aligned index.  Is this correct?

Yes.

> So, perhaps more succinctly:
> 
> Thread 1		Thread 2		Thread 3
> --------		--------		--------
> index 0x202, hold PMD lock 0x200
> 			index 0x203, sleep on 0x200
> 						index 0x204, sleep on 0x200
> downgrade, removing 0x200
> wake one waiter on 0x200
> insert PTE @ 0x202
> 			wake up, grab index 0x203
> 			...
> 			wake one waiter on index 0x203
> 
> 						... sleeps forever
> Right?
 
Exactly.

> > > @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> > >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> > >  		if (error)
> > >  			return ERR_PTR(error);
> > > +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> > > +			!(flags & RADIX_DAX_HZP)) {
> > > +		/* replacing huge zero page with PMD block mapping */
> > > +		unmap_mapping_range(mapping,
> > > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > >  	}
> > >  
> > >  	spin_lock_irq(&mapping->tree_lock);
> > > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > > -		       RADIX_DAX_ENTRY_LOCK);
> > > +	new_entry = dax_radix_entry(sector, flags);
> > > +
> > 
> > You've lost the RADIX_DAX_ENTRY_LOCK flag here?
> 
> Oh, nope, that's embedded in the dax_radix_entry() helper:
> 
> /* entries begin locked */
> static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
> {
> 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
> 			((unsigned long)sector << RADIX_DAX_SHIFT) |
> 			RADIX_DAX_ENTRY_LOCK);
> }
> 
> I'll s/dax_radix_entry/dax_radix_locked_entry/ or something to make this
> clearer to the reader.

Yep, that would be better. Thanks!

> > >  	if (hole_fill) {
> > >  		__delete_from_page_cache(entry, NULL);
> > >  		/* Drop pagecache reference */
> > >  		put_page(entry);
> > > -		error = radix_tree_insert(page_tree, index, new_entry);
> > > +		error = __radix_tree_insert(page_tree, index,
> > > +				dax_radix_order(new_entry), new_entry);
> > >  		if (error) {
> > >  			new_entry = ERR_PTR(error);
> > >  			goto unlock;
> > >  		}
> > >  		mapping->nrexceptional++;
> > > -	} else {
> > > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> > >  		void **slot;
> > >  		void *ret;
> > 
> > Uh, why this condition need to change? Is it some protection so that we
> > don't replace a mapped PMD entry with PTE one?
> 
> Yea, the logic was that if we have a radix tree that has PMDs in it, and some
> new process comes along doing PTE faults, we can just leave the DAX mapped PMD
> entries in the tree.  Locking, dirtying and flushing will happen on PMD basis
> for all processes.

Yup, I got this and I agree with that.

> If you think this is dangerous or not worth the effort we can make it like the
> hole case where any PTE sized faults will result in an unmap and a downgrade
> to PTE sized entries in the radix tree.

Ok, so probably I'm somewhat surprised that the logic that mapped PMD entry
is not replaced by a PTE entry is handled down in
dax_insert_mapping_entry(). But now looking at this it is just how we work
in other cases as well so please just add a comment there.

Longer term we may just shortcut the fault if grab_mapping_entry() will
return us entry of type we can use. That can save us quite some work if
several processes are mapping the same file. But again that's an
optimization to do once the PMD, iomap, and page protection works settle.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

end of thread, other threads:[~2016-10-12  7:46 UTC | newest]

Thread overview: 182+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-07 21:08 [PATCH v5 00/17] re-enable DAX PMD support Ross Zwisler
2016-10-07 21:08 ` Ross Zwisler
2016-10-07 21:08 ` Ross Zwisler
2016-10-07 21:08 ` Ross Zwisler
2016-10-07 21:08 ` [PATCH v5 01/17] ext4: allow DAX writeback for hole punch Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08 ` [PATCH v5 02/17] ext4: tell DAX the size of allocation holes Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08 ` [PATCH v5 03/17] dax: remove buffer_size_valid() Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08 ` [PATCH v5 04/17] ext2: remove support for DAX PMD faults Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08 ` [PATCH v5 05/17] ext2: return -EIO on ext2_iomap_end() failure Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-11  6:48   ` Jan Kara
2016-10-11  6:48     ` Jan Kara
2016-10-11  6:48     ` Jan Kara
2016-10-11  6:48     ` Jan Kara
2016-10-07 21:08 ` [PATCH v5 06/17] dax: make 'wait_table' global variable static Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08 ` [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-10 15:45   ` Christoph Hellwig
2016-10-10 15:45     ` Christoph Hellwig
2016-10-10 15:45     ` Christoph Hellwig
2016-10-10 15:45     ` Christoph Hellwig
2016-10-10 15:45     ` Christoph Hellwig
2016-10-11  6:50   ` Jan Kara
2016-10-11  6:50     ` Jan Kara
2016-10-11  6:50     ` Jan Kara
2016-10-11  6:50     ` Jan Kara
2016-10-11  6:50     ` Jan Kara
2016-10-07 21:08 ` [PATCH v5 08/17] dax: consistent variable naming for DAX entries Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08 ` [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-10 15:46   ` Christoph Hellwig
2016-10-10 15:46     ` Christoph Hellwig
2016-10-10 15:46     ` Christoph Hellwig
2016-10-11  7:04   ` Jan Kara
2016-10-11  7:04     ` Jan Kara
2016-10-11  7:04     ` Jan Kara
2016-10-11  7:04     ` Jan Kara
2016-10-11 21:18     ` Ross Zwisler
2016-10-11 21:18       ` Ross Zwisler
2016-10-11 21:18       ` Ross Zwisler
2016-10-07 21:08 ` [PATCH v5 10/17] dax: remove dax_pmd_fault() Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08 ` [PATCH v5 11/17] dax: correct dax iomap code namespace Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-09 15:28   ` Christoph Hellwig
2016-10-09 15:28     ` Christoph Hellwig
2016-10-09 15:28     ` Christoph Hellwig
2016-10-09 15:28     ` Christoph Hellwig
2016-10-10 19:04     ` [PATCH] " Ross Zwisler
2016-10-10 19:04       ` Ross Zwisler
2016-10-10 19:04       ` Ross Zwisler
2016-10-10 19:04       ` Ross Zwisler
2016-10-10 20:19       ` Dave Chinner
2016-10-10 20:19         ` Dave Chinner
2016-10-10 20:19         ` Dave Chinner
2016-10-10 20:19         ` Dave Chinner
2016-10-07 21:08 ` [PATCH v5 12/17] dax: add dax_iomap_sector() helper function Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-07 21:08   ` Ross Zwisler
2016-10-10 15:46   ` Christoph Hellwig
2016-10-10 15:46     ` Christoph Hellwig
2016-10-10 15:46     ` Christoph Hellwig
2016-10-10 15:46     ` Christoph Hellwig
2016-10-10 15:46     ` Christoph Hellwig
2016-10-11  7:06   ` Jan Kara
2016-10-11  7:06     ` Jan Kara
2016-10-11  7:06     ` Jan Kara
2016-10-11  7:06     ` Jan Kara
2016-10-07 21:09 ` [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end() Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-10 15:50   ` Christoph Hellwig
2016-10-10 15:50     ` Christoph Hellwig
2016-10-10 15:50     ` Christoph Hellwig
2016-10-10 15:50     ` Christoph Hellwig
2016-10-10 22:05     ` Ross Zwisler
2016-10-10 22:05       ` Ross Zwisler
2016-10-10 22:05       ` Ross Zwisler
2016-10-10 22:05       ` Ross Zwisler
2016-10-11  7:21   ` Jan Kara
2016-10-11  7:21     ` Jan Kara
2016-10-11  7:21     ` Jan Kara
2016-10-11  7:21     ` Jan Kara
2016-10-11  7:21     ` Jan Kara
2016-10-07 21:09 ` [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-10 15:50   ` Christoph Hellwig
2016-10-10 15:50     ` Christoph Hellwig
2016-10-10 15:50     ` Christoph Hellwig
2016-10-10 15:50     ` Christoph Hellwig
2016-10-10 15:50     ` Christoph Hellwig
2016-10-11  7:23   ` Jan Kara
2016-10-11  7:23     ` Jan Kara
2016-10-11  7:23     ` Jan Kara
2016-10-11  7:23     ` Jan Kara
2016-10-11  7:23     ` Jan Kara
2016-10-07 21:09 ` [PATCH v5 15/17] dax: add struct iomap based DAX PMD support Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-10 15:59   ` Christoph Hellwig
2016-10-10 15:59     ` Christoph Hellwig
2016-10-10 15:59     ` Christoph Hellwig
2016-10-10 15:59     ` Christoph Hellwig
2016-10-10 22:06     ` Ross Zwisler
2016-10-10 22:06       ` Ross Zwisler
2016-10-10 22:06       ` Ross Zwisler
2016-10-10 22:06       ` Ross Zwisler
2016-10-10 22:06       ` Ross Zwisler
2016-10-11 21:48     ` Ross Zwisler
2016-10-11 21:48       ` Ross Zwisler
2016-10-11 21:48       ` Ross Zwisler
2016-10-11 21:48       ` Ross Zwisler
2016-10-11 21:48       ` Ross Zwisler
2016-10-11  8:31   ` Jan Kara
2016-10-11  8:31     ` Jan Kara
2016-10-11  8:31     ` Jan Kara
2016-10-11  8:31     ` Jan Kara
2016-10-11  8:31     ` Jan Kara
2016-10-11 22:51     ` Ross Zwisler
2016-10-11 22:51       ` Ross Zwisler
2016-10-11 22:51       ` Ross Zwisler
2016-10-11 22:51       ` Ross Zwisler
2016-10-11 22:51       ` Ross Zwisler
2016-10-12  7:45       ` Jan Kara
2016-10-12  7:45         ` Jan Kara
2016-10-12  7:45         ` Jan Kara
2016-10-07 21:09 ` [PATCH v5 16/17] xfs: use struct iomap based DAX PMD fault path Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-11  8:34   ` Jan Kara
2016-10-11  8:34     ` Jan Kara
2016-10-11  8:34     ` Jan Kara
2016-10-11  8:34     ` Jan Kara
2016-10-07 21:09 ` [PATCH v5 17/17] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler
2016-10-07 21:09   ` Ross Zwisler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.