All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-15 19:09 ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Dave Chinner,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, linux-ext4

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

This series restores DAX PMD functionality back to what it was before it
was disabled.  There is still a known issue between DAX PMDs and hole
punch, which I am currently working on and which I plan to address with a
separate series.

Ross Zwisler (7):
  ext2: tell DAX the size of allocation holes
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  dax: rename 'ret' to 'entry' in grab_mapping_entry
  dax: lock based on slot instead of [mapping, index]
  dax: re-enable DAX PMD support
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig          |   1 -
 fs/dax.c            | 301 ++++++++++++++++++++++++++--------------------------
 fs/ext2/inode.c     |   6 ++
 fs/ext4/inode.c     |   3 +
 include/linux/dax.h |  30 +++++-
 mm/filemap.c        |   7 +-
 6 files changed, 191 insertions(+), 157 deletions(-)

-- 
2.9.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-15 19:09 ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

This series restores DAX PMD functionality back to what it was before it
was disabled.  There is still a known issue between DAX PMDs and hole
punch, which I am currently working on and which I plan to address with a
separate series.

Ross Zwisler (7):
  ext2: tell DAX the size of allocation holes
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  dax: rename 'ret' to 'entry' in grab_mapping_entry
  dax: lock based on slot instead of [mapping, index]
  dax: re-enable DAX PMD support
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig          |   1 -
 fs/dax.c            | 301 ++++++++++++++++++++++++++--------------------------
 fs/ext2/inode.c     |   6 ++
 fs/ext4/inode.c     |   3 +
 include/linux/dax.h |  30 +++++-
 mm/filemap.c        |   7 +-
 6 files changed, 191 insertions(+), 157 deletions(-)

-- 
2.9.0

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-15 19:09 ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

This series restores DAX PMD functionality back to what it was before it
was disabled.  There is still a known issue between DAX PMDs and hole
punch, which I am currently working on and which I plan to address with a
separate series.

Ross Zwisler (7):
  ext2: tell DAX the size of allocation holes
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  dax: rename 'ret' to 'entry' in grab_mapping_entry
  dax: lock based on slot instead of [mapping, index]
  dax: re-enable DAX PMD support
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig          |   1 -
 fs/dax.c            | 301 ++++++++++++++++++++++++++--------------------------
 fs/ext2/inode.c     |   6 ++
 fs/ext4/inode.c     |   3 +
 include/linux/dax.h |  30 +++++-
 mm/filemap.c        |   7 +-
 6 files changed, 191 insertions(+), 157 deletions(-)

-- 
2.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-15 19:09 ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Andrew Morton,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

This series restores DAX PMD functionality back to what it was before it
was disabled.  There is still a known issue between DAX PMDs and hole
punch, which I am currently working on and which I plan to address with a
separate series.

Ross Zwisler (7):
  ext2: tell DAX the size of allocation holes
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  dax: rename 'ret' to 'entry' in grab_mapping_entry
  dax: lock based on slot instead of [mapping, index]
  dax: re-enable DAX PMD support
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig          |   1 -
 fs/dax.c            | 301 ++++++++++++++++++++++++++--------------------------
 fs/ext2/inode.c     |   6 ++
 fs/ext4/inode.c     |   3 +
 include/linux/dax.h |  30 +++++-
 mm/filemap.c        |   7 +-
 6 files changed, 191 insertions(+), 157 deletions(-)

-- 
2.9.0

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/7] ext2: tell DAX the size of allocation holes
  2016-08-15 19:09 ` Ross Zwisler
  (?)
  (?)
@ 2016-08-15 19:09   ` Ross Zwisler
  -1 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Dave Chinner,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, linux-ext4

When DAX calls ext2_get_block() and the file offset points to a hole we
currently don't set bh_result->b_size.  When we re-enable PMD faults DAX
will need bh_result->b_size to tell it the size of the hole so it can
decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.

For ext2 we always want DAX to use 4 KiB zero pages, so we just tell DAX
that all holes are 4 KiB in size.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/inode.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index d5c7d09..c6d9763 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -773,6 +773,12 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
 	if (ret > 0) {
 		bh_result->b_size = (ret << inode->i_blkbits);
 		ret = 0;
+	} else if (ret == 0 && IS_DAX(inode)) {
+		/*
+		 * We have hit a hole.  Tell DAX it is 4k in size so that it
+		 * uses PTE faults.
+		 */
+		bh_result->b_size = PAGE_SIZE;
 	}
 	return ret;
 
-- 
2.9.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 1/7] ext2: tell DAX the size of allocation holes
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

When DAX calls ext2_get_block() and the file offset points to a hole we
currently don't set bh_result->b_size.  When we re-enable PMD faults DAX
will need bh_result->b_size to tell it the size of the hole so it can
decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.

For ext2 we always want DAX to use 4 KiB zero pages, so we just tell DAX
that all holes are 4 KiB in size.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/inode.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index d5c7d09..c6d9763 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -773,6 +773,12 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
 	if (ret > 0) {
 		bh_result->b_size = (ret << inode->i_blkbits);
 		ret = 0;
+	} else if (ret == 0 && IS_DAX(inode)) {
+		/*
+		 * We have hit a hole.  Tell DAX it is 4k in size so that it
+		 * uses PTE faults.
+		 */
+		bh_result->b_size = PAGE_SIZE;
 	}
 	return ret;
 
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 1/7] ext2: tell DAX the size of allocation holes
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

When DAX calls ext2_get_block() and the file offset points to a hole we
currently don't set bh_result->b_size.  When we re-enable PMD faults DAX
will need bh_result->b_size to tell it the size of the hole so it can
decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.

For ext2 we always want DAX to use 4 KiB zero pages, so we just tell DAX
that all holes are 4 KiB in size.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/inode.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index d5c7d09..c6d9763 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -773,6 +773,12 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
 	if (ret > 0) {
 		bh_result->b_size = (ret << inode->i_blkbits);
 		ret = 0;
+	} else if (ret == 0 && IS_DAX(inode)) {
+		/*
+		 * We have hit a hole.  Tell DAX it is 4k in size so that it
+		 * uses PTE faults.
+		 */
+		bh_result->b_size = PAGE_SIZE;
 	}
 	return ret;
 
-- 
2.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 1/7] ext2: tell DAX the size of allocation holes
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Andrew Morton,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

When DAX calls ext2_get_block() and the file offset points to a hole we
currently don't set bh_result->b_size.  When we re-enable PMD faults DAX
will need bh_result->b_size to tell it the size of the hole so it can
decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.

For ext2 we always want DAX to use 4 KiB zero pages, so we just tell DAX
that all holes are 4 KiB in size.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/ext2/inode.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index d5c7d09..c6d9763 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -773,6 +773,12 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
 	if (ret > 0) {
 		bh_result->b_size = (ret << inode->i_blkbits);
 		ret = 0;
+	} else if (ret == 0 && IS_DAX(inode)) {
+		/*
+		 * We have hit a hole.  Tell DAX it is 4k in size so that it
+		 * uses PTE faults.
+		 */
+		bh_result->b_size = PAGE_SIZE;
 	}
 	return ret;
 
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 2/7] ext4: tell DAX the size of allocation holes
  2016-08-15 19:09 ` Ross Zwisler
  (?)
  (?)
@ 2016-08-15 19:09   ` Ross Zwisler
  -1 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Dave Chinner,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, linux-ext4

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  When we re-enable PMD faults DAX will
need bh->b_size to tell it the size of the hole so it can decide whether to
fault in a 4 KiB zero page or a 2 MiB zero page.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..1808013 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_update_bh_state(bh, map.m_flags);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
+	} else if (ret == 0) {
+		/* hole case, need to fill in bh->b_size */
+		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 	}
 	return ret;
 }
-- 
2.9.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 2/7] ext4: tell DAX the size of allocation holes
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  When we re-enable PMD faults DAX will
need bh->b_size to tell it the size of the hole so it can decide whether to
fault in a 4 KiB zero page or a 2 MiB zero page.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..1808013 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_update_bh_state(bh, map.m_flags);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
+	} else if (ret == 0) {
+		/* hole case, need to fill in bh->b_size */
+		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 	}
 	return ret;
 }
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 2/7] ext4: tell DAX the size of allocation holes
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  When we re-enable PMD faults DAX will
need bh->b_size to tell it the size of the hole so it can decide whether to
fault in a 4 KiB zero page or a 2 MiB zero page.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..1808013 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_update_bh_state(bh, map.m_flags);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
+	} else if (ret == 0) {
+		/* hole case, need to fill in bh->b_size */
+		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 	}
 	return ret;
 }
-- 
2.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 2/7] ext4: tell DAX the size of allocation holes
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Andrew Morton,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  When we re-enable PMD faults DAX will
need bh->b_size to tell it the size of the hole so it can decide whether to
fault in a 4 KiB zero page or a 2 MiB zero page.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..1808013 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_update_bh_state(bh, map.m_flags);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
+	} else if (ret == 0) {
+		/* hole case, need to fill in bh->b_size */
+		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 	}
 	return ret;
 }
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 3/7] dax: remove buffer_size_valid()
  2016-08-15 19:09 ` Ross Zwisler
  (?)
  (?)
@ 2016-08-15 19:09   ` Ross Zwisler
  -1 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Dave Chinner,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, linux-ext4

Now that all our supported filesystems (ext2, ext4 and XFS) all properly
set bh.b_size when we call get_block() for a hole, rely on that value and
remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 993dc6f..8030f93 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -121,19 +121,6 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-	return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
 		const struct inode *inode)
 {
@@ -175,8 +162,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				rc = get_block(inode, block, bh, rw == WRITE);
 				if (rc)
 					break;
-				if (!buffer_size_valid(bh))
-					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
 				bdev = bh->b_bdev;
 				/*
@@ -1010,12 +995,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bdev = bh.b_bdev;
 
-	/*
-	 * If the filesystem isn't willing to tell us the length of a hole,
-	 * just fall back to PTEs.  Calling get_block 512 times in a loop
-	 * would be silly.
-	 */
-	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
 		return VM_FAULT_FALLBACK;
 	}
-- 
2.9.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 3/7] dax: remove buffer_size_valid()
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

Now that all our supported filesystems (ext2, ext4 and XFS) all properly
set bh.b_size when we call get_block() for a hole, rely on that value and
remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 993dc6f..8030f93 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -121,19 +121,6 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-	return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
 		const struct inode *inode)
 {
@@ -175,8 +162,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				rc = get_block(inode, block, bh, rw == WRITE);
 				if (rc)
 					break;
-				if (!buffer_size_valid(bh))
-					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
 				bdev = bh->b_bdev;
 				/*
@@ -1010,12 +995,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bdev = bh.b_bdev;
 
-	/*
-	 * If the filesystem isn't willing to tell us the length of a hole,
-	 * just fall back to PTEs.  Calling get_block 512 times in a loop
-	 * would be silly.
-	 */
-	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
 		return VM_FAULT_FALLBACK;
 	}
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 3/7] dax: remove buffer_size_valid()
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

Now that all our supported filesystems (ext2, ext4 and XFS) all properly
set bh.b_size when we call get_block() for a hole, rely on that value and
remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 993dc6f..8030f93 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -121,19 +121,6 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-	return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
 		const struct inode *inode)
 {
@@ -175,8 +162,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				rc = get_block(inode, block, bh, rw == WRITE);
 				if (rc)
 					break;
-				if (!buffer_size_valid(bh))
-					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
 				bdev = bh->b_bdev;
 				/*
@@ -1010,12 +995,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bdev = bh.b_bdev;
 
-	/*
-	 * If the filesystem isn't willing to tell us the length of a hole,
-	 * just fall back to PTEs.  Calling get_block 512 times in a loop
-	 * would be silly.
-	 */
-	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
 		return VM_FAULT_FALLBACK;
 	}
-- 
2.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 3/7] dax: remove buffer_size_valid()
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Andrew Morton,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

Now that all our supported filesystems (ext2, ext4 and XFS) all properly
set bh.b_size when we call get_block() for a hole, rely on that value and
remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/dax.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 993dc6f..8030f93 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -121,19 +121,6 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-	return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
 		const struct inode *inode)
 {
@@ -175,8 +162,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				rc = get_block(inode, block, bh, rw == WRITE);
 				if (rc)
 					break;
-				if (!buffer_size_valid(bh))
-					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
 				bdev = bh->b_bdev;
 				/*
@@ -1010,12 +995,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bdev = bh.b_bdev;
 
-	/*
-	 * If the filesystem isn't willing to tell us the length of a hole,
-	 * just fall back to PTEs.  Calling get_block 512 times in a loop
-	 * would be silly.
-	 */
-	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
 		return VM_FAULT_FALLBACK;
 	}
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 4/7] dax: rename 'ret' to 'entry' in grab_mapping_entry
  2016-08-15 19:09 ` Ross Zwisler
  (?)
  (?)
@ 2016-08-15 19:09   ` Ross Zwisler
  -1 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Dave Chinner,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, linux-ext4

No functional change.

Everywhere else that we get entries via get_unlocked_mapping_entry(), we
save it in 'entry' variables.  Just change this to be more descriptive.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 8030f93..fed6a52 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -394,13 +394,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
-	ret = get_unlocked_mapping_entry(mapping, index, &slot);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!ret) {
+	if (!entry) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -408,10 +408,10 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 			       RADIX_DAX_ENTRY_LOCK);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		err = radix_tree_insert(&mapping->page_tree, index, entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
@@ -423,11 +423,11 @@ restart:
 		/* Good, we have inserted empty locked entry into the tree. */
 		mapping->nrexceptional++;
 		spin_unlock_irq(&mapping->tree_lock);
-		return ret;
+		return entry;
 	}
 	/* Normal page in radix tree? */
-	if (!radix_tree_exceptional_entry(ret)) {
-		struct page *page = ret;
+	if (!radix_tree_exceptional_entry(entry)) {
+		struct page *page = entry;
 
 		get_page(page);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -440,9 +440,9 @@ restart:
 		}
 		return page;
 	}
-	ret = lock_slot(mapping, slot);
+	entry = lock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	return ret;
+	return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-- 
2.9.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 4/7] dax: rename 'ret' to 'entry' in grab_mapping_entry
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

No functional change.

Everywhere else that we get entries via get_unlocked_mapping_entry(), we
save it in 'entry' variables.  Just change this to be more descriptive.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 8030f93..fed6a52 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -394,13 +394,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
-	ret = get_unlocked_mapping_entry(mapping, index, &slot);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!ret) {
+	if (!entry) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -408,10 +408,10 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 			       RADIX_DAX_ENTRY_LOCK);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		err = radix_tree_insert(&mapping->page_tree, index, entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
@@ -423,11 +423,11 @@ restart:
 		/* Good, we have inserted empty locked entry into the tree. */
 		mapping->nrexceptional++;
 		spin_unlock_irq(&mapping->tree_lock);
-		return ret;
+		return entry;
 	}
 	/* Normal page in radix tree? */
-	if (!radix_tree_exceptional_entry(ret)) {
-		struct page *page = ret;
+	if (!radix_tree_exceptional_entry(entry)) {
+		struct page *page = entry;
 
 		get_page(page);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -440,9 +440,9 @@ restart:
 		}
 		return page;
 	}
-	ret = lock_slot(mapping, slot);
+	entry = lock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	return ret;
+	return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 4/7] dax: rename 'ret' to 'entry' in grab_mapping_entry
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

No functional change.

Everywhere else that we get entries via get_unlocked_mapping_entry(), we
save it in 'entry' variables.  Just change this to be more descriptive.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 8030f93..fed6a52 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -394,13 +394,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
-	ret = get_unlocked_mapping_entry(mapping, index, &slot);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!ret) {
+	if (!entry) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -408,10 +408,10 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 			       RADIX_DAX_ENTRY_LOCK);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		err = radix_tree_insert(&mapping->page_tree, index, entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
@@ -423,11 +423,11 @@ restart:
 		/* Good, we have inserted empty locked entry into the tree. */
 		mapping->nrexceptional++;
 		spin_unlock_irq(&mapping->tree_lock);
-		return ret;
+		return entry;
 	}
 	/* Normal page in radix tree? */
-	if (!radix_tree_exceptional_entry(ret)) {
-		struct page *page = ret;
+	if (!radix_tree_exceptional_entry(entry)) {
+		struct page *page = entry;
 
 		get_page(page);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -440,9 +440,9 @@ restart:
 		}
 		return page;
 	}
-	ret = lock_slot(mapping, slot);
+	entry = lock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	return ret;
+	return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-- 
2.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 4/7] dax: rename 'ret' to 'entry' in grab_mapping_entry
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Andrew Morton,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

No functional change.

Everywhere else that we get entries via get_unlocked_mapping_entry(), we
save it in 'entry' variables.  Just change this to be more descriptive.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/dax.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 8030f93..fed6a52 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -394,13 +394,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
-	ret = get_unlocked_mapping_entry(mapping, index, &slot);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!ret) {
+	if (!entry) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -408,10 +408,10 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 			       RADIX_DAX_ENTRY_LOCK);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		err = radix_tree_insert(&mapping->page_tree, index, entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
@@ -423,11 +423,11 @@ restart:
 		/* Good, we have inserted empty locked entry into the tree. */
 		mapping->nrexceptional++;
 		spin_unlock_irq(&mapping->tree_lock);
-		return ret;
+		return entry;
 	}
 	/* Normal page in radix tree? */
-	if (!radix_tree_exceptional_entry(ret)) {
-		struct page *page = ret;
+	if (!radix_tree_exceptional_entry(entry)) {
+		struct page *page = entry;
 
 		get_page(page);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -440,9 +440,9 @@ restart:
 		}
 		return page;
 	}
-	ret = lock_slot(mapping, slot);
+	entry = lock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	return ret;
+	return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
  2016-08-15 19:09 ` Ross Zwisler
  (?)
  (?)
@ 2016-08-15 19:09   ` Ross Zwisler
  -1 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Dave Chinner,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, linux-ext4

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, lock based on the 'slot' pointer in the radix tree
instead of [mapping, index].

When a PMD entry is present in the tree, all offsets will map to the same
'slot' via radix tree lookups, and they will all share the same locking.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 59 +++++++++++++++++++++--------------------------------
 include/linux/dax.h |  3 +--
 mm/filemap.c        |  3 +--
 3 files changed, 25 insertions(+), 40 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fed6a52..0f1d053 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -62,11 +62,10 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
-					      pgoff_t index)
+static wait_queue_head_t *dax_entry_waitqueue(void **slot)
 {
-	unsigned long hash = hash_long((unsigned long)mapping ^ index,
-				       DAX_WAIT_TABLE_BITS);
+	unsigned long hash = hash_long((unsigned long)slot,
+					DAX_WAIT_TABLE_BITS);
 	return wait_table + hash;
 }
 
@@ -281,25 +280,19 @@ EXPORT_SYMBOL_GPL(dax_do_io);
 /*
  * DAX radix tree locking
  */
-struct exceptional_entry_key {
-	struct address_space *mapping;
-	unsigned long index;
-};
-
 struct wait_exceptional_entry_queue {
 	wait_queue_t wait;
-	struct exceptional_entry_key key;
+	void **slot;
 };
 
 static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 				       int sync, void *keyp)
 {
-	struct exceptional_entry_key *key = keyp;
+	void **slot = keyp;
 	struct wait_exceptional_entry_queue *ewait =
 		container_of(wait, struct wait_exceptional_entry_queue, wait);
 
-	if (key->mapping != ewait->key.mapping ||
-	    key->index != ewait->key.index)
+	if (slot != ewait->slot)
 		return 0;
 	return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -357,12 +350,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 {
 	void *ret, **slot;
 	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq;
 
 	init_wait(&ewait.wait);
 	ewait.wait.func = wake_exceptional_entry_func;
-	ewait.key.mapping = mapping;
-	ewait.key.index = index;
 
 	for (;;) {
 		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -373,6 +364,9 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 				*slotp = slot;
 			return ret;
 		}
+
+		wq = dax_entry_waitqueue(slot);
+		ewait.slot = slot;
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -445,10 +439,9 @@ restart:
 	return entry;
 }
 
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all)
+void dax_wake_mapping_entry_waiter(void **slot, bool wake_all)
 {
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq = dax_entry_waitqueue(slot);
 
 	/*
 	 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -456,13 +449,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 	 * So at this point all tasks that could have seen our entry locked
 	 * must be in the waitqueue and the following check will see them.
 	 */
-	if (waitqueue_active(wq)) {
-		struct exceptional_entry_key key;
-
-		key.mapping = mapping;
-		key.index = index;
-		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
-	}
+	if (waitqueue_active(wq))
+		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, slot);
 }
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
@@ -478,7 +466,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 	}
 	unlock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(slot, false);
 }
 
 static void put_locked_mapping_entry(struct address_space *mapping,
@@ -496,14 +484,13 @@ static void put_locked_mapping_entry(struct address_space *mapping,
  * Called when we are done with radix tree entry we looked up via
  * get_unlocked_mapping_entry() and which we didn't lock in the end.
  */
-static void put_unlocked_mapping_entry(struct address_space *mapping,
-				       pgoff_t index, void *entry)
+static void put_unlocked_mapping_entry(void **slot, void *entry)
 {
 	if (!radix_tree_exceptional_entry(entry))
 		return;
 
 	/* We have to wake up next waiter for the radix tree entry lock */
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(slot, false);
 }
 
 /*
@@ -512,10 +499,10 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
  */
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *entry;
+	void *entry, **slot;
 
 	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/*
 	 * This gets called from truncate / punch_hole path. As such, the caller
 	 * must hold locks protecting against concurrent modifications of the
@@ -530,7 +517,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	radix_tree_delete(&mapping->page_tree, index);
 	mapping->nrexceptional--;
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, true);
+	dax_wake_mapping_entry_waiter(slot, true);
 
 	return 1;
 }
@@ -1118,15 +1105,15 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
-	void *entry;
+	void *entry, **slot;
 	pgoff_t index = vmf->pgoff;
 
 	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	if (!entry || !radix_tree_exceptional_entry(entry))
 		goto out;
 	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
-	put_unlocked_mapping_entry(mapping, index, entry);
+	put_unlocked_mapping_entry(slot, entry);
 out:
 	spin_unlock_irq(&mapping->tree_lock);
 	return VM_FAULT_NOPAGE;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9c6dc77..8bcb852 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -15,8 +15,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all);
+void dax_wake_mapping_entry_waiter(void **slot, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..56c4ac7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,8 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 			if (node)
 				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
-						      false);
+			dax_wake_mapping_entry_waiter(slot, false);
 		}
 	}
 	radix_tree_replace_slot(slot, page);
-- 
2.9.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, lock based on the 'slot' pointer in the radix tree
instead of [mapping, index].

When a PMD entry is present in the tree, all offsets will map to the same
'slot' via radix tree lookups, and they will all share the same locking.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 59 +++++++++++++++++++++--------------------------------
 include/linux/dax.h |  3 +--
 mm/filemap.c        |  3 +--
 3 files changed, 25 insertions(+), 40 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fed6a52..0f1d053 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -62,11 +62,10 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
-					      pgoff_t index)
+static wait_queue_head_t *dax_entry_waitqueue(void **slot)
 {
-	unsigned long hash = hash_long((unsigned long)mapping ^ index,
-				       DAX_WAIT_TABLE_BITS);
+	unsigned long hash = hash_long((unsigned long)slot,
+					DAX_WAIT_TABLE_BITS);
 	return wait_table + hash;
 }
 
@@ -281,25 +280,19 @@ EXPORT_SYMBOL_GPL(dax_do_io);
 /*
  * DAX radix tree locking
  */
-struct exceptional_entry_key {
-	struct address_space *mapping;
-	unsigned long index;
-};
-
 struct wait_exceptional_entry_queue {
 	wait_queue_t wait;
-	struct exceptional_entry_key key;
+	void **slot;
 };
 
 static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 				       int sync, void *keyp)
 {
-	struct exceptional_entry_key *key = keyp;
+	void **slot = keyp;
 	struct wait_exceptional_entry_queue *ewait =
 		container_of(wait, struct wait_exceptional_entry_queue, wait);
 
-	if (key->mapping != ewait->key.mapping ||
-	    key->index != ewait->key.index)
+	if (slot != ewait->slot)
 		return 0;
 	return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -357,12 +350,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 {
 	void *ret, **slot;
 	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq;
 
 	init_wait(&ewait.wait);
 	ewait.wait.func = wake_exceptional_entry_func;
-	ewait.key.mapping = mapping;
-	ewait.key.index = index;
 
 	for (;;) {
 		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -373,6 +364,9 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 				*slotp = slot;
 			return ret;
 		}
+
+		wq = dax_entry_waitqueue(slot);
+		ewait.slot = slot;
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -445,10 +439,9 @@ restart:
 	return entry;
 }
 
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all)
+void dax_wake_mapping_entry_waiter(void **slot, bool wake_all)
 {
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq = dax_entry_waitqueue(slot);
 
 	/*
 	 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -456,13 +449,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 	 * So at this point all tasks that could have seen our entry locked
 	 * must be in the waitqueue and the following check will see them.
 	 */
-	if (waitqueue_active(wq)) {
-		struct exceptional_entry_key key;
-
-		key.mapping = mapping;
-		key.index = index;
-		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
-	}
+	if (waitqueue_active(wq))
+		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, slot);
 }
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
@@ -478,7 +466,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 	}
 	unlock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(slot, false);
 }
 
 static void put_locked_mapping_entry(struct address_space *mapping,
@@ -496,14 +484,13 @@ static void put_locked_mapping_entry(struct address_space *mapping,
  * Called when we are done with radix tree entry we looked up via
  * get_unlocked_mapping_entry() and which we didn't lock in the end.
  */
-static void put_unlocked_mapping_entry(struct address_space *mapping,
-				       pgoff_t index, void *entry)
+static void put_unlocked_mapping_entry(void **slot, void *entry)
 {
 	if (!radix_tree_exceptional_entry(entry))
 		return;
 
 	/* We have to wake up next waiter for the radix tree entry lock */
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(slot, false);
 }
 
 /*
@@ -512,10 +499,10 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
  */
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *entry;
+	void *entry, **slot;
 
 	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/*
 	 * This gets called from truncate / punch_hole path. As such, the caller
 	 * must hold locks protecting against concurrent modifications of the
@@ -530,7 +517,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	radix_tree_delete(&mapping->page_tree, index);
 	mapping->nrexceptional--;
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, true);
+	dax_wake_mapping_entry_waiter(slot, true);
 
 	return 1;
 }
@@ -1118,15 +1105,15 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
-	void *entry;
+	void *entry, **slot;
 	pgoff_t index = vmf->pgoff;
 
 	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	if (!entry || !radix_tree_exceptional_entry(entry))
 		goto out;
 	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
-	put_unlocked_mapping_entry(mapping, index, entry);
+	put_unlocked_mapping_entry(slot, entry);
 out:
 	spin_unlock_irq(&mapping->tree_lock);
 	return VM_FAULT_NOPAGE;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9c6dc77..8bcb852 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -15,8 +15,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all);
+void dax_wake_mapping_entry_waiter(void **slot, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..56c4ac7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,8 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 			if (node)
 				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
-						      false);
+			dax_wake_mapping_entry_waiter(slot, false);
 		}
 	}
 	radix_tree_replace_slot(slot, page);
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, lock based on the 'slot' pointer in the radix tree
instead of [mapping, index].

When a PMD entry is present in the tree, all offsets will map to the same
'slot' via radix tree lookups, and they will all share the same locking.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 59 +++++++++++++++++++++--------------------------------
 include/linux/dax.h |  3 +--
 mm/filemap.c        |  3 +--
 3 files changed, 25 insertions(+), 40 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fed6a52..0f1d053 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -62,11 +62,10 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
-					      pgoff_t index)
+static wait_queue_head_t *dax_entry_waitqueue(void **slot)
 {
-	unsigned long hash = hash_long((unsigned long)mapping ^ index,
-				       DAX_WAIT_TABLE_BITS);
+	unsigned long hash = hash_long((unsigned long)slot,
+					DAX_WAIT_TABLE_BITS);
 	return wait_table + hash;
 }
 
@@ -281,25 +280,19 @@ EXPORT_SYMBOL_GPL(dax_do_io);
 /*
  * DAX radix tree locking
  */
-struct exceptional_entry_key {
-	struct address_space *mapping;
-	unsigned long index;
-};
-
 struct wait_exceptional_entry_queue {
 	wait_queue_t wait;
-	struct exceptional_entry_key key;
+	void **slot;
 };
 
 static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 				       int sync, void *keyp)
 {
-	struct exceptional_entry_key *key = keyp;
+	void **slot = keyp;
 	struct wait_exceptional_entry_queue *ewait =
 		container_of(wait, struct wait_exceptional_entry_queue, wait);
 
-	if (key->mapping != ewait->key.mapping ||
-	    key->index != ewait->key.index)
+	if (slot != ewait->slot)
 		return 0;
 	return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -357,12 +350,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 {
 	void *ret, **slot;
 	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq;
 
 	init_wait(&ewait.wait);
 	ewait.wait.func = wake_exceptional_entry_func;
-	ewait.key.mapping = mapping;
-	ewait.key.index = index;
 
 	for (;;) {
 		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -373,6 +364,9 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 				*slotp = slot;
 			return ret;
 		}
+
+		wq = dax_entry_waitqueue(slot);
+		ewait.slot = slot;
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -445,10 +439,9 @@ restart:
 	return entry;
 }
 
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all)
+void dax_wake_mapping_entry_waiter(void **slot, bool wake_all)
 {
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq = dax_entry_waitqueue(slot);
 
 	/*
 	 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -456,13 +449,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 	 * So at this point all tasks that could have seen our entry locked
 	 * must be in the waitqueue and the following check will see them.
 	 */
-	if (waitqueue_active(wq)) {
-		struct exceptional_entry_key key;
-
-		key.mapping = mapping;
-		key.index = index;
-		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
-	}
+	if (waitqueue_active(wq))
+		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, slot);
 }
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
@@ -478,7 +466,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 	}
 	unlock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(slot, false);
 }
 
 static void put_locked_mapping_entry(struct address_space *mapping,
@@ -496,14 +484,13 @@ static void put_locked_mapping_entry(struct address_space *mapping,
  * Called when we are done with radix tree entry we looked up via
  * get_unlocked_mapping_entry() and which we didn't lock in the end.
  */
-static void put_unlocked_mapping_entry(struct address_space *mapping,
-				       pgoff_t index, void *entry)
+static void put_unlocked_mapping_entry(void **slot, void *entry)
 {
 	if (!radix_tree_exceptional_entry(entry))
 		return;
 
 	/* We have to wake up next waiter for the radix tree entry lock */
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(slot, false);
 }
 
 /*
@@ -512,10 +499,10 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
  */
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *entry;
+	void *entry, **slot;
 
 	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/*
 	 * This gets called from truncate / punch_hole path. As such, the caller
 	 * must hold locks protecting against concurrent modifications of the
@@ -530,7 +517,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	radix_tree_delete(&mapping->page_tree, index);
 	mapping->nrexceptional--;
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, true);
+	dax_wake_mapping_entry_waiter(slot, true);
 
 	return 1;
 }
@@ -1118,15 +1105,15 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
-	void *entry;
+	void *entry, **slot;
 	pgoff_t index = vmf->pgoff;
 
 	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	if (!entry || !radix_tree_exceptional_entry(entry))
 		goto out;
 	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
-	put_unlocked_mapping_entry(mapping, index, entry);
+	put_unlocked_mapping_entry(slot, entry);
 out:
 	spin_unlock_irq(&mapping->tree_lock);
 	return VM_FAULT_NOPAGE;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9c6dc77..8bcb852 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -15,8 +15,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all);
+void dax_wake_mapping_entry_waiter(void **slot, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..56c4ac7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,8 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 			if (node)
 				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
-						      false);
+			dax_wake_mapping_entry_waiter(slot, false);
 		}
 	}
 	radix_tree_replace_slot(slot, page);
-- 
2.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Andrew Morton,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, lock based on the 'slot' pointer in the radix tree
instead of [mapping, index].

When a PMD entry is present in the tree, all offsets will map to the same
'slot' via radix tree lookups, and they will all share the same locking.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/dax.c            | 59 +++++++++++++++++++++--------------------------------
 include/linux/dax.h |  3 +--
 mm/filemap.c        |  3 +--
 3 files changed, 25 insertions(+), 40 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fed6a52..0f1d053 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -62,11 +62,10 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
-					      pgoff_t index)
+static wait_queue_head_t *dax_entry_waitqueue(void **slot)
 {
-	unsigned long hash = hash_long((unsigned long)mapping ^ index,
-				       DAX_WAIT_TABLE_BITS);
+	unsigned long hash = hash_long((unsigned long)slot,
+					DAX_WAIT_TABLE_BITS);
 	return wait_table + hash;
 }
 
@@ -281,25 +280,19 @@ EXPORT_SYMBOL_GPL(dax_do_io);
 /*
  * DAX radix tree locking
  */
-struct exceptional_entry_key {
-	struct address_space *mapping;
-	unsigned long index;
-};
-
 struct wait_exceptional_entry_queue {
 	wait_queue_t wait;
-	struct exceptional_entry_key key;
+	void **slot;
 };
 
 static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 				       int sync, void *keyp)
 {
-	struct exceptional_entry_key *key = keyp;
+	void **slot = keyp;
 	struct wait_exceptional_entry_queue *ewait =
 		container_of(wait, struct wait_exceptional_entry_queue, wait);
 
-	if (key->mapping != ewait->key.mapping ||
-	    key->index != ewait->key.index)
+	if (slot != ewait->slot)
 		return 0;
 	return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -357,12 +350,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 {
 	void *ret, **slot;
 	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq;
 
 	init_wait(&ewait.wait);
 	ewait.wait.func = wake_exceptional_entry_func;
-	ewait.key.mapping = mapping;
-	ewait.key.index = index;
 
 	for (;;) {
 		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -373,6 +364,9 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 				*slotp = slot;
 			return ret;
 		}
+
+		wq = dax_entry_waitqueue(slot);
+		ewait.slot = slot;
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -445,10 +439,9 @@ restart:
 	return entry;
 }
 
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all)
+void dax_wake_mapping_entry_waiter(void **slot, bool wake_all)
 {
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq = dax_entry_waitqueue(slot);
 
 	/*
 	 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -456,13 +449,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 	 * So at this point all tasks that could have seen our entry locked
 	 * must be in the waitqueue and the following check will see them.
 	 */
-	if (waitqueue_active(wq)) {
-		struct exceptional_entry_key key;
-
-		key.mapping = mapping;
-		key.index = index;
-		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
-	}
+	if (waitqueue_active(wq))
+		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, slot);
 }
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
@@ -478,7 +466,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 	}
 	unlock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(slot, false);
 }
 
 static void put_locked_mapping_entry(struct address_space *mapping,
@@ -496,14 +484,13 @@ static void put_locked_mapping_entry(struct address_space *mapping,
  * Called when we are done with radix tree entry we looked up via
  * get_unlocked_mapping_entry() and which we didn't lock in the end.
  */
-static void put_unlocked_mapping_entry(struct address_space *mapping,
-				       pgoff_t index, void *entry)
+static void put_unlocked_mapping_entry(void **slot, void *entry)
 {
 	if (!radix_tree_exceptional_entry(entry))
 		return;
 
 	/* We have to wake up next waiter for the radix tree entry lock */
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(slot, false);
 }
 
 /*
@@ -512,10 +499,10 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
  */
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *entry;
+	void *entry, **slot;
 
 	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/*
 	 * This gets called from truncate / punch_hole path. As such, the caller
 	 * must hold locks protecting against concurrent modifications of the
@@ -530,7 +517,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	radix_tree_delete(&mapping->page_tree, index);
 	mapping->nrexceptional--;
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, true);
+	dax_wake_mapping_entry_waiter(slot, true);
 
 	return 1;
 }
@@ -1118,15 +1105,15 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
-	void *entry;
+	void *entry, **slot;
 	pgoff_t index = vmf->pgoff;
 
 	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	if (!entry || !radix_tree_exceptional_entry(entry))
 		goto out;
 	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
-	put_unlocked_mapping_entry(mapping, index, entry);
+	put_unlocked_mapping_entry(slot, entry);
 out:
 	spin_unlock_irq(&mapping->tree_lock);
 	return VM_FAULT_NOPAGE;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9c6dc77..8bcb852 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -15,8 +15,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-				   pgoff_t index, bool wake_all);
+void dax_wake_mapping_entry_waiter(void **slot, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..56c4ac7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,8 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 			if (node)
 				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
-						      false);
+			dax_wake_mapping_entry_waiter(slot, false);
 		}
 	}
 	radix_tree_replace_slot(slot, page);
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 6/7] dax: re-enable DAX PMD support
  2016-08-15 19:09 ` Ross Zwisler
  (?)
  (?)
@ 2016-08-15 19:09   ` Ross Zwisler
  -1 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Dave Chinner,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, linux-ext4

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 206 +++++++++++++++++++++++++++++++---------------------
 include/linux/dax.h |  27 ++++++-
 mm/filemap.c        |   4 +-
 3 files changed, 149 insertions(+), 88 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0f1d053..482e616 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,20 +32,6 @@
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
@@ -386,15 +372,32 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+		unsigned long new_type)
 {
+	bool pmd_downgrade = false;
 	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
 	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+	if (entry && new_type == RADIX_DAX_PMD) {
+		if (!radix_tree_exceptional_entry(entry) ||
+				RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
+			spin_unlock_irq(&mapping->tree_lock);
+			return ERR_PTR(-EEXIST);
+		}
+	} else if (entry && new_type == RADIX_DAX_PTE) {
+		if (radix_tree_exceptional_entry(entry) &&
+		    RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD &&
+		    (unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
+			pmd_downgrade = true;
+		}
+	}
+
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!entry) {
+	if (!entry || pmd_downgrade) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -402,15 +405,27 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-			       RADIX_DAX_ENTRY_LOCK);
+
+		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+			unmap_mapping_range(mapping,
+				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
+		entry = RADIX_DAX_EMPTY_ENTRY(new_type);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+		if (pmd_downgrade) {
+			radix_tree_delete(&mapping->page_tree, index);
+			mapping->nrexceptional--;
+			dax_wake_mapping_entry_waiter(slot, false);
+		}
+
+		err = __radix_tree_insert(&mapping->page_tree, index,
+				RADIX_DAX_ORDER(new_type), entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
 			/* Someone already created the entry? */
-			if (err == -EEXIST)
+			if (err == -EEXIST && new_type == RADIX_DAX_PTE)
 				goto restart;
 			return ERR_PTR(err);
 		}
@@ -571,15 +586,15 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector)
+				      void *entry, sector_t sector,
+				      unsigned long new_type, bool hzp)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	int error = 0;
 	bool hole_fill = false;
+	bool hzp_fill = false;
 	void *new_entry;
 	pgoff_t index = vmf->pgoff;
 
@@ -598,22 +613,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
 		if (error)
 			return ERR_PTR(error);
+	} else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
+		hzp_fill = true;
+		unmap_mapping_range(mapping,
+			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
-		       RADIX_DAX_ENTRY_LOCK);
+	if (hzp)
+		new_entry = RADIX_DAX_HZP_ENTRY();
+	else
+		new_entry = RADIX_DAX_ENTRY(sector, new_type);
+
 	if (hole_fill) {
 		__delete_from_page_cache(entry, NULL);
 		/* Drop pagecache reference */
 		put_page(entry);
-		error = radix_tree_insert(page_tree, index, new_entry);
+		error = __radix_tree_insert(page_tree, index,
+				RADIX_DAX_ORDER(new_type), new_entry);
 		if (error) {
 			new_entry = ERR_PTR(error);
 			goto unlock;
 		}
 		mapping->nrexceptional++;
-	} else {
+	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
 		void **slot;
 		void *ret;
 
@@ -669,6 +692,18 @@ static int dax_writeback_one(struct block_device *bdev,
 		goto unlock;
 	}
 
+	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	/*
+	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we are given will be aligned to
+	 * the start index of the PMD, as will the sector we pull from
+	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
+	 * worry about partial PMD writebacks.
+	 */
 	dax.sector = RADIX_DAX_SECTOR(entry);
 	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
 	spin_unlock_irq(&mapping->tree_lock);
@@ -709,12 +744,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	pgoff_t start_index, end_index, pmd_index;
+	pgoff_t start_index, end_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	bool done = false;
 	int i, ret = 0;
-	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
@@ -724,15 +758,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 	start_index = wbc->range_start >> PAGE_SHIFT;
 	end_index = wbc->range_end >> PAGE_SHIFT;
-	pmd_index = DAX_PMD_INDEX(start_index);
-
-	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
-	rcu_read_unlock();
-
-	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start_index = pmd_index;
 
 	tag_pages_for_writeback(mapping, start_index, end_index);
 
@@ -778,7 +803,8 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return PTR_ERR(dax.addr);
 	dax_unmap_atomic(bdev, &dax);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector,
+			RADIX_DAX_PTE, false);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 	*entryp = ret;
@@ -825,7 +851,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, RADIX_DAX_PTE);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -929,9 +955,11 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	bool write = flags & FAULT_FLAG_WRITE;
 	struct block_device *bdev;
 	pgoff_t size, pgoff;
+	struct vm_fault vmf;
 	sector_t block;
 	int result = 0;
-	bool alloc = false;
+	void *entry, *ret;
+
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -953,6 +981,11 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		return VM_FAULT_FALLBACK;
 	}
 
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is supposed
+	 * to hold locks serializing us with truncate / punch hole so this is
+	 * a reliable test.
+	 */
 	pgoff = linear_page_index(vma, pmd_addr);
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (pgoff >= size)
@@ -970,37 +1003,45 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bh.b_size = PMD_SIZE;
 
-	if (get_block(inode, block, &bh, 0) != 0)
-		return VM_FAULT_SIGBUS;
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		return VM_FAULT_FALLBACK;
+
+	if (get_block(inode, block, &bh, 0) != 0) {
+		result = VM_FAULT_SIGBUS;
+		goto unlock_entry;
+	}
 
 	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+		if (get_block(inode, block, &bh, 1) != 0) {
+			result = VM_FAULT_SIGBUS;
+			goto unlock_entry;
+		}
 	}
 
+	/* Filesystem should not return unwritten buffers to us! */
+	WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+
 	bdev = bh.b_bdev;
 
 	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
-		return VM_FAULT_FALLBACK;
+		goto fallback;
 	}
 
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;
 
 	if (!write && !buffer_mapped(&bh)) {
 		spinlock_t *ptl;
-		pmd_t entry;
+		pmd_t pmd_entry;
 		struct page *zero_page = get_huge_zero_page();
 
 		if (unlikely(!zero_page)) {
@@ -1008,6 +1049,15 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 		}
 
+		ret = dax_insert_mapping_entry(mapping, &vmf, entry,
+				0, RADIX_DAX_PMD, true);
+		if (IS_ERR(ret)) {
+			dax_pmd_dbg(&bh, address,
+					"PMD radix insertion failed");
+			goto fallback;
+		}
+		entry = ret;
+
 		ptl = pmd_lock(vma->vm_mm, pmd);
 		if (!pmd_none(*pmd)) {
 			spin_unlock(ptl);
@@ -1020,9 +1070,9 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 				__func__, current->comm, address,
 				(unsigned long long) to_sector(&bh, inode));
 
-		entry = mk_pmd(zero_page, vma->vm_page_prot);
-		entry = pmd_mkhuge(entry);
-		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
+		pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+		pmd_entry = pmd_mkhuge(pmd_entry);
+		set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
@@ -1054,27 +1104,14 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			/*
-			 * We should insert radix-tree entry and dirty it here.
-			 * For now this is broken...
-			 */
+		ret = dax_insert_mapping_entry(mapping, &vmf, entry,
+				dax.sector, RADIX_DAX_PMD, false);
+		if (IS_ERR(ret)) {
+			dax_pmd_dbg(&bh, address,
+					"PMD radix insertion failed");
+			goto fallback;
 		}
+		entry = ret;
 
 		dev_dbg(part_to_dev(bdev->bd_part),
 				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
@@ -1085,13 +1122,14 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 				dax.pfn, write);
 	}
 
- out:
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
 	return result;
 
  fallback:
 	count_vm_event(THP_FAULT_FALLBACK);
 	result = VM_FAULT_FALLBACK;
-	goto out;
+	goto unlock_entry;
 }
 EXPORT_SYMBOL_GPL(dax_pmd_fault);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8bcb852..7151147 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -6,8 +6,33 @@
 #include <linux/radix-tree.h>
 #include <asm/pgtable.h>
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, two bits for
+ * the entry type (PMD & PTE), and two more for flags (HZP and empty).  In
+ * total five special bits.
+ */
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 5)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+/* PTE and PMD types */
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+/* huge zero page and empty entry flags */
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 4))
+
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+
+/* entries begin locked */
+#define RADIX_DAX_ENTRY(sector, type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |\
+	type | (unsigned long)sector << RADIX_DAX_SHIFT | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_HZP_ENTRY() ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+	RADIX_DAX_PMD | RADIX_DAX_HZP | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_EMPTY_ENTRY(type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+		type | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+
+#define RADIX_DAX_ORDER(type) (type == RADIX_DAX_PMD ? PMD_SHIFT-PAGE_SHIFT : 0)
 
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 56c4ac7..1994a2a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
+			WARN_ON_ONCE(p != RADIX_DAX_EMPTY_ENTRY(RADIX_DAX_PTE));
 			/* DAX accounts exceptional entries as normal pages */
 			if (node)
 				workingset_node_pages_dec(node);
-- 
2.9.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 6/7] dax: re-enable DAX PMD support
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 206 +++++++++++++++++++++++++++++++---------------------
 include/linux/dax.h |  27 ++++++-
 mm/filemap.c        |   4 +-
 3 files changed, 149 insertions(+), 88 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0f1d053..482e616 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,20 +32,6 @@
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
@@ -386,15 +372,32 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+		unsigned long new_type)
 {
+	bool pmd_downgrade = false;
 	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
 	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+	if (entry && new_type == RADIX_DAX_PMD) {
+		if (!radix_tree_exceptional_entry(entry) ||
+				RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
+			spin_unlock_irq(&mapping->tree_lock);
+			return ERR_PTR(-EEXIST);
+		}
+	} else if (entry && new_type == RADIX_DAX_PTE) {
+		if (radix_tree_exceptional_entry(entry) &&
+		    RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD &&
+		    (unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
+			pmd_downgrade = true;
+		}
+	}
+
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!entry) {
+	if (!entry || pmd_downgrade) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -402,15 +405,27 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-			       RADIX_DAX_ENTRY_LOCK);
+
+		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+			unmap_mapping_range(mapping,
+				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
+		entry = RADIX_DAX_EMPTY_ENTRY(new_type);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+		if (pmd_downgrade) {
+			radix_tree_delete(&mapping->page_tree, index);
+			mapping->nrexceptional--;
+			dax_wake_mapping_entry_waiter(slot, false);
+		}
+
+		err = __radix_tree_insert(&mapping->page_tree, index,
+				RADIX_DAX_ORDER(new_type), entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
 			/* Someone already created the entry? */
-			if (err == -EEXIST)
+			if (err == -EEXIST && new_type == RADIX_DAX_PTE)
 				goto restart;
 			return ERR_PTR(err);
 		}
@@ -571,15 +586,15 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector)
+				      void *entry, sector_t sector,
+				      unsigned long new_type, bool hzp)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	int error = 0;
 	bool hole_fill = false;
+	bool hzp_fill = false;
 	void *new_entry;
 	pgoff_t index = vmf->pgoff;
 
@@ -598,22 +613,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
 		if (error)
 			return ERR_PTR(error);
+	} else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
+		hzp_fill = true;
+		unmap_mapping_range(mapping,
+			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
-		       RADIX_DAX_ENTRY_LOCK);
+	if (hzp)
+		new_entry = RADIX_DAX_HZP_ENTRY();
+	else
+		new_entry = RADIX_DAX_ENTRY(sector, new_type);
+
 	if (hole_fill) {
 		__delete_from_page_cache(entry, NULL);
 		/* Drop pagecache reference */
 		put_page(entry);
-		error = radix_tree_insert(page_tree, index, new_entry);
+		error = __radix_tree_insert(page_tree, index,
+				RADIX_DAX_ORDER(new_type), new_entry);
 		if (error) {
 			new_entry = ERR_PTR(error);
 			goto unlock;
 		}
 		mapping->nrexceptional++;
-	} else {
+	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
 		void **slot;
 		void *ret;
 
@@ -669,6 +692,18 @@ static int dax_writeback_one(struct block_device *bdev,
 		goto unlock;
 	}
 
+	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	/*
+	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we are given will be aligned to
+	 * the start index of the PMD, as will the sector we pull from
+	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
+	 * worry about partial PMD writebacks.
+	 */
 	dax.sector = RADIX_DAX_SECTOR(entry);
 	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
 	spin_unlock_irq(&mapping->tree_lock);
@@ -709,12 +744,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	pgoff_t start_index, end_index, pmd_index;
+	pgoff_t start_index, end_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	bool done = false;
 	int i, ret = 0;
-	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
@@ -724,15 +758,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 	start_index = wbc->range_start >> PAGE_SHIFT;
 	end_index = wbc->range_end >> PAGE_SHIFT;
-	pmd_index = DAX_PMD_INDEX(start_index);
-
-	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
-	rcu_read_unlock();
-
-	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start_index = pmd_index;
 
 	tag_pages_for_writeback(mapping, start_index, end_index);
 
@@ -778,7 +803,8 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return PTR_ERR(dax.addr);
 	dax_unmap_atomic(bdev, &dax);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector,
+			RADIX_DAX_PTE, false);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 	*entryp = ret;
@@ -825,7 +851,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, RADIX_DAX_PTE);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -929,9 +955,11 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	bool write = flags & FAULT_FLAG_WRITE;
 	struct block_device *bdev;
 	pgoff_t size, pgoff;
+	struct vm_fault vmf;
 	sector_t block;
 	int result = 0;
-	bool alloc = false;
+	void *entry, *ret;
+
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -953,6 +981,11 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		return VM_FAULT_FALLBACK;
 	}
 
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is supposed
+	 * to hold locks serializing us with truncate / punch hole so this is
+	 * a reliable test.
+	 */
 	pgoff = linear_page_index(vma, pmd_addr);
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (pgoff >= size)
@@ -970,37 +1003,45 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bh.b_size = PMD_SIZE;
 
-	if (get_block(inode, block, &bh, 0) != 0)
-		return VM_FAULT_SIGBUS;
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		return VM_FAULT_FALLBACK;
+
+	if (get_block(inode, block, &bh, 0) != 0) {
+		result = VM_FAULT_SIGBUS;
+		goto unlock_entry;
+	}
 
 	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+		if (get_block(inode, block, &bh, 1) != 0) {
+			result = VM_FAULT_SIGBUS;
+			goto unlock_entry;
+		}
 	}
 
+	/* Filesystem should not return unwritten buffers to us! */
+	WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+
 	bdev = bh.b_bdev;
 
 	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
-		return VM_FAULT_FALLBACK;
+		goto fallback;
 	}
 
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;
 
 	if (!write && !buffer_mapped(&bh)) {
 		spinlock_t *ptl;
-		pmd_t entry;
+		pmd_t pmd_entry;
 		struct page *zero_page = get_huge_zero_page();
 
 		if (unlikely(!zero_page)) {
@@ -1008,6 +1049,15 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 		}
 
+		ret = dax_insert_mapping_entry(mapping, &vmf, entry,
+				0, RADIX_DAX_PMD, true);
+		if (IS_ERR(ret)) {
+			dax_pmd_dbg(&bh, address,
+					"PMD radix insertion failed");
+			goto fallback;
+		}
+		entry = ret;
+
 		ptl = pmd_lock(vma->vm_mm, pmd);
 		if (!pmd_none(*pmd)) {
 			spin_unlock(ptl);
@@ -1020,9 +1070,9 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 				__func__, current->comm, address,
 				(unsigned long long) to_sector(&bh, inode));
 
-		entry = mk_pmd(zero_page, vma->vm_page_prot);
-		entry = pmd_mkhuge(entry);
-		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
+		pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+		pmd_entry = pmd_mkhuge(pmd_entry);
+		set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
@@ -1054,27 +1104,14 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			/*
-			 * We should insert radix-tree entry and dirty it here.
-			 * For now this is broken...
-			 */
+		ret = dax_insert_mapping_entry(mapping, &vmf, entry,
+				dax.sector, RADIX_DAX_PMD, false);
+		if (IS_ERR(ret)) {
+			dax_pmd_dbg(&bh, address,
+					"PMD radix insertion failed");
+			goto fallback;
 		}
+		entry = ret;
 
 		dev_dbg(part_to_dev(bdev->bd_part),
 				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
@@ -1085,13 +1122,14 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 				dax.pfn, write);
 	}
 
- out:
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
 	return result;
 
  fallback:
 	count_vm_event(THP_FAULT_FALLBACK);
 	result = VM_FAULT_FALLBACK;
-	goto out;
+	goto unlock_entry;
 }
 EXPORT_SYMBOL_GPL(dax_pmd_fault);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8bcb852..7151147 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -6,8 +6,33 @@
 #include <linux/radix-tree.h>
 #include <asm/pgtable.h>
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, two bits for
+ * the entry type (PMD & PTE), and two more for flags (HZP and empty).  In
+ * total five special bits.
+ */
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 5)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+/* PTE and PMD types */
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+/* huge zero page and empty entry flags */
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 4))
+
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+
+/* entries begin locked */
+#define RADIX_DAX_ENTRY(sector, type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |\
+	type | (unsigned long)sector << RADIX_DAX_SHIFT | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_HZP_ENTRY() ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+	RADIX_DAX_PMD | RADIX_DAX_HZP | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_EMPTY_ENTRY(type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+		type | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+
+#define RADIX_DAX_ORDER(type) (type == RADIX_DAX_PMD ? PMD_SHIFT-PAGE_SHIFT : 0)
 
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 56c4ac7..1994a2a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
+			WARN_ON_ONCE(p != RADIX_DAX_EMPTY_ENTRY(RADIX_DAX_PTE));
 			/* DAX accounts exceptional entries as normal pages */
 			if (node)
 				workingset_node_pages_dec(node);
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 6/7] dax: re-enable DAX PMD support
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 206 +++++++++++++++++++++++++++++++---------------------
 include/linux/dax.h |  27 ++++++-
 mm/filemap.c        |   4 +-
 3 files changed, 149 insertions(+), 88 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0f1d053..482e616 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,20 +32,6 @@
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
@@ -386,15 +372,32 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+		unsigned long new_type)
 {
+	bool pmd_downgrade = false;
 	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
 	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+	if (entry && new_type == RADIX_DAX_PMD) {
+		if (!radix_tree_exceptional_entry(entry) ||
+				RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
+			spin_unlock_irq(&mapping->tree_lock);
+			return ERR_PTR(-EEXIST);
+		}
+	} else if (entry && new_type == RADIX_DAX_PTE) {
+		if (radix_tree_exceptional_entry(entry) &&
+		    RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD &&
+		    (unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
+			pmd_downgrade = true;
+		}
+	}
+
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!entry) {
+	if (!entry || pmd_downgrade) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -402,15 +405,27 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-			       RADIX_DAX_ENTRY_LOCK);
+
+		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+			unmap_mapping_range(mapping,
+				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
+		entry = RADIX_DAX_EMPTY_ENTRY(new_type);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+		if (pmd_downgrade) {
+			radix_tree_delete(&mapping->page_tree, index);
+			mapping->nrexceptional--;
+			dax_wake_mapping_entry_waiter(slot, false);
+		}
+
+		err = __radix_tree_insert(&mapping->page_tree, index,
+				RADIX_DAX_ORDER(new_type), entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
 			/* Someone already created the entry? */
-			if (err == -EEXIST)
+			if (err == -EEXIST && new_type == RADIX_DAX_PTE)
 				goto restart;
 			return ERR_PTR(err);
 		}
@@ -571,15 +586,15 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector)
+				      void *entry, sector_t sector,
+				      unsigned long new_type, bool hzp)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	int error = 0;
 	bool hole_fill = false;
+	bool hzp_fill = false;
 	void *new_entry;
 	pgoff_t index = vmf->pgoff;
 
@@ -598,22 +613,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
 		if (error)
 			return ERR_PTR(error);
+	} else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
+		hzp_fill = true;
+		unmap_mapping_range(mapping,
+			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
-		       RADIX_DAX_ENTRY_LOCK);
+	if (hzp)
+		new_entry = RADIX_DAX_HZP_ENTRY();
+	else
+		new_entry = RADIX_DAX_ENTRY(sector, new_type);
+
 	if (hole_fill) {
 		__delete_from_page_cache(entry, NULL);
 		/* Drop pagecache reference */
 		put_page(entry);
-		error = radix_tree_insert(page_tree, index, new_entry);
+		error = __radix_tree_insert(page_tree, index,
+				RADIX_DAX_ORDER(new_type), new_entry);
 		if (error) {
 			new_entry = ERR_PTR(error);
 			goto unlock;
 		}
 		mapping->nrexceptional++;
-	} else {
+	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
 		void **slot;
 		void *ret;
 
@@ -669,6 +692,18 @@ static int dax_writeback_one(struct block_device *bdev,
 		goto unlock;
 	}
 
+	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	/*
+	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we are given will be aligned to
+	 * the start index of the PMD, as will the sector we pull from
+	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
+	 * worry about partial PMD writebacks.
+	 */
 	dax.sector = RADIX_DAX_SECTOR(entry);
 	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
 	spin_unlock_irq(&mapping->tree_lock);
@@ -709,12 +744,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	pgoff_t start_index, end_index, pmd_index;
+	pgoff_t start_index, end_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	bool done = false;
 	int i, ret = 0;
-	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
@@ -724,15 +758,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 	start_index = wbc->range_start >> PAGE_SHIFT;
 	end_index = wbc->range_end >> PAGE_SHIFT;
-	pmd_index = DAX_PMD_INDEX(start_index);
-
-	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
-	rcu_read_unlock();
-
-	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start_index = pmd_index;
 
 	tag_pages_for_writeback(mapping, start_index, end_index);
 
@@ -778,7 +803,8 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return PTR_ERR(dax.addr);
 	dax_unmap_atomic(bdev, &dax);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector,
+			RADIX_DAX_PTE, false);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 	*entryp = ret;
@@ -825,7 +851,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, RADIX_DAX_PTE);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -929,9 +955,11 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	bool write = flags & FAULT_FLAG_WRITE;
 	struct block_device *bdev;
 	pgoff_t size, pgoff;
+	struct vm_fault vmf;
 	sector_t block;
 	int result = 0;
-	bool alloc = false;
+	void *entry, *ret;
+
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -953,6 +981,11 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		return VM_FAULT_FALLBACK;
 	}
 
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is supposed
+	 * to hold locks serializing us with truncate / punch hole so this is
+	 * a reliable test.
+	 */
 	pgoff = linear_page_index(vma, pmd_addr);
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (pgoff >= size)
@@ -970,37 +1003,45 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bh.b_size = PMD_SIZE;
 
-	if (get_block(inode, block, &bh, 0) != 0)
-		return VM_FAULT_SIGBUS;
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		return VM_FAULT_FALLBACK;
+
+	if (get_block(inode, block, &bh, 0) != 0) {
+		result = VM_FAULT_SIGBUS;
+		goto unlock_entry;
+	}
 
 	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+		if (get_block(inode, block, &bh, 1) != 0) {
+			result = VM_FAULT_SIGBUS;
+			goto unlock_entry;
+		}
 	}
 
+	/* Filesystem should not return unwritten buffers to us! */
+	WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+
 	bdev = bh.b_bdev;
 
 	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
-		return VM_FAULT_FALLBACK;
+		goto fallback;
 	}
 
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;
 
 	if (!write && !buffer_mapped(&bh)) {
 		spinlock_t *ptl;
-		pmd_t entry;
+		pmd_t pmd_entry;
 		struct page *zero_page = get_huge_zero_page();
 
 		if (unlikely(!zero_page)) {
@@ -1008,6 +1049,15 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 		}
 
+		ret = dax_insert_mapping_entry(mapping, &vmf, entry,
+				0, RADIX_DAX_PMD, true);
+		if (IS_ERR(ret)) {
+			dax_pmd_dbg(&bh, address,
+					"PMD radix insertion failed");
+			goto fallback;
+		}
+		entry = ret;
+
 		ptl = pmd_lock(vma->vm_mm, pmd);
 		if (!pmd_none(*pmd)) {
 			spin_unlock(ptl);
@@ -1020,9 +1070,9 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 				__func__, current->comm, address,
 				(unsigned long long) to_sector(&bh, inode));
 
-		entry = mk_pmd(zero_page, vma->vm_page_prot);
-		entry = pmd_mkhuge(entry);
-		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
+		pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+		pmd_entry = pmd_mkhuge(pmd_entry);
+		set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
@@ -1054,27 +1104,14 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			/*
-			 * We should insert radix-tree entry and dirty it here.
-			 * For now this is broken...
-			 */
+		ret = dax_insert_mapping_entry(mapping, &vmf, entry,
+				dax.sector, RADIX_DAX_PMD, false);
+		if (IS_ERR(ret)) {
+			dax_pmd_dbg(&bh, address,
+					"PMD radix insertion failed");
+			goto fallback;
 		}
+		entry = ret;
 
 		dev_dbg(part_to_dev(bdev->bd_part),
 				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
@@ -1085,13 +1122,14 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 				dax.pfn, write);
 	}
 
- out:
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
 	return result;
 
  fallback:
 	count_vm_event(THP_FAULT_FALLBACK);
 	result = VM_FAULT_FALLBACK;
-	goto out;
+	goto unlock_entry;
 }
 EXPORT_SYMBOL_GPL(dax_pmd_fault);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8bcb852..7151147 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -6,8 +6,33 @@
 #include <linux/radix-tree.h>
 #include <asm/pgtable.h>
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, two bits for
+ * the entry type (PMD & PTE), and two more for flags (HZP and empty).  In
+ * total five special bits.
+ */
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 5)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+/* PTE and PMD types */
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+/* huge zero page and empty entry flags */
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 4))
+
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+
+/* entries begin locked */
+#define RADIX_DAX_ENTRY(sector, type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |\
+	type | (unsigned long)sector << RADIX_DAX_SHIFT | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_HZP_ENTRY() ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+	RADIX_DAX_PMD | RADIX_DAX_HZP | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_EMPTY_ENTRY(type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+		type | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+
+#define RADIX_DAX_ORDER(type) (type == RADIX_DAX_PMD ? PMD_SHIFT-PAGE_SHIFT : 0)
 
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 56c4ac7..1994a2a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
+			WARN_ON_ONCE(p != RADIX_DAX_EMPTY_ENTRY(RADIX_DAX_PTE));
 			/* DAX accounts exceptional entries as normal pages */
 			if (node)
 				workingset_node_pages_dec(node);
-- 
2.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 6/7] dax: re-enable DAX PMD support
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Andrew Morton,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/dax.c            | 206 +++++++++++++++++++++++++++++++---------------------
 include/linux/dax.h |  27 ++++++-
 mm/filemap.c        |   4 +-
 3 files changed, 149 insertions(+), 88 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0f1d053..482e616 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,20 +32,6 @@
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
@@ -386,15 +372,32 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+		unsigned long new_type)
 {
+	bool pmd_downgrade = false;
 	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
 	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+	if (entry && new_type == RADIX_DAX_PMD) {
+		if (!radix_tree_exceptional_entry(entry) ||
+				RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
+			spin_unlock_irq(&mapping->tree_lock);
+			return ERR_PTR(-EEXIST);
+		}
+	} else if (entry && new_type == RADIX_DAX_PTE) {
+		if (radix_tree_exceptional_entry(entry) &&
+		    RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD &&
+		    (unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
+			pmd_downgrade = true;
+		}
+	}
+
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!entry) {
+	if (!entry || pmd_downgrade) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -402,15 +405,27 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-			       RADIX_DAX_ENTRY_LOCK);
+
+		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+			unmap_mapping_range(mapping,
+				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
+		entry = RADIX_DAX_EMPTY_ENTRY(new_type);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+		if (pmd_downgrade) {
+			radix_tree_delete(&mapping->page_tree, index);
+			mapping->nrexceptional--;
+			dax_wake_mapping_entry_waiter(slot, false);
+		}
+
+		err = __radix_tree_insert(&mapping->page_tree, index,
+				RADIX_DAX_ORDER(new_type), entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
 			/* Someone already created the entry? */
-			if (err == -EEXIST)
+			if (err == -EEXIST && new_type == RADIX_DAX_PTE)
 				goto restart;
 			return ERR_PTR(err);
 		}
@@ -571,15 +586,15 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector)
+				      void *entry, sector_t sector,
+				      unsigned long new_type, bool hzp)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	int error = 0;
 	bool hole_fill = false;
+	bool hzp_fill = false;
 	void *new_entry;
 	pgoff_t index = vmf->pgoff;
 
@@ -598,22 +613,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
 		if (error)
 			return ERR_PTR(error);
+	} else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
+		hzp_fill = true;
+		unmap_mapping_range(mapping,
+			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
-		       RADIX_DAX_ENTRY_LOCK);
+	if (hzp)
+		new_entry = RADIX_DAX_HZP_ENTRY();
+	else
+		new_entry = RADIX_DAX_ENTRY(sector, new_type);
+
 	if (hole_fill) {
 		__delete_from_page_cache(entry, NULL);
 		/* Drop pagecache reference */
 		put_page(entry);
-		error = radix_tree_insert(page_tree, index, new_entry);
+		error = __radix_tree_insert(page_tree, index,
+				RADIX_DAX_ORDER(new_type), new_entry);
 		if (error) {
 			new_entry = ERR_PTR(error);
 			goto unlock;
 		}
 		mapping->nrexceptional++;
-	} else {
+	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
 		void **slot;
 		void *ret;
 
@@ -669,6 +692,18 @@ static int dax_writeback_one(struct block_device *bdev,
 		goto unlock;
 	}
 
+	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	/*
+	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we are given will be aligned to
+	 * the start index of the PMD, as will the sector we pull from
+	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
+	 * worry about partial PMD writebacks.
+	 */
 	dax.sector = RADIX_DAX_SECTOR(entry);
 	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
 	spin_unlock_irq(&mapping->tree_lock);
@@ -709,12 +744,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	pgoff_t start_index, end_index, pmd_index;
+	pgoff_t start_index, end_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	bool done = false;
 	int i, ret = 0;
-	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
@@ -724,15 +758,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 	start_index = wbc->range_start >> PAGE_SHIFT;
 	end_index = wbc->range_end >> PAGE_SHIFT;
-	pmd_index = DAX_PMD_INDEX(start_index);
-
-	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
-	rcu_read_unlock();
-
-	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start_index = pmd_index;
 
 	tag_pages_for_writeback(mapping, start_index, end_index);
 
@@ -778,7 +803,8 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return PTR_ERR(dax.addr);
 	dax_unmap_atomic(bdev, &dax);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector,
+			RADIX_DAX_PTE, false);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 	*entryp = ret;
@@ -825,7 +851,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, RADIX_DAX_PTE);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -929,9 +955,11 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	bool write = flags & FAULT_FLAG_WRITE;
 	struct block_device *bdev;
 	pgoff_t size, pgoff;
+	struct vm_fault vmf;
 	sector_t block;
 	int result = 0;
-	bool alloc = false;
+	void *entry, *ret;
+
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -953,6 +981,11 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		return VM_FAULT_FALLBACK;
 	}
 
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is supposed
+	 * to hold locks serializing us with truncate / punch hole so this is
+	 * a reliable test.
+	 */
 	pgoff = linear_page_index(vma, pmd_addr);
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (pgoff >= size)
@@ -970,37 +1003,45 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bh.b_size = PMD_SIZE;
 
-	if (get_block(inode, block, &bh, 0) != 0)
-		return VM_FAULT_SIGBUS;
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		return VM_FAULT_FALLBACK;
+
+	if (get_block(inode, block, &bh, 0) != 0) {
+		result = VM_FAULT_SIGBUS;
+		goto unlock_entry;
+	}
 
 	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+		if (get_block(inode, block, &bh, 1) != 0) {
+			result = VM_FAULT_SIGBUS;
+			goto unlock_entry;
+		}
 	}
 
+	/* Filesystem should not return unwritten buffers to us! */
+	WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+
 	bdev = bh.b_bdev;
 
 	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
-		return VM_FAULT_FALLBACK;
+		goto fallback;
 	}
 
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;
 
 	if (!write && !buffer_mapped(&bh)) {
 		spinlock_t *ptl;
-		pmd_t entry;
+		pmd_t pmd_entry;
 		struct page *zero_page = get_huge_zero_page();
 
 		if (unlikely(!zero_page)) {
@@ -1008,6 +1049,15 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 		}
 
+		ret = dax_insert_mapping_entry(mapping, &vmf, entry,
+				0, RADIX_DAX_PMD, true);
+		if (IS_ERR(ret)) {
+			dax_pmd_dbg(&bh, address,
+					"PMD radix insertion failed");
+			goto fallback;
+		}
+		entry = ret;
+
 		ptl = pmd_lock(vma->vm_mm, pmd);
 		if (!pmd_none(*pmd)) {
 			spin_unlock(ptl);
@@ -1020,9 +1070,9 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 				__func__, current->comm, address,
 				(unsigned long long) to_sector(&bh, inode));
 
-		entry = mk_pmd(zero_page, vma->vm_page_prot);
-		entry = pmd_mkhuge(entry);
-		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
+		pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+		pmd_entry = pmd_mkhuge(pmd_entry);
+		set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
@@ -1054,27 +1104,14 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			/*
-			 * We should insert radix-tree entry and dirty it here.
-			 * For now this is broken...
-			 */
+		ret = dax_insert_mapping_entry(mapping, &vmf, entry,
+				dax.sector, RADIX_DAX_PMD, false);
+		if (IS_ERR(ret)) {
+			dax_pmd_dbg(&bh, address,
+					"PMD radix insertion failed");
+			goto fallback;
 		}
+		entry = ret;
 
 		dev_dbg(part_to_dev(bdev->bd_part),
 				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
@@ -1085,13 +1122,14 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 				dax.pfn, write);
 	}
 
- out:
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
 	return result;
 
  fallback:
 	count_vm_event(THP_FAULT_FALLBACK);
 	result = VM_FAULT_FALLBACK;
-	goto out;
+	goto unlock_entry;
 }
 EXPORT_SYMBOL_GPL(dax_pmd_fault);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8bcb852..7151147 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -6,8 +6,33 @@
 #include <linux/radix-tree.h>
 #include <asm/pgtable.h>
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, two bits for
+ * the entry type (PMD & PTE), and two more for flags (HZP and empty).  In
+ * total five special bits.
+ */
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 5)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+/* PTE and PMD types */
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+/* huge zero page and empty entry flags */
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 4))
+
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+
+/* entries begin locked */
+#define RADIX_DAX_ENTRY(sector, type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |\
+	type | (unsigned long)sector << RADIX_DAX_SHIFT | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_HZP_ENTRY() ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+	RADIX_DAX_PMD | RADIX_DAX_HZP | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_EMPTY_ENTRY(type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+		type | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+
+#define RADIX_DAX_ORDER(type) (type == RADIX_DAX_PMD ? PMD_SHIFT-PAGE_SHIFT : 0)
 
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 56c4ac7..1994a2a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
+			WARN_ON_ONCE(p != RADIX_DAX_EMPTY_ENTRY(RADIX_DAX_PTE));
 			/* DAX accounts exceptional entries as normal pages */
 			if (node)
 				workingset_node_pages_dec(node);
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 7/7] dax: remove "depends on BROKEN" from FS_DAX_PMD
  2016-08-15 19:09 ` Ross Zwisler
  (?)
  (?)
@ 2016-08-15 19:09   ` Ross Zwisler
  -1 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Dave Chinner,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, linux-ext4

Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
 	depends on FS_DAX
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
-	depends on BROKEN
 
 endif # BLOCK
 
-- 
2.9.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 7/7] dax: remove "depends on BROKEN" from FS_DAX_PMD
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
 	depends on FS_DAX
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
-	depends on BROKEN
 
 endif # BLOCK
 
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 7/7] dax: remove "depends on BROKEN" from FS_DAX_PMD
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
 	depends on FS_DAX
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
-	depends on BROKEN
 
 endif # BLOCK
 
-- 
2.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 7/7] dax: remove "depends on BROKEN" from FS_DAX_PMD
@ 2016-08-15 19:09   ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 19:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Theodore Ts'o, Andrew Morton,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
 	depends on FS_DAX
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
-	depends on BROKEN
 
 endif # BLOCK
 
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
  2016-08-15 19:09 ` Ross Zwisler
  (?)
  (?)
@ 2016-08-15 20:21   ` Dan Williams
  -1 siblings, 0 replies; 72+ messages in thread
From: Dan Williams @ 2016-08-15 20:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, linux-nvdimm, Dave Chinner, linux-kernel,
	Linux MM, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, Andrew Morton, linux-ext4

On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This series allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled.

Looks good to me.

> This series restores DAX PMD functionality back to what it was before it
> was disabled.  There is still a known issue between DAX PMDs and hole
> punch, which I am currently working on and which I plan to address with a
> separate series.

Perhaps we should hold off on applying patch 6 and 7 until after the
hole-punch fix is ready?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-15 20:21   ` Dan Williams
  0 siblings, 0 replies; 72+ messages in thread
From: Dan Williams @ 2016-08-15 20:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dave Chinner, Jan Kara, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org

On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This series allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled.

Looks good to me.

> This series restores DAX PMD functionality back to what it was before it
> was disabled.  There is still a known issue between DAX PMDs and hole
> punch, which I am currently working on and which I plan to address with a
> separate series.

Perhaps we should hold off on applying patch 6 and 7 until after the
hole-punch fix is ready?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-15 20:21   ` Dan Williams
  0 siblings, 0 replies; 72+ messages in thread
From: Dan Williams @ 2016-08-15 20:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dave Chinner, Jan Kara, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm

On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This series allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled.

Looks good to me.

> This series restores DAX PMD functionality back to what it was before it
> was disabled.  There is still a known issue between DAX PMDs and hole
> punch, which I am currently working on and which I plan to address with a
> separate series.

Perhaps we should hold off on applying patch 6 and 7 until after the
hole-punch fix is ready?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-15 20:21   ` Dan Williams
  0 siblings, 0 replies; 72+ messages in thread
From: Dan Williams @ 2016-08-15 20:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Andreas Dilger, Alexander Viro, Jan Kara, linux-fsdevel,
	Andrew Morton, linux-ext4

On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
<ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This series allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled.

Looks good to me.

> This series restores DAX PMD functionality back to what it was before it
> was disabled.  There is still a known issue between DAX PMDs and hole
> punch, which I am currently working on and which I plan to address with a
> separate series.

Perhaps we should hold off on applying patch 6 and 7 until after the
hole-punch fix is ready?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
  2016-08-15 20:21   ` Dan Williams
  (?)
@ 2016-08-15 21:11     ` Ross Zwisler
  -1 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 21:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Theodore Ts'o, linux-nvdimm, Dave Chinner, linux-kernel,
	Linux MM, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, linux-ext4, Andrew Morton

On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> > based locking scheme so that they can be re-enabled.
> 
> Looks good to me.
> 
> > This series restores DAX PMD functionality back to what it was before it
> > was disabled.  There is still a known issue between DAX PMDs and hole
> > punch, which I am currently working on and which I plan to address with a
> > separate series.
> 
> Perhaps we should hold off on applying patch 6 and 7 until after the
> hole-punch fix is ready?

Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
the hole punch fix is ready.

I don't see a reason to hold off on patch 6, though?  It stands on it's own,
implements the correct locking, and doesn't break anything.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-15 21:11     ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 21:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dave Chinner, Jan Kara,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org

On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> > based locking scheme so that they can be re-enabled.
> 
> Looks good to me.
> 
> > This series restores DAX PMD functionality back to what it was before it
> > was disabled.  There is still a known issue between DAX PMDs and hole
> > punch, which I am currently working on and which I plan to address with a
> > separate series.
> 
> Perhaps we should hold off on applying patch 6 and 7 until after the
> hole-punch fix is ready?

Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
the hole punch fix is ready.

I don't see a reason to hold off on patch 6, though?  It stands on it's own,
implements the correct locking, and doesn't break anything.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-15 21:11     ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-15 21:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dave Chinner, Jan Kara,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm

On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> > based locking scheme so that they can be re-enabled.
> 
> Looks good to me.
> 
> > This series restores DAX PMD functionality back to what it was before it
> > was disabled.  There is still a known issue between DAX PMDs and hole
> > punch, which I am currently working on and which I plan to address with a
> > separate series.
> 
> Perhaps we should hold off on applying patch 6 and 7 until after the
> hole-punch fix is ready?

Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
the hole punch fix is ready.

I don't see a reason to hold off on patch 6, though?  It stands on it's own,
implements the correct locking, and doesn't break anything.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
  2016-08-15 21:11     ` Ross Zwisler
@ 2016-08-15 21:14       ` Dan Williams
  -1 siblings, 0 replies; 72+ messages in thread
From: Dan Williams @ 2016-08-15 21:14 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dave Chinner,
	Jan Kara, linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm

On Mon, Aug 15, 2016 at 2:11 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
>> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
>> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
>> > based locking scheme so that they can be re-enabled.
>>
>> Looks good to me.
>>
>> > This series restores DAX PMD functionality back to what it was before it
>> > was disabled.  There is still a known issue between DAX PMDs and hole
>> > punch, which I am currently working on and which I plan to address with a
>> > separate series.
>>
>> Perhaps we should hold off on applying patch 6 and 7 until after the
>> hole-punch fix is ready?
>
> Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
> the hole punch fix is ready.
>
> I don't see a reason to hold off on patch 6, though?  It stands on it's own,
> implements the correct locking, and doesn't break anything.

Whoops, I just meant 7.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-15 21:14       ` Dan Williams
  0 siblings, 0 replies; 72+ messages in thread
From: Dan Williams @ 2016-08-15 21:14 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dave Chinner,
	Jan Kara, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org

On Mon, Aug 15, 2016 at 2:11 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
>> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
>> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
>> > based locking scheme so that they can be re-enabled.
>>
>> Looks good to me.
>>
>> > This series restores DAX PMD functionality back to what it was before it
>> > was disabled.  There is still a known issue between DAX PMDs and hole
>> > punch, which I am currently working on and which I plan to address with a
>> > separate series.
>>
>> Perhaps we should hold off on applying patch 6 and 7 until after the
>> hole-punch fix is ready?
>
> Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
> the hole punch fix is ready.
>
> I don't see a reason to hold off on patch 6, though?  It stands on it's own,
> implements the correct locking, and doesn't break anything.

Whoops, I just meant 7.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/7] ext2: tell DAX the size of allocation holes
  2016-08-15 19:09   ` Ross Zwisler
  (?)
@ 2016-08-16  9:10     ` Jan Kara
  -1 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:10 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, linux-nvdimm, Dave Chinner, linux-kernel,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, Andrew Morton, linux-ext4

On Mon 15-08-16 13:09:12, Ross Zwisler wrote:
> When DAX calls ext2_get_block() and the file offset points to a hole we
> currently don't set bh_result->b_size.  When we re-enable PMD faults DAX
> will need bh_result->b_size to tell it the size of the hole so it can
> decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.
> 
> For ext2 we always want DAX to use 4 KiB zero pages, so we just tell DAX
> that all holes are 4 KiB in size.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/ext2/inode.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index d5c7d09..c6d9763 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -773,6 +773,12 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
>  	if (ret > 0) {
>  		bh_result->b_size = (ret << inode->i_blkbits);
>  		ret = 0;
> +	} else if (ret == 0 && IS_DAX(inode)) {

I'd just drop the IS_DAX() check and set

	bh_result->b_size = 1 << inode->i_blkbits;

IMO it's better to have things consistent between DAX & !DAX whenever
possible.

								Honza

> +		/*
> +		 * We have hit a hole.  Tell DAX it is 4k in size so that it
> +		 * uses PTE faults.
> +		 */
> +		bh_result->b_size = PAGE_SIZE;
>  	}
>  	return ret;
>  
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/7] ext2: tell DAX the size of allocation holes
@ 2016-08-16  9:10     ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:10 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

On Mon 15-08-16 13:09:12, Ross Zwisler wrote:
> When DAX calls ext2_get_block() and the file offset points to a hole we
> currently don't set bh_result->b_size.  When we re-enable PMD faults DAX
> will need bh_result->b_size to tell it the size of the hole so it can
> decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.
> 
> For ext2 we always want DAX to use 4 KiB zero pages, so we just tell DAX
> that all holes are 4 KiB in size.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/ext2/inode.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index d5c7d09..c6d9763 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -773,6 +773,12 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
>  	if (ret > 0) {
>  		bh_result->b_size = (ret << inode->i_blkbits);
>  		ret = 0;
> +	} else if (ret == 0 && IS_DAX(inode)) {

I'd just drop the IS_DAX() check and set

	bh_result->b_size = 1 << inode->i_blkbits;

IMO it's better to have things consistent between DAX & !DAX whenever
possible.

								Honza

> +		/*
> +		 * We have hit a hole.  Tell DAX it is 4k in size so that it
> +		 * uses PTE faults.
> +		 */
> +		bh_result->b_size = PAGE_SIZE;
>  	}
>  	return ret;
>  
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/7] ext2: tell DAX the size of allocation holes
@ 2016-08-16  9:10     ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:10 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

On Mon 15-08-16 13:09:12, Ross Zwisler wrote:
> When DAX calls ext2_get_block() and the file offset points to a hole we
> currently don't set bh_result->b_size.  When we re-enable PMD faults DAX
> will need bh_result->b_size to tell it the size of the hole so it can
> decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.
> 
> For ext2 we always want DAX to use 4 KiB zero pages, so we just tell DAX
> that all holes are 4 KiB in size.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/ext2/inode.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index d5c7d09..c6d9763 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -773,6 +773,12 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
>  	if (ret > 0) {
>  		bh_result->b_size = (ret << inode->i_blkbits);
>  		ret = 0;
> +	} else if (ret == 0 && IS_DAX(inode)) {

I'd just drop the IS_DAX() check and set

	bh_result->b_size = 1 << inode->i_blkbits;

IMO it's better to have things consistent between DAX & !DAX whenever
possible.

								Honza

> +		/*
> +		 * We have hit a hole.  Tell DAX it is 4k in size so that it
> +		 * uses PTE faults.
> +		 */
> +		bh_result->b_size = PAGE_SIZE;
>  	}
>  	return ret;
>  
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/7] ext4: tell DAX the size of allocation holes
  2016-08-15 19:09   ` Ross Zwisler
  (?)
@ 2016-08-16  9:12     ` Jan Kara
  -1 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:12 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, linux-nvdimm, Dave Chinner, linux-kernel,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, Andrew Morton, linux-ext4

On Mon 15-08-16 13:09:13, Ross Zwisler wrote:
> When DAX calls _ext4_get_block() and the file offset points to a hole we
> currently don't set bh->b_size.  When we re-enable PMD faults DAX will
> need bh->b_size to tell it the size of the hole so it can decide whether to
> fault in a 4 KiB zero page or a 2 MiB zero page.
> 
> _ext4_get_block() has the hole size information from ext4_map_blocks(), so
> populate bh->b_size.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3131747..1808013 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
>  		ext4_update_bh_state(bh, map.m_flags);
>  		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
>  		ret = 0;
> +	} else if (ret == 0) {
> +		/* hole case, need to fill in bh->b_size */
> +		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
>  	}
>  	return ret;
>  }
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/7] ext4: tell DAX the size of allocation holes
@ 2016-08-16  9:12     ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:12 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

On Mon 15-08-16 13:09:13, Ross Zwisler wrote:
> When DAX calls _ext4_get_block() and the file offset points to a hole we
> currently don't set bh->b_size.  When we re-enable PMD faults DAX will
> need bh->b_size to tell it the size of the hole so it can decide whether to
> fault in a 4 KiB zero page or a 2 MiB zero page.
> 
> _ext4_get_block() has the hole size information from ext4_map_blocks(), so
> populate bh->b_size.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3131747..1808013 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
>  		ext4_update_bh_state(bh, map.m_flags);
>  		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
>  		ret = 0;
> +	} else if (ret == 0) {
> +		/* hole case, need to fill in bh->b_size */
> +		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
>  	}
>  	return ret;
>  }
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/7] ext4: tell DAX the size of allocation holes
@ 2016-08-16  9:12     ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:12 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

On Mon 15-08-16 13:09:13, Ross Zwisler wrote:
> When DAX calls _ext4_get_block() and the file offset points to a hole we
> currently don't set bh->b_size.  When we re-enable PMD faults DAX will
> need bh->b_size to tell it the size of the hole so it can decide whether to
> fault in a 4 KiB zero page or a 2 MiB zero page.
> 
> _ext4_get_block() has the hole size information from ext4_map_blocks(), so
> populate bh->b_size.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3131747..1808013 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
>  		ext4_update_bh_state(bh, map.m_flags);
>  		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
>  		ret = 0;
> +	} else if (ret == 0) {
> +		/* hole case, need to fill in bh->b_size */
> +		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
>  	}
>  	return ret;
>  }
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 3/7] dax: remove buffer_size_valid()
  2016-08-15 19:09   ` Ross Zwisler
@ 2016-08-16  9:13     ` Jan Kara
  -1 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:13 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

On Mon 15-08-16 13:09:14, Ross Zwisler wrote:
> Now that all our supported filesystems (ext2, ext4 and XFS) all properly
> set bh.b_size when we call get_block() for a hole, rely on that value and
> remove the buffer_size_valid() sanity check.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 22 +---------------------
>  1 file changed, 1 insertion(+), 21 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 993dc6f..8030f93 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -121,19 +121,6 @@ static bool buffer_written(struct buffer_head *bh)
>  	return buffer_mapped(bh) && !buffer_unwritten(bh);
>  }
>  
> -/*
> - * When ext4 encounters a hole, it returns without modifying the buffer_head
> - * which means that we can't trust b_size.  To cope with this, we set b_state
> - * to 0 before calling get_block and, if any bit is set, we know we can trust
> - * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
> - * and would save us time calling get_block repeatedly.
> - */
> -static bool buffer_size_valid(struct buffer_head *bh)
> -{
> -	return bh->b_state != 0;
> -}
> -
> -
>  static sector_t to_sector(const struct buffer_head *bh,
>  		const struct inode *inode)
>  {
> @@ -175,8 +162,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>  				rc = get_block(inode, block, bh, rw == WRITE);
>  				if (rc)
>  					break;
> -				if (!buffer_size_valid(bh))
> -					bh->b_size = 1 << blkbits;
>  				bh_max = pos - first + bh->b_size;
>  				bdev = bh->b_bdev;
>  				/*
> @@ -1010,12 +995,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  
>  	bdev = bh.b_bdev;
>  
> -	/*
> -	 * If the filesystem isn't willing to tell us the length of a hole,
> -	 * just fall back to PTEs.  Calling get_block 512 times in a loop
> -	 * would be silly.
> -	 */
> -	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
> +	if (bh.b_size < PMD_SIZE) {
>  		dax_pmd_dbg(&bh, address, "allocated block too small");
>  		return VM_FAULT_FALLBACK;
>  	}
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 3/7] dax: remove buffer_size_valid()
@ 2016-08-16  9:13     ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:13 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

On Mon 15-08-16 13:09:14, Ross Zwisler wrote:
> Now that all our supported filesystems (ext2, ext4 and XFS) all properly
> set bh.b_size when we call get_block() for a hole, rely on that value and
> remove the buffer_size_valid() sanity check.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 22 +---------------------
>  1 file changed, 1 insertion(+), 21 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 993dc6f..8030f93 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -121,19 +121,6 @@ static bool buffer_written(struct buffer_head *bh)
>  	return buffer_mapped(bh) && !buffer_unwritten(bh);
>  }
>  
> -/*
> - * When ext4 encounters a hole, it returns without modifying the buffer_head
> - * which means that we can't trust b_size.  To cope with this, we set b_state
> - * to 0 before calling get_block and, if any bit is set, we know we can trust
> - * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
> - * and would save us time calling get_block repeatedly.
> - */
> -static bool buffer_size_valid(struct buffer_head *bh)
> -{
> -	return bh->b_state != 0;
> -}
> -
> -
>  static sector_t to_sector(const struct buffer_head *bh,
>  		const struct inode *inode)
>  {
> @@ -175,8 +162,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>  				rc = get_block(inode, block, bh, rw == WRITE);
>  				if (rc)
>  					break;
> -				if (!buffer_size_valid(bh))
> -					bh->b_size = 1 << blkbits;
>  				bh_max = pos - first + bh->b_size;
>  				bdev = bh->b_bdev;
>  				/*
> @@ -1010,12 +995,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  
>  	bdev = bh.b_bdev;
>  
> -	/*
> -	 * If the filesystem isn't willing to tell us the length of a hole,
> -	 * just fall back to PTEs.  Calling get_block 512 times in a loop
> -	 * would be silly.
> -	 */
> -	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
> +	if (bh.b_size < PMD_SIZE) {
>  		dax_pmd_dbg(&bh, address, "allocated block too small");
>  		return VM_FAULT_FALLBACK;
>  	}
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 4/7] dax: rename 'ret' to 'entry' in grab_mapping_entry
  2016-08-15 19:09   ` Ross Zwisler
@ 2016-08-16  9:14     ` Jan Kara
  -1 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:14 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

On Mon 15-08-16 13:09:15, Ross Zwisler wrote:
> No functional change.
> 
> Everywhere else that we get entries via get_unlocked_mapping_entry(), we
> save it in 'entry' variables.  Just change this to be more descriptive.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 8030f93..fed6a52 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -394,13 +394,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
>   */
>  static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
>  {
> -	void *ret, **slot;
> +	void *entry, **slot;
>  
>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
> -	ret = get_unlocked_mapping_entry(mapping, index, &slot);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	/* No entry for given index? Make sure radix tree is big enough. */
> -	if (!ret) {
> +	if (!entry) {
>  		int err;
>  
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -408,10 +408,10 @@ restart:
>  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
>  		if (err)
>  			return ERR_PTR(err);
> -		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> +		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
>  			       RADIX_DAX_ENTRY_LOCK);
>  		spin_lock_irq(&mapping->tree_lock);
> -		err = radix_tree_insert(&mapping->page_tree, index, ret);
> +		err = radix_tree_insert(&mapping->page_tree, index, entry);
>  		radix_tree_preload_end();
>  		if (err) {
>  			spin_unlock_irq(&mapping->tree_lock);
> @@ -423,11 +423,11 @@ restart:
>  		/* Good, we have inserted empty locked entry into the tree. */
>  		mapping->nrexceptional++;
>  		spin_unlock_irq(&mapping->tree_lock);
> -		return ret;
> +		return entry;
>  	}
>  	/* Normal page in radix tree? */
> -	if (!radix_tree_exceptional_entry(ret)) {
> -		struct page *page = ret;
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		struct page *page = entry;
>  
>  		get_page(page);
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -440,9 +440,9 @@ restart:
>  		}
>  		return page;
>  	}
> -	ret = lock_slot(mapping, slot);
> +	entry = lock_slot(mapping, slot);
>  	spin_unlock_irq(&mapping->tree_lock);
> -	return ret;
> +	return entry;
>  }
>  
>  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 4/7] dax: rename 'ret' to 'entry' in grab_mapping_entry
@ 2016-08-16  9:14     ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:14 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

On Mon 15-08-16 13:09:15, Ross Zwisler wrote:
> No functional change.
> 
> Everywhere else that we get entries via get_unlocked_mapping_entry(), we
> save it in 'entry' variables.  Just change this to be more descriptive.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 8030f93..fed6a52 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -394,13 +394,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
>   */
>  static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
>  {
> -	void *ret, **slot;
> +	void *entry, **slot;
>  
>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
> -	ret = get_unlocked_mapping_entry(mapping, index, &slot);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	/* No entry for given index? Make sure radix tree is big enough. */
> -	if (!ret) {
> +	if (!entry) {
>  		int err;
>  
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -408,10 +408,10 @@ restart:
>  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
>  		if (err)
>  			return ERR_PTR(err);
> -		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> +		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
>  			       RADIX_DAX_ENTRY_LOCK);
>  		spin_lock_irq(&mapping->tree_lock);
> -		err = radix_tree_insert(&mapping->page_tree, index, ret);
> +		err = radix_tree_insert(&mapping->page_tree, index, entry);
>  		radix_tree_preload_end();
>  		if (err) {
>  			spin_unlock_irq(&mapping->tree_lock);
> @@ -423,11 +423,11 @@ restart:
>  		/* Good, we have inserted empty locked entry into the tree. */
>  		mapping->nrexceptional++;
>  		spin_unlock_irq(&mapping->tree_lock);
> -		return ret;
> +		return entry;
>  	}
>  	/* Normal page in radix tree? */
> -	if (!radix_tree_exceptional_entry(ret)) {
> -		struct page *page = ret;
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		struct page *page = entry;
>  
>  		get_page(page);
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -440,9 +440,9 @@ restart:
>  		}
>  		return page;
>  	}
> -	ret = lock_slot(mapping, slot);
> +	entry = lock_slot(mapping, slot);
>  	spin_unlock_irq(&mapping->tree_lock);
> -	return ret;
> +	return entry;
>  }
>  
>  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
  2016-08-15 19:09   ` Ross Zwisler
  (?)
@ 2016-08-16  9:28     ` Jan Kara
  -1 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:28 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, linux-nvdimm, Dave Chinner, linux-kernel,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, Andrew Morton, linux-ext4

On Mon 15-08-16 13:09:16, Ross Zwisler wrote:
> DAX radix tree locking currently locks entries based on the unique
> combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> This works for PTEs, but as we move to PMDs we will need to have all the
> offsets within the range covered by the PMD to map to the same bit lock.
> To accomplish this, lock based on the 'slot' pointer in the radix tree
> instead of [mapping, index].

I'm not convinced this is safe. What makes the slot pointer still valid
after you drop tree_lock? At least radix_tree_shrink() or
radix_tree_expand() could move your slot without letting the waiter know
and he would be never woken.

								Honza

> 
> When a PMD entry is present in the tree, all offsets will map to the same
> 'slot' via radix tree lookups, and they will all share the same locking.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c            | 59 +++++++++++++++++++++--------------------------------
>  include/linux/dax.h |  3 +--
>  mm/filemap.c        |  3 +--
>  3 files changed, 25 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index fed6a52..0f1d053 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -62,11 +62,10 @@ static int __init init_dax_wait_table(void)
>  }
>  fs_initcall(init_dax_wait_table);
>  
> -static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> -					      pgoff_t index)
> +static wait_queue_head_t *dax_entry_waitqueue(void **slot)
>  {
> -	unsigned long hash = hash_long((unsigned long)mapping ^ index,
> -				       DAX_WAIT_TABLE_BITS);
> +	unsigned long hash = hash_long((unsigned long)slot,
> +					DAX_WAIT_TABLE_BITS);
>  	return wait_table + hash;
>  }
>  
> @@ -281,25 +280,19 @@ EXPORT_SYMBOL_GPL(dax_do_io);
>  /*
>   * DAX radix tree locking
>   */
> -struct exceptional_entry_key {
> -	struct address_space *mapping;
> -	unsigned long index;
> -};
> -
>  struct wait_exceptional_entry_queue {
>  	wait_queue_t wait;
> -	struct exceptional_entry_key key;
> +	void **slot;
>  };
>  
>  static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
>  				       int sync, void *keyp)
>  {
> -	struct exceptional_entry_key *key = keyp;
> +	void **slot = keyp;
>  	struct wait_exceptional_entry_queue *ewait =
>  		container_of(wait, struct wait_exceptional_entry_queue, wait);
>  
> -	if (key->mapping != ewait->key.mapping ||
> -	    key->index != ewait->key.index)
> +	if (slot != ewait->slot)
>  		return 0;
>  	return autoremove_wake_function(wait, mode, sync, NULL);
>  }
> @@ -357,12 +350,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
>  {
>  	void *ret, **slot;
>  	struct wait_exceptional_entry_queue ewait;
> -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +	wait_queue_head_t *wq;
>  
>  	init_wait(&ewait.wait);
>  	ewait.wait.func = wake_exceptional_entry_func;
> -	ewait.key.mapping = mapping;
> -	ewait.key.index = index;
>  
>  	for (;;) {
>  		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> @@ -373,6 +364,9 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
>  				*slotp = slot;
>  			return ret;
>  		}
> +
> +		wq = dax_entry_waitqueue(slot);
> +		ewait.slot = slot;
>  		prepare_to_wait_exclusive(wq, &ewait.wait,
>  					  TASK_UNINTERRUPTIBLE);
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -445,10 +439,9 @@ restart:
>  	return entry;
>  }
>  
> -void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -				   pgoff_t index, bool wake_all)
> +void dax_wake_mapping_entry_waiter(void **slot, bool wake_all)
>  {
> -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +	wait_queue_head_t *wq = dax_entry_waitqueue(slot);
>  
>  	/*
>  	 * Checking for locked entry and prepare_to_wait_exclusive() happens
> @@ -456,13 +449,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
>  	 * So at this point all tasks that could have seen our entry locked
>  	 * must be in the waitqueue and the following check will see them.
>  	 */
> -	if (waitqueue_active(wq)) {
> -		struct exceptional_entry_key key;
> -
> -		key.mapping = mapping;
> -		key.index = index;
> -		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
> -	}
> +	if (waitqueue_active(wq))
> +		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, slot);
>  }
>  
>  void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> @@ -478,7 +466,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
>  	}
>  	unlock_slot(mapping, slot);
>  	spin_unlock_irq(&mapping->tree_lock);
> -	dax_wake_mapping_entry_waiter(mapping, index, false);
> +	dax_wake_mapping_entry_waiter(slot, false);
>  }
>  
>  static void put_locked_mapping_entry(struct address_space *mapping,
> @@ -496,14 +484,13 @@ static void put_locked_mapping_entry(struct address_space *mapping,
>   * Called when we are done with radix tree entry we looked up via
>   * get_unlocked_mapping_entry() and which we didn't lock in the end.
>   */
> -static void put_unlocked_mapping_entry(struct address_space *mapping,
> -				       pgoff_t index, void *entry)
> +static void put_unlocked_mapping_entry(void **slot, void *entry)
>  {
>  	if (!radix_tree_exceptional_entry(entry))
>  		return;
>  
>  	/* We have to wake up next waiter for the radix tree entry lock */
> -	dax_wake_mapping_entry_waiter(mapping, index, false);
> +	dax_wake_mapping_entry_waiter(slot, false);
>  }
>  
>  /*
> @@ -512,10 +499,10 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
>   */
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
>  {
> -	void *entry;
> +	void *entry, **slot;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	/*
>  	 * This gets called from truncate / punch_hole path. As such, the caller
>  	 * must hold locks protecting against concurrent modifications of the
> @@ -530,7 +517,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
>  	radix_tree_delete(&mapping->page_tree, index);
>  	mapping->nrexceptional--;
>  	spin_unlock_irq(&mapping->tree_lock);
> -	dax_wake_mapping_entry_waiter(mapping, index, true);
> +	dax_wake_mapping_entry_waiter(slot, true);
>  
>  	return 1;
>  }
> @@ -1118,15 +1105,15 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> -	void *entry;
> +	void *entry, **slot;
>  	pgoff_t index = vmf->pgoff;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	if (!entry || !radix_tree_exceptional_entry(entry))
>  		goto out;
>  	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
> -	put_unlocked_mapping_entry(mapping, index, entry);
> +	put_unlocked_mapping_entry(slot, entry);
>  out:
>  	spin_unlock_irq(&mapping->tree_lock);
>  	return VM_FAULT_NOPAGE;
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 9c6dc77..8bcb852 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -15,8 +15,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> -void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -				   pgoff_t index, bool wake_all);
> +void dax_wake_mapping_entry_waiter(void **slot, bool wake_all);
>  
>  #ifdef CONFIG_FS_DAX
>  struct page *read_dax_sector(struct block_device *bdev, sector_t n);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8a287df..56c4ac7 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -617,8 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  			if (node)
>  				workingset_node_pages_dec(node);
>  			/* Wakeup waiters for exceptional entry lock */
> -			dax_wake_mapping_entry_waiter(mapping, page->index,
> -						      false);
> +			dax_wake_mapping_entry_waiter(slot, false);
>  		}
>  	}
>  	radix_tree_replace_slot(slot, page);
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
@ 2016-08-16  9:28     ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:28 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

On Mon 15-08-16 13:09:16, Ross Zwisler wrote:
> DAX radix tree locking currently locks entries based on the unique
> combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> This works for PTEs, but as we move to PMDs we will need to have all the
> offsets within the range covered by the PMD to map to the same bit lock.
> To accomplish this, lock based on the 'slot' pointer in the radix tree
> instead of [mapping, index].

I'm not convinced this is safe. What makes the slot pointer still valid
after you drop tree_lock? At least radix_tree_shrink() or
radix_tree_expand() could move your slot without letting the waiter know
and he would be never woken.

								Honza

> 
> When a PMD entry is present in the tree, all offsets will map to the same
> 'slot' via radix tree lookups, and they will all share the same locking.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c            | 59 +++++++++++++++++++++--------------------------------
>  include/linux/dax.h |  3 +--
>  mm/filemap.c        |  3 +--
>  3 files changed, 25 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index fed6a52..0f1d053 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -62,11 +62,10 @@ static int __init init_dax_wait_table(void)
>  }
>  fs_initcall(init_dax_wait_table);
>  
> -static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> -					      pgoff_t index)
> +static wait_queue_head_t *dax_entry_waitqueue(void **slot)
>  {
> -	unsigned long hash = hash_long((unsigned long)mapping ^ index,
> -				       DAX_WAIT_TABLE_BITS);
> +	unsigned long hash = hash_long((unsigned long)slot,
> +					DAX_WAIT_TABLE_BITS);
>  	return wait_table + hash;
>  }
>  
> @@ -281,25 +280,19 @@ EXPORT_SYMBOL_GPL(dax_do_io);
>  /*
>   * DAX radix tree locking
>   */
> -struct exceptional_entry_key {
> -	struct address_space *mapping;
> -	unsigned long index;
> -};
> -
>  struct wait_exceptional_entry_queue {
>  	wait_queue_t wait;
> -	struct exceptional_entry_key key;
> +	void **slot;
>  };
>  
>  static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
>  				       int sync, void *keyp)
>  {
> -	struct exceptional_entry_key *key = keyp;
> +	void **slot = keyp;
>  	struct wait_exceptional_entry_queue *ewait =
>  		container_of(wait, struct wait_exceptional_entry_queue, wait);
>  
> -	if (key->mapping != ewait->key.mapping ||
> -	    key->index != ewait->key.index)
> +	if (slot != ewait->slot)
>  		return 0;
>  	return autoremove_wake_function(wait, mode, sync, NULL);
>  }
> @@ -357,12 +350,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
>  {
>  	void *ret, **slot;
>  	struct wait_exceptional_entry_queue ewait;
> -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +	wait_queue_head_t *wq;
>  
>  	init_wait(&ewait.wait);
>  	ewait.wait.func = wake_exceptional_entry_func;
> -	ewait.key.mapping = mapping;
> -	ewait.key.index = index;
>  
>  	for (;;) {
>  		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> @@ -373,6 +364,9 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
>  				*slotp = slot;
>  			return ret;
>  		}
> +
> +		wq = dax_entry_waitqueue(slot);
> +		ewait.slot = slot;
>  		prepare_to_wait_exclusive(wq, &ewait.wait,
>  					  TASK_UNINTERRUPTIBLE);
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -445,10 +439,9 @@ restart:
>  	return entry;
>  }
>  
> -void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -				   pgoff_t index, bool wake_all)
> +void dax_wake_mapping_entry_waiter(void **slot, bool wake_all)
>  {
> -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +	wait_queue_head_t *wq = dax_entry_waitqueue(slot);
>  
>  	/*
>  	 * Checking for locked entry and prepare_to_wait_exclusive() happens
> @@ -456,13 +449,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
>  	 * So at this point all tasks that could have seen our entry locked
>  	 * must be in the waitqueue and the following check will see them.
>  	 */
> -	if (waitqueue_active(wq)) {
> -		struct exceptional_entry_key key;
> -
> -		key.mapping = mapping;
> -		key.index = index;
> -		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
> -	}
> +	if (waitqueue_active(wq))
> +		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, slot);
>  }
>  
>  void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> @@ -478,7 +466,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
>  	}
>  	unlock_slot(mapping, slot);
>  	spin_unlock_irq(&mapping->tree_lock);
> -	dax_wake_mapping_entry_waiter(mapping, index, false);
> +	dax_wake_mapping_entry_waiter(slot, false);
>  }
>  
>  static void put_locked_mapping_entry(struct address_space *mapping,
> @@ -496,14 +484,13 @@ static void put_locked_mapping_entry(struct address_space *mapping,
>   * Called when we are done with radix tree entry we looked up via
>   * get_unlocked_mapping_entry() and which we didn't lock in the end.
>   */
> -static void put_unlocked_mapping_entry(struct address_space *mapping,
> -				       pgoff_t index, void *entry)
> +static void put_unlocked_mapping_entry(void **slot, void *entry)
>  {
>  	if (!radix_tree_exceptional_entry(entry))
>  		return;
>  
>  	/* We have to wake up next waiter for the radix tree entry lock */
> -	dax_wake_mapping_entry_waiter(mapping, index, false);
> +	dax_wake_mapping_entry_waiter(slot, false);
>  }
>  
>  /*
> @@ -512,10 +499,10 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
>   */
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
>  {
> -	void *entry;
> +	void *entry, **slot;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	/*
>  	 * This gets called from truncate / punch_hole path. As such, the caller
>  	 * must hold locks protecting against concurrent modifications of the
> @@ -530,7 +517,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
>  	radix_tree_delete(&mapping->page_tree, index);
>  	mapping->nrexceptional--;
>  	spin_unlock_irq(&mapping->tree_lock);
> -	dax_wake_mapping_entry_waiter(mapping, index, true);
> +	dax_wake_mapping_entry_waiter(slot, true);
>  
>  	return 1;
>  }
> @@ -1118,15 +1105,15 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> -	void *entry;
> +	void *entry, **slot;
>  	pgoff_t index = vmf->pgoff;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	if (!entry || !radix_tree_exceptional_entry(entry))
>  		goto out;
>  	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
> -	put_unlocked_mapping_entry(mapping, index, entry);
> +	put_unlocked_mapping_entry(slot, entry);
>  out:
>  	spin_unlock_irq(&mapping->tree_lock);
>  	return VM_FAULT_NOPAGE;
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 9c6dc77..8bcb852 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -15,8 +15,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> -void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -				   pgoff_t index, bool wake_all);
> +void dax_wake_mapping_entry_waiter(void **slot, bool wake_all);
>  
>  #ifdef CONFIG_FS_DAX
>  struct page *read_dax_sector(struct block_device *bdev, sector_t n);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8a287df..56c4ac7 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -617,8 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  			if (node)
>  				workingset_node_pages_dec(node);
>  			/* Wakeup waiters for exceptional entry lock */
> -			dax_wake_mapping_entry_waiter(mapping, page->index,
> -						      false);
> +			dax_wake_mapping_entry_waiter(slot, false);
>  		}
>  	}
>  	radix_tree_replace_slot(slot, page);
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
@ 2016-08-16  9:28     ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-16  9:28 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm

On Mon 15-08-16 13:09:16, Ross Zwisler wrote:
> DAX radix tree locking currently locks entries based on the unique
> combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> This works for PTEs, but as we move to PMDs we will need to have all the
> offsets within the range covered by the PMD to map to the same bit lock.
> To accomplish this, lock based on the 'slot' pointer in the radix tree
> instead of [mapping, index].

I'm not convinced this is safe. What makes the slot pointer still valid
after you drop tree_lock? At least radix_tree_shrink() or
radix_tree_expand() could move your slot without letting the waiter know
and he would be never woken.

								Honza

> 
> When a PMD entry is present in the tree, all offsets will map to the same
> 'slot' via radix tree lookups, and they will all share the same locking.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c            | 59 +++++++++++++++++++++--------------------------------
>  include/linux/dax.h |  3 +--
>  mm/filemap.c        |  3 +--
>  3 files changed, 25 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index fed6a52..0f1d053 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -62,11 +62,10 @@ static int __init init_dax_wait_table(void)
>  }
>  fs_initcall(init_dax_wait_table);
>  
> -static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> -					      pgoff_t index)
> +static wait_queue_head_t *dax_entry_waitqueue(void **slot)
>  {
> -	unsigned long hash = hash_long((unsigned long)mapping ^ index,
> -				       DAX_WAIT_TABLE_BITS);
> +	unsigned long hash = hash_long((unsigned long)slot,
> +					DAX_WAIT_TABLE_BITS);
>  	return wait_table + hash;
>  }
>  
> @@ -281,25 +280,19 @@ EXPORT_SYMBOL_GPL(dax_do_io);
>  /*
>   * DAX radix tree locking
>   */
> -struct exceptional_entry_key {
> -	struct address_space *mapping;
> -	unsigned long index;
> -};
> -
>  struct wait_exceptional_entry_queue {
>  	wait_queue_t wait;
> -	struct exceptional_entry_key key;
> +	void **slot;
>  };
>  
>  static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
>  				       int sync, void *keyp)
>  {
> -	struct exceptional_entry_key *key = keyp;
> +	void **slot = keyp;
>  	struct wait_exceptional_entry_queue *ewait =
>  		container_of(wait, struct wait_exceptional_entry_queue, wait);
>  
> -	if (key->mapping != ewait->key.mapping ||
> -	    key->index != ewait->key.index)
> +	if (slot != ewait->slot)
>  		return 0;
>  	return autoremove_wake_function(wait, mode, sync, NULL);
>  }
> @@ -357,12 +350,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
>  {
>  	void *ret, **slot;
>  	struct wait_exceptional_entry_queue ewait;
> -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +	wait_queue_head_t *wq;
>  
>  	init_wait(&ewait.wait);
>  	ewait.wait.func = wake_exceptional_entry_func;
> -	ewait.key.mapping = mapping;
> -	ewait.key.index = index;
>  
>  	for (;;) {
>  		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> @@ -373,6 +364,9 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
>  				*slotp = slot;
>  			return ret;
>  		}
> +
> +		wq = dax_entry_waitqueue(slot);
> +		ewait.slot = slot;
>  		prepare_to_wait_exclusive(wq, &ewait.wait,
>  					  TASK_UNINTERRUPTIBLE);
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -445,10 +439,9 @@ restart:
>  	return entry;
>  }
>  
> -void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -				   pgoff_t index, bool wake_all)
> +void dax_wake_mapping_entry_waiter(void **slot, bool wake_all)
>  {
> -	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +	wait_queue_head_t *wq = dax_entry_waitqueue(slot);
>  
>  	/*
>  	 * Checking for locked entry and prepare_to_wait_exclusive() happens
> @@ -456,13 +449,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
>  	 * So at this point all tasks that could have seen our entry locked
>  	 * must be in the waitqueue and the following check will see them.
>  	 */
> -	if (waitqueue_active(wq)) {
> -		struct exceptional_entry_key key;
> -
> -		key.mapping = mapping;
> -		key.index = index;
> -		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
> -	}
> +	if (waitqueue_active(wq))
> +		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, slot);
>  }
>  
>  void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> @@ -478,7 +466,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
>  	}
>  	unlock_slot(mapping, slot);
>  	spin_unlock_irq(&mapping->tree_lock);
> -	dax_wake_mapping_entry_waiter(mapping, index, false);
> +	dax_wake_mapping_entry_waiter(slot, false);
>  }
>  
>  static void put_locked_mapping_entry(struct address_space *mapping,
> @@ -496,14 +484,13 @@ static void put_locked_mapping_entry(struct address_space *mapping,
>   * Called when we are done with radix tree entry we looked up via
>   * get_unlocked_mapping_entry() and which we didn't lock in the end.
>   */
> -static void put_unlocked_mapping_entry(struct address_space *mapping,
> -				       pgoff_t index, void *entry)
> +static void put_unlocked_mapping_entry(void **slot, void *entry)
>  {
>  	if (!radix_tree_exceptional_entry(entry))
>  		return;
>  
>  	/* We have to wake up next waiter for the radix tree entry lock */
> -	dax_wake_mapping_entry_waiter(mapping, index, false);
> +	dax_wake_mapping_entry_waiter(slot, false);
>  }
>  
>  /*
> @@ -512,10 +499,10 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
>   */
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
>  {
> -	void *entry;
> +	void *entry, **slot;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	/*
>  	 * This gets called from truncate / punch_hole path. As such, the caller
>  	 * must hold locks protecting against concurrent modifications of the
> @@ -530,7 +517,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
>  	radix_tree_delete(&mapping->page_tree, index);
>  	mapping->nrexceptional--;
>  	spin_unlock_irq(&mapping->tree_lock);
> -	dax_wake_mapping_entry_waiter(mapping, index, true);
> +	dax_wake_mapping_entry_waiter(slot, true);
>  
>  	return 1;
>  }
> @@ -1118,15 +1105,15 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> -	void *entry;
> +	void *entry, **slot;
>  	pgoff_t index = vmf->pgoff;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	if (!entry || !radix_tree_exceptional_entry(entry))
>  		goto out;
>  	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
> -	put_unlocked_mapping_entry(mapping, index, entry);
> +	put_unlocked_mapping_entry(slot, entry);
>  out:
>  	spin_unlock_irq(&mapping->tree_lock);
>  	return VM_FAULT_NOPAGE;
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 9c6dc77..8bcb852 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -15,8 +15,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> -void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> -				   pgoff_t index, bool wake_all);
> +void dax_wake_mapping_entry_waiter(void **slot, bool wake_all);
>  
>  #ifdef CONFIG_FS_DAX
>  struct page *read_dax_sector(struct block_device *bdev, sector_t n);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8a287df..56c4ac7 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -617,8 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  			if (node)
>  				workingset_node_pages_dec(node);
>  			/* Wakeup waiters for exceptional entry lock */
> -			dax_wake_mapping_entry_waiter(mapping, page->index,
> -						      false);
> +			dax_wake_mapping_entry_waiter(slot, false);
>  		}
>  	}
>  	radix_tree_replace_slot(slot, page);
> -- 
> 2.9.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/7] ext2: tell DAX the size of allocation holes
  2016-08-16  9:10     ` Jan Kara
  (?)
  (?)
@ 2016-08-16 22:52       ` Ross Zwisler
  -1 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-16 22:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, linux-nvdimm, Dave Chinner, linux-kernel,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, linux-ext4, Andrew Morton

On Tue, Aug 16, 2016 at 11:10:25AM +0200, Jan Kara wrote:
> On Mon 15-08-16 13:09:12, Ross Zwisler wrote:
> > When DAX calls ext2_get_block() and the file offset points to a hole we
> > currently don't set bh_result->b_size.  When we re-enable PMD faults DAX
> > will need bh_result->b_size to tell it the size of the hole so it can
> > decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.
> > 
> > For ext2 we always want DAX to use 4 KiB zero pages, so we just tell DAX
> > that all holes are 4 KiB in size.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/ext2/inode.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> > index d5c7d09..c6d9763 100644
> > --- a/fs/ext2/inode.c
> > +++ b/fs/ext2/inode.c
> > @@ -773,6 +773,12 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
> >  	if (ret > 0) {
> >  		bh_result->b_size = (ret << inode->i_blkbits);
> >  		ret = 0;
> > +	} else if (ret == 0 && IS_DAX(inode)) {
> 
> I'd just drop the IS_DAX() check and set
> 
> 	bh_result->b_size = 1 << inode->i_blkbits;
> 
> IMO it's better to have things consistent between DAX & !DAX whenever
> possible.

Agreed, this is better.  Fixed for v2, thanks!
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/7] ext2: tell DAX the size of allocation holes
@ 2016-08-16 22:52       ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-16 22:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm

On Tue, Aug 16, 2016 at 11:10:25AM +0200, Jan Kara wrote:
> On Mon 15-08-16 13:09:12, Ross Zwisler wrote:
> > When DAX calls ext2_get_block() and the file offset points to a hole we
> > currently don't set bh_result->b_size.  When we re-enable PMD faults DAX
> > will need bh_result->b_size to tell it the size of the hole so it can
> > decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.
> > 
> > For ext2 we always want DAX to use 4 KiB zero pages, so we just tell DAX
> > that all holes are 4 KiB in size.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/ext2/inode.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> > index d5c7d09..c6d9763 100644
> > --- a/fs/ext2/inode.c
> > +++ b/fs/ext2/inode.c
> > @@ -773,6 +773,12 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
> >  	if (ret > 0) {
> >  		bh_result->b_size = (ret << inode->i_blkbits);
> >  		ret = 0;
> > +	} else if (ret == 0 && IS_DAX(inode)) {
> 
> I'd just drop the IS_DAX() check and set
> 
> 	bh_result->b_size = 1 << inode->i_blkbits;
> 
> IMO it's better to have things consistent between DAX & !DAX whenever
> possible.

Agreed, this is better.  Fixed for v2, thanks!

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/7] ext2: tell DAX the size of allocation holes
@ 2016-08-16 22:52       ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-16 22:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm

On Tue, Aug 16, 2016 at 11:10:25AM +0200, Jan Kara wrote:
> On Mon 15-08-16 13:09:12, Ross Zwisler wrote:
> > When DAX calls ext2_get_block() and the file offset points to a hole we
> > currently don't set bh_result->b_size.  When we re-enable PMD faults DAX
> > will need bh_result->b_size to tell it the size of the hole so it can
> > decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.
> > 
> > For ext2 we always want DAX to use 4 KiB zero pages, so we just tell DAX
> > that all holes are 4 KiB in size.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/ext2/inode.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> > index d5c7d09..c6d9763 100644
> > --- a/fs/ext2/inode.c
> > +++ b/fs/ext2/inode.c
> > @@ -773,6 +773,12 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
> >  	if (ret > 0) {
> >  		bh_result->b_size = (ret << inode->i_blkbits);
> >  		ret = 0;
> > +	} else if (ret == 0 && IS_DAX(inode)) {
> 
> I'd just drop the IS_DAX() check and set
> 
> 	bh_result->b_size = 1 << inode->i_blkbits;
> 
> IMO it's better to have things consistent between DAX & !DAX whenever
> possible.

Agreed, this is better.  Fixed for v2, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/7] ext2: tell DAX the size of allocation holes
@ 2016-08-16 22:52       ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-16 22:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Tue, Aug 16, 2016 at 11:10:25AM +0200, Jan Kara wrote:
> On Mon 15-08-16 13:09:12, Ross Zwisler wrote:
> > When DAX calls ext2_get_block() and the file offset points to a hole we
> > currently don't set bh_result->b_size.  When we re-enable PMD faults DAX
> > will need bh_result->b_size to tell it the size of the hole so it can
> > decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.
> > 
> > For ext2 we always want DAX to use 4 KiB zero pages, so we just tell DAX
> > that all holes are 4 KiB in size.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > ---
> >  fs/ext2/inode.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> > index d5c7d09..c6d9763 100644
> > --- a/fs/ext2/inode.c
> > +++ b/fs/ext2/inode.c
> > @@ -773,6 +773,12 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
> >  	if (ret > 0) {
> >  		bh_result->b_size = (ret << inode->i_blkbits);
> >  		ret = 0;
> > +	} else if (ret == 0 && IS_DAX(inode)) {
> 
> I'd just drop the IS_DAX() check and set
> 
> 	bh_result->b_size = 1 << inode->i_blkbits;
> 
> IMO it's better to have things consistent between DAX & !DAX whenever
> possible.

Agreed, this is better.  Fixed for v2, thanks!

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
  2016-08-15 21:14       ` Dan Williams
  (?)
@ 2016-08-17 16:21         ` Ross Zwisler
  -1 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-17 16:21 UTC (permalink / raw)
  To: Dan Williams, Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dave Chinner, Jan Kara,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm

On Mon, Aug 15, 2016 at 02:14:14PM -0700, Dan Williams wrote:
> On Mon, Aug 15, 2016 at 2:11 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
> >> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> >> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> >> > based locking scheme so that they can be re-enabled.
> >>
> >> Looks good to me.
> >>
> >> > This series restores DAX PMD functionality back to what it was before it
> >> > was disabled.  There is still a known issue between DAX PMDs and hole
> >> > punch, which I am currently working on and which I plan to address with a
> >> > separate series.
> >>
> >> Perhaps we should hold off on applying patch 6 and 7 until after the
> >> hole-punch fix is ready?
> >
> > Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
> > the hole punch fix is ready.
> >
> > I don't see a reason to hold off on patch 6, though?  It stands on it's own,
> > implements the correct locking, and doesn't break anything.
> 
> Whoops, I just meant 7.

Well, it looks like the hole punch case is much improved since I tested it
last!  :)  I used to be able to generate a few different kernel BUGs when hole
punching DAX PMDs, but those have apparently been fixed in the mm layer since
I was last testing, which admittedly was quite a long time ago (February?).

The only issue I was able to find with DAX PMD hole punching was that ext4
wasn't properly doing a writeback before the hole was unmapped and the radix
tree entries were removed.  This issue applies equally to the 4k case, so I've
submitted a bug fix for v4.8:

https://lists.01.org/pipermail/linux-nvdimm/2016-August/006621.html

With that applied, I don't know of any more issues related to DAX PMDs and
hole punch.  I've tested ext4 and XFS (ext2 doesn't support hole punch), and
they both properly do a writeback of all affected PMDs, fully unmap all
affected PMDs, and remove the radix tree entries.  I've tested that new page
faults for addresses previously covered by the old PMDs generate new page
faults, and 4k pages are now faulted in because the block allocator no longer
has 2MiB contiguous allocations.

One question (probably for Jan): should the above ext4 fix be marked for
stable?

Thanks,
- Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-17 16:21         ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-17 16:21 UTC (permalink / raw)
  To: Dan Williams, Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dave Chinner, Jan Kara,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org

On Mon, Aug 15, 2016 at 02:14:14PM -0700, Dan Williams wrote:
> On Mon, Aug 15, 2016 at 2:11 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
> >> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> >> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> >> > based locking scheme so that they can be re-enabled.
> >>
> >> Looks good to me.
> >>
> >> > This series restores DAX PMD functionality back to what it was before it
> >> > was disabled.  There is still a known issue between DAX PMDs and hole
> >> > punch, which I am currently working on and which I plan to address with a
> >> > separate series.
> >>
> >> Perhaps we should hold off on applying patch 6 and 7 until after the
> >> hole-punch fix is ready?
> >
> > Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
> > the hole punch fix is ready.
> >
> > I don't see a reason to hold off on patch 6, though?  It stands on it's own,
> > implements the correct locking, and doesn't break anything.
> 
> Whoops, I just meant 7.

Well, it looks like the hole punch case is much improved since I tested it
last!  :)  I used to be able to generate a few different kernel BUGs when hole
punching DAX PMDs, but those have apparently been fixed in the mm layer since
I was last testing, which admittedly was quite a long time ago (February?).

The only issue I was able to find with DAX PMD hole punching was that ext4
wasn't properly doing a writeback before the hole was unmapped and the radix
tree entries were removed.  This issue applies equally to the 4k case, so I've
submitted a bug fix for v4.8:

https://lists.01.org/pipermail/linux-nvdimm/2016-August/006621.html

With that applied, I don't know of any more issues related to DAX PMDs and
hole punch.  I've tested ext4 and XFS (ext2 doesn't support hole punch), and
they both properly do a writeback of all affected PMDs, fully unmap all
affected PMDs, and remove the radix tree entries.  I've tested that new page
faults for addresses previously covered by the old PMDs generate new page
faults, and 4k pages are now faulted in because the block allocator no longer
has 2MiB contiguous allocations.

One question (probably for Jan): should the above ext4 fix be marked for
stable?

Thanks,
- Ross

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-17 16:21         ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-17 16:21 UTC (permalink / raw)
  To: Dan Williams, Jan Kara
  Cc: Theodore Ts'o, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Andreas Dilger, Alexander Viro, Jan Kara, linux-fsdevel,
	linux-ext4, Andrew Morton

On Mon, Aug 15, 2016 at 02:14:14PM -0700, Dan Williams wrote:
> On Mon, Aug 15, 2016 at 2:11 PM, Ross Zwisler
> <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> > On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
> >> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
> >> <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> >> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> >> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> >> > based locking scheme so that they can be re-enabled.
> >>
> >> Looks good to me.
> >>
> >> > This series restores DAX PMD functionality back to what it was before it
> >> > was disabled.  There is still a known issue between DAX PMDs and hole
> >> > punch, which I am currently working on and which I plan to address with a
> >> > separate series.
> >>
> >> Perhaps we should hold off on applying patch 6 and 7 until after the
> >> hole-punch fix is ready?
> >
> > Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
> > the hole punch fix is ready.
> >
> > I don't see a reason to hold off on patch 6, though?  It stands on it's own,
> > implements the correct locking, and doesn't break anything.
> 
> Whoops, I just meant 7.

Well, it looks like the hole punch case is much improved since I tested it
last!  :)  I used to be able to generate a few different kernel BUGs when hole
punching DAX PMDs, but those have apparently been fixed in the mm layer since
I was last testing, which admittedly was quite a long time ago (February?).

The only issue I was able to find with DAX PMD hole punching was that ext4
wasn't properly doing a writeback before the hole was unmapped and the radix
tree entries were removed.  This issue applies equally to the 4k case, so I've
submitted a bug fix for v4.8:

https://lists.01.org/pipermail/linux-nvdimm/2016-August/006621.html

With that applied, I don't know of any more issues related to DAX PMDs and
hole punch.  I've tested ext4 and XFS (ext2 doesn't support hole punch), and
they both properly do a writeback of all affected PMDs, fully unmap all
affected PMDs, and remove the radix tree entries.  I've tested that new page
faults for addresses previously covered by the old PMDs generate new page
faults, and 4k pages are now faulted in because the block allocator no longer
has 2MiB contiguous allocations.

One question (probably for Jan): should the above ext4 fix be marked for
stable?

Thanks,
- Ross

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
  2016-08-17 16:21         ` Ross Zwisler
  (?)
  (?)
@ 2016-08-17 17:21           ` Jan Kara
  -1 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-17 17:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, linux-nvdimm, Dave Chinner, linux-kernel,
	Linux MM, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, Jan Kara, linux-ext4, Andrew Morton

On Wed 17-08-16 10:21:24, Ross Zwisler wrote:
> On Mon, Aug 15, 2016 at 02:14:14PM -0700, Dan Williams wrote:
> > On Mon, Aug 15, 2016 at 2:11 PM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
> > >> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > >> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> > >> > based locking scheme so that they can be re-enabled.
> > >>
> > >> Looks good to me.
> > >>
> > >> > This series restores DAX PMD functionality back to what it was before it
> > >> > was disabled.  There is still a known issue between DAX PMDs and hole
> > >> > punch, which I am currently working on and which I plan to address with a
> > >> > separate series.
> > >>
> > >> Perhaps we should hold off on applying patch 6 and 7 until after the
> > >> hole-punch fix is ready?
> > >
> > > Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
> > > the hole punch fix is ready.
> > >
> > > I don't see a reason to hold off on patch 6, though?  It stands on it's own,
> > > implements the correct locking, and doesn't break anything.
> > 
> > Whoops, I just meant 7.
> 
> Well, it looks like the hole punch case is much improved since I tested it
> last!  :)  I used to be able to generate a few different kernel BUGs when hole
> punching DAX PMDs, but those have apparently been fixed in the mm layer since
> I was last testing, which admittedly was quite a long time ago (February?).
> 
> The only issue I was able to find with DAX PMD hole punching was that ext4
> wasn't properly doing a writeback before the hole was unmapped and the radix
> tree entries were removed.  This issue applies equally to the 4k case, so I've
> submitted a bug fix for v4.8:
> 
> https://lists.01.org/pipermail/linux-nvdimm/2016-August/006621.html
> 
> With that applied, I don't know of any more issues related to DAX PMDs and
> hole punch.  I've tested ext4 and XFS (ext2 doesn't support hole punch), and
> they both properly do a writeback of all affected PMDs, fully unmap all
> affected PMDs, and remove the radix tree entries.  I've tested that new page
> faults for addresses previously covered by the old PMDs generate new page
> faults, and 4k pages are now faulted in because the block allocator no longer
> has 2MiB contiguous allocations.
> 
> One question (probably for Jan): should the above ext4 fix be marked for
> stable?

Yes, probably it should be.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-17 17:21           ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-17 17:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, Jan Kara, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dave Chinner,
	Jan Kara, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org

On Wed 17-08-16 10:21:24, Ross Zwisler wrote:
> On Mon, Aug 15, 2016 at 02:14:14PM -0700, Dan Williams wrote:
> > On Mon, Aug 15, 2016 at 2:11 PM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
> > >> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > >> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> > >> > based locking scheme so that they can be re-enabled.
> > >>
> > >> Looks good to me.
> > >>
> > >> > This series restores DAX PMD functionality back to what it was before it
> > >> > was disabled.  There is still a known issue between DAX PMDs and hole
> > >> > punch, which I am currently working on and which I plan to address with a
> > >> > separate series.
> > >>
> > >> Perhaps we should hold off on applying patch 6 and 7 until after the
> > >> hole-punch fix is ready?
> > >
> > > Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
> > > the hole punch fix is ready.
> > >
> > > I don't see a reason to hold off on patch 6, though?  It stands on it's own,
> > > implements the correct locking, and doesn't break anything.
> > 
> > Whoops, I just meant 7.
> 
> Well, it looks like the hole punch case is much improved since I tested it
> last!  :)  I used to be able to generate a few different kernel BUGs when hole
> punching DAX PMDs, but those have apparently been fixed in the mm layer since
> I was last testing, which admittedly was quite a long time ago (February?).
> 
> The only issue I was able to find with DAX PMD hole punching was that ext4
> wasn't properly doing a writeback before the hole was unmapped and the radix
> tree entries were removed.  This issue applies equally to the 4k case, so I've
> submitted a bug fix for v4.8:
> 
> https://lists.01.org/pipermail/linux-nvdimm/2016-August/006621.html
> 
> With that applied, I don't know of any more issues related to DAX PMDs and
> hole punch.  I've tested ext4 and XFS (ext2 doesn't support hole punch), and
> they both properly do a writeback of all affected PMDs, fully unmap all
> affected PMDs, and remove the radix tree entries.  I've tested that new page
> faults for addresses previously covered by the old PMDs generate new page
> faults, and 4k pages are now faulted in because the block allocator no longer
> has 2MiB contiguous allocations.
> 
> One question (probably for Jan): should the above ext4 fix be marked for
> stable?

Yes, probably it should be.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-17 17:21           ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-17 17:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, Jan Kara, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dave Chinner,
	Jan Kara, linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm

On Wed 17-08-16 10:21:24, Ross Zwisler wrote:
> On Mon, Aug 15, 2016 at 02:14:14PM -0700, Dan Williams wrote:
> > On Mon, Aug 15, 2016 at 2:11 PM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
> > >> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > >> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> > >> > based locking scheme so that they can be re-enabled.
> > >>
> > >> Looks good to me.
> > >>
> > >> > This series restores DAX PMD functionality back to what it was before it
> > >> > was disabled.  There is still a known issue between DAX PMDs and hole
> > >> > punch, which I am currently working on and which I plan to address with a
> > >> > separate series.
> > >>
> > >> Perhaps we should hold off on applying patch 6 and 7 until after the
> > >> hole-punch fix is ready?
> > >
> > > Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
> > > the hole punch fix is ready.
> > >
> > > I don't see a reason to hold off on patch 6, though?  It stands on it's own,
> > > implements the correct locking, and doesn't break anything.
> > 
> > Whoops, I just meant 7.
> 
> Well, it looks like the hole punch case is much improved since I tested it
> last!  :)  I used to be able to generate a few different kernel BUGs when hole
> punching DAX PMDs, but those have apparently been fixed in the mm layer since
> I was last testing, which admittedly was quite a long time ago (February?).
> 
> The only issue I was able to find with DAX PMD hole punching was that ext4
> wasn't properly doing a writeback before the hole was unmapped and the radix
> tree entries were removed.  This issue applies equally to the 4k case, so I've
> submitted a bug fix for v4.8:
> 
> https://lists.01.org/pipermail/linux-nvdimm/2016-August/006621.html
> 
> With that applied, I don't know of any more issues related to DAX PMDs and
> hole punch.  I've tested ext4 and XFS (ext2 doesn't support hole punch), and
> they both properly do a writeback of all affected PMDs, fully unmap all
> affected PMDs, and remove the radix tree entries.  I've tested that new page
> faults for addresses previously covered by the old PMDs generate new page
> faults, and 4k pages are now faulted in because the block allocator no longer
> has 2MiB contiguous allocations.
> 
> One question (probably for Jan): should the above ext4 fix be marked for
> stable?

Yes, probably it should be.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/7] re-enable DAX PMD support
@ 2016-08-17 17:21           ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-17 17:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Andreas Dilger, Alexander Viro, Jan Kara, linux-fsdevel,
	Jan Kara, linux-ext4, Andrew Morton

On Wed 17-08-16 10:21:24, Ross Zwisler wrote:
> On Mon, Aug 15, 2016 at 02:14:14PM -0700, Dan Williams wrote:
> > On Mon, Aug 15, 2016 at 2:11 PM, Ross Zwisler
> > <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> > > On Mon, Aug 15, 2016 at 01:21:47PM -0700, Dan Williams wrote:
> > >> On Mon, Aug 15, 2016 at 12:09 PM, Ross Zwisler
> > >> <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> > >> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > >> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> > >> > based locking scheme so that they can be re-enabled.
> > >>
> > >> Looks good to me.
> > >>
> > >> > This series restores DAX PMD functionality back to what it was before it
> > >> > was disabled.  There is still a known issue between DAX PMDs and hole
> > >> > punch, which I am currently working on and which I plan to address with a
> > >> > separate series.
> > >>
> > >> Perhaps we should hold off on applying patch 6 and 7 until after the
> > >> hole-punch fix is ready?
> > >
> > > Sure, I'm cool with holding off on patch 7 (the Kconfig change) until after
> > > the hole punch fix is ready.
> > >
> > > I don't see a reason to hold off on patch 6, though?  It stands on it's own,
> > > implements the correct locking, and doesn't break anything.
> > 
> > Whoops, I just meant 7.
> 
> Well, it looks like the hole punch case is much improved since I tested it
> last!  :)  I used to be able to generate a few different kernel BUGs when hole
> punching DAX PMDs, but those have apparently been fixed in the mm layer since
> I was last testing, which admittedly was quite a long time ago (February?).
> 
> The only issue I was able to find with DAX PMD hole punching was that ext4
> wasn't properly doing a writeback before the hole was unmapped and the radix
> tree entries were removed.  This issue applies equally to the 4k case, so I've
> submitted a bug fix for v4.8:
> 
> https://lists.01.org/pipermail/linux-nvdimm/2016-August/006621.html
> 
> With that applied, I don't know of any more issues related to DAX PMDs and
> hole punch.  I've tested ext4 and XFS (ext2 doesn't support hole punch), and
> they both properly do a writeback of all affected PMDs, fully unmap all
> affected PMDs, and remove the radix tree entries.  I've tested that new page
> faults for addresses previously covered by the old PMDs generate new page
> faults, and 4k pages are now faulted in because the block allocator no longer
> has 2MiB contiguous allocations.
> 
> One question (probably for Jan): should the above ext4 fix be marked for
> stable?

Yes, probably it should be.

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
  2016-08-16  9:28     ` Jan Kara
  (?)
@ 2016-08-17 20:25       ` Ross Zwisler
  -1 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-17 20:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm

On Tue, Aug 16, 2016 at 11:28:16AM +0200, Jan Kara wrote:
> On Mon 15-08-16 13:09:16, Ross Zwisler wrote:
> > DAX radix tree locking currently locks entries based on the unique
> > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > This works for PTEs, but as we move to PMDs we will need to have all the
> > offsets within the range covered by the PMD to map to the same bit lock.
> > To accomplish this, lock based on the 'slot' pointer in the radix tree
> > instead of [mapping, index].
> 
> I'm not convinced this is safe. What makes the slot pointer still valid
> after you drop tree_lock? At least radix_tree_shrink() or
> radix_tree_expand() could move your slot without letting the waiter know
> and he would be never woken.
> 
> 								Honza

Yep, you're right, thanks for catching that.

Given that we can't rely on 'slot' being stable, my next idea is to use a
combination of [mapping, index], but tweak 'index' so that it's always the
beginning of the entry.  So for 4k entries we'd leave it alone, but for 2MiB
entries we'd mask it down to the appropriate 2MiB barrier.

Let me hack on that for a bit, unless you've a better idea.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
@ 2016-08-17 20:25       ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-17 20:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm

On Tue, Aug 16, 2016 at 11:28:16AM +0200, Jan Kara wrote:
> On Mon 15-08-16 13:09:16, Ross Zwisler wrote:
> > DAX radix tree locking currently locks entries based on the unique
> > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > This works for PTEs, but as we move to PMDs we will need to have all the
> > offsets within the range covered by the PMD to map to the same bit lock.
> > To accomplish this, lock based on the 'slot' pointer in the radix tree
> > instead of [mapping, index].
> 
> I'm not convinced this is safe. What makes the slot pointer still valid
> after you drop tree_lock? At least radix_tree_shrink() or
> radix_tree_expand() could move your slot without letting the waiter know
> and he would be never woken.
> 
> 								Honza

Yep, you're right, thanks for catching that.

Given that we can't rely on 'slot' being stable, my next idea is to use a
combination of [mapping, index], but tweak 'index' so that it's always the
beginning of the entry.  So for 4k entries we'd leave it alone, but for 2MiB
entries we'd mask it down to the appropriate 2MiB barrier.

Let me hack on that for a bit, unless you've a better idea.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
@ 2016-08-17 20:25       ` Ross Zwisler
  0 siblings, 0 replies; 72+ messages in thread
From: Ross Zwisler @ 2016-08-17 20:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm

On Tue, Aug 16, 2016 at 11:28:16AM +0200, Jan Kara wrote:
> On Mon 15-08-16 13:09:16, Ross Zwisler wrote:
> > DAX radix tree locking currently locks entries based on the unique
> > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > This works for PTEs, but as we move to PMDs we will need to have all the
> > offsets within the range covered by the PMD to map to the same bit lock.
> > To accomplish this, lock based on the 'slot' pointer in the radix tree
> > instead of [mapping, index].
> 
> I'm not convinced this is safe. What makes the slot pointer still valid
> after you drop tree_lock? At least radix_tree_shrink() or
> radix_tree_expand() could move your slot without letting the waiter know
> and he would be never woken.
> 
> 								Honza

Yep, you're right, thanks for catching that.

Given that we can't rely on 'slot' being stable, my next idea is to use a
combination of [mapping, index], but tweak 'index' so that it's always the
beginning of the entry.  So for 4k entries we'd leave it alone, but for 2MiB
entries we'd mask it down to the appropriate 2MiB barrier.

Let me hack on that for a bit, unless you've a better idea.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
  2016-08-17 20:25       ` Ross Zwisler
  (?)
  (?)
@ 2016-08-18 14:15         ` Jan Kara
  -1 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-18 14:15 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, linux-nvdimm, Dave Chinner, linux-kernel,
	linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
	linux-fsdevel, Jan Kara, Andrew Morton, linux-ext4

On Wed 17-08-16 14:25:56, Ross Zwisler wrote:
> On Tue, Aug 16, 2016 at 11:28:16AM +0200, Jan Kara wrote:
> > On Mon 15-08-16 13:09:16, Ross Zwisler wrote:
> > > DAX radix tree locking currently locks entries based on the unique
> > > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > > This works for PTEs, but as we move to PMDs we will need to have all the
> > > offsets within the range covered by the PMD to map to the same bit lock.
> > > To accomplish this, lock based on the 'slot' pointer in the radix tree
> > > instead of [mapping, index].
> > 
> > I'm not convinced this is safe. What makes the slot pointer still valid
> > after you drop tree_lock? At least radix_tree_shrink() or
> > radix_tree_expand() could move your slot without letting the waiter know
> > and he would be never woken.
> > 
> > 								Honza
> 
> Yep, you're right, thanks for catching that.
> 
> Given that we can't rely on 'slot' being stable, my next idea is to use a
> combination of [mapping, index], but tweak 'index' so that it's always the
> beginning of the entry.  So for 4k entries we'd leave it alone, but for 2MiB
> entries we'd mask it down to the appropriate 2MiB barrier.
> 
> Let me hack on that for a bit, unless you've a better idea.

No, that's what I'd do as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
@ 2016-08-18 14:15         ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-18 14:15 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm

On Wed 17-08-16 14:25:56, Ross Zwisler wrote:
> On Tue, Aug 16, 2016 at 11:28:16AM +0200, Jan Kara wrote:
> > On Mon 15-08-16 13:09:16, Ross Zwisler wrote:
> > > DAX radix tree locking currently locks entries based on the unique
> > > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > > This works for PTEs, but as we move to PMDs we will need to have all the
> > > offsets within the range covered by the PMD to map to the same bit lock.
> > > To accomplish this, lock based on the 'slot' pointer in the radix tree
> > > instead of [mapping, index].
> > 
> > I'm not convinced this is safe. What makes the slot pointer still valid
> > after you drop tree_lock? At least radix_tree_shrink() or
> > radix_tree_expand() could move your slot without letting the waiter know
> > and he would be never woken.
> > 
> > 								Honza
> 
> Yep, you're right, thanks for catching that.
> 
> Given that we can't rely on 'slot' being stable, my next idea is to use a
> combination of [mapping, index], but tweak 'index' so that it's always the
> beginning of the entry.  So for 4k entries we'd leave it alone, but for 2MiB
> entries we'd mask it down to the appropriate 2MiB barrier.
> 
> Let me hack on that for a bit, unless you've a better idea.

No, that's what I'd do as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
@ 2016-08-18 14:15         ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-18 14:15 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm

On Wed 17-08-16 14:25:56, Ross Zwisler wrote:
> On Tue, Aug 16, 2016 at 11:28:16AM +0200, Jan Kara wrote:
> > On Mon 15-08-16 13:09:16, Ross Zwisler wrote:
> > > DAX radix tree locking currently locks entries based on the unique
> > > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > > This works for PTEs, but as we move to PMDs we will need to have all the
> > > offsets within the range covered by the PMD to map to the same bit lock.
> > > To accomplish this, lock based on the 'slot' pointer in the radix tree
> > > instead of [mapping, index].
> > 
> > I'm not convinced this is safe. What makes the slot pointer still valid
> > after you drop tree_lock? At least radix_tree_shrink() or
> > radix_tree_expand() could move your slot without letting the waiter know
> > and he would be never woken.
> > 
> > 								Honza
> 
> Yep, you're right, thanks for catching that.
> 
> Given that we can't rely on 'slot' being stable, my next idea is to use a
> combination of [mapping, index], but tweak 'index' so that it's always the
> beginning of the entry.  So for 4k entries we'd leave it alone, but for 2MiB
> entries we'd mask it down to the appropriate 2MiB barrier.
> 
> Let me hack on that for a bit, unless you've a better idea.

No, that's what I'd do as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/7] dax: lock based on slot instead of [mapping, index]
@ 2016-08-18 14:15         ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2016-08-18 14:15 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Theodore Ts'o, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jan Kara,
	Andrew Morton, linux-ext4-u79uwXL29TY76Z2rM5mHXA

On Wed 17-08-16 14:25:56, Ross Zwisler wrote:
> On Tue, Aug 16, 2016 at 11:28:16AM +0200, Jan Kara wrote:
> > On Mon 15-08-16 13:09:16, Ross Zwisler wrote:
> > > DAX radix tree locking currently locks entries based on the unique
> > > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > > This works for PTEs, but as we move to PMDs we will need to have all the
> > > offsets within the range covered by the PMD to map to the same bit lock.
> > > To accomplish this, lock based on the 'slot' pointer in the radix tree
> > > instead of [mapping, index].
> > 
> > I'm not convinced this is safe. What makes the slot pointer still valid
> > after you drop tree_lock? At least radix_tree_shrink() or
> > radix_tree_expand() could move your slot without letting the waiter know
> > and he would be never woken.
> > 
> > 								Honza
> 
> Yep, you're right, thanks for catching that.
> 
> Given that we can't rely on 'slot' being stable, my next idea is to use a
> combination of [mapping, index], but tweak 'index' so that it's always the
> beginning of the entry.  So for 4k entries we'd leave it alone, but for 2MiB
> entries we'd mask it down to the appropriate 2MiB barrier.
> 
> Let me hack on that for a bit, unless you've a better idea.

No, that's what I'd do as well.

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2016-08-18 14:16 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-15 19:09 [PATCH 0/7] re-enable DAX PMD support Ross Zwisler
2016-08-15 19:09 ` Ross Zwisler
2016-08-15 19:09 ` Ross Zwisler
2016-08-15 19:09 ` Ross Zwisler
2016-08-15 19:09 ` [PATCH 1/7] ext2: tell DAX the size of allocation holes Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-16  9:10   ` Jan Kara
2016-08-16  9:10     ` Jan Kara
2016-08-16  9:10     ` Jan Kara
2016-08-16 22:52     ` Ross Zwisler
2016-08-16 22:52       ` Ross Zwisler
2016-08-16 22:52       ` Ross Zwisler
2016-08-16 22:52       ` Ross Zwisler
2016-08-15 19:09 ` [PATCH 2/7] ext4: " Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-16  9:12   ` Jan Kara
2016-08-16  9:12     ` Jan Kara
2016-08-16  9:12     ` Jan Kara
2016-08-15 19:09 ` [PATCH 3/7] dax: remove buffer_size_valid() Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-16  9:13   ` Jan Kara
2016-08-16  9:13     ` Jan Kara
2016-08-15 19:09 ` [PATCH 4/7] dax: rename 'ret' to 'entry' in grab_mapping_entry Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-16  9:14   ` Jan Kara
2016-08-16  9:14     ` Jan Kara
2016-08-15 19:09 ` [PATCH 5/7] dax: lock based on slot instead of [mapping, index] Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-16  9:28   ` Jan Kara
2016-08-16  9:28     ` Jan Kara
2016-08-16  9:28     ` Jan Kara
2016-08-17 20:25     ` Ross Zwisler
2016-08-17 20:25       ` Ross Zwisler
2016-08-17 20:25       ` Ross Zwisler
2016-08-18 14:15       ` Jan Kara
2016-08-18 14:15         ` Jan Kara
2016-08-18 14:15         ` Jan Kara
2016-08-18 14:15         ` Jan Kara
2016-08-15 19:09 ` [PATCH 6/7] dax: re-enable DAX PMD support Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09 ` [PATCH 7/7] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 19:09   ` Ross Zwisler
2016-08-15 20:21 ` [PATCH 0/7] re-enable DAX PMD support Dan Williams
2016-08-15 20:21   ` Dan Williams
2016-08-15 20:21   ` Dan Williams
2016-08-15 20:21   ` Dan Williams
2016-08-15 21:11   ` Ross Zwisler
2016-08-15 21:11     ` Ross Zwisler
2016-08-15 21:11     ` Ross Zwisler
2016-08-15 21:14     ` Dan Williams
2016-08-15 21:14       ` Dan Williams
2016-08-17 16:21       ` Ross Zwisler
2016-08-17 16:21         ` Ross Zwisler
2016-08-17 16:21         ` Ross Zwisler
2016-08-17 17:21         ` Jan Kara
2016-08-17 17:21           ` Jan Kara
2016-08-17 17:21           ` Jan Kara
2016-08-17 17:21           ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.