linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] Miscellaneous DAX patches, take 2
@ 2015-07-03 14:40 Matthew Wilcox
  2015-07-03 14:40 ` [PATCH v2 1/6] dax: Add block size note to documentation Matthew Wilcox
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Matthew Wilcox @ 2015-07-03 14:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, Alexander Viro; +Cc: Matthew Wilcox, willy

A miscellaneous set of patches to improve DAX.  All are independent of
each other.

I dropped the DAX mmap support for block devices; I started trying
to put in a sysfs toggle and ran into some difficulties, so just drop
that part of the patch for now.  There's no problem (that I know of)
with the read()/write() portion of the patch.

I fixed up the conflict between Dave Chinner's changes to ext4 and mine.

I forgot the 'bdev_direct_access() may sleep' patch in the previous
submission.

Matthew Wilcox (6):
  dax: Add block size note to documentation
  dax: Use copy_from_iter_nocache
  ext4: Use ext4_get_block_write() for DAX
  vfs: Allow truncate, chomd and chown to be interrupted by fatal
    signals
  block: Add support for DAX reads/writes to block devices
  dax: bdev_direct_access() may sleep

 Documentation/filesystems/dax.txt |  6 ++++--
 fs/block_dev.c                    | 10 ++++++++++
 fs/dax.c                          |  8 +++++---
 fs/ext4/file.c                    |  8 ++++----
 fs/open.c                         |  9 ++++++---
 5 files changed, 29 insertions(+), 12 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/6] dax: Add block size note to documentation
  2015-07-03 14:40 [PATCH v2 0/6] Miscellaneous DAX patches, take 2 Matthew Wilcox
@ 2015-07-03 14:40 ` Matthew Wilcox
  2015-07-04  5:03   ` Christoph Hellwig
  2015-07-03 14:40 ` [PATCH v2 2/6] dax: Use copy_from_iter_nocache Matthew Wilcox
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2015-07-03 14:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, Alexander Viro; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

For block devices which are small enough, mkfs will default to creating
a filesystem with block sizes smaller than page size.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 Documentation/filesystems/dax.txt | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index baf4111..7af2851 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -18,8 +18,10 @@ Usage
 -----
 
 If you have a block device which supports DAX, you can make a filesystem
-on it as usual.  When mounting it, use the -o dax option manually
-or add 'dax' to the options in /etc/fstab.
+on it as usual.  The DAX code currently only supports files with a block
+size equal to your kernel's PAGE_SIZE, so you may need to specify a block
+size when creating the filesystem.  When mounting it, use the "-o dax"
+option on the command line or add 'dax' to the options in /etc/fstab.
 
 
 Implementation Tips for Block Driver Writers
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/6] dax: Use copy_from_iter_nocache
  2015-07-03 14:40 [PATCH v2 0/6] Miscellaneous DAX patches, take 2 Matthew Wilcox
  2015-07-03 14:40 ` [PATCH v2 1/6] dax: Add block size note to documentation Matthew Wilcox
@ 2015-07-03 14:40 ` Matthew Wilcox
  2015-07-05 13:11   ` Boaz Harrosh
  2015-07-03 14:40 ` [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX Matthew Wilcox
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2015-07-03 14:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, Alexander Viro; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

When userspace does a write, there's no need for the written data to
pollute the CPU cache.  This matches the original XIP code.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 99b5fbc..eaa9e06 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -155,7 +155,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		}
 
 		if (iov_iter_rw(iter) == WRITE)
-			len = copy_from_iter(addr, max - pos, iter);
+			len = copy_from_iter_nocache(addr, max - pos, iter);
 		else if (!hole)
 			len = copy_to_iter(addr, max - pos, iter);
 		else
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX
  2015-07-03 14:40 [PATCH v2 0/6] Miscellaneous DAX patches, take 2 Matthew Wilcox
  2015-07-03 14:40 ` [PATCH v2 1/6] dax: Add block size note to documentation Matthew Wilcox
  2015-07-03 14:40 ` [PATCH v2 2/6] dax: Use copy_from_iter_nocache Matthew Wilcox
@ 2015-07-03 14:40 ` Matthew Wilcox
  2015-07-03 18:30   ` Theodore Ts'o
  2015-07-03 14:40 ` [PATCH v2 4/6] vfs: Allow truncate, chomd and chown to be interrupted by fatal signals Matthew Wilcox
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2015-07-03 14:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, Alexander Viro
  Cc: Matthew Wilcox, Theodore Ts'o, Andreas Dilger, linux-ext4

From: Matthew Wilcox <willy@linux.intel.com>

DAX relies on the get_block function either zeroing newly allocated blocks
before they're findable by subsequent calls to get_block, or marking newly
allocated blocks as unwritten.  ext4_get_block() cannot create unwritten
extents, but ext4_get_block_write() can.

Reported-by: Andy Rudoff <andy.rudoff@intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/ext4/file.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index bc313ac..e5bdcb7 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -195,7 +195,7 @@ out:
 static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
 {
 	struct inode *inode = bh->b_assoc_map->host;
-	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
+	/* XXX: breaks on 32-bit > 16TB. Is that even supported? */
 	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
 	int err;
 	if (!uptodate)
@@ -206,13 +206,13 @@ static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
 
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
-					/* Is this the right get_block? */
+	return dax_fault(vma, vmf, ext4_get_block_write, ext4_end_io_unwritten);
 }
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
+	return dax_mkwrite(vma, vmf, ext4_get_block_write,
+				ext4_end_io_unwritten);
 }
 
 static const struct vm_operations_struct ext4_dax_vm_ops = {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/6] vfs: Allow truncate, chomd and chown to be interrupted by fatal signals
  2015-07-03 14:40 [PATCH v2 0/6] Miscellaneous DAX patches, take 2 Matthew Wilcox
                   ` (2 preceding siblings ...)
  2015-07-03 14:40 ` [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX Matthew Wilcox
@ 2015-07-03 14:40 ` Matthew Wilcox
  2015-07-03 14:40 ` [PATCH v2 5/6] block: Add support for DAX reads/writes to block devices Matthew Wilcox
  2015-07-03 14:40 ` [PATCH v2 6/6] dax: bdev_direct_access() may sleep Matthew Wilcox
  5 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2015-07-03 14:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, Alexander Viro; +Cc: Matthew Wilcox, willy

If another task dies while holding the i_mutex, tasks trying to truncate,
chmod or chown the file will hang.  Allowing a fatal interrupt to kill
the task is beneficial for the system administrator trying to get the
machine to shut down more cleanly.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 fs/open.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index e0250bd..4b0061e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -56,7 +56,8 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 	if (ret)
 		newattrs.ia_valid |= ret | ATTR_FORCE;
 
-	mutex_lock(&dentry->d_inode->i_mutex);
+	if (mutex_lock_killable(&dentry->d_inode->i_mutex))
+		return -EINTR;
 	/* Note any delegations or leases have already been broken: */
 	ret = notify_change(dentry, &newattrs, NULL);
 	mutex_unlock(&dentry->d_inode->i_mutex);
@@ -508,7 +509,8 @@ static int chmod_common(struct path *path, umode_t mode)
 	if (error)
 		return error;
 retry_deleg:
-	mutex_lock(&inode->i_mutex);
+	if (mutex_lock_killable(&inode->i_mutex))
+		return -EINTR;
 	error = security_path_chmod(path, mode);
 	if (error)
 		goto out_unlock;
@@ -591,7 +593,8 @@ retry_deleg:
 	if (!S_ISDIR(inode->i_mode))
 		newattrs.ia_valid |=
 			ATTR_KILL_SUID | ATTR_KILL_SGID | ATTR_KILL_PRIV;
-	mutex_lock(&inode->i_mutex);
+	if (mutex_lock_killable(&inode->i_mutex))
+		return -EINTR;
 	error = security_path_chown(path, uid, gid);
 	if (!error)
 		error = notify_change(path->dentry, &newattrs, &delegated_inode);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/6] block: Add support for DAX reads/writes to block devices
  2015-07-03 14:40 [PATCH v2 0/6] Miscellaneous DAX patches, take 2 Matthew Wilcox
                   ` (3 preceding siblings ...)
  2015-07-03 14:40 ` [PATCH v2 4/6] vfs: Allow truncate, chomd and chown to be interrupted by fatal signals Matthew Wilcox
@ 2015-07-03 14:40 ` Matthew Wilcox
  2015-07-05 13:47   ` Boaz Harrosh
  2015-07-03 14:40 ` [PATCH v2 6/6] dax: bdev_direct_access() may sleep Matthew Wilcox
  5 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2015-07-03 14:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, Alexander Viro; +Cc: Matthew Wilcox, willy

If a block device supports the ->direct_access methods, bypass the normal
DIO path and use DAX to go straight to memcpy() instead of allocating
a DIO and a BIO.

Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
do_blockdev_direct_IO().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/block_dev.c | 4 ++++
 fs/dax.c       | 6 ++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4fe10f9..0bb2993 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -152,6 +152,9 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 
+	if (IS_DAX(inode))
+		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
+				NULL, DIO_SKIP_DIO_COUNT);
 	return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
 				    blkdev_get_block, NULL, NULL,
 				    DIO_SKIP_DIO_COUNT);
@@ -1170,6 +1173,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		bdev->bd_disk = disk;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
+		bdev->bd_inode->i_flags = disk->fops->direct_access ? S_DAX : 0;
 		if (!partno) {
 			ret = -ENXIO;
 			bdev->bd_part = disk_get_part(disk, partno);
diff --git a/fs/dax.c b/fs/dax.c
index eaa9e06..c3e21cc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -209,7 +209,8 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 	}
 
 	/* Protects against truncate */
-	inode_dio_begin(inode);
+	if (!(flags & DIO_SKIP_DIO_COUNT))
+		inode_dio_begin(inode);
 
 	retval = dax_io(inode, iter, pos, end, get_block, &bh);
 
@@ -219,7 +220,8 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 	if ((retval > 0) && end_io)
 		end_io(iocb, pos, retval, bh.b_private);
 
-	inode_dio_end(inode);
+	if (!(flags & DIO_SKIP_DIO_COUNT))
+		inode_dio_end(inode);
  out:
 	return retval;
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 6/6] dax: bdev_direct_access() may sleep
  2015-07-03 14:40 [PATCH v2 0/6] Miscellaneous DAX patches, take 2 Matthew Wilcox
                   ` (4 preceding siblings ...)
  2015-07-03 14:40 ` [PATCH v2 5/6] block: Add support for DAX reads/writes to block devices Matthew Wilcox
@ 2015-07-03 14:40 ` Matthew Wilcox
  5 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2015-07-03 14:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, Alexander Viro; +Cc: Matthew Wilcox, willy

The brd driver is the only in-tree driver that may sleep currently.
After some discussion on linux-fsdevel, we decided that any driver
may choose to sleep in its ->direct_access method.  To ensure that all
callers of bdev_direct_access() are prepared for this, add a call
to might_sleep().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/block_dev.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0bb2993..1982437 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -446,6 +446,12 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	long avail;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
 
+	/*
+	 * The device driver is allowed to sleep, in order to make the
+	 * memory directly accessible.
+	 */
+	might_sleep();
+
 	if (size < 0)
 		return size;
 	if (!ops->direct_access)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX
  2015-07-03 14:40 ` [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX Matthew Wilcox
@ 2015-07-03 18:30   ` Theodore Ts'o
  2015-07-03 18:48     ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2015-07-03 18:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, Alexander Viro, Matthew Wilcox,
	Andreas Dilger, linux-ext4

On Fri, Jul 03, 2015 at 10:40:40AM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> DAX relies on the get_block function either zeroing newly allocated blocks
> before they're findable by subsequent calls to get_block, or marking newly
> allocated blocks as unwritten.  ext4_get_block() cannot create unwritten
> extents, but ext4_get_block_write() can.

To be clear, this patch has no prerequistes or dependencies, right?
That is, it would be fine if I take this through the ext4 git tree?
Or is there a reason or a preference for carrying this patch
somewhere else?

Also, is there a way I can test the DAX functionality in ext4 using
KVM?  If so, can you give me a cheat sheet about how I can do that?

Thanks,

							- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX
  2015-07-03 18:30   ` Theodore Ts'o
@ 2015-07-03 18:48     ` Matthew Wilcox
  2015-07-03 19:07       ` Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2015-07-03 18:48 UTC (permalink / raw)
  To: Theodore Ts'o, Matthew Wilcox, linux-fsdevel, linux-kernel,
	Alexander Viro, Andreas Dilger, linux-ext4

On Fri, Jul 03, 2015 at 02:30:27PM -0400, Theodore Ts'o wrote:
> On Fri, Jul 03, 2015 at 10:40:40AM -0400, Matthew Wilcox wrote:
> > From: Matthew Wilcox <willy@linux.intel.com>
> > 
> > DAX relies on the get_block function either zeroing newly allocated blocks
> > before they're findable by subsequent calls to get_block, or marking newly
> > allocated blocks as unwritten.  ext4_get_block() cannot create unwritten
> > extents, but ext4_get_block_write() can.
> 
> To be clear, this patch has no prerequistes or dependencies, right?
> That is, it would be fine if I take this through the ext4 git tree?
> Or is there a reason or a preference for carrying this patch
> somewhere else?

Right, no dependencies or prerequisites, completely independent of all
the other patches.

> Also, is there a way I can test the DAX functionality in ext4 using
> KVM?  If so, can you give me a cheat sheet about how I can do that?

I don't use KVM, but I can tell you what I do ... (additional explanation
added, not for Ted's benefit, but because people less familiar with
Linux than Ted is may happen upon this email for their own purposes).

In /etc/default/grub, I have this line:

GRUB_CMDLINE_LINUX="memmap=4G!4G"

        memmap=nn[KMG]!ss[KMG]
                        [KNL,X86] Mark specific memory as protected.
                        Region of memory to be used, from ss to ss+nn.
                        The memory region may be marked as e820 type 12 (0xc)
                        and is NVDIMM or ADR memory.


In my kernel config, I have:

CONFIG_X86_PMEM_LEGACY=y
CONFIG_LIBNVDIMM=y
CONFIG_BLK_DEV_PMEM=m

At boot, I "modprobe pmem".  On the desktop-class system I'm using as
my development machine, the BIOS doesn't clear RAM between boots (only
power cycles), so the partition table and ext4 filesystem stays good,
and all I have to do is:

mount -odax /dev/pmem0p1 /mnt/ram0/

Also my xfstests local.config:

TEST_DEV=/dev/pmem0p1
TEST_DIR=/mnt/ram0
SCRATCH_DEV=/dev/pmem0p2
SCRATCH_MNT=/mnt/ram1
TEST_FS_MOUNT_OPTS="-o dax"
EXT_MOUNT_OPTIONS="-o dax"
MKFS_OPTIONS="-b4096"

Hope I didn't forget anything.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX
  2015-07-03 18:48     ` Matthew Wilcox
@ 2015-07-03 19:07       ` Theodore Ts'o
  2015-07-05 13:29         ` Boaz Harrosh
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2015-07-03 19:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, Alexander Viro,
	Andreas Dilger, linux-ext4

On Fri, Jul 03, 2015 at 02:48:24PM -0400, Matthew Wilcox wrote:
> 
> At boot, I "modprobe pmem".

Is there a reason why it's important to build and load pmem as a
module?  If I use CONFIG_BLK_DEV_PMEM=y (which is more convenient
given how I launch my KVM test appliance), should I expect any
problems?

I assume that this won't detect any bugs caused by missing CLFLUSH
instructions, but I assume that when using NVM as a block device, this
isn't much of an issue, as long as we don't care about torn writes?
(How using NVM with metdata checksums, or any checksums for that
matter, seems to be an interesting question --- how do we recover from
a checksum failure after a power failure?)

					- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/6] dax: Add block size note to documentation
  2015-07-03 14:40 ` [PATCH v2 1/6] dax: Add block size note to documentation Matthew Wilcox
@ 2015-07-04  5:03   ` Christoph Hellwig
  2015-07-05  8:43     ` Boaz Harrosh
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2015-07-04  5:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, Alexander Viro, Matthew Wilcox

On Fri, Jul 03, 2015 at 10:40:38AM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> For block devices which are small enough, mkfs will default to creating
> a filesystem with block sizes smaller than page size.

This seems like an odd note.  At least for XFS and btrfs filesystem size
don't change block sizes.  ff this is an extN oddity you're probably
better off documenting it there.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/6] dax: Add block size note to documentation
  2015-07-04  5:03   ` Christoph Hellwig
@ 2015-07-05  8:43     ` Boaz Harrosh
  0 siblings, 0 replies; 15+ messages in thread
From: Boaz Harrosh @ 2015-07-05  8:43 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, Alexander Viro, Matthew Wilcox

On 07/04/2015 08:03 AM, Christoph Hellwig wrote:
> On Fri, Jul 03, 2015 at 10:40:38AM -0400, Matthew Wilcox wrote:
>> From: Matthew Wilcox <willy@linux.intel.com>
>>
>> For block devices which are small enough, mkfs will default to creating
>> a filesystem with block sizes smaller than page size.
> 
> This seems like an odd note.  At least for XFS and btrfs filesystem size
> don't change block sizes.  ff this is an extN oddity you're probably
> better off documenting it there.
> 

Is why I added to brd:
	/* This is so fdisk will align partitions on 4k, because of
	 * direct_access API needing 4k alignment, returning a PFN
	 * (This is only a problem on very small devices <= 4M,
	 *  otherwise fdisk will align on 1M. Regardless this call
	 *  is harmless)
	 */
	blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE);

Perhaps add the same to pmem, and/or perhaps make mkfs.extX
also inspect this and not create blocks smaller then physical_block_size?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/6] dax: Use copy_from_iter_nocache
  2015-07-03 14:40 ` [PATCH v2 2/6] dax: Use copy_from_iter_nocache Matthew Wilcox
@ 2015-07-05 13:11   ` Boaz Harrosh
  0 siblings, 0 replies; 15+ messages in thread
From: Boaz Harrosh @ 2015-07-05 13:11 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel, linux-kernel, Alexander Viro
  Cc: Matthew Wilcox

On 07/03/2015 05:40 PM, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> When userspace does a write, there's no need for the written data to
> pollute the CPU cache.  This matches the original XIP code.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
>  fs/dax.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 99b5fbc..eaa9e06 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -155,7 +155,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>  		}
>  
>  		if (iov_iter_rw(iter) == WRITE)
> -			len = copy_from_iter(addr, max - pos, iter);
> +			len = copy_from_iter_nocache(addr, max - pos, iter);
>  		else if (!hole)
>  			len = copy_to_iter(addr, max - pos, iter);
>  		else
> 

With the current ioremap_nocache at pmem none of this matters for pmem.

For brd yes, so We've been conducting some measurements and regular ext4
(no DAX) benchmark gives 6-16% increase in performance with this above.
And DAX is almost x2 then no DAX.
Is why the network guys been using this for a long time. So I'd say this
is a good default for any page-cache writes. (Think about it it makes sense,
we will 95% of the time flush these to real memory before DMA)

For pmem with any sane cached mapping (We use page-stuct-pmem actually)
Then DAX, for it to actually work (persist) with pmem, needs this:

static size_t copy_from_iter_nt(void *addr, size_t bytes, struct iov_iter *ii)
{
	size_t ret = copy_from_iter_nocache(addr, bytes, ii);

	if (unlikely((ii->type & ITER_BVEC) || (ii->type & ITER_KVEC))) {
		/* FIXME: copy_from_iter_nocache did regular copy for Kernel
		 * buffers (BVEC or KVEC). Before we fix it do cl_flush
		 * for now.
		 */
		cl_flush(addr, bytes, false);
	} else {
		/* copy_from_iter_nocache only persists in 8-byte aligned words.
		 * Lets persist remaining unaligned edges.
		 */
		if (unlikely((ulong)addr & 0x7))
			cl_flush(addr, 1, false);
		if (unlikely((ulong)(addr + bytes) & 0x7))
			cl_flush((addr + bytes), 1, false);
	}

	return ret;
}

This is based on an not-in-kernel cl_flush().

The first part FIXME above could be fixed with Dan's memcpy_persistent() patches

Cheers
Boaz


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX
  2015-07-03 19:07       ` Theodore Ts'o
@ 2015-07-05 13:29         ` Boaz Harrosh
  0 siblings, 0 replies; 15+ messages in thread
From: Boaz Harrosh @ 2015-07-05 13:29 UTC (permalink / raw)
  To: Theodore Ts'o, Matthew Wilcox, Matthew Wilcox, linux-fsdevel,
	linux-kernel, Alexander Viro, Andreas Dilger, linux-ext4

On 07/03/2015 10:07 PM, Theodore Ts'o wrote:
> On Fri, Jul 03, 2015 at 02:48:24PM -0400, Matthew Wilcox wrote:
>>
>> At boot, I "modprobe pmem".
> 
> Is there a reason why it's important to build and load pmem as a
> module?  If I use CONFIG_BLK_DEV_PMEM=y (which is more convenient
> given how I launch my KVM test appliance), should I expect any
> problems?
> 

This (=y) should work fine. We use it a lot. (with KVM even boot an
image with -dax, with a trick)

Note that DAX need not be tested with pmem only, you can always use brd
at any given point without any reboot.

One more trick for xfstest I use:
	memmap=2G!4G,2G!6G

And have two pmem0/1 and don't need to bother with any fdisk. Do need
to mkfs every boot though.

BTW: with kvm a reboot with above memmap will give you back the exact
same memory. halt and "virsh start" is a different story.

> I assume that this won't detect any bugs caused by missing CLFLUSH
> instructions, but I assume that when using NVM as a block device, this
> isn't much of an issue, as long as we don't care about torn writes?
> (How using NVM with metdata checksums, or any checksums for that
> matter, seems to be an interesting question --- how do we recover from
> a checksum failure after a power failure?)
> 

Currently pmem maps a very-(very) slow ioremap_nocache. So any Kernel
memory access should be pmem persistent. For a real world faster ioremap,
there are few major pieces still missing in the stack to make it
persistent. Note the even today with  ioremap_nocache, any application
mmap (like git) is not persistent.

> 					- Ted

Cheers
Boaz


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/6] block: Add support for DAX reads/writes to block devices
  2015-07-03 14:40 ` [PATCH v2 5/6] block: Add support for DAX reads/writes to block devices Matthew Wilcox
@ 2015-07-05 13:47   ` Boaz Harrosh
  0 siblings, 0 replies; 15+ messages in thread
From: Boaz Harrosh @ 2015-07-05 13:47 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel, linux-kernel, Alexander Viro; +Cc: willy

On 07/03/2015 05:40 PM, Matthew Wilcox wrote:
> If a block device supports the ->direct_access methods, bypass the normal
> DIO path and use DAX to go straight to memcpy() instead of allocating
> a DIO and a BIO.
> 

I can't remember the details but I'm not sure it is safe for mmap to go through
page-cache while DAX is bypassing page-cache. (Because of the pg_mkwrite thing,
while IO), do you find this code safe?

I think you need to force all DAX IO through dax.c. I did not understand why we
need to give the user an option? why would he ever want a cached access to a memory
device (ie two copies of the same thing). Why not DAX hard coded?

> Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
> do_blockdev_direct_IO().
> 
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> ---
>  fs/block_dev.c | 4 ++++
>  fs/dax.c       | 6 ++++--
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4fe10f9..0bb2993 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -152,6 +152,9 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file->f_mapping->host;
>  
> +	if (IS_DAX(inode))
> +		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
> +				NULL, DIO_SKIP_DIO_COUNT);
>  	return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
>  				    blkdev_get_block, NULL, NULL,
>  				    DIO_SKIP_DIO_COUNT);
> @@ -1170,6 +1173,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  		bdev->bd_disk = disk;
>  		bdev->bd_queue = disk->queue;
>  		bdev->bd_contains = bdev;
> +		bdev->bd_inode->i_flags = disk->fops->direct_access ? S_DAX : 0;
>  		if (!partno) {
>  			ret = -ENXIO;
>  			bdev->bd_part = disk_get_part(disk, partno);
> diff --git a/fs/dax.c b/fs/dax.c
> index eaa9e06..c3e21cc 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -209,7 +209,8 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
>  	}
>  
>  	/* Protects against truncate */
> -	inode_dio_begin(inode);
> +	if (!(flags & DIO_SKIP_DIO_COUNT))
> +		inode_dio_begin(inode);

Is really a separate issue, is it not?

>  
>  	retval = dax_io(inode, iter, pos, end, get_block, &bh);
>  
> @@ -219,7 +220,8 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
>  	if ((retval > 0) && end_io)
>  		end_io(iocb, pos, retval, bh.b_private);
>  
> -	inode_dio_end(inode);
> +	if (!(flags & DIO_SKIP_DIO_COUNT))
> +		inode_dio_end(inode);
>   out:
>  	return retval;
>  }
> 

This scares me, the mix of DAX while cached data. I'd vote for hard-coded DAX
all over.

Boaz


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-07-05 13:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-03 14:40 [PATCH v2 0/6] Miscellaneous DAX patches, take 2 Matthew Wilcox
2015-07-03 14:40 ` [PATCH v2 1/6] dax: Add block size note to documentation Matthew Wilcox
2015-07-04  5:03   ` Christoph Hellwig
2015-07-05  8:43     ` Boaz Harrosh
2015-07-03 14:40 ` [PATCH v2 2/6] dax: Use copy_from_iter_nocache Matthew Wilcox
2015-07-05 13:11   ` Boaz Harrosh
2015-07-03 14:40 ` [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX Matthew Wilcox
2015-07-03 18:30   ` Theodore Ts'o
2015-07-03 18:48     ` Matthew Wilcox
2015-07-03 19:07       ` Theodore Ts'o
2015-07-05 13:29         ` Boaz Harrosh
2015-07-03 14:40 ` [PATCH v2 4/6] vfs: Allow truncate, chomd and chown to be interrupted by fatal signals Matthew Wilcox
2015-07-03 14:40 ` [PATCH v2 5/6] block: Add support for DAX reads/writes to block devices Matthew Wilcox
2015-07-05 13:47   ` Boaz Harrosh
2015-07-03 14:40 ` [PATCH v2 6/6] dax: bdev_direct_access() may sleep Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).