linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 00/21] Support ext4 on NV-DIMMs
@ 2014-08-27  3:45 Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 01/21] axonram: Fix bug in direct_access Matthew Wilcox
                   ` (23 more replies)
  0 siblings, 24 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

One of the primary uses for NV-DIMMs is to expose them as a block device
and use a filesystem to store files on the NV-DIMM.  While that works,
it currently wastes memory and CPU time buffering the files in the page
cache.  We have support in ext2 for bypassing the page cache, but it
has some races which are unfixable in the current design.  This series
of patches rewrite the underlying support, and add support for direct
access to ext4.

Note that patch 6/21 has been included in
https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next-candidate

This iteration of the patchset rebases to 3.17-rc2, changes the page fault
locking, fixes a couple of bugs and makes a few other minor changes.

 - Move the calculation of the maximum size available at the requested
   location from the ->direct_access implementations to bdev_direct_access()
 - Fix a comment typo (Ross Zwisler)
 - Check that the requested length is positive in bdev_direct_access().  If
   it is not, assume that it's an errno, and just return it.
 - Fix some whitespace issues flagged by checkpatch
 - Added the Acked-by responses from Kirill that I forget in the last round
 - Added myself to MAINTAINERS for DAX
 - Fixed compilation with !CONFIG_DAX (Vishal Verma)
 - Revert the locking in the page fault handler back to an earlier version.
   If we hit the race that we were trying to protect against, we will leave
   blocks allocated past the end of the file.  They will be removed on file
   removal, the next truncate, or fsck.


Matthew Wilcox (20):
  axonram: Fix bug in direct_access
  Change direct_access calling convention
  Fix XIP fault vs truncate race
  Allow page fault handlers to perform the COW
  Introduce IS_DAX(inode)
  Add copy_to_iter(), copy_from_iter() and iov_iter_zero()
  Replace XIP read and write with DAX I/O
  Replace ext2_clear_xip_target with dax_clear_blocks
  Replace the XIP page fault handler with the DAX page fault handler
  Replace xip_truncate_page with dax_truncate_page
  Replace XIP documentation with DAX documentation
  Remove get_xip_mem
  ext2: Remove ext2_xip_verify_sb()
  ext2: Remove ext2_use_xip
  ext2: Remove xip.c and xip.h
  Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
  ext2: Remove ext2_aops_xip
  Get rid of most mentions of XIP in ext2
  xip: Add xip_zero_page_range
  brd: Rename XIP to DAX

Ross Zwisler (1):
  ext4: Add DAX functionality

 Documentation/filesystems/Locking  |   3 -
 Documentation/filesystems/dax.txt  |  91 +++++++
 Documentation/filesystems/ext4.txt |   2 +
 Documentation/filesystems/xip.txt  |  68 -----
 MAINTAINERS                        |   6 +
 arch/powerpc/sysdev/axonram.c      |  19 +-
 drivers/block/Kconfig              |  13 +-
 drivers/block/brd.c                |  26 +-
 drivers/s390/block/dcssblk.c       |  21 +-
 fs/Kconfig                         |  21 +-
 fs/Makefile                        |   1 +
 fs/block_dev.c                     |  40 +++
 fs/dax.c                           | 497 +++++++++++++++++++++++++++++++++++++
 fs/exofs/inode.c                   |   1 -
 fs/ext2/Kconfig                    |  11 -
 fs/ext2/Makefile                   |   1 -
 fs/ext2/ext2.h                     |  10 +-
 fs/ext2/file.c                     |  45 +++-
 fs/ext2/inode.c                    |  38 +--
 fs/ext2/namei.c                    |  13 +-
 fs/ext2/super.c                    |  53 ++--
 fs/ext2/xip.c                      |  91 -------
 fs/ext2/xip.h                      |  26 --
 fs/ext4/ext4.h                     |   6 +
 fs/ext4/file.c                     |  49 +++-
 fs/ext4/indirect.c                 |  18 +-
 fs/ext4/inode.c                    |  51 ++--
 fs/ext4/namei.c                    |  10 +-
 fs/ext4/super.c                    |  39 ++-
 fs/open.c                          |   5 +-
 include/linux/blkdev.h             |   6 +-
 include/linux/fs.h                 |  49 +++-
 include/linux/mm.h                 |   1 +
 include/linux/uio.h                |   3 +
 mm/Makefile                        |   1 -
 mm/fadvise.c                       |   6 +-
 mm/filemap.c                       |   6 +-
 mm/filemap_xip.c                   | 483 -----------------------------------
 mm/iov_iter.c                      | 237 ++++++++++++++++--
 mm/madvise.c                       |   2 +-
 mm/memory.c                        |  33 ++-
 41 files changed, 1229 insertions(+), 873 deletions(-)
 create mode 100644 Documentation/filesystems/dax.txt
 delete mode 100644 Documentation/filesystems/xip.txt
 create mode 100644 fs/dax.c
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h
 delete mode 100644 mm/filemap_xip.c

-- 
2.0.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v10 01/21] axonram: Fix bug in direct_access
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 02/21] Change direct_access calling convention Matthew Wilcox
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 arch/powerpc/sysdev/axonram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 47b6b9f..830edc8 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
 	}
 
 	*kaddr = (void *)(bank->ph_addr + offset);
-	*pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
 	return 0;
 }
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 02/21] Change direct_access calling convention
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 01/21] axonram: Fix bug in direct_access Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 03/21] Fix XIP fault vs truncate race Matthew Wilcox
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Boaz Harrosh <boaz@plexistor.com>
---
 Documentation/filesystems/xip.txt | 15 +++++++++------
 arch/powerpc/sysdev/axonram.c     | 17 ++++-------------
 drivers/block/brd.c               | 12 +++++-------
 drivers/s390/block/dcssblk.c      | 21 +++++++++-----------
 fs/block_dev.c                    | 40 +++++++++++++++++++++++++++++++++++++++
 fs/ext2/xip.c                     | 31 +++++++++++++-----------------
 include/linux/blkdev.h            |  6 ++++--
 7 files changed, 84 insertions(+), 58 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b774729 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
 Execute-in-place is implemented in three steps: block device operation,
 address space operation, and file operations.
 
-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory.  It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested.  The function should return the number
+of bytes that can be contiguously accessed at that offset.  It may also
+return a negative errno if an error occurs.
 
 The block device operation is optional, these block devices support it as of
 today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 830edc8..8709b9f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,26 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
-static int
+static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void **kaddr, unsigned long *pfn)
+		       void **kaddr, unsigned long *pfn, long size)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
-	loff_t offset;
-
-	offset = sector;
-	if (device->bd_part != NULL)
-		offset += device->bd_part->start_sect;
-	offset <<= AXON_RAM_SECTOR_SHIFT;
-	if (offset >= bank->size) {
-		dev_err(&bank->device->dev, "Access outside of address space\n");
-		return -ERANGE;
-	}
+	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
 
 	*kaddr = (void *)(bank->ph_addr + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return 0;
+	return bank->size - offset;
 }
 
 static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index c7d138e..fee10bf 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -370,25 +370,23 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 #ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
-			void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+			void **kaddr, unsigned long *pfn, long size)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
 
 	if (!brd)
 		return -ENODEV;
-	if (sector & (PAGE_SECTORS-1))
-		return -EINVAL;
-	if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
-		return -ERANGE;
 	page = brd_insert_page(brd, sector);
 	if (!page)
 		return -ENOSPC;
 	*kaddr = page_address(page);
 	*pfn = page_to_pfn(page);
 
-	return 0;
+	/* If size > PAGE_SIZE, we could look to see if the next page in the
+	 * file happens to be mapped to the next page of physical RAM */
+	return PAGE_SIZE;
 }
 #endif
 
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 0f47175..96bc411 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
 static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-				 void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+				 void **kaddr, unsigned long *pfn, long size);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -866,25 +866,22 @@ fail:
 	bio_io_error(bio);
 }
 
-static int
+static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void **kaddr, unsigned long *pfn)
+			void **kaddr, unsigned long *pfn, long size)
 {
 	struct dcssblk_dev_info *dev_info;
-	unsigned long pgoff;
+	unsigned long offset, dev_sz;
 
 	dev_info = bdev->bd_disk->private_data;
 	if (!dev_info)
 		return -ENODEV;
-	if (secnum % (PAGE_SIZE/512))
-		return -EINVAL;
-	pgoff = secnum / (PAGE_SIZE / 512);
-	if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
-		return -ERANGE;
-	*kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+	dev_sz = dev_info->end - dev_info->start;
+	offset = secnum * 512;
+	*kaddr = (void *) (dev_info->start + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return 0;
+	return dev_sz - offset;
 }
 
 static void
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6d72746..ffe0761 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -427,6 +427,46 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL_GPL(bdev_write_page);
 
+/**
+ * bdev_direct_access() - Get the address for directly-accessibly memory
+ * @bdev: The device containing the memory
+ * @sector: The offset within the device
+ * @addr: Where to put the address of the memory
+ * @pfn: The Page Frame Number for the memory
+ * @size: The number of bytes requested
+ *
+ * If a block device is made up of directly addressable memory, this function
+ * will tell the caller the PFN and the address of the memory.  The address
+ * may be directly dereferenced within the kernel without the need to call
+ * ioremap(), kmap() or similar.  The PFN is suitable for inserting into
+ * page tables.
+ *
+ * Return: negative errno if an error occurs, otherwise the number of bytes
+ * accessible at this address.
+ */
+long bdev_direct_access(struct block_device *bdev, sector_t sector,
+			void **addr, unsigned long *pfn, long size)
+{
+	long avail;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+
+	if (size < 0)
+		return size;
+	if (!ops->direct_access)
+		return -EOPNOTSUPP;
+	if ((sector + DIV_ROUND_UP(size, 512)) >
+					part_nr_sects_read(bdev->bd_part))
+		return -ERANGE;
+	sector += get_start_sect(bdev);
+	if (sector % (PAGE_SIZE / 512))
+		return -EINVAL;
+	avail = ops->direct_access(bdev, sector, addr, pfn, size);
+	if (!avail)
+		return -ERANGE;
+	return min(avail, size);
+}
+EXPORT_SYMBOL_GPL(bdev_direct_access);
+
 /*
  * pseudo-fs
  */
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..bbc5fec 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,12 @@
 #include "ext2.h"
 #include "xip.h"
 
-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
-		      void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+				void **kaddr, unsigned long *pfn, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
-	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	sector_t sector;
-
-	sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
-	BUG_ON(!ops->direct_access);
-	return ops->direct_access(bdev, sector, kaddr, pfn);
+	sector_t sector = block * (PAGE_SIZE / 512);
+	return bdev_direct_access(bdev, sector, kaddr, pfn, size);
 }
 
 static inline int
@@ -53,12 +47,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
 {
 	void *kaddr;
 	unsigned long pfn;
-	int rc;
+	long size;
 
-	rc = __inode_direct_access(inode, block, &kaddr, &pfn);
-	if (!rc)
-		clear_page(kaddr);
-	return rc;
+	size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+	if (size < 0)
+		return size;
+	clear_page(kaddr);
+	return 0;
 }
 
 void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +72,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
 int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
 				void **kmem, unsigned long *pfn)
 {
-	int rc;
+	long rc;
 	sector_t block;
 
 	/* first, retrieve the sector number */
@@ -86,6 +81,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
 		return rc;
 
 	/* retrieve address of the target data */
-	rc = __inode_direct_access(mapping->host, block, kmem, pfn);
-	return rc;
+	rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+	return (rc < 0) ? rc : 0;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 518b465..ac25166 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1615,8 +1615,8 @@ struct block_device_operations {
 	int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
-	int (*direct_access) (struct block_device *, sector_t,
-						void **, unsigned long *);
+	long (*direct_access)(struct block_device *, sector_t,
+					void **, unsigned long *pfn, long size);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1634,6 +1634,8 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
 extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
+extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
+						unsigned long *pfn, long size);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 03/21] Fix XIP fault vs truncate race
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 01/21] axonram: Fix bug in direct_access Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 02/21] Change direct_access calling convention Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 04/21] Allow page fault handlers to perform the COW Matthew Wilcox
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate.  We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead.  It is locked in the
truncate path in unmap_mapping_range() after updating i_size.  So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap_xip.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
 		__xip_unmap(mapping, vmf->pgoff);
 
 found:
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			mutex_unlock(&mapping->i_mmap_mutex);
+			return VM_FAULT_SIGBUS;
+		}
 		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
 							xip_pfn);
+		mutex_unlock(&mapping->i_mmap_mutex);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		/*
@@ -285,16 +294,27 @@ found:
 		}
 		if (error != -ENODATA)
 			goto out;
+
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			ret = VM_FAULT_SIGBUS;
+			goto unlock;
+		}
 		/* not shared and writable, use xip_sparse_page() */
 		page = xip_sparse_page();
 		if (!page)
-			goto out;
+			goto unlock;
 		err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
 							page);
 		if (err == -ENOMEM)
-			goto out;
+			goto unlock;
 
 		ret = VM_FAULT_NOPAGE;
+unlock:
+		mutex_unlock(&mapping->i_mmap_mutex);
 out:
 		write_seqcount_end(&xip_sparse_seq);
 		mutex_unlock(&xip_sparse_mutex);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 04/21] Allow page fault handlers to perform the COW
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (2 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 03/21] Fix XIP fault vs truncate race Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 05/21] Introduce IS_DAX(inode) Matthew Wilcox
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page.  It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

The handler cannot insert the page itself if there is already a read-only
mapping at that address, so allow the handler to return VM_FAULT_LOCKED
and set the fault_page to be NULL.  This indicates to the MM code that
the i_mmap_mutex is held instead of the page lock.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 33 ++++++++++++++++++++++++---------
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..0a47817 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,6 +208,7 @@ struct vm_fault {
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	void __user *virtual_address;	/* Faulting virtual address */
 
+	struct page *cow_page;		/* Handler may choose to COW */
 	struct page *page;		/* ->fault handlers should return a
 					 * page here, unless VM_FAULT_NOPAGE
 					 * is set (which is also implied by
diff --git a/mm/memory.c b/mm/memory.c
index adeac30..3368785 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2000,6 +2000,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	vmf.pgoff = page->index;
 	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 	vmf.page = page;
+	vmf.cow_page = NULL;
 
 	ret = vma->vm_ops->page_mkwrite(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
@@ -2698,7 +2699,8 @@ oom:
  * See filemap_fault() and __lock_page_retry().
  */
 static int __do_fault(struct vm_area_struct *vma, unsigned long address,
-		pgoff_t pgoff, unsigned int flags, struct page **page)
+			pgoff_t pgoff, unsigned int flags,
+			struct page *cow_page, struct page **page)
 {
 	struct vm_fault vmf;
 	int ret;
@@ -2707,10 +2709,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.page = NULL;
+	vmf.cow_page = cow_page;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
+	if (!vmf.page)
+		goto out;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
 		if (ret & VM_FAULT_LOCKED)
@@ -2724,6 +2729,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	else
 		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
 
+ out:
 	*page = vmf.page;
 	return ret;
 }
@@ -2897,7 +2903,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(pte, ptl);
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
@@ -2937,26 +2943,35 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
-	copy_user_highpage(new_page, fault_page, address, vma);
+	if (fault_page)
+		copy_user_highpage(new_page, fault_page, address, vma);
 	__SetPageUptodate(new_page);
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (unlikely(!pte_same(*pte, orig_pte))) {
 		pte_unmap_unlock(pte, ptl);
-		unlock_page(fault_page);
-		page_cache_release(fault_page);
+		if (fault_page) {
+			unlock_page(fault_page);
+			page_cache_release(fault_page);
+		} else {
+			mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+		}
 		goto uncharge_out;
 	}
 	do_set_pte(vma, address, new_page, pte, true, true);
 	mem_cgroup_commit_charge(new_page, memcg, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
-	unlock_page(fault_page);
-	page_cache_release(fault_page);
+	if (fault_page) {
+		unlock_page(fault_page);
+		page_cache_release(fault_page);
+	} else {
+		mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+	}
 	return ret;
 uncharge_out:
 	mem_cgroup_cancel_charge(new_page, memcg);
@@ -2975,7 +2990,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int dirtied = 0;
 	int ret, tmp;
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 05/21] Introduce IS_DAX(inode)
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (3 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 04/21] Allow page fault handlers to perform the COW Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 06/21] Add copy_to_iter(), copy_from_iter() and iov_iter_zero() Matthew Wilcox
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/inode.c    | 9 ++++++---
 fs/ext2/xip.h      | 2 --
 include/linux/fs.h | 6 ++++++
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 36d35c3..0cb0448 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode,
 		goto cleanup;
 	}
 
-	if (ext2_use_xip(inode->i_sb)) {
+	if (IS_DAX(inode)) {
 		/*
 		 * we need to clear the block
 		 */
@@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 
 	inode_dio_wait(inode);
 
-	if (mapping_is_xip(inode->i_mapping))
+	if (IS_DAX(inode))
 		error = xip_truncate_page(inode->i_mapping, newsize);
 	else if (test_opt(inode->i_sb, NOBH))
 		error = nobh_truncate_page(inode->i_mapping,
@@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
 {
 	unsigned int flags = EXT2_I(inode)->i_flags;
 
-	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+	inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+				S_DIRSYNC | S_DAX);
 	if (flags & EXT2_SYNC_FL)
 		inode->i_flags |= S_SYNC;
 	if (flags & EXT2_APPEND_FL)
@@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	if (test_opt(inode->i_sb, XIP))
+		inode->i_flags |= S_DAX;
 }
 
 /* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
 }
 int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
 				void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
 #else
-#define mapping_is_xip(map)			0
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #define ext2_clear_xip_target(inode, chain)	0
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9418772..e99e5c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1605,6 +1605,7 @@ struct super_operations {
 #define S_IMA		1024	/* Inode has an associated IMA struct */
 #define S_AUTOMOUNT	2048	/* Automount/referral quasi-directory */
 #define S_NOSEC		4096	/* no suid or xattr security attributes */
+#define S_DAX		8192	/* Direct Access, avoiding the page cache */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1642,6 +1643,11 @@ struct super_operations {
 #define IS_IMA(inode)		((inode)->i_flags & S_IMA)
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
+#ifdef CONFIG_FS_XIP
+#define IS_DAX(inode)		((inode)->i_flags & S_DAX)
+#else
+#define IS_DAX(inode)		0
+#endif
 
 /*
  * Inode state bits.  Protected by inode->i_lock
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 06/21] Add copy_to_iter(), copy_from_iter() and iov_iter_zero()
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (4 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 05/21] Introduce IS_DAX(inode) Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 07/21] Replace XIP read and write with DAX I/O Matthew Wilcox
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

For DAX, we want to be able to copy between iovecs and kernel addresses
that don't necessarily have a struct page.  This is a fairly simple
rearrangement for bvec iters to kmap the pages outside and pass them in,
but for user iovecs it gets more complicated because we might try various
different ways to kmap the memory.  Duplicating the existing logic works
out best in this case.

We need to be able to write zeroes to an iovec for reads from unwritten
ranges in a file.  This is performed by the new iov_iter_zero() function,
again patterned after the existing code that handles iovec iterators.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 include/linux/uio.h |   3 +
 mm/iov_iter.c       | 237 ++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 48d64e6..1863ddd 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -80,6 +80,9 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
 			unsigned long nr_segs, size_t count);
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index ab88dc0..d481fd8 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -4,6 +4,96 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 
+static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
+{
+	size_t skip, copy, left, wanted;
+	const struct iovec *iov;
+	char __user *buf;
+
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	wanted = bytes;
+	iov = i->iov;
+	skip = i->iov_offset;
+	buf = iov->iov_base + skip;
+	copy = min(bytes, iov->iov_len - skip);
+
+	left = __copy_to_user(buf, from, copy);
+	copy -= left;
+	skip += copy;
+	from += copy;
+	bytes -= copy;
+	while (unlikely(!left && bytes)) {
+		iov++;
+		buf = iov->iov_base;
+		copy = min(bytes, iov->iov_len);
+		left = __copy_to_user(buf, from, copy);
+		copy -= left;
+		skip = copy;
+		from += copy;
+		bytes -= copy;
+	}
+
+	if (skip == iov->iov_len) {
+		iov++;
+		skip = 0;
+	}
+	i->count -= wanted - bytes;
+	i->nr_segs -= iov - i->iov;
+	i->iov = iov;
+	i->iov_offset = skip;
+	return wanted - bytes;
+}
+
+static size_t copy_from_iter_iovec(void *to, size_t bytes, struct iov_iter *i)
+{
+	size_t skip, copy, left, wanted;
+	const struct iovec *iov;
+	char __user *buf;
+
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	wanted = bytes;
+	iov = i->iov;
+	skip = i->iov_offset;
+	buf = iov->iov_base + skip;
+	copy = min(bytes, iov->iov_len - skip);
+
+	left = __copy_from_user(to, buf, copy);
+	copy -= left;
+	skip += copy;
+	to += copy;
+	bytes -= copy;
+	while (unlikely(!left && bytes)) {
+		iov++;
+		buf = iov->iov_base;
+		copy = min(bytes, iov->iov_len);
+		left = __copy_from_user(to, buf, copy);
+		copy -= left;
+		skip = copy;
+		to += copy;
+		bytes -= copy;
+	}
+
+	if (skip == iov->iov_len) {
+		iov++;
+		skip = 0;
+	}
+	i->count -= wanted - bytes;
+	i->nr_segs -= iov - i->iov;
+	i->iov = iov;
+	i->iov_offset = skip;
+	return wanted - bytes;
+}
+
 static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
@@ -166,6 +256,50 @@ done:
 	return wanted - bytes;
 }
 
+static size_t zero_iovec(size_t bytes, struct iov_iter *i)
+{
+	size_t skip, copy, left, wanted;
+	const struct iovec *iov;
+	char __user *buf;
+
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	wanted = bytes;
+	iov = i->iov;
+	skip = i->iov_offset;
+	buf = iov->iov_base + skip;
+	copy = min(bytes, iov->iov_len - skip);
+
+	left = __clear_user(buf, copy);
+	copy -= left;
+	skip += copy;
+	bytes -= copy;
+
+	while (unlikely(!left && bytes)) {
+		iov++;
+		buf = iov->iov_base;
+		copy = min(bytes, iov->iov_len);
+		left = __clear_user(buf, copy);
+		copy -= left;
+		skip = copy;
+		bytes -= copy;
+	}
+
+	if (skip == iov->iov_len) {
+		iov++;
+		skip = 0;
+	}
+	i->count -= wanted - bytes;
+	i->nr_segs -= iov - i->iov;
+	i->iov = iov;
+	i->iov_offset = skip;
+	return wanted - bytes;
+}
+
 static size_t __iovec_copy_from_user_inatomic(char *vaddr,
 			const struct iovec *iov, size_t base, size_t bytes)
 {
@@ -412,12 +546,17 @@ static void memcpy_to_page(struct page *page, size_t offset, char *from, size_t
 	kunmap_atomic(to);
 }
 
-static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t bytes,
-			 struct iov_iter *i)
+static void memzero_page(struct page *page, size_t offset, size_t len)
+{
+	char *addr = kmap_atomic(page);
+	memset(addr + offset, 0, len);
+	kunmap_atomic(addr);
+}
+
+static size_t copy_to_iter_bvec(void *from, size_t bytes, struct iov_iter *i)
 {
 	size_t skip, copy, wanted;
 	const struct bio_vec *bvec;
-	void *kaddr, *from;
 
 	if (unlikely(bytes > i->count))
 		bytes = i->count;
@@ -430,8 +569,6 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
 	skip = i->iov_offset;
 	copy = min_t(size_t, bytes, bvec->bv_len - skip);
 
-	kaddr = kmap_atomic(page);
-	from = kaddr + offset;
 	memcpy_to_page(bvec->bv_page, skip + bvec->bv_offset, from, copy);
 	skip += copy;
 	from += copy;
@@ -444,7 +581,6 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
 		from += copy;
 		bytes -= copy;
 	}
-	kunmap_atomic(kaddr);
 	if (skip == bvec->bv_len) {
 		bvec++;
 		skip = 0;
@@ -456,12 +592,10 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
 	return wanted - bytes;
 }
 
-static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t bytes,
-			 struct iov_iter *i)
+static size_t copy_from_iter_bvec(void *to, size_t bytes, struct iov_iter *i)
 {
 	size_t skip, copy, wanted;
 	const struct bio_vec *bvec;
-	void *kaddr, *to;
 
 	if (unlikely(bytes > i->count))
 		bytes = i->count;
@@ -473,10 +607,6 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
 	bvec = i->bvec;
 	skip = i->iov_offset;
 
-	kaddr = kmap_atomic(page);
-
-	to = kaddr + offset;
-
 	copy = min(bytes, bvec->bv_len - skip);
 
 	memcpy_from_page(to, bvec->bv_page, bvec->bv_offset + skip, copy);
@@ -493,7 +623,6 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
 		to += copy;
 		bytes -= copy;
 	}
-	kunmap_atomic(kaddr);
 	if (skip == bvec->bv_len) {
 		bvec++;
 		skip = 0;
@@ -505,6 +634,61 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
 	return wanted;
 }
 
+static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
+					size_t bytes, struct iov_iter *i)
+{
+	void *kaddr = kmap_atomic(page);
+	size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);
+	kunmap_atomic(kaddr);
+	return wanted;
+}
+
+static size_t copy_page_from_iter_bvec(struct page *page, size_t offset,
+					size_t bytes, struct iov_iter *i)
+{
+	void *kaddr = kmap_atomic(page);
+	size_t wanted = copy_from_iter_bvec(kaddr + offset, bytes, i);
+	kunmap_atomic(kaddr);
+	return wanted;
+}
+
+static size_t zero_bvec(size_t bytes, struct iov_iter *i)
+{
+	size_t skip, copy, wanted;
+	const struct bio_vec *bvec;
+
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	wanted = bytes;
+	bvec = i->bvec;
+	skip = i->iov_offset;
+	copy = min_t(size_t, bytes, bvec->bv_len - skip);
+
+	memzero_page(bvec->bv_page, skip + bvec->bv_offset, copy);
+	skip += copy;
+	bytes -= copy;
+	while (bytes) {
+		bvec++;
+		copy = min(bytes, (size_t)bvec->bv_len);
+		memzero_page(bvec->bv_page, bvec->bv_offset, copy);
+		skip = copy;
+		bytes -= copy;
+	}
+	if (skip == bvec->bv_len) {
+		bvec++;
+		skip = 0;
+	}
+	i->count -= wanted - bytes;
+	i->nr_segs -= bvec - i->bvec;
+	i->bvec = bvec;
+	i->iov_offset = skip;
+	return wanted - bytes;
+}
+
 static size_t copy_from_user_bvec(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes)
 {
@@ -668,6 +852,31 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
+{
+	if (i->type & ITER_BVEC)
+		return copy_to_iter_bvec(addr, bytes, i);
+	else
+		return copy_to_iter_iovec(addr, bytes, i);
+}
+
+size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
+{
+	if (i->type & ITER_BVEC)
+		return copy_from_iter_bvec(addr, bytes, i);
+	else
+		return copy_from_iter_iovec(addr, bytes, i);
+}
+
+size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
+{
+	if (i->type & ITER_BVEC) {
+		return zero_bvec(bytes, i);
+	} else {
+		return zero_iovec(bytes, i);
+	}
+}
+
 size_t iov_iter_copy_from_user_atomic(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes)
 {
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 07/21] Replace XIP read and write with DAX I/O
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (5 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 06/21] Add copy_to_iter(), copy_from_iter() and iov_iter_zero() Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-09-14 14:11   ` Boaz Harrosh
  2014-08-27  3:45 ` [PATCH v10 08/21] Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

Use the generic AIO infrastructure instead of custom read and write
methods.  In addition to giving us support for AIO, this adds the missing
locking between read() and truncate().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 MAINTAINERS        |   6 ++
 fs/Makefile        |   1 +
 fs/dax.c           | 195 ++++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/file.c     |   6 +-
 fs/ext2/inode.c    |   8 +-
 include/linux/fs.h |  18 ++++-
 mm/filemap.c       |   6 +-
 mm/filemap_xip.c   | 234 -----------------------------------------------------
 8 files changed, 229 insertions(+), 245 deletions(-)
 create mode 100644 fs/dax.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1ff06de..3f29153 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2929,6 +2929,12 @@ L:	linux-i2c@vger.kernel.org
 S:	Maintained
 F:	drivers/i2c/busses/i2c-diolan-u2c.c
 
+DIRECT ACCESS (DAX)
+M:	Matthew Wilcox <willy@linux.intel.com>
+L:	linux-fsdevel@vger.kernel.org
+S:	Supported
+F:	fs/dax.c
+
 DIRECTORY NOTIFICATION (DNOTIFY)
 M:	Eric Paris <eparis@parisplace.org>
 S:	Maintained
diff --git a/fs/Makefile b/fs/Makefile
index 90c8852..0325ec3 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_FS_XIP)		+= dax.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
diff --git a/fs/dax.c b/fs/dax.c
new file mode 100644
index 0000000..108c68e
--- /dev/null
+++ b/fs/dax.c
@@ -0,0 +1,195 @@
+/*
+ * fs/dax.c - Direct Access filesystem code
+ * Copyright (c) 2013-2014 Intel Corporation
+ * Author: Matthew Wilcox <matthew.r.wilcox@intel.com>
+ * Author: Ross Zwisler <ross.zwisler@linux.intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/mutex.h>
+#include <linux/uio.h>
+
+static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
+{
+	unsigned long pfn;
+	sector_t sector = bh->b_blocknr << (blkbits - 9);
+	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
+}
+
+static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
+			loff_t end)
+{
+	loff_t final = end - pos + first; /* The final byte of the buffer */
+
+	if (first > 0)
+		memset(addr, 0, first);
+	if (final < size)
+		memset(addr + final, 0, size - final);
+}
+
+static bool buffer_written(struct buffer_head *bh)
+{
+	return buffer_mapped(bh) && !buffer_unwritten(bh);
+}
+
+/*
+ * When ext4 encounters a hole, it returns without modifying the buffer_head
+ * which means that we can't trust b_size.  To cope with this, we set b_state
+ * to 0 before calling get_block and, if any bit is set, we know we can trust
+ * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
+ * and would save us time calling get_block repeatedly.
+ */
+static bool buffer_size_valid(struct buffer_head *bh)
+{
+	return bh->b_state != 0;
+}
+
+static ssize_t dax_io(int rw, struct inode *inode, struct iov_iter *iter,
+			loff_t start, loff_t end, get_block_t get_block,
+			struct buffer_head *bh)
+{
+	ssize_t retval = 0;
+	loff_t pos = start;
+	loff_t max = start;
+	loff_t bh_max = start;
+	void *addr;
+	bool hole = false;
+
+	if (rw != WRITE)
+		end = min(end, i_size_read(inode));
+
+	while (pos < end) {
+		unsigned len;
+		if (pos == max) {
+			unsigned blkbits = inode->i_blkbits;
+			sector_t block = pos >> blkbits;
+			unsigned first = pos - (block << blkbits);
+			long size;
+
+			if (pos == bh_max) {
+				bh->b_size = PAGE_ALIGN(end - pos);
+				bh->b_state = 0;
+				retval = get_block(inode, block, bh,
+								rw == WRITE);
+				if (retval)
+					break;
+				if (!buffer_size_valid(bh))
+					bh->b_size = 1 << blkbits;
+				bh_max = pos - first + bh->b_size;
+			} else {
+				unsigned done = bh->b_size -
+						(bh_max - (pos - first));
+				bh->b_blocknr += done >> blkbits;
+				bh->b_size -= done;
+			}
+			if (rw == WRITE) {
+				if (!buffer_mapped(bh)) {
+					retval = -EIO;
+					/* FIXME: fall back to buffered I/O */
+					break;
+				}
+				hole = false;
+			} else {
+				hole = !buffer_written(bh);
+			}
+
+			if (hole) {
+				addr = NULL;
+				size = bh->b_size - first;
+			} else {
+				retval = dax_get_addr(bh, &addr, blkbits);
+				if (retval < 0)
+					break;
+				if (buffer_unwritten(bh) || buffer_new(bh))
+					dax_new_buf(addr, retval, first, pos,
+									end);
+				addr += first;
+				size = retval - first;
+			}
+			max = min(pos + size, end);
+		}
+
+		if (rw == WRITE)
+			len = copy_from_iter(addr, max - pos, iter);
+		else if (!hole)
+			len = copy_to_iter(addr, max - pos, iter);
+		else
+			len = iov_iter_zero(max - pos, iter);
+
+		if (!len)
+			break;
+
+		pos += len;
+		addr += len;
+	}
+
+	return (pos == start) ? retval : pos - start;
+}
+
+/**
+ * dax_do_io - Perform I/O to a DAX file
+ * @rw: READ to read or WRITE to write
+ * @iocb: The control block for this I/O
+ * @inode: The file which the I/O is directed at
+ * @iter: The addresses to do I/O from or to
+ * @pos: The file offset where the I/O starts
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ * @end_io: A filesystem callback for I/O completion
+ * @flags: See below
+ *
+ * This function uses the same locking scheme as do_blockdev_direct_IO:
+ * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
+ * caller for writes.  For reads, we take and release the i_mutex ourselves.
+ * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
+ * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
+ * is in progress.
+ */
+ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+			struct iov_iter *iter, loff_t pos,
+			get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+	struct buffer_head bh;
+	ssize_t retval = -EINVAL;
+	loff_t end = pos + iov_iter_count(iter);
+
+	memset(&bh, 0, sizeof(bh));
+
+	if ((flags & DIO_LOCKING) && (rw == READ)) {
+		struct address_space *mapping = inode->i_mapping;
+		mutex_lock(&inode->i_mutex);
+		retval = filemap_write_and_wait_range(mapping, pos, end - 1);
+		if (retval) {
+			mutex_unlock(&inode->i_mutex);
+			goto out;
+		}
+	}
+
+	/* Protects against truncate */
+	atomic_inc(&inode->i_dio_count);
+
+	retval = dax_io(rw, inode, iter, pos, end, get_block, &bh);
+
+	if ((flags & DIO_LOCKING) && (rw == READ))
+		mutex_unlock(&inode->i_mutex);
+
+	if ((retval > 0) && end_io)
+		end_io(iocb, pos, retval, bh.b_private);
+
+	inode_dio_done(inode);
+ out:
+	return retval;
+}
+EXPORT_SYMBOL_GPL(dax_do_io);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 7c87b22..a247123 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = {
 #ifdef CONFIG_EXT2_FS_XIP
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= xip_file_read,
-	.write		= xip_file_write,
+	.read		= new_sync_read,
+	.write		= new_sync_write,
+	.read_iter	= generic_file_read_iter,
+	.write_iter	= generic_file_write_iter,
 	.unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0cb0448..3ccd5fd 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -859,7 +859,12 @@ ext2_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,
 	size_t count = iov_iter_count(iter);
 	ssize_t ret;
 
-	ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, ext2_get_block);
+	if (IS_DAX(inode))
+		ret = dax_do_io(rw, iocb, inode, iter, offset, ext2_get_block,
+				NULL, DIO_LOCKING);
+	else
+		ret = blockdev_direct_IO(rw, iocb, inode, iter, offset,
+					 ext2_get_block);
 	if (ret < 0 && (rw & WRITE))
 		ext2_write_failed(mapping, offset + count);
 	return ret;
@@ -888,6 +893,7 @@ const struct address_space_operations ext2_aops = {
 const struct address_space_operations ext2_aops_xip = {
 	.bmap			= ext2_bmap,
 	.get_xip_mem		= ext2_get_xip_mem,
+	.direct_IO		= ext2_direct_IO,
 };
 
 const struct address_space_operations ext2_nobh_aops = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e99e5c4..45839e8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2490,17 +2490,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
-extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
-			     loff_t *ppos);
 extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
-extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
-			      size_t len, loff_t *ppos);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
+		loff_t, get_block_t, dio_iodone_t, int flags);
 #else
 static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
 {
 	return 0;
 }
+
+static inline ssize_t dax_do_io(int rw, struct kiocb *iocb,
+		struct inode *inode, struct iov_iter *iter, loff_t pos,
+		get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+	return -ENOTTY;
+}
 #endif
 
 #ifdef CONFIG_BLOCK
@@ -2657,6 +2662,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
 extern void save_mount_options(struct super_block *sb, char *options);
 extern void replace_mount_options(struct super_block *sb, char *options);
 
+static inline bool io_is_direct(struct file *filp)
+{
+	return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcd..19bdb68 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1690,8 +1690,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	loff_t *ppos = &iocb->ki_pos;
 	loff_t pos = *ppos;
 
-	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
-	if (file->f_flags & O_DIRECT) {
+	if (io_is_direct(file)) {
 		struct address_space *mapping = file->f_mapping;
 		struct inode *inode = mapping->host;
 		size_t count = iov_iter_count(iter);
@@ -2579,8 +2578,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (err)
 		goto out;
 
-	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
-	if (unlikely(file->f_flags & O_DIRECT)) {
+	if (io_is_direct(file)) {
 		loff_t endbyte;
 
 		written = generic_file_direct_write(iocb, from, pos);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c8d23e9..f7c37a1 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void)
 }
 
 /*
- * This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_mem() function for the actual low-level
- * stuff.
- *
- * Note the struct file* is not used at all.  It may be NULL.
- */
-static ssize_t
-do_xip_mapping_read(struct address_space *mapping,
-		    struct file_ra_state *_ra,
-		    struct file *filp,
-		    char __user *buf,
-		    size_t len,
-		    loff_t *ppos)
-{
-	struct inode *inode = mapping->host;
-	pgoff_t index, end_index;
-	unsigned long offset;
-	loff_t isize, pos;
-	size_t copied = 0, error = 0;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	pos = *ppos;
-	index = pos >> PAGE_CACHE_SHIFT;
-	offset = pos & ~PAGE_CACHE_MASK;
-
-	isize = i_size_read(inode);
-	if (!isize)
-		goto out;
-
-	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
-	do {
-		unsigned long nr, left;
-		void *xip_mem;
-		unsigned long xip_pfn;
-		int zero = 0;
-
-		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
-		if (index >= end_index) {
-			if (index > end_index)
-				goto out;
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
-			if (nr <= offset) {
-				goto out;
-			}
-		}
-		nr = nr - offset;
-		if (nr > len - copied)
-			nr = len - copied;
-
-		error = mapping->a_ops->get_xip_mem(mapping, index, 0,
-							&xip_mem, &xip_pfn);
-		if (unlikely(error)) {
-			if (error == -ENODATA) {
-				/* sparse */
-				zero = 1;
-			} else
-				goto out;
-		}
-
-		/* If users can be writing to this page using arbitrary
-		 * virtual addresses, take care about potential aliasing
-		 * before reading the page on the kernel side.
-		 */
-		if (mapping_writably_mapped(mapping))
-			/* address based flush */ ;
-
-		/*
-		 * Ok, we have the mem, so now we can copy it to user space...
-		 *
-		 * The actor routine returns how many bytes were actually used..
-		 * NOTE! This may not be the same as how much of a user buffer
-		 * we filled up (we may be padding etc), so we can only update
-		 * "pos" here (the actor routine has to update the user buffer
-		 * pointers and the remaining count).
-		 */
-		if (!zero)
-			left = __copy_to_user(buf+copied, xip_mem+offset, nr);
-		else
-			left = __clear_user(buf + copied, nr);
-
-		if (left) {
-			error = -EFAULT;
-			goto out;
-		}
-
-		copied += (nr - left);
-		offset += (nr - left);
-		index += offset >> PAGE_CACHE_SHIFT;
-		offset &= ~PAGE_CACHE_MASK;
-	} while (copied < len);
-
-out:
-	*ppos = pos + copied;
-	if (filp)
-		file_accessed(filp);
-
-	return (copied ? copied : error);
-}
-
-ssize_t
-xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
-{
-	if (!access_ok(VERIFY_WRITE, buf, len))
-		return -EFAULT;
-
-	return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
-			    buf, len, ppos);
-}
-EXPORT_SYMBOL_GPL(xip_file_read);
-
-/*
  * __xip_unmap is invoked from xip_unmap and
  * xip_write
  *
@@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
 }
 EXPORT_SYMBOL_GPL(xip_file_mmap);
 
-static ssize_t
-__xip_file_write(struct file *filp, const char __user *buf,
-		  size_t count, loff_t pos, loff_t *ppos)
-{
-	struct address_space * mapping = filp->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode 	*inode = mapping->host;
-	long		status = 0;
-	size_t		bytes;
-	ssize_t		written = 0;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	do {
-		unsigned long index;
-		unsigned long offset;
-		size_t copied;
-		void *xip_mem;
-		unsigned long xip_pfn;
-
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
-		if (bytes > count)
-			bytes = count;
-
-		status = a_ops->get_xip_mem(mapping, index, 0,
-						&xip_mem, &xip_pfn);
-		if (status == -ENODATA) {
-			/* we allocate a new page unmap it */
-			mutex_lock(&xip_sparse_mutex);
-			status = a_ops->get_xip_mem(mapping, index, 1,
-							&xip_mem, &xip_pfn);
-			mutex_unlock(&xip_sparse_mutex);
-			if (!status)
-				/* unmap page at pgoff from all other vmas */
-				__xip_unmap(mapping, index);
-		}
-
-		if (status)
-			break;
-
-		copied = bytes -
-			__copy_from_user_nocache(xip_mem + offset, buf, bytes);
-
-		if (likely(copied > 0)) {
-			status = copied;
-
-			if (status >= 0) {
-				written += status;
-				count -= status;
-				pos += status;
-				buf += status;
-			}
-		}
-		if (unlikely(copied != bytes))
-			if (status >= 0)
-				status = -EFAULT;
-		if (status < 0)
-			break;
-	} while (count);
-	*ppos = pos;
-	/*
-	 * No need to use i_size_read() here, the i_size
-	 * cannot change under us because we hold i_mutex.
-	 */
-	if (pos > inode->i_size) {
-		i_size_write(inode, pos);
-		mark_inode_dirty(inode);
-	}
-
-	return written ? written : status;
-}
-
-ssize_t
-xip_file_write(struct file *filp, const char __user *buf, size_t len,
-	       loff_t *ppos)
-{
-	struct address_space *mapping = filp->f_mapping;
-	struct inode *inode = mapping->host;
-	size_t count;
-	loff_t pos;
-	ssize_t ret;
-
-	mutex_lock(&inode->i_mutex);
-
-	if (!access_ok(VERIFY_READ, buf, len)) {
-		ret=-EFAULT;
-		goto out_up;
-	}
-
-	pos = *ppos;
-	count = len;
-
-	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
-
-	ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
-	if (ret)
-		goto out_backing;
-	if (count == 0)
-		goto out_backing;
-
-	ret = file_remove_suid(filp);
-	if (ret)
-		goto out_backing;
-
-	ret = file_update_time(filp);
-	if (ret)
-		goto out_backing;
-
-	ret = __xip_file_write (filp, buf, count, pos, ppos);
-
- out_backing:
-	current->backing_dev_info = NULL;
- out_up:
-	mutex_unlock(&inode->i_mutex);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(xip_file_write);
-
 /*
  * truncate a page used for execute in place
  * functionality is analog to block_truncate_page but does use get_xip_mem
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 08/21] Replace ext2_clear_xip_target with dax_clear_blocks
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (6 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 07/21] Replace XIP read and write with DAX I/O Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

This is practically generic code; other filesystems will want to call
it from other places, but there's nothing ext2-specific about it.

Make it a little more generic by allowing it to take a count of the number
of bytes to zero rather than fixing it to a single page.  Thanks to Dave
Hansen for suggesting that I need to call cond_resched() if zeroing more
than one page.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/dax.c           | 35 +++++++++++++++++++++++++++++++++++
 fs/ext2/inode.c    |  8 +++++---
 fs/ext2/xip.c      | 14 --------------
 fs/ext2/xip.h      |  3 ---
 include/linux/fs.h |  6 ++++++
 5 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 108c68e..02e226f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -20,8 +20,43 @@
 #include <linux/fs.h>
 #include <linux/genhd.h>
 #include <linux/mutex.h>
+#include <linux/sched.h>
 #include <linux/uio.h>
 
+int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	sector_t sector = block << (inode->i_blkbits - 9);
+
+	might_sleep();
+	do {
+		void *addr;
+		unsigned long pfn;
+		long count;
+
+		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+		if (count < 0)
+			return count;
+		while (count > 0) {
+			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
+			if (pgsz > count)
+				pgsz = count;
+			if (pgsz < PAGE_SIZE)
+				memset(addr, 0, pgsz);
+			else
+				clear_page(addr);
+			addr += pgsz;
+			size -= pgsz;
+			count -= pgsz;
+			sector += pgsz / 512;
+			cond_resched();
+		}
+	} while (size);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_clear_blocks);
+
 static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
 {
 	unsigned long pfn;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 3ccd5fd..52978b8 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode,
 
 	if (IS_DAX(inode)) {
 		/*
-		 * we need to clear the block
+		 * block must be initialised before we put it in the tree
+		 * so that it's not found by another thread before it's
+		 * initialised
 		 */
-		err = ext2_clear_xip_target (inode,
-			le32_to_cpu(chain[depth-1].key));
+		err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
+						1 << inode->i_blkbits);
 		if (err) {
 			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index bbc5fec..8cfca3a 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -42,20 +42,6 @@ __ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
 	return rc;
 }
 
-int
-ext2_clear_xip_target(struct inode *inode, sector_t block)
-{
-	void *kaddr;
-	unsigned long pfn;
-	long size;
-
-	size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
-	if (size < 0)
-		return size;
-	clear_page(kaddr);
-	return 0;
-}
-
 void ext2_xip_verify_sb(struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 29be737..b2592f2 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -7,8 +7,6 @@
 
 #ifdef CONFIG_EXT2_FS_XIP
 extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, sector_t);
-
 static inline int ext2_use_xip (struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -19,6 +17,5 @@ int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
 #else
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
-#define ext2_clear_xip_target(inode, chain)	0
 #define ext2_get_xip_mem			NULL
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 45839e8..c04d371 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2490,11 +2490,17 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
+int dax_clear_blocks(struct inode *, sector_t block, long size);
 extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 		loff_t, get_block_t, dio_iodone_t, int flags);
 #else
+static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
+{
+	return 0;
+}
+
 static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
 {
 	return 0;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (7 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 08/21] Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-09-03  7:47   ` Dave Chinner
  2014-08-27  3:45 ` [PATCH v10 10/21] Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

Instead of calling aops->get_xip_mem from the fault handler, the
filesystem passes a get_block_t that is used to find the appropriate
blocks.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c           | 215 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/file.c     |  35 ++++++++-
 include/linux/fs.h |   4 +-
 mm/filemap_xip.c   | 206 --------------------------------------------------
 4 files changed, 251 insertions(+), 209 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 02e226f..f134078 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -19,9 +19,13 @@
 #include <linux/buffer_head.h>
 #include <linux/fs.h>
 #include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/memcontrol.h>
+#include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
+#include <linux/vmstat.h>
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -64,6 +68,14 @@ static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
 	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
 }
 
+static long dax_get_pfn(struct buffer_head *bh, unsigned long *pfn,
+							unsigned blkbits)
+{
+	void *addr;
+	sector_t sector = bh->b_blocknr << (blkbits - 9);
+	return bdev_direct_access(bh->b_bdev, sector, &addr, pfn, bh->b_size);
+}
+
 static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
 			loff_t end)
 {
@@ -228,3 +240,206 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 	return retval;
 }
 EXPORT_SYMBOL_GPL(dax_do_io);
+
+/*
+ * The user has performed a load from a hole in the file.  Allocating
+ * a new page in the file would cause excessive storage usage for
+ * workloads with sparse files.  We allocate a page cache page instead.
+ * We'll kick it out of the page cache if it's ever written to,
+ * otherwise it will simply fall out of the page cache under memory
+ * pressure without ever having been dirtied.
+ */
+static int dax_load_hole(struct address_space *mapping, struct page *page,
+							struct vm_fault *vmf)
+{
+	unsigned long size;
+	struct inode *inode = mapping->host;
+	if (!page)
+		page = find_or_create_page(mapping, vmf->pgoff,
+						GFP_KERNEL | __GFP_ZERO);
+	if (!page)
+		return VM_FAULT_OOM;
+	/* Recheck i_size under page lock to avoid truncate race */
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size) {
+		unlock_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static int copy_user_bh(struct page *to, struct buffer_head *bh,
+			unsigned blkbits, unsigned long vaddr)
+{
+	void *vfrom, *vto;
+	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
+		return -EIO;
+	vto = kmap_atomic(to);
+	copy_user_page(vto, vfrom, vaddr, to);
+	kunmap_atomic(vto);
+	return 0;
+}
+
+static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = file->f_mapping;
+	struct page *page;
+	struct buffer_head bh;
+	unsigned long vaddr = (unsigned long)vmf->virtual_address;
+	unsigned blkbits = inode->i_blkbits;
+	sector_t block;
+	pgoff_t size;
+	unsigned long pfn;
+	int error;
+	int major = 0;
+
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	memset(&bh, 0, sizeof(bh));
+	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
+	bh.b_size = PAGE_SIZE;
+
+ repeat:
+	page = find_get_page(mapping, vmf->pgoff);
+	if (page) {
+		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+			page_cache_release(page);
+			return VM_FAULT_RETRY;
+		}
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
+		}
+	}
+
+	error = get_block(inode, block, &bh, 0);
+	if (!error && (bh.b_size < PAGE_SIZE))
+		error = -EIO;
+	if (error)
+		goto unlock_page;
+
+	if (!buffer_written(&bh) && !vmf->cow_page) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			error = get_block(inode, block, &bh, 1);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			major = VM_FAULT_MAJOR;
+			if (!error && (bh.b_size < PAGE_SIZE))
+				error = -EIO;
+			if (error)
+				goto unlock_page;
+		} else {
+			return dax_load_hole(mapping, page, vmf);
+		}
+	}
+
+	if (vmf->cow_page) {
+		struct page *new_page = vmf->cow_page;
+		if (buffer_written(&bh))
+			error = copy_user_bh(new_page, &bh, blkbits, vaddr);
+		else
+			clear_user_highpage(new_page, vaddr);
+		if (error)
+			goto unlock_page;
+		vmf->page = page;
+		if (!page) {
+			mutex_lock(&mapping->i_mmap_mutex);
+			/* Check we didn't race with truncate */
+			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
+								PAGE_SHIFT;
+			if (vmf->pgoff >= size) {
+				error = -EIO;
+				goto out;
+			}
+		}
+		return VM_FAULT_LOCKED;
+	}
+
+	if (buffer_unwritten(&bh) || buffer_new(&bh))
+		dax_clear_blocks(inode, bh.b_blocknr, bh.b_size);
+
+	/* Check we didn't race with a read fault installing a new page */
+	if (!page && major)
+		page = find_lock_page(mapping, vmf->pgoff);
+
+	if (page) {
+		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+							PAGE_CACHE_SIZE, 0);
+		delete_from_page_cache(page);
+		unlock_page(page);
+		page_cache_release(page);
+	}
+
+	mutex_lock(&mapping->i_mmap_mutex);
+
+	/*
+	 * Check truncate didn't happen while we were allocating a block.
+	 * If it did, this block may or may not be still allocated to the
+	 * file.  We can't tell the filesystem to free it because we can't
+	 * take i_mutex here.  In the worst case, the file still has blocks
+	 * allocated past the end of the file.
+	 */
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (unlikely(vmf->pgoff >= size)) {
+		mutex_unlock(&mapping->i_mmap_mutex);
+		error = -EIO;
+		goto out;
+	}
+
+	error = dax_get_pfn(&bh, &pfn, blkbits);
+	if (error > 0)
+		error = vm_insert_mixed(vma, vaddr, pfn);
+
+	mutex_unlock(&mapping->i_mmap_mutex);
+
+ out:
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM | major;
+	/* -EBUSY is fine, somebody else faulted on the same PTE */
+	if ((error < 0) && (error != -EBUSY))
+		return VM_FAULT_SIGBUS | major;
+	return VM_FAULT_NOPAGE | major;
+
+ unlock_page:
+	if (page) {
+		unlock_page(page);
+		page_cache_release(page);
+	}
+	goto out;
+}
+
+/**
+ * dax_fault - handle a page fault on a DAX file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * fault handler for DAX files.
+ */
+int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	int result;
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	if (vmf->flags & FAULT_FLAG_WRITE) {
+		sb_start_pagefault(sb);
+		file_update_time(vma->vm_file);
+	}
+	result = do_dax_fault(vma, vmf, get_block);
+	if (vmf->flags & FAULT_FLAG_WRITE)
+		sb_end_pagefault(sb);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_fault);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index a247123..da8dc64 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,37 @@
 #include "xattr.h"
 #include "acl.h"
 
+#ifdef CONFIG_EXT2_FS_XIP
+static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_fault(vma, vmf, ext2_get_block);
+}
+
+static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_mkwrite(vma, vmf, ext2_get_block);
+}
+
+static const struct vm_operations_struct ext2_dax_vm_ops = {
+	.fault		= ext2_dax_fault,
+	.page_mkwrite	= ext2_dax_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if (!IS_DAX(file_inode(file)))
+		return generic_file_mmap(file, vma);
+
+	file_accessed(file);
+	vma->vm_ops = &ext2_dax_vm_ops;
+	vma->vm_flags |= VM_MIXEDMAP;
+	return 0;
+}
+#else
+#define ext2_file_mmap	generic_file_mmap
+#endif
+
 /*
  * Called when filp is released. This happens when all file descriptors
  * for a single struct file are closed. Note that different open() calls
@@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
-	.mmap		= generic_file_mmap,
+	.mmap		= ext2_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
@@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
-	.mmap		= xip_file_mmap,
+	.mmap		= ext2_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c04d371..338f04b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -49,6 +49,7 @@ struct swap_info_struct;
 struct seq_file;
 struct workqueue_struct;
 struct iov_iter;
+struct vm_fault;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -2491,10 +2492,11 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
 int dax_clear_blocks(struct inode *, sector_t block, long size);
-extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 		loff_t, get_block_t, dio_iodone_t, int flags);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 #else
 static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
 {
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index f7c37a1..9dd45f3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -22,212 +22,6 @@
 #include <asm/io.h>
 
 /*
- * We do use our own empty page to avoid interference with other users
- * of ZERO_PAGE(), such as /dev/zero
- */
-static DEFINE_MUTEX(xip_sparse_mutex);
-static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
-static struct page *__xip_sparse_page;
-
-/* called under xip_sparse_mutex */
-static struct page *xip_sparse_page(void)
-{
-	if (!__xip_sparse_page) {
-		struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-
-		if (page)
-			__xip_sparse_page = page;
-	}
-	return __xip_sparse_page;
-}
-
-/*
- * __xip_unmap is invoked from xip_unmap and
- * xip_write
- *
- * This function walks all vmas of the address_space and unmaps the
- * __xip_sparse_page when found at pgoff.
- */
-static void
-__xip_unmap (struct address_space * mapping,
-		     unsigned long pgoff)
-{
-	struct vm_area_struct *vma;
-	struct mm_struct *mm;
-	unsigned long address;
-	pte_t *pte;
-	pte_t pteval;
-	spinlock_t *ptl;
-	struct page *page;
-	unsigned count;
-	int locked = 0;
-
-	count = read_seqcount_begin(&xip_sparse_seq);
-
-	page = __xip_sparse_page;
-	if (!page)
-		return;
-
-retry:
-	mutex_lock(&mapping->i_mmap_mutex);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
-		mm = vma->vm_mm;
-		address = vma->vm_start +
-			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl, 1);
-		if (pte) {
-			/* Nuke the page table entry. */
-			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
-			page_remove_rmap(page);
-			dec_mm_counter(mm, MM_FILEPAGES);
-			BUG_ON(pte_dirty(pteval));
-			pte_unmap_unlock(pte, ptl);
-			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
-			page_cache_release(page);
-		}
-	}
-	mutex_unlock(&mapping->i_mmap_mutex);
-
-	if (locked) {
-		mutex_unlock(&xip_sparse_mutex);
-	} else if (read_seqcount_retry(&xip_sparse_seq, count)) {
-		mutex_lock(&xip_sparse_mutex);
-		locked = 1;
-		goto retry;
-	}
-}
-
-/*
- * xip_fault() is invoked via the vma operations vector for a
- * mapped memory region to read in file data during a page fault.
- *
- * This function is derived from filemap_fault, but used for execute in place
- */
-static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	pgoff_t size;
-	void *xip_mem;
-	unsigned long xip_pfn;
-	struct page *page;
-	int error;
-
-	/* XXX: are VM_FAULT_ codes OK? */
-again:
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (vmf->pgoff >= size)
-		return VM_FAULT_SIGBUS;
-
-	error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
-						&xip_mem, &xip_pfn);
-	if (likely(!error))
-		goto found;
-	if (error != -ENODATA)
-		return VM_FAULT_OOM;
-
-	/* sparse block */
-	if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
-	    (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
-	    (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
-		int err;
-
-		/* maybe shared writable, allocate new block */
-		mutex_lock(&xip_sparse_mutex);
-		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
-							&xip_mem, &xip_pfn);
-		mutex_unlock(&xip_sparse_mutex);
-		if (error)
-			return VM_FAULT_SIGBUS;
-		/* unmap sparse mappings at pgoff from all other vmas */
-		__xip_unmap(mapping, vmf->pgoff);
-
-found:
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
-		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			mutex_unlock(&mapping->i_mmap_mutex);
-			return VM_FAULT_SIGBUS;
-		}
-		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
-							xip_pfn);
-		mutex_unlock(&mapping->i_mmap_mutex);
-		if (err == -ENOMEM)
-			return VM_FAULT_OOM;
-		/*
-		 * err == -EBUSY is fine, we've raced against another thread
-		 * that faulted-in the same page
-		 */
-		if (err != -EBUSY)
-			BUG_ON(err);
-		return VM_FAULT_NOPAGE;
-	} else {
-		int err, ret = VM_FAULT_OOM;
-
-		mutex_lock(&xip_sparse_mutex);
-		write_seqcount_begin(&xip_sparse_seq);
-		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
-							&xip_mem, &xip_pfn);
-		if (unlikely(!error)) {
-			write_seqcount_end(&xip_sparse_seq);
-			mutex_unlock(&xip_sparse_mutex);
-			goto again;
-		}
-		if (error != -ENODATA)
-			goto out;
-
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
-		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			ret = VM_FAULT_SIGBUS;
-			goto unlock;
-		}
-		/* not shared and writable, use xip_sparse_page() */
-		page = xip_sparse_page();
-		if (!page)
-			goto unlock;
-		err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
-							page);
-		if (err == -ENOMEM)
-			goto unlock;
-
-		ret = VM_FAULT_NOPAGE;
-unlock:
-		mutex_unlock(&mapping->i_mmap_mutex);
-out:
-		write_seqcount_end(&xip_sparse_seq);
-		mutex_unlock(&xip_sparse_mutex);
-
-		return ret;
-	}
-}
-
-static const struct vm_operations_struct xip_file_vm_ops = {
-	.fault	= xip_file_fault,
-	.page_mkwrite	= filemap_page_mkwrite,
-	.remap_pages = generic_file_remap_pages,
-};
-
-int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
-{
-	BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
-
-	file_accessed(file);
-	vma->vm_ops = &xip_file_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xip_file_mmap);
-
-/*
  * truncate a page used for execute in place
  * functionality is analog to block_truncate_page but does use get_xip_mem
  * to get the page instead of page cache
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 10/21] Replace xip_truncate_page with dax_truncate_page
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (8 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 11/21] Replace XIP documentation with DAX documentation Matthew Wilcox
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

It takes a get_block parameter just like nobh_truncate_page() and
block_truncate_page()

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/dax.c           | 44 ++++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/inode.c    |  2 +-
 include/linux/fs.h |  4 ++--
 mm/filemap_xip.c   | 40 ----------------------------------------
 4 files changed, 47 insertions(+), 43 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f134078..d54f7d3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -443,3 +443,47 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return result;
 }
 EXPORT_SYMBOL_GPL(dax_fault);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks.  Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+	struct buffer_head bh;
+	pgoff_t index = from >> PAGE_CACHE_SHIFT;
+	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned length = PAGE_CACHE_ALIGN(from) - from;
+	int err;
+
+	/* Block boundary? Nothing to do */
+	if (!length)
+		return 0;
+
+	memset(&bh, 0, sizeof(bh));
+	bh.b_size = PAGE_CACHE_SIZE;
+	err = get_block(inode, index, &bh, 0);
+	if (err < 0)
+		return err;
+	if (buffer_written(&bh)) {
+		void *addr;
+		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
+		if (err < 0)
+			return err;
+		memset(addr + offset, 0, length);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 52978b8..5ac0a34 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1210,7 +1210,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	inode_dio_wait(inode);
 
 	if (IS_DAX(inode))
-		error = xip_truncate_page(inode->i_mapping, newsize);
+		error = dax_truncate_page(inode, newsize, ext2_get_block);
 	else if (test_opt(inode->i_sb, NOBH))
 		error = nobh_truncate_page(inode->i_mapping,
 				newsize, ext2_get_block);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 338f04b..eee848d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2492,7 +2492,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
 int dax_clear_blocks(struct inode *, sector_t block, long size);
-extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 		loff_t, get_block_t, dio_iodone_t, int flags);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
@@ -2503,7 +2503,7 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
 	return 0;
 }
 
-static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
+static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
 {
 	return 0;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 9dd45f3..6316578 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -21,43 +21,3 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 
-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
-{
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned blocksize;
-	unsigned length;
-	void *xip_mem;
-	unsigned long xip_pfn;
-	int err;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	blocksize = 1 << mapping->host->i_blkbits;
-	length = offset & (blocksize - 1);
-
-	/* Block boundary? Nothing to do */
-	if (!length)
-		return 0;
-
-	length = blocksize - length;
-
-	err = mapping->a_ops->get_xip_mem(mapping, index, 0,
-						&xip_mem, &xip_pfn);
-	if (unlikely(err)) {
-		if (err == -ENODATA)
-			/* Hole? No need to truncate */
-			return 0;
-		else
-			return err;
-	}
-	memset(xip_mem + offset, 0, length);
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xip_truncate_page);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 11/21] Replace XIP documentation with DAX documentation
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (9 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 10/21] Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 12/21] Remove get_xip_mem Matthew Wilcox
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Based on the original XIP documentation, this documents the current
state of affairs, and includes instructions on how users can enable DAX
if their devices and kernel support it.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
---
 Documentation/filesystems/dax.txt | 89 +++++++++++++++++++++++++++++++++++++++
 Documentation/filesystems/xip.txt | 71 -------------------------------
 2 files changed, 89 insertions(+), 71 deletions(-)
 create mode 100644 Documentation/filesystems/dax.txt
 delete mode 100644 Documentation/filesystems/xip.txt

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
new file mode 100644
index 0000000..635adaa
--- /dev/null
+++ b/Documentation/filesystems/dax.txt
@@ -0,0 +1,89 @@
+Direct Access for files
+-----------------------
+
+Motivation
+----------
+
+The page cache is usually used to buffer reads and writes to files.
+It is also used to provide the pages which are mapped into userspace
+by a call to mmap.
+
+For block devices that are memory-like, the page cache pages would be
+unnecessary copies of the original storage.  The DAX code removes the
+extra copy by performing reads and writes directly to the storage device.
+For file mappings, the storage device is mapped directly into userspace.
+
+
+Usage
+-----
+
+If you have a block device which supports DAX, you can make a filesystem
+on it as usual.  When mounting it, use the -o dax option manually
+or add 'dax' to the options in /etc/fstab.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support DAX in your block driver, implement the 'direct_access'
+block device operation.  It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory.  It also returns a
+kernel virtual address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested.  The function should return the number
+of bytes that can be contiguously accessed at that offset.  It may also
+return a negative errno if an error occurs.
+
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times.  If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access.  Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
+- dcssblk: s390 dcss block device driver
+
+
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of
+- adding support to mark inodes as being DAX by setting the S_DAX flag in
+  i_flags
+- implementing the direct_IO address space operation, and calling
+  dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
+- implementing an mmap file operation for DAX files which sets the
+  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
+  for fault and page_mkwrite (which should probably call dax_fault() and
+  dax_mkwrite(), passing the appropriate get_block() callback)
+- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- ensuring that there is sufficient locking between reads, writes,
+  truncates and page faults
+
+The get_block() callback passed to the DAX functions may return
+uninitialised extents.  If it does, it must ensure that simultaneous
+calls to get_block() (for example by a page-fault racing with a read()
+or a write()) work correctly.
+
+These filesystems may be used for inspiration:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+
+
+Shortcomings
+------------
+
+Even if the kernel or its modules are stored on a filesystem that supports
+DAX on a block device that supports DAX, they will still be copied into RAM.
+
+Calling get_user_pages() on a range of user memory that has been mmaped
+from a DAX file will fail as there are no 'struct page' to describe
+those pages.  This problem is being worked on.  That means that O_DIRECT
+reads/writes to those memory ranges from a non-DAX file will fail (note
+that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
+that is being accessed that is key here).  Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
deleted file mode 100644
index b774729..0000000
--- a/Documentation/filesystems/xip.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-Execute-in-place for file mappings
-----------------------------------
-
-Motivation
-----------
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read&write type file operations also transfer data from/to the page
-cache.
-
-For memory backed storage devices that use the block device interface, the page
-cache pages are in fact copies of the original storage. Various approaches
-exist to work around the need for an extra copy. The ramdisk driver for example
-does read the data into the page cache, keeps a reference, and discards the
-original data behind later on.
-
-Execute-in-place solves this issue the other way around: instead of keeping
-data in the page cache, the need to have a page cache copy is eliminated
-completely. With execute-in-place, read&write type operations are performed
-directly from/to the memory backed storage device. For file mappings, the
-storage device itself is mapped directly into userspace.
-
-This implementation was initially written for shared memory segments between
-different virtual machines on s390 hardware to allow multiple machines to
-share the same binaries and libraries.
-
-Implementation
---------------
-Execute-in-place is implemented in three steps: block device operation,
-address space operation, and file operations.
-
-A block device operation named direct_access is used to translate the
-block device sector number to a page frame number (pfn) that identifies
-the physical page for the memory.  It also returns a kernel virtual
-address that can be used to access the memory.
-
-The direct_access method takes a 'size' parameter that indicates the
-number of bytes being requested.  The function should return the number
-of bytes that can be contiguously accessed at that offset.  It may also
-return a negative errno if an error occurs.
-
-The block device operation is optional, these block devices support it as of
-today:
-- dcssblk: s390 dcss block device driver
-
-An address space operation named get_xip_mem is used to retrieve references
-to a page frame number and a kernel address. To obtain these values a reference
-to an address_space is provided. This function assigns values to the kmem and
-pfn parameters. The third argument indicates whether the function should allocate
-blocks if needed.
-
-This address space operation is mutually exclusive with readpage&writepage that
-do page cache read/write operations.
-The following filesystems support it as of today:
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
-
-A set of file operations that do utilize get_xip_page can be found in
-mm/filemap_xip.c . The following file operation implementations are provided:
-- aio_read/aio_write
-- readv/writev
-- sendfile
-
-The generic file operations do_sync_read/do_sync_write can be used to implement
-classic synchronous IO calls.
-
-Shortcomings
-------------
-This implementation is limited to storage devices that are cpu addressable at
-all times (no highmem or such). It works well on rom/ram, but enhancements are
-needed to make it work with flash in read+write mode.
-Putting the Linux kernel and/or its modules on a xip filesystem does not mean
-they are not copied.
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 12/21] Remove get_xip_mem
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (10 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 11/21] Replace XIP documentation with DAX documentation Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 13/21] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

All callers of get_xip_mem() are now gone.  Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.
Also remove mm/filemap_xip.c as it is now empty.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/Locking |  3 ---
 fs/exofs/inode.c                  |  1 -
 fs/ext2/inode.c                   |  1 -
 fs/ext2/xip.c                     | 45 ---------------------------------------
 fs/ext2/xip.h                     |  3 ---
 fs/open.c                         |  5 +----
 include/linux/fs.h                |  2 --
 mm/Makefile                       |  1 -
 mm/fadvise.c                      |  6 ++++--
 mm/filemap_xip.c                  | 23 --------------------
 mm/madvise.c                      |  2 +-
 11 files changed, 6 insertions(+), 86 deletions(-)
 delete mode 100644 mm/filemap_xip.c

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f1997e9..226ccc3 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -197,8 +197,6 @@ prototypes:
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
-	int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
-				unsigned long *);
 	int (*migratepage)(struct address_space *, struct page *, struct page *);
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
@@ -223,7 +221,6 @@ invalidatepage:		yes
 releasepage:		yes
 freepage:		yes
 direct_IO:
-get_xip_mem:					maybe
 migratepage:		yes (both)
 launder_page:		yes
 is_partially_uptodate:	yes
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 3f9cafd..c408a53 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = {
 	.direct_IO	= exofs_direct_IO,
 
 	/* With these NULL has special meaning or default is not exported */
-	.get_xip_mem	= NULL,
 	.migratepage	= NULL,
 	.launder_page	= NULL,
 	.is_partially_uptodate = NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 5ac0a34..59d6c7d 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -894,7 +894,6 @@ const struct address_space_operations ext2_aops = {
 
 const struct address_space_operations ext2_aops_xip = {
 	.bmap			= ext2_bmap,
-	.get_xip_mem		= ext2_get_xip_mem,
 	.direct_IO		= ext2_direct_IO,
 };
 
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 8cfca3a..132d4da 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,35 +13,6 @@
 #include "ext2.h"
 #include "xip.h"
 
-static inline long __inode_direct_access(struct inode *inode, sector_t block,
-				void **kaddr, unsigned long *pfn, long size)
-{
-	struct block_device *bdev = inode->i_sb->s_bdev;
-	sector_t sector = block * (PAGE_SIZE / 512);
-	return bdev_direct_access(bdev, sector, kaddr, pfn, size);
-}
-
-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
-		   sector_t *result)
-{
-	struct buffer_head tmp;
-	int rc;
-
-	memset(&tmp, 0, sizeof(struct buffer_head));
-	tmp.b_size = 1 << inode->i_blkbits;
-	rc = ext2_get_block(inode, pgoff, &tmp, create);
-	*result = tmp.b_blocknr;
-
-	/* did we get a sparse block (hole in the file)? */
-	if (!tmp.b_blocknr && !rc) {
-		BUG_ON(create);
-		rc = -ENODATA;
-	}
-
-	return rc;
-}
-
 void ext2_xip_verify_sb(struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -54,19 +25,3 @@ void ext2_xip_verify_sb(struct super_block *sb)
 			     "not supported by bdev");
 	}
 }
-
-int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
-				void **kmem, unsigned long *pfn)
-{
-	long rc;
-	sector_t block;
-
-	/* first, retrieve the sector number */
-	rc = __ext2_get_block(mapping->host, pgoff, create, &block);
-	if (rc)
-		return rc;
-
-	/* retrieve address of the target data */
-	rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
-	return (rc < 0) ? rc : 0;
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index b2592f2..e7b9f0a 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -12,10 +12,7 @@ static inline int ext2_use_xip (struct super_block *sb)
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
-int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
-				void **, unsigned long *);
 #else
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
-#define ext2_get_xip_mem			NULL
 #endif
diff --git a/fs/open.c b/fs/open.c
index d6fd3ac..ca68e47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -655,11 +655,8 @@ int open_check_o_direct(struct file *f)
 {
 	/* NB: we're sure to have correct a_ops only after f_op->open */
 	if (f->f_flags & O_DIRECT) {
-		if (!f->f_mapping->a_ops ||
-		    ((!f->f_mapping->a_ops->direct_IO) &&
-		    (!f->f_mapping->a_ops->get_xip_mem))) {
+		if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
 			return -EINVAL;
-		}
 	}
 	return 0;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eee848d..d73db11 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -349,8 +349,6 @@ struct address_space_operations {
 	int (*releasepage) (struct page *, gfp_t);
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
-	int (*get_xip_mem)(struct address_space *, pgoff_t, int,
-						void **, unsigned long *);
 	/*
 	 * migrate the contents of a page to the specified target. If
 	 * migrate_mode is MIGRATE_ASYNC, it must not block.
diff --git a/mm/Makefile b/mm/Makefile
index 632ae77..b2c7623 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KMEMCHECK) += kmemcheck.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
-obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..1f1925f 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -28,6 +28,7 @@
 SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 {
 	struct fd f = fdget(fd);
+	struct inode *inode;
 	struct address_space *mapping;
 	struct backing_dev_info *bdi;
 	loff_t endbyte;			/* inclusive */
@@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 	if (!f.file)
 		return -EBADF;
 
-	if (S_ISFIFO(file_inode(f.file)->i_mode)) {
+	inode = file_inode(f.file);
+	if (S_ISFIFO(inode->i_mode)) {
 		ret = -ESPIPE;
 		goto out;
 	}
@@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 		goto out;
 	}
 
-	if (mapping->a_ops->get_xip_mem) {
+	if (IS_DAX(inode)) {
 		switch (advice) {
 		case POSIX_FADV_NORMAL:
 		case POSIX_FADV_RANDOM:
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
deleted file mode 100644
index 6316578..0000000
--- a/mm/filemap_xip.c
+++ /dev/null
@@ -1,23 +0,0 @@
-/*
- *	linux/mm/filemap_xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte <cotte@de.ibm.com>
- *
- * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
- *
- */
-
-#include <linux/fs.h>
-#include <linux/pagemap.h>
-#include <linux/export.h>
-#include <linux/uio.h>
-#include <linux/rmap.h>
-#include <linux/mmu_notifier.h>
-#include <linux/sched.h>
-#include <linux/seqlock.h>
-#include <linux/mutex.h>
-#include <linux/gfp.h>
-#include <asm/tlbflush.h>
-#include <asm/io.h>
-
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30..1611ebf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	if (!file)
 		return -EBADF;
 
-	if (file->f_mapping->a_ops->get_xip_mem) {
+	if (IS_DAX(file_inode(file))) {
 		/* no bad return value, but ignore advice */
 		return 0;
 	}
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 13/21] ext2: Remove ext2_xip_verify_sb()
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (11 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 12/21] Remove get_xip_mem Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 14/21] ext2: Remove ext2_use_xip Matthew Wilcox
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed.  It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the
equivalent check and delete the definition.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/super.c | 33 ++++++++++++---------------------
 fs/ext2/xip.c   | 12 ------------
 fs/ext2/xip.h   |  2 --
 3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index b88edc0..d862031 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 		((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
 		 MS_POSIXACL : 0);
 
-	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-				    EXT2_MOUNT_XIP if not */
-
 	if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
 	    (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
 	     EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-	if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
-		if (!silent)
+	if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+		if (blocksize != PAGE_SIZE) {
 			ext2_msg(sb, KERN_ERR,
-				"error: unsupported blocksize for xip");
-		goto failed_mount;
+					"error: unsupported blocksize for xip");
+			goto failed_mount;
+		}
+		if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			ext2_msg(sb, KERN_ERR,
+					"error: device does not support xip");
+			goto failed_mount;
+		}
 	}
 
 	/* If the blocksize doesn't match, re-read the thing.. */
@@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 {
 	struct ext2_sb_info * sbi = EXT2_SB(sb);
 	struct ext2_super_block * es;
-	unsigned long old_mount_opt = sbi->s_mount_opt;
 	struct ext2_mount_options old_opts;
 	unsigned long old_sb_flags;
 	int err;
@@ -1274,22 +1276,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
 		((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
-	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-				    EXT2_MOUNT_XIP if not */
-
-	if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) {
-		ext2_msg(sb, KERN_WARNING,
-			"warning: unsupported blocksize for xip");
-		err = -EINVAL;
-		goto restore_opts;
-	}
-
 	es = sbi->s_es;
-	if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) {
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
 		ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
 			 "xip flag with busy inodes while remounting");
-		sbi->s_mount_opt &= ~EXT2_MOUNT_XIP;
-		sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP;
+		sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
 	}
 	if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
 		spin_unlock(&sbi->s_lock);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 132d4da..66ca113 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,15 +13,3 @@
 #include "ext2.h"
 #include "xip.h"
 
-void ext2_xip_verify_sb(struct super_block *sb)
-{
-	struct ext2_sb_info *sbi = EXT2_SB(sb);
-
-	if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) &&
-	    !sb->s_bdev->bd_disk->fops->direct_access) {
-		sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
-		ext2_msg(sb, KERN_WARNING,
-			     "warning: ignoring xip option - "
-			     "not supported by bdev");
-	}
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index e7b9f0a..87eeb04 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -6,13 +6,11 @@
  */
 
 #ifdef CONFIG_EXT2_FS_XIP
-extern void ext2_xip_verify_sb (struct super_block *);
 static inline int ext2_use_xip (struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
 #else
-#define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #endif
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 14/21] ext2: Remove ext2_use_xip
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (12 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 13/21] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 15/21] ext2: Remove xip.c and xip.h Matthew Wilcox
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

Replace ext2_use_xip() with test_opt(XIP) which expands to the same code

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/ext2.h  | 4 ++++
 fs/ext2/inode.c | 2 +-
 fs/ext2/namei.c | 4 ++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index d9a17d0..5ecf570 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,11 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID32		0x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
+#ifdef CONFIG_FS_XIP
 #define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
+#else
+#define EXT2_MOUNT_XIP			0
+#endif
 #define EXT2_MOUNT_USRQUOTA		0x020000  /* user quota */
 #define EXT2_MOUNT_GRPQUOTA		0x040000  /* group quota */
 #define EXT2_MOUNT_RESERVATION		0x080000  /* Preallocation */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 59d6c7d..cba3833 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1394,7 +1394,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (ext2_use_xip(inode->i_sb)) {
+		if (test_opt(inode->i_sb, XIP)) {
 			inode->i_mapping->a_ops = &ext2_aops_xip;
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index c268d0a..846c356 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (ext2_use_xip(inode->i_sb)) {
+	if (test_opt(inode->i_sb, XIP)) {
 		inode->i_mapping->a_ops = &ext2_aops_xip;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (ext2_use_xip(inode->i_sb)) {
+	if (test_opt(inode->i_sb, XIP)) {
 		inode->i_mapping->a_ops = &ext2_aops_xip;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 15/21] ext2: Remove xip.c and xip.h
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (13 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 14/21] ext2: Remove ext2_use_xip Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 16/21] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Matthew Wilcox
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

These files are now empty, so delete them

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/Makefile |  1 -
 fs/ext2/inode.c  |  1 -
 fs/ext2/namei.c  |  1 -
 fs/ext2/super.c  |  1 -
 fs/ext2/xip.c    | 15 ---------------
 fs/ext2/xip.h    | 16 ----------------
 6 files changed, 35 deletions(-)
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h

diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index f42af45..445b0e9 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
 ext2-$(CONFIG_EXT2_FS_XATTR)	 += xattr.o xattr_user.o xattr_trusted.o
 ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
 ext2-$(CONFIG_EXT2_FS_SECURITY)	 += xattr_security.o
-ext2-$(CONFIG_EXT2_FS_XIP)	 += xip.o
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index cba3833..154cbcf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -34,7 +34,6 @@
 #include <linux/aio.h>
 #include "ext2.h"
 #include "acl.h"
-#include "xip.h"
 #include "xattr.h"
 
 static int __ext2_write_inode(struct inode *inode, int do_sync);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 846c356..7ca803f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
 {
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index d862031..0393c6d 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static void ext2_sync_super(struct super_block *sb,
 			    struct ext2_super_block *es, int wait);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
deleted file mode 100644
index 66ca113..0000000
--- a/fs/ext2/xip.c
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- *  linux/fs/ext2/xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (cotte@de.ibm.com)
- */
-
-#include <linux/mm.h>
-#include <linux/fs.h>
-#include <linux/genhd.h>
-#include <linux/buffer_head.h>
-#include <linux/blkdev.h>
-#include "ext2.h"
-#include "xip.h"
-
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
deleted file mode 100644
index 87eeb04..0000000
--- a/fs/ext2/xip.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/*
- *  linux/fs/ext2/xip.h
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (cotte@de.ibm.com)
- */
-
-#ifdef CONFIG_EXT2_FS_XIP
-static inline int ext2_use_xip (struct super_block *sb)
-{
-	struct ext2_sb_info *sbi = EXT2_SB(sb);
-	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
-}
-#else
-#define ext2_use_xip(sb)			0
-#endif
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 16/21] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (14 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 15/21] ext2: Remove xip.c and xip.h Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 17/21] ext2: Remove ext2_aops_xip Matthew Wilcox
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

The fewer Kconfig options we have the better.  Use the generic
CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/Kconfig         | 21 ++++++++++++++-------
 fs/Makefile        |  2 +-
 fs/ext2/Kconfig    | 11 -----------
 fs/ext2/ext2.h     |  2 +-
 fs/ext2/file.c     |  4 ++--
 fs/ext2/super.c    |  4 ++--
 include/linux/fs.h |  4 ++--
 7 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 312393f..a9eb53d 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -13,13 +13,6 @@ if BLOCK
 source "fs/ext2/Kconfig"
 source "fs/ext3/Kconfig"
 source "fs/ext4/Kconfig"
-
-config FS_XIP
-# execute in place
-	bool
-	depends on EXT2_FS_XIP
-	default y
-
 source "fs/jbd/Kconfig"
 source "fs/jbd2/Kconfig"
 
@@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig"
 source "fs/btrfs/Kconfig"
 source "fs/nilfs2/Kconfig"
 
+config FS_DAX
+	bool "Direct Access support"
+	depends on MMU
+	help
+	  Direct Access (DAX) can be used on memory-backed block devices.
+	  If the block device supports DAX and the filesystem supports DAX,
+	  then you can avoid using the pagecache to buffer I/Os.  Turning
+	  on this option will compile in support for DAX; you will need to
+	  mount the filesystem using the -o xip option.
+
+	  If you do not have a block device that is capable of using this,
+	  or if unsure, say N.  Saying Y will increase the size of the kernel
+	  by about 2kB.
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index 0325ec3..df4a4cf 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,7 +28,7 @@ obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_XIP)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 14a6780..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -42,14 +42,3 @@ config EXT2_FS_SECURITY
 
 	  If you are not using a security module that requires using
 	  extended attributes for file security labels, say N.
-
-config EXT2_FS_XIP
-	bool "Ext2 execute in place support"
-	depends on EXT2_FS && MMU
-	help
-	  Execute in place can be used on memory-backed block devices. If you
-	  enable this option, you can select to mount block devices which are
-	  capable of this feature without using the page cache.
-
-	  If you do not use a block device that is capable of using this,
-	  or if unsure, say N.
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 5ecf570..b30c3bd 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,7 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID32		0x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 #define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
 #else
 #define EXT2_MOUNT_XIP			0
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index da8dc64..46b333d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,7 +25,7 @@
 #include "xattr.h"
 #include "acl.h"
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	return dax_fault(vma, vmf, ext2_get_block);
@@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = {
 	.splice_write	= iter_file_splice_write,
 };
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= new_sync_read,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 0393c6d..feb53d8 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
 		seq_puts(seq, ",grpquota");
 #endif
 
-#if defined(CONFIG_EXT2_FS_XIP)
+#ifdef CONFIG_FS_DAX
 	if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
 		seq_puts(seq, ",xip");
 #endif
@@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb)
 			break;
 #endif
 		case Opt_xip:
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 			set_opt (sbi->s_mount_opt, XIP);
 #else
 			ext2_msg(sb, KERN_INFO, "xip option not supported");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d73db11..e6b48cc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1642,7 +1642,7 @@ struct super_operations {
 #define IS_IMA(inode)		((inode)->i_flags & S_IMA)
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
 #else
 #define IS_DAX(inode)		0
@@ -2488,7 +2488,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
 extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 17/21] ext2: Remove ext2_aops_xip
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (15 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 16/21] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 18/21] Get rid of most mentions of XIP in ext2 Matthew Wilcox
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

We shouldn't need a special address_space_operations any more

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/ext2.h  | 1 -
 fs/ext2/inode.c | 7 +------
 fs/ext2/namei.c | 4 ++--
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b30c3bd..b8b1c11 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
-extern const struct address_space_operations ext2_aops_xip;
 extern const struct address_space_operations ext2_nobh_aops;
 
 /* namei.c */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 154cbcf..034fd42 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -891,11 +891,6 @@ const struct address_space_operations ext2_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
-const struct address_space_operations ext2_aops_xip = {
-	.bmap			= ext2_bmap,
-	.direct_IO		= ext2_direct_IO,
-};
-
 const struct address_space_operations ext2_nobh_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
@@ -1394,7 +1389,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
 		if (test_opt(inode->i_sb, XIP)) {
-			inode->i_mapping->a_ops = &ext2_aops_xip;
+			inode->i_mapping->a_ops = &ext2_aops;
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 7ca803f..0db888c 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 
 	inode->i_op = &ext2_file_inode_operations;
 	if (test_opt(inode->i_sb, XIP)) {
-		inode->i_mapping->a_ops = &ext2_aops_xip;
+		inode->i_mapping->a_ops = &ext2_aops;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 
 	inode->i_op = &ext2_file_inode_operations;
 	if (test_opt(inode->i_sb, XIP)) {
-		inode->i_mapping->a_ops = &ext2_aops_xip;
+		inode->i_mapping->a_ops = &ext2_aops;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 18/21] Get rid of most mentions of XIP in ext2
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (16 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 17/21] ext2: Remove ext2_aops_xip Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27  3:45 ` [PATCH v10 19/21] xip: Add xip_zero_page_range Matthew Wilcox
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy

To help people transition, accept the 'xip' mount option (and report it
in /proc/mounts), but print a message encouraging people to switch over
to the 'dax' option.
---
 fs/ext2/ext2.h  | 13 +++++++------
 fs/ext2/file.c  |  2 +-
 fs/ext2/inode.c |  6 +++---
 fs/ext2/namei.c |  8 ++++----
 fs/ext2/super.c | 25 ++++++++++++++++---------
 5 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b8b1c11..46133a0 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,14 +380,15 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID32		0x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_DAX
-#define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
-#else
-#define EXT2_MOUNT_XIP			0
-#endif
+#define EXT2_MOUNT_XIP			0x010000  /* Obsolete, use DAX */
 #define EXT2_MOUNT_USRQUOTA		0x020000  /* user quota */
 #define EXT2_MOUNT_GRPQUOTA		0x040000  /* group quota */
 #define EXT2_MOUNT_RESERVATION		0x080000  /* Preallocation */
+#ifdef CONFIG_FS_DAX
+#define EXT2_MOUNT_DAX			0x100000  /* Direct Access */
+#else
+#define EXT2_MOUNT_DAX			0
+#endif
 
 
 #define clear_opt(o, opt)		o &= ~EXT2_MOUNT_##opt
@@ -789,7 +790,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
 		      int datasync);
 extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_xip_file_operations;
+extern const struct file_operations ext2_dax_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 46b333d..5b8cab5 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = {
 };
 
 #ifdef CONFIG_FS_DAX
-const struct file_operations ext2_xip_file_operations = {
+const struct file_operations ext2_dax_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= new_sync_read,
 	.write		= new_sync_write,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 034fd42..6434bc0 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1286,7 +1286,7 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
-	if (test_opt(inode->i_sb, XIP))
+	if (test_opt(inode->i_sb, DAX))
 		inode->i_flags |= S_DAX;
 }
 
@@ -1388,9 +1388,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, XIP)) {
+		if (test_opt(inode->i_sb, DAX)) {
 			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_xip_file_operations;
+			inode->i_fop = &ext2_dax_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
 			inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0db888c..148f6e3 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, XIP)) {
+	if (test_opt(inode->i_sb, DAX)) {
 		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_xip_file_operations;
+		inode->i_fop = &ext2_dax_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
@@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, XIP)) {
+	if (test_opt(inode->i_sb, DAX)) {
 		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_xip_file_operations;
+		inode->i_fop = &ext2_dax_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index feb53d8..8b9debf 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -290,6 +290,8 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
 #ifdef CONFIG_FS_DAX
 	if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
 		seq_puts(seq, ",xip");
+	if (sbi->s_mount_opt & EXT2_MOUNT_DAX)
+		seq_puts(seq, ",dax");
 #endif
 
 	if (!test_opt(sb, RESERVATION))
@@ -393,7 +395,7 @@ enum {
 	Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic,
 	Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug,
 	Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr,
-	Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota,
+	Opt_acl, Opt_noacl, Opt_xip, Opt_dax, Opt_ignore, Opt_err, Opt_quota,
 	Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation
 };
 
@@ -422,6 +424,7 @@ static const match_table_t tokens = {
 	{Opt_acl, "acl"},
 	{Opt_noacl, "noacl"},
 	{Opt_xip, "xip"},
+	{Opt_dax, "dax"},
 	{Opt_grpquota, "grpquota"},
 	{Opt_ignore, "noquota"},
 	{Opt_quota, "quota"},
@@ -549,10 +552,14 @@ static int parse_options(char *options, struct super_block *sb)
 			break;
 #endif
 		case Opt_xip:
+			ext2_msg(sb, KERN_INFO, "use dax instead of xip");
+			set_opt(sbi->s_mount_opt, XIP);
+			/* Fall through */
+		case Opt_dax:
 #ifdef CONFIG_FS_DAX
-			set_opt (sbi->s_mount_opt, XIP);
+			set_opt(sbi->s_mount_opt, DAX);
 #else
-			ext2_msg(sb, KERN_INFO, "xip option not supported");
+			ext2_msg(sb, KERN_INFO, "dax option not supported");
 #endif
 			break;
 
@@ -896,15 +903,15 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-	if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+	if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
 		if (blocksize != PAGE_SIZE) {
 			ext2_msg(sb, KERN_ERR,
-					"error: unsupported blocksize for xip");
+					"error: unsupported blocksize for dax");
 			goto failed_mount;
 		}
 		if (!sb->s_bdev->bd_disk->fops->direct_access) {
 			ext2_msg(sb, KERN_ERR,
-					"error: device does not support xip");
+					"error: device does not support dax");
 			goto failed_mount;
 		}
 	}
@@ -1276,10 +1283,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 		((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
 	es = sbi->s_es;
-	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) {
 		ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
-			 "xip flag with busy inodes while remounting");
-		sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
+			 "dax flag with busy inodes while remounting");
+		sbi->s_mount_opt ^= EXT2_MOUNT_DAX;
 	}
 	if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
 		spin_unlock(&sbi->s_lock);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 19/21] xip: Add xip_zero_page_range
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (17 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 18/21] Get rid of most mentions of XIP in ext2 Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-09-03  9:21   ` Dave Chinner
  2014-08-27  3:45 ` [PATCH v10 20/21] ext4: Add DAX functionality Matthew Wilcox
                   ` (4 subsequent siblings)
  23 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy, Ross Zwisler

This new function allows us to support hole-punch for XIP files by zeroing
a partial page, as opposed to the xip_truncate_page() function which can
only truncate to the end of the page.  Reimplement xip_truncate_page() as
a macro that calls xip_zero_page_range().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
[ported to 3.13-rc2]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 Documentation/filesystems/dax.txt |  1 +
 fs/dax.c                          | 20 ++++++++++++++------
 include/linux/fs.h                |  9 ++++++++-
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 635adaa..ebcd97f 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -62,6 +62,7 @@ Filesystem support consists of
   for fault and page_mkwrite (which should probably call dax_fault() and
   dax_mkwrite(), passing the appropriate get_block() callback)
 - calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- calling dax_zero_page_range() instead of zero_user() for DAX files
 - ensuring that there is sufficient locking between reads, writes,
   truncates and page faults
 
diff --git a/fs/dax.c b/fs/dax.c
index d54f7d3..96c4fed 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -445,13 +445,16 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
- * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * dax_zero_page_range - zero a range within a page of a DAX file
  * @inode: The file being truncated
  * @from: The file offset that is being truncated to
+ * @length: The number of bytes to zero
  * @get_block: The filesystem method used to translate file offsets to blocks
  *
- * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an DAX file to handle the partial page.
+ * This function can be called by a filesystem when it is zeroing part of a
+ * page in a DAX file.  This is intended for hole-punch operations.  If
+ * you are truncating a file, the helper function dax_truncate_page() may be
+ * more convenient.
  *
  * We work in terms of PAGE_CACHE_SIZE here for commonality with
  * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
@@ -459,12 +462,12 @@ EXPORT_SYMBOL_GPL(dax_fault);
  * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
  * since the file might be mmaped.
  */
-int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
+							get_block_t get_block)
 {
 	struct buffer_head bh;
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned length = PAGE_CACHE_ALIGN(from) - from;
 	int err;
 
 	/* Block boundary? Nothing to do */
@@ -481,9 +484,14 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
 		if (err < 0)
 			return err;
+		/*
+		 * ext4 sometimes asks to zero past the end of a block.  It
+		 * really just wants to zero to the end of the block.
+		 */
+		length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset);
 		memset(addr + offset, 0, length);
 	}
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(dax_truncate_page);
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e6b48cc..b0078df 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2490,6 +2490,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_DAX
 int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 		loff_t, get_block_t, dio_iodone_t, int flags);
@@ -2501,7 +2502,8 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
 	return 0;
 }
 
-static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
+static inline int dax_zero_page_range(struct inode *inode, loff_t from,
+						unsigned len, get_block_t gb)
 {
 	return 0;
 }
@@ -2514,6 +2516,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb,
 }
 #endif
 
+/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */
+#define dax_truncate_page(inode, from, get_block)	\
+	dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block)
+
+
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
 			    loff_t file_offset);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 20/21] ext4: Add DAX functionality
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (18 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 19/21] xip: Add xip_zero_page_range Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-09-03 11:13   ` Dave Chinner
  2014-08-27  3:45 ` [PATCH v10 21/21] brd: Rename XIP to DAX Matthew Wilcox
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Ross Zwisler, willy, Matthew Wilcox

From: Ross Zwisler <ross.zwisler@linux.intel.com>

This is a port of the DAX functionality found in the current version of
ext2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
[heavily tweaked]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/dax.txt  |  1 +
 Documentation/filesystems/ext4.txt |  2 ++
 fs/ext4/ext4.h                     |  6 +++++
 fs/ext4/file.c                     | 49 ++++++++++++++++++++++++++++++++++--
 fs/ext4/indirect.c                 | 18 ++++++++++----
 fs/ext4/inode.c                    | 51 ++++++++++++++++++++++++--------------
 fs/ext4/namei.c                    | 10 ++++++--
 fs/ext4/super.c                    | 39 ++++++++++++++++++++++++++++-
 8 files changed, 148 insertions(+), 28 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index ebcd97f..be376d9 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -73,6 +73,7 @@ or a write()) work correctly.
 
 These filesystems may be used for inspiration:
 - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
 
 
 Shortcomings
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..9c511c4 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@ max_dir_size_kb=n	This limits the size of directories so that any
 i_version		Enable 64-bit inode version support. This option is
 			off by default.
 
+dax			Use direct access if possible
+
 Data Mode
 =========
 There are 3 different data modes:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5b19760..c065a3e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -969,6 +969,11 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_ERRORS_MASK		0x00070
 #define EXT4_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
 #define EXT4_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
+#ifdef CONFIG_FS_DAX
+#define EXT4_MOUNT_DAX			0x00200	/* Execute in place */
+#else
+#define EXT4_MOUNT_DAX			0
+#endif
 #define EXT4_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
 #define EXT4_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
 #define EXT4_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
@@ -2558,6 +2563,7 @@ extern const struct file_operations ext4_dir_operations;
 /* file.c */
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_dax_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 
 /* inline.c */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aca7b24..9c7bde5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -95,7 +95,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct mutex *aio_mutex = NULL;
 	struct blk_plug plug;
-	int o_direct = file->f_flags & O_DIRECT;
+	int o_direct = io_is_direct(file);
 	int overwrite = 0;
 	size_t length = iov_iter_count(from);
 	ssize_t ret;
@@ -191,6 +191,27 @@ errout:
 	return ret;
 }
 
+#ifdef CONFIG_FS_DAX
+static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_fault(vma, vmf, ext4_get_block);
+					/* Is this the right get_block? */
+}
+
+static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_mkwrite(vma, vmf, ext4_get_block);
+}
+
+static const struct vm_operations_struct ext4_dax_vm_ops = {
+	.fault		= ext4_dax_fault,
+	.page_mkwrite	= ext4_dax_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+#else
+#define ext4_dax_vm_ops	ext4_file_vm_ops
+#endif
+
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.map_pages	= filemap_map_pages,
@@ -201,7 +222,12 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	file_accessed(file);
-	vma->vm_ops = &ext4_file_vm_ops;
+	if (IS_DAX(file_inode(file))) {
+		vma->vm_ops = &ext4_dax_vm_ops;
+		vma->vm_flags |= VM_MIXEDMAP;
+	} else {
+		vma->vm_ops = &ext4_file_vm_ops;
+	}
 	return 0;
 }
 
@@ -600,6 +626,25 @@ const struct file_operations ext4_file_operations = {
 	.fallocate	= ext4_fallocate,
 };
 
+#ifdef CONFIG_FS_DAX
+const struct file_operations ext4_dax_file_operations = {
+	.llseek		= ext4_llseek,
+	.read		= new_sync_read,
+	.write		= new_sync_write,
+	.read_iter	= generic_file_read_iter,
+	.write_iter	= ext4_file_write_iter,
+	.unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ext4_compat_ioctl,
+#endif
+	.mmap		= ext4_file_mmap,
+	.open		= ext4_file_open,
+	.release	= ext4_release_file,
+	.fsync		= ext4_sync_file,
+	.fallocate	= ext4_fallocate,
+};
+#endif
+
 const struct inode_operations ext4_file_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index e75f840..fa9ec8d 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -691,14 +691,22 @@ retry:
 			inode_dio_done(inode);
 			goto locked;
 		}
-		ret = __blockdev_direct_IO(rw, iocb, inode,
-				 inode->i_sb->s_bdev, iter, offset,
-				 ext4_get_block, NULL, NULL, 0);
+		if (IS_DAX(inode))
+			ret = dax_do_io(rw, iocb, inode, iter, offset,
+					ext4_get_block, NULL, 0);
+		else
+			ret = __blockdev_direct_IO(rw, iocb, inode,
+					inode->i_sb->s_bdev, iter, offset,
+					ext4_get_block, NULL, NULL, 0);
 		inode_dio_done(inode);
 	} else {
 locked:
-		ret = blockdev_direct_IO(rw, iocb, inode, iter,
-				 offset, ext4_get_block);
+		if (IS_DAX(inode))
+			ret = dax_do_io(rw, iocb, inode, iter, offset,
+					ext4_get_block, NULL, DIO_LOCKING);
+		else
+			ret = blockdev_direct_IO(rw, iocb, inode, iter,
+					offset, ext4_get_block);
 
 		if (unlikely((rw & WRITE) && ret < 0)) {
 			loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 367a60c..e71adf6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3055,13 +3055,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 		get_block_func = ext4_get_block_write;
 		dio_flags = DIO_LOCKING;
 	}
-	ret = __blockdev_direct_IO(rw, iocb, inode,
-				   inode->i_sb->s_bdev, iter,
-				   offset,
-				   get_block_func,
-				   ext4_end_io_dio,
-				   NULL,
-				   dio_flags);
+	if (IS_DAX(inode))
+		ret = dax_do_io(rw, iocb, inode, iter, offset, get_block_func,
+				ext4_end_io_dio, dio_flags);
+	else
+		ret = __blockdev_direct_IO(rw, iocb, inode,
+					   inode->i_sb->s_bdev, iter, offset,
+					   get_block_func,
+					   ext4_end_io_dio, NULL, dio_flags);
 
 	/*
 	 * Put our reference to io_end. This can free the io_end structure e.g.
@@ -3225,14 +3226,7 @@ void ext4_set_aops(struct inode *inode)
 		inode->i_mapping->a_ops = &ext4_aops;
 }
 
-/*
- * ext4_block_zero_page_range() zeros out a mapping of length 'length'
- * starting from file offset 'from'.  The range to be zero'd must
- * be contained with in one block.  If the specified range exceeds
- * the end of the block it will be shortened to end of the block
- * that cooresponds to 'from'
- */
-static int ext4_block_zero_page_range(handle_t *handle,
+static int __ext4_block_zero_page_range(handle_t *handle,
 		struct address_space *mapping, loff_t from, loff_t length)
 {
 	ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3323,6 +3317,22 @@ unlock:
 }
 
 /*
+ * ext4_block_zero_page_range() zeros out a mapping of length 'length'
+ * starting from file offset 'from'.  The range to be zero'd must
+ * be contained with in one block.  If the specified range exceeds
+ * the end of the block it will be shortened to end of the block
+ * that cooresponds to 'from'
+ */
+static int ext4_block_zero_page_range(handle_t *handle,
+		struct address_space *mapping, loff_t from, loff_t length)
+{
+	struct inode *inode = mapping->host;
+	if (IS_DAX(inode))
+		return dax_zero_page_range(inode, from, length, ext4_get_block);
+	return __ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
+/*
  * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
  * up to the end of the block which corresponds to `from'.
  * This required during truncate. We need to physically zero the tail end
@@ -3843,8 +3853,10 @@ void ext4_set_inode_flags(struct inode *inode)
 		new_fl |= S_NOATIME;
 	if (flags & EXT4_DIRSYNC_FL)
 		new_fl |= S_DIRSYNC;
+	if (test_opt(inode->i_sb, DAX))
+		new_fl |= S_DAX;
 	inode_set_flags(inode, new_fl,
-			S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+			S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
 }
 
 /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
@@ -4098,7 +4110,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
@@ -4568,7 +4583,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		 * Truncate pagecache after we've waited for commit
 		 * in data=journal mode to make pages freeable.
 		 */
-			truncate_pagecache(inode, inode->i_size);
+		truncate_pagecache(inode, inode->i_size);
 	}
 	/*
 	 * We want to call ext4_truncate() even if attr->ia_size ==
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index b147a67..4900990 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2251,7 +2251,10 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
 		if (!err && IS_DIRSYNC(dir))
@@ -2315,7 +2318,10 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		d_tmpfile(dentry, inode);
 		err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 32b43ad..d946f16 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1162,7 +1162,7 @@ enum {
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
 	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
-	Opt_usrquota, Opt_grpquota, Opt_i_version,
+	Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_dax,
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
 	Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
@@ -1224,6 +1224,7 @@ static const match_table_t tokens = {
 	{Opt_barrier, "barrier"},
 	{Opt_nobarrier, "nobarrier"},
 	{Opt_i_version, "i_version"},
+	{Opt_dax, "dax"},
 	{Opt_stripe, "stripe=%u"},
 	{Opt_delalloc, "delalloc"},
 	{Opt_nodelalloc, "nodelalloc"},
@@ -1406,6 +1407,7 @@ static const struct mount_opts {
 	{Opt_min_batch_time, 0, MOPT_GTE0},
 	{Opt_inode_readahead_blks, 0, MOPT_GTE0},
 	{Opt_init_itable, 0, MOPT_GTE0},
+	{Opt_dax, EXT4_MOUNT_DAX, MOPT_SET},
 	{Opt_stripe, 0, MOPT_GTE0},
 	{Opt_resuid, 0, MOPT_GTE0},
 	{Opt_resgid, 0, MOPT_GTE0},
@@ -1642,6 +1644,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		}
 		sbi->s_jquota_fmt = m->mount_opt;
 #endif
+#ifndef CONFIG_FS_DAX
+	} else if (token == Opt_dax) {
+		ext4_msg(sb, KERN_INFO, "dax option not supported");
+		return -1;
+#endif
 	} else {
 		if (!args->from)
 			arg = 1;
@@ -3571,6 +3578,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 				 "both data=journal and dioread_nolock");
 			goto failed_mount;
 		}
+		if (test_opt(sb, DAX)) {
+			ext4_msg(sb, KERN_ERR, "can't mount with "
+				 "both data=journal and dax");
+			goto failed_mount;
+		}
 		if (test_opt(sb, DELALLOC))
 			clear_opt(sb, DELALLOC);
 	}
@@ -3634,6 +3646,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto failed_mount;
 	}
 
+	if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
+		if (blocksize != PAGE_SIZE) {
+			ext4_msg(sb, KERN_ERR,
+					"error: unsupported blocksize for dax");
+			goto failed_mount;
+		}
+		if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			ext4_msg(sb, KERN_ERR,
+					"error: device does not support dax");
+			goto failed_mount;
+		}
+	}
+
 	if (sb->s_blocksize != blocksize) {
 		/* Validate the filesystem blocksize */
 		if (!sb_set_blocksize(sb, blocksize)) {
@@ -4836,6 +4861,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 			err = -EINVAL;
 			goto restore_opts;
 		}
+		if (test_opt(sb, DAX)) {
+			ext4_msg(sb, KERN_ERR, "can't mount with "
+				 "both data=journal and dax");
+			err = -EINVAL;
+			goto restore_opts;
+		}
+	}
+
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) {
+		ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+			"dax flag with busy inodes while remounting");
+		sbi->s_mount_opt ^= EXT4_MOUNT_DAX;
 	}
 
 	if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v10 21/21] brd: Rename XIP to DAX
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (19 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 20/21] ext4: Add DAX functionality Matthew Wilcox
@ 2014-08-27  3:45 ` Matthew Wilcox
  2014-08-27 20:06 ` [PATCH v10 00/21] Support ext4 on NV-DIMMs Andrew Morton
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Since this is relating to FS_XIP, not KERNEL_XIP, it should be called
DAX instead of XIP.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 drivers/block/Kconfig | 13 +++++++------
 drivers/block/brd.c   | 14 +++++++-------
 fs/Kconfig            |  4 ++--
 3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 014a1cf..1b8094d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -393,14 +393,15 @@ config BLK_DEV_RAM_SIZE
 	  The default value is 4096 kilobytes. Only change this if you know
 	  what you are doing.
 
-config BLK_DEV_XIP
-	bool "Support XIP filesystems on RAM block device"
-	depends on BLK_DEV_RAM
+config BLK_DEV_RAM_DAX
+	bool "Support Direct Access (DAX) to RAM block devices"
+	depends on BLK_DEV_RAM && FS_DAX
 	default n
 	help
-	  Support XIP filesystems (such as ext2 with XIP support on) on
-	  top of block ram device. This will slightly enlarge the kernel, and
-	  will prevent RAM block device backing store memory from being
+	  Support filesystems using DAX to access RAM block devices.  This
+	  avoids double-buffering data in the page cache before copying it
+	  to the block device.  Answering Y will slightly enlarge the kernel,
+	  and will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
 config CDROM_PKTCDVD
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index fee10bf..344681a 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
 	 * Must use NOIO because we don't want to recurse back into the
 	 * block or filesystem layers from page reclaim.
 	 *
-	 * Cannot support XIP and highmem, because our ->direct_access
-	 * routine for XIP must return memory that is always addressable.
-	 * If XIP was reworked to use pfns and kmap throughout, this
+	 * Cannot support DAX and highmem, because our ->direct_access
+	 * routine for DAX must return memory that is always addressable.
+	 * If DAX was reworked to use pfns and kmap throughout, this
 	 * restriction might be able to be lifted.
 	 */
 	gfp_flags = GFP_NOIO | __GFP_ZERO;
-#ifndef CONFIG_BLK_DEV_XIP
+#ifndef CONFIG_BLK_DEV_RAM_DAX
 	gfp_flags |= __GFP_HIGHMEM;
 #endif
 	page = alloc_page(gfp_flags);
@@ -369,7 +369,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 	return err;
 }
 
-#ifdef CONFIG_BLK_DEV_XIP
+#ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
 			void **kaddr, unsigned long *pfn, long size)
 {
@@ -388,6 +388,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 	 * file happens to be mapped to the next page of physical RAM */
 	return PAGE_SIZE;
 }
+#else
+#define brd_direct_access NULL
 #endif
 
 static int brd_ioctl(struct block_device *bdev, fmode_t mode,
@@ -428,9 +430,7 @@ static const struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		brd_rw_page,
 	.ioctl =		brd_ioctl,
-#ifdef CONFIG_BLK_DEV_XIP
 	.direct_access =	brd_direct_access,
-#endif
 };
 
 /*
diff --git a/fs/Kconfig b/fs/Kconfig
index a9eb53d..117900f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -34,7 +34,7 @@ source "fs/btrfs/Kconfig"
 source "fs/nilfs2/Kconfig"
 
 config FS_DAX
-	bool "Direct Access support"
+	bool "Direct Access (DAX) support"
 	depends on MMU
 	help
 	  Direct Access (DAX) can be used on memory-backed block devices.
@@ -45,7 +45,7 @@ config FS_DAX
 
 	  If you do not have a block device that is capable of using this,
 	  or if unsure, say N.  Saying Y will increase the size of the kernel
-	  by about 2kB.
+	  by about 5kB.
 
 endif # BLOCK
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (20 preceding siblings ...)
  2014-08-27  3:45 ` [PATCH v10 21/21] brd: Rename XIP to DAX Matthew Wilcox
@ 2014-08-27 20:06 ` Andrew Morton
  2014-08-27 21:12   ` Matthew Wilcox
  2014-08-27 21:22   ` Christoph Lameter
  2014-08-28  8:08 ` Boaz Harrosh
  2014-09-03 12:05 ` [PATCH 1/1] xfs: add DAX support Dave Chinner
  23 siblings, 2 replies; 52+ messages in thread
From: Andrew Morton @ 2014-08-27 20:06 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel, willy

On Tue, 26 Aug 2014 23:45:20 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:

> One of the primary uses for NV-DIMMs is to expose them as a block device
> and use a filesystem to store files on the NV-DIMM.  While that works,
> it currently wastes memory and CPU time buffering the files in the page
> cache.  We have support in ext2 for bypassing the page cache, but it
> has some races which are unfixable in the current design.  This series
> of patches rewrite the underlying support, and add support for direct
> access to ext4.

Sat down to read all this but I'm finding it rather unwieldy - it's
just a great blob of code.  Is there some overall
what-it-does-and-how-it-does-it roadmap?

Some explanation of why one would use ext4 instead of, say,
suitably-modified ramfs/tmpfs/rd/etc?

Performance testing results?

Carsten Otte wrote filemap_xip.c and may be a useful reviewer of this
work.

All the patch subjects violate Documentation/SubmittingPatches
section 15 ;)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-27 20:06 ` [PATCH v10 00/21] Support ext4 on NV-DIMMs Andrew Morton
@ 2014-08-27 21:12   ` Matthew Wilcox
  2014-08-27 21:46     ` Andrew Morton
  2014-08-27 21:22   ` Christoph Lameter
  1 sibling, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-27 21:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel

On Wed, Aug 27, 2014 at 01:06:13PM -0700, Andrew Morton wrote:
> On Tue, 26 Aug 2014 23:45:20 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:
> 
> > One of the primary uses for NV-DIMMs is to expose them as a block device
> > and use a filesystem to store files on the NV-DIMM.  While that works,
> > it currently wastes memory and CPU time buffering the files in the page
> > cache.  We have support in ext2 for bypassing the page cache, but it
> > has some races which are unfixable in the current design.  This series
> > of patches rewrite the underlying support, and add support for direct
> > access to ext4.
> 
> Sat down to read all this but I'm finding it rather unwieldy - it's
> just a great blob of code.  Is there some overall
> what-it-does-and-how-it-does-it roadmap?

The overall goal is to map persistent memory / NV-DIMMs directly to
userspace.  We have that functionality in the XIP code, but the way
it's structured is unsuitable for filesystems like ext4 & XFS, and
it has some pretty ugly races.

Patches 1 & 3 are simply bug-fixes.  They should go in regardless of
the merits of anything else in this series.

Patch 2 changes the API for the direct_access block_device_operation so
it can report more than a single page at a time.  As the series evolved,
this work also included moving support for partitioning into the VFS
where it belongs, handling various error cases in the VFS and so on.

Patch 4 is an optimisation.  It's poor form to make userspace take two
faults for the same dereference.

Patch 5 gives us a VFS flag for the DAX property, which lets us get rid of
the get_xip_mem() method later on.

Patch 6 is also prep work; Al Viro liked it enough that it's now in
his tree.

The new DAX code is then dribbled in over patches 7-11, split up by
functional area.  At each stage, the ext2-xip code is converted over to
the new DAX code.

Patches 12-18 delete the remnants of the old XIP code, and fix the things
in ext2 that Jan didn't like when he reviewed them for ext4 :-)

Patches 19 & 20 are the work to make ext4 use DAX.

Patch 21 is some final cleanup of references to the old XIP code, renaming
it all to DAX.

> Some explanation of why one would use ext4 instead of, say,
> suitably-modified ramfs/tmpfs/rd/etc?

ramfs and tmpfs really rely on the page cache.  They're not exactly
built for permanence either.  brd also relies on the page cache, and
there's a clear desire to use a filesystem instead of a block device
for all the usual reasons of access permissions, grow/shrink, etc.

Some people might want to use XFS instead of ext4.  We're starting with
ext4, but we've been keeping an eye on what other filesystems might want
to use.  btrfs isn't going to use the DAX code, but some of the other
pieces will probably come in handy.

There are also at least three people working on their own filesystems
specially designed for persistent memory.  I wish them all the best
... but I'd like to get this infrastructure into place.

> Performance testing results?

I haven't been running any performance tests.  What sort of performance
tests would be interesting for you to see?

> Carsten Otte wrote filemap_xip.c and may be a useful reviewer of this
> work.

I cc'd him on some earlier versions and didn't hear anything back.  It felt
rude to keep plying him with 20+ patches every month.

> All the patch subjects violate Documentation/SubmittingPatches
> section 15 ;)

errr ... which bit?  I used git format-patch to create them.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-27 20:06 ` [PATCH v10 00/21] Support ext4 on NV-DIMMs Andrew Morton
  2014-08-27 21:12   ` Matthew Wilcox
@ 2014-08-27 21:22   ` Christoph Lameter
  2014-08-27 21:30     ` Andrew Morton
  1 sibling, 1 reply; 52+ messages in thread
From: Christoph Lameter @ 2014-08-27 21:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel, willy

On Wed, 27 Aug 2014, Andrew Morton wrote:

> Sat down to read all this but I'm finding it rather unwieldy - it's
> just a great blob of code.  Is there some overall
> what-it-does-and-how-it-does-it roadmap?

Matthew gave a talk about DAX at the kernel summit. Its a great feature
because this is another piece of the bare metal hardware technology that
is being improved by him.

> Some explanation of why one would use ext4 instead of, say,
> suitably-modified ramfs/tmpfs/rd/etc?

The NVDIMM contents survive reboot and therefore ramfs and friends wont
work with it.

> Performance testing results?

This is obviously avoiding kernel buffering and therefore decreasing
kernel overhead for non volatile memory. Avoids useless duplication of
data from the non volatile memory into regular ram and allows direct
access to non volatile memory from user space in a controlled fashion.

I think this should be a priority item.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-27 21:22   ` Christoph Lameter
@ 2014-08-27 21:30     ` Andrew Morton
  2014-08-27 23:04       ` One Thousand Gnomes
  2014-08-28  7:17       ` Dave Chinner
  0 siblings, 2 replies; 52+ messages in thread
From: Andrew Morton @ 2014-08-27 21:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel, willy

On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:

> > Some explanation of why one would use ext4 instead of, say,
> > suitably-modified ramfs/tmpfs/rd/etc?
> 
> The NVDIMM contents survive reboot and therefore ramfs and friends wont
> work with it.

See "suitably modified".  Presumably this type of memory would need to
come from a particular page allocator zone.  ramfs would be unweildy
due to its use to dentry/inode caches, but rd/etc should be feasible.

I dunno, I'm not proposing implementations - I'm asking obvious
questions.  Stuff which should have been addressed in the changelogs
before one even starts to read the code...


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-27 21:12   ` Matthew Wilcox
@ 2014-08-27 21:46     ` Andrew Morton
  2014-08-28  1:30       ` Andy Lutomirski
  2014-08-28 15:45       ` Matthew Wilcox
  0 siblings, 2 replies; 52+ messages in thread
From: Andrew Morton @ 2014-08-27 21:46 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel

On Wed, 27 Aug 2014 17:12:50 -0400 Matthew Wilcox <willy@linux.intel.com> wrote:

> On Wed, Aug 27, 2014 at 01:06:13PM -0700, Andrew Morton wrote:
> > On Tue, 26 Aug 2014 23:45:20 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:
> > 
> > > One of the primary uses for NV-DIMMs is to expose them as a block device
> > > and use a filesystem to store files on the NV-DIMM.  While that works,
> > > it currently wastes memory and CPU time buffering the files in the page
> > > cache.  We have support in ext2 for bypassing the page cache, but it
> > > has some races which are unfixable in the current design.  This series
> > > of patches rewrite the underlying support, and add support for direct
> > > access to ext4.
> > 
> > Sat down to read all this but I'm finding it rather unwieldy - it's
> > just a great blob of code.  Is there some overall
> > what-it-does-and-how-it-does-it roadmap?
> 
> The overall goal is to map persistent memory / NV-DIMMs directly to
> userspace.  We have that functionality in the XIP code, but the way
> it's structured is unsuitable for filesystems like ext4 & XFS, and
> it has some pretty ugly races.

When thinking about looking at the patchset I wonder things like how
does mmap work, in what situations does a page get COWed, how do we
handle partial pages at EOF, etc.  I guess that's all part of the
filemap_xip legacy, the details of which I've totally forgotten.

> Patches 1 & 3 are simply bug-fixes.  They should go in regardless of
> the merits of anything else in this series.
> 
> Patch 2 changes the API for the direct_access block_device_operation so
> it can report more than a single page at a time.  As the series evolved,
> this work also included moving support for partitioning into the VFS
> where it belongs, handling various error cases in the VFS and so on.
> 
> Patch 4 is an optimisation.  It's poor form to make userspace take two
> faults for the same dereference.
> 
> Patch 5 gives us a VFS flag for the DAX property, which lets us get rid of
> the get_xip_mem() method later on.
> 
> Patch 6 is also prep work; Al Viro liked it enough that it's now in
> his tree.
> 
> The new DAX code is then dribbled in over patches 7-11, split up by
> functional area.  At each stage, the ext2-xip code is converted over to
> the new DAX code.
> 
> Patches 12-18 delete the remnants of the old XIP code, and fix the things
> in ext2 that Jan didn't like when he reviewed them for ext4 :-)
> 
> Patches 19 & 20 are the work to make ext4 use DAX.
> 
> Patch 21 is some final cleanup of references to the old XIP code, renaming
> it all to DAX.

hrm.

> > Some explanation of why one would use ext4 instead of, say,
> > suitably-modified ramfs/tmpfs/rd/etc?
> 
> ramfs and tmpfs really rely on the page cache.  They're not exactly
> built for permanence either.  brd also relies on the page cache, and
> there's a clear desire to use a filesystem instead of a block device
> for all the usual reasons of access permissions, grow/shrink, etc.
> 
> Some people might want to use XFS instead of ext4.  We're starting with
> ext4, but we've been keeping an eye on what other filesystems might want
> to use.  btrfs isn't going to use the DAX code, but some of the other
> pieces will probably come in handy.
> 
> There are also at least three people working on their own filesystems
> specially designed for persistent memory.  I wish them all the best
> ... but I'd like to get this infrastructure into place.

This is the sort of thing which first-timers (this one at least) like
to see in [0/n].

> > Performance testing results?
> 
> I haven't been running any performance tests.  What sort of performance
> tests would be interesting for you to see?

fs benchmarks?  `dd' would be a good start ;)

I assume (because I wasn't told!) that there are two objectives here:

1) reduce memory consumption by not maintaining pagecache and
2) reduce CPU cost by avoiding the double-copies.

These things are pretty easily quantified.  And really they must be
quantified as part of the developer testing, because if you find
they've worsened then holy cow, what went wrong.

> > Carsten Otte wrote filemap_xip.c and may be a useful reviewer of this
> > work.
> 
> I cc'd him on some earlier versions and didn't hear anything back.  It felt
> rude to keep plying him with 20+ patches every month.

OK.

> > All the patch subjects violate Documentation/SubmittingPatches
> > section 15 ;)
> 
> errr ... which bit?  I used git format-patch to create them.

None of the patch titles identify the subsystem(s) which they're
hitting.  eg, "Introduce IS_DAX(inode)" is an ext2 patch, but nobody
would know that from browsing the titles.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-27 21:30     ` Andrew Morton
@ 2014-08-27 23:04       ` One Thousand Gnomes
  2014-08-28  7:17       ` Dave Chinner
  1 sibling, 0 replies; 52+ messages in thread
From: One Thousand Gnomes @ 2014-08-27 23:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Matthew Wilcox, linux-fsdevel, linux-mm,
	linux-kernel, willy

On Wed, 27 Aug 2014 14:30:55 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:
> 
> > > Some explanation of why one would use ext4 instead of, say,
> > > suitably-modified ramfs/tmpfs/rd/etc?
> > 
> > The NVDIMM contents survive reboot and therefore ramfs and friends wont
> > work with it.
> 
> See "suitably modified".  Presumably this type of memory would need to
> come from a particular page allocator zone.  ramfs would be unweildy
> due to its use to dentry/inode caches, but rd/etc should be feasible.

If you took one of the existing ramfs types you would then need to

- make it persistent in its storage, and put all the objects in the store
- add journalling for failures mid transaction. Your dimm may retain its
  bits but if your CPU reset mid fs operation its got to be recovered
- write an fsck tool for it
- validate it

at which point it's probably turned into ext4 8)

It's persistent but that doesn't solve the 'my box crashed' problem. 

Alan

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-27 21:46     ` Andrew Morton
@ 2014-08-28  1:30       ` Andy Lutomirski
  2014-08-28 16:50         ` Matthew Wilcox
  2014-08-28 15:45       ` Matthew Wilcox
  1 sibling, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2014-08-28  1:30 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox
  Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel

On 08/27/2014 02:46 PM, Andrew Morton wrote:
> I assume (because I wasn't told!) that there are two objectives here:
> 
> 1) reduce memory consumption by not maintaining pagecache and
> 2) reduce CPU cost by avoiding the double-copies.
> 
> These things are pretty easily quantified.  And really they must be
> quantified as part of the developer testing, because if you find
> they've worsened then holy cow, what went wrong.
> 

There are two more huge ones:

3) Writes via mmap are immediately durable (or at least they're durable
after a *very* lightweight flush).

4) No page faults ever once a page is writable (I hope -- I'm not sure
whether this series actually achieves that goal).

A note on #3: there is ongoing work to enable write-through memory for
things like this.  Once that's done, then writes via mmap might actually
be synchronously durable, depending on chipset details.

--Andy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-27 21:30     ` Andrew Morton
  2014-08-27 23:04       ` One Thousand Gnomes
@ 2014-08-28  7:17       ` Dave Chinner
  2014-08-30 23:11         ` Christian Stroetmann
  1 sibling, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2014-08-28  7:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Matthew Wilcox, linux-fsdevel, linux-mm,
	linux-kernel, willy

On Wed, Aug 27, 2014 at 02:30:55PM -0700, Andrew Morton wrote:
> On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:
> 
> > > Some explanation of why one would use ext4 instead of, say,
> > > suitably-modified ramfs/tmpfs/rd/etc?
> > 
> > The NVDIMM contents survive reboot and therefore ramfs and friends wont
> > work with it.
> 
> See "suitably modified".  Presumably this type of memory would need to
> come from a particular page allocator zone.  ramfs would be unweildy
> due to its use to dentry/inode caches, but rd/etc should be feasible.

<sigh>

That's where we started about two years ago with that horrible
pramfs trainwreck.

To start with: brd is a block device, not a filesystem. We still
need the filesystem on top of a persistent ram disk to make it
useful to applications. We can do this with ext4/XFS right now, and
that is the fundamental basis on which DAX is built.

For sake of the discussion, however, let's walk through what is
required to make an "existing" ramfs persistent. Persistence means we
can't just wipe it and start again if it gets corrupted, and
rebooting is not a fix for problems.  Hence we need to be able to
identify it, check it, repair it, ensure metadata operations are
persistent across machine crashes, etc, so there is all sorts of
management tools required by a persistent ramfs.

But most important of all: the persistent storage format needs to be
forwards and backwards compatible across kernel versions.  Hence we
can't encode any structure the kernel uses internally into the
persistent storage because they aren't stable structures.  That
means we need to marshall objects between the persistence domain and
the volatile domain in an orderly fashion.

We can avoid using the dentry/inode *caches* by freeing those
volatile objects the moment reference counts dop to zero rather than
putting them on LRUs. However, we can't store them in persistent
storage and we can't avoid using them to interface with the VFS, so
it makes little sense to burn CPU continually marshalling such
structures in and out of volatile memory if we have free RAM to do
so. So even with a "persistent ramfs" caching the working set of
volatile VFS objects makes sense from a peformance point of view.

Then you've got crash recovery management: NVDIMMs are not
synchronous: they can still lose data while it is being written on
power loss. And we can't update persistent memory piecemeal as the
VFS code modifies metadata - there needs to be synchronisation
points, otherwise we will always have inconsistent metadata state in
persistent memory.

Persistent memory also can't do atomic writes across multiple,
disjoint CPU cachelines or NVDIMMs, and this is what is needed for
synchroniation points for multi-object metadata modification
operations to be consistent after a crash.  There is some work in
the nvme working groups to define this, but so far there hasn't been
any useful outcome, and then we willhave to wait for CPUs to
implement those interfaces.

Hence the metadata that indexes the persistent RAM needs to use COW
techniques, use a log structure or use WAL (journalling).  Hence
that "persistent ramfs" is now looking much more like a database or
traditional filesystem.

Further, it's going to need to scale to very large amounts of
storage.  We're talking about machines with *tens of TB* of NVDIMM
capacity in the immediate future and so free space manangement and
concurrency of allocation and freeing of used space is going to be
fundamental to the performance of the persistent NVRAM filesystem.
So, you end up with block/allocation groups to subdivide the space.
Looking a lot like ext4 or XFS at this point.

And now you have to scale to indexing tens of millions of
everything. At least tens of millions - hundreds of millions to
billions is more likely, because storing tens of terabytes of small
files is going to require indexing billions of files. And because
there is no performance penalty for doing this, people will use the
filesystem as a great big database. So now you have to have a
scalable posix compatible directory structures, scalable freespace
indexation, dynamic, scalable inode allocation, freeing, etc. Oh,
and it also needs to be highly concurrent to handle machines with
hundreds of CPU cores.

Funnily enough, we already have a couple of persistent storage
implementations that solve these problems to varying degrees. ext4
is one of them, if you ignore the scalability and concurrency
requirements. XFS is the other. And both will run unmodified on
a persistant ram block device, which we *already have*.

And so back to DAX. What users actually want from their high speed
persistant RAM storage is direct, cpu addressable access to that
persistent storage. They don't want to have to care about how to
find an object in the persistent storage - that's what filesystems
are for - they just want to be able to read and write to it
directly. That's what DAX does - it provides existing filesystems
a method for exposing direct access to the persistent RAM to
applications in a manner that application developers are already
familiar with. It's a win-win situation all round.

IOWs, ext4/XFS + DAX gets us to a place that is good enough for most
users and the hardware capabilities we expect to see in the next 5
years.  And hopefully that will be long enough to bring a purpose
built, next generation persistent memory filesystem to production
quality that can take full advantage of the technology...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (21 preceding siblings ...)
  2014-08-27 20:06 ` [PATCH v10 00/21] Support ext4 on NV-DIMMs Andrew Morton
@ 2014-08-28  8:08 ` Boaz Harrosh
  2014-08-28 22:09   ` Zwisler, Ross
  2014-09-03 12:05 ` [PATCH 1/1] xfs: add DAX support Dave Chinner
  23 siblings, 1 reply; 52+ messages in thread
From: Boaz Harrosh @ 2014-08-28  8:08 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel; +Cc: willy

On 08/27/2014 06:45 AM, Matthew Wilcox wrote:
> One of the primary uses for NV-DIMMs is to expose them as a block device
> and use a filesystem to store files on the NV-DIMM.  While that works,
> it currently wastes memory and CPU time buffering the files in the page
> cache.  We have support in ext2 for bypassing the page cache, but it
> has some races which are unfixable in the current design.  This series
> of patches rewrite the underlying support, and add support for direct
> access to ext4.
> 
> Note that patch 6/21 has been included in
> https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next-candidate
> 

Matthew hi

Could you please push this to the regular or a new public tree?

(Old versions are at: https://github.com/01org/prd)

Thanks
Boaz

> This iteration of the patchset rebases to 3.17-rc2, changes the page fault
> locking, fixes a couple of bugs and makes a few other minor changes.
> 
>  - Move the calculation of the maximum size available at the requested
>    location from the ->direct_access implementations to bdev_direct_access()
>  - Fix a comment typo (Ross Zwisler)
>  - Check that the requested length is positive in bdev_direct_access().  If
>    it is not, assume that it's an errno, and just return it.
>  - Fix some whitespace issues flagged by checkpatch
>  - Added the Acked-by responses from Kirill that I forget in the last round
>  - Added myself to MAINTAINERS for DAX
>  - Fixed compilation with !CONFIG_DAX (Vishal Verma)
>  - Revert the locking in the page fault handler back to an earlier version.
>    If we hit the race that we were trying to protect against, we will leave
>    blocks allocated past the end of the file.  They will be removed on file
>    removal, the next truncate, or fsck.
> 
> 
> Matthew Wilcox (20):
>   axonram: Fix bug in direct_access
>   Change direct_access calling convention
>   Fix XIP fault vs truncate race
>   Allow page fault handlers to perform the COW
>   Introduce IS_DAX(inode)
>   Add copy_to_iter(), copy_from_iter() and iov_iter_zero()
>   Replace XIP read and write with DAX I/O
>   Replace ext2_clear_xip_target with dax_clear_blocks
>   Replace the XIP page fault handler with the DAX page fault handler
>   Replace xip_truncate_page with dax_truncate_page
>   Replace XIP documentation with DAX documentation
>   Remove get_xip_mem
>   ext2: Remove ext2_xip_verify_sb()
>   ext2: Remove ext2_use_xip
>   ext2: Remove xip.c and xip.h
>   Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
>   ext2: Remove ext2_aops_xip
>   Get rid of most mentions of XIP in ext2
>   xip: Add xip_zero_page_range
>   brd: Rename XIP to DAX
> 
> Ross Zwisler (1):
>   ext4: Add DAX functionality
> 
>  Documentation/filesystems/Locking  |   3 -
>  Documentation/filesystems/dax.txt  |  91 +++++++
>  Documentation/filesystems/ext4.txt |   2 +
>  Documentation/filesystems/xip.txt  |  68 -----
>  MAINTAINERS                        |   6 +
>  arch/powerpc/sysdev/axonram.c      |  19 +-
>  drivers/block/Kconfig              |  13 +-
>  drivers/block/brd.c                |  26 +-
>  drivers/s390/block/dcssblk.c       |  21 +-
>  fs/Kconfig                         |  21 +-
>  fs/Makefile                        |   1 +
>  fs/block_dev.c                     |  40 +++
>  fs/dax.c                           | 497 +++++++++++++++++++++++++++++++++++++
>  fs/exofs/inode.c                   |   1 -
>  fs/ext2/Kconfig                    |  11 -
>  fs/ext2/Makefile                   |   1 -
>  fs/ext2/ext2.h                     |  10 +-
>  fs/ext2/file.c                     |  45 +++-
>  fs/ext2/inode.c                    |  38 +--
>  fs/ext2/namei.c                    |  13 +-
>  fs/ext2/super.c                    |  53 ++--
>  fs/ext2/xip.c                      |  91 -------
>  fs/ext2/xip.h                      |  26 --
>  fs/ext4/ext4.h                     |   6 +
>  fs/ext4/file.c                     |  49 +++-
>  fs/ext4/indirect.c                 |  18 +-
>  fs/ext4/inode.c                    |  51 ++--
>  fs/ext4/namei.c                    |  10 +-
>  fs/ext4/super.c                    |  39 ++-
>  fs/open.c                          |   5 +-
>  include/linux/blkdev.h             |   6 +-
>  include/linux/fs.h                 |  49 +++-
>  include/linux/mm.h                 |   1 +
>  include/linux/uio.h                |   3 +
>  mm/Makefile                        |   1 -
>  mm/fadvise.c                       |   6 +-
>  mm/filemap.c                       |   6 +-
>  mm/filemap_xip.c                   | 483 -----------------------------------
>  mm/iov_iter.c                      | 237 ++++++++++++++++--
>  mm/madvise.c                       |   2 +-
>  mm/memory.c                        |  33 ++-
>  41 files changed, 1229 insertions(+), 873 deletions(-)
>  create mode 100644 Documentation/filesystems/dax.txt
>  delete mode 100644 Documentation/filesystems/xip.txt
>  create mode 100644 fs/dax.c
>  delete mode 100644 fs/ext2/xip.c
>  delete mode 100644 fs/ext2/xip.h
>  delete mode 100644 mm/filemap_xip.c
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-27 21:46     ` Andrew Morton
  2014-08-28  1:30       ` Andy Lutomirski
@ 2014-08-28 15:45       ` Matthew Wilcox
  1 sibling, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-28 15:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel

On Wed, Aug 27, 2014 at 02:46:22PM -0700, Andrew Morton wrote:
> > > Sat down to read all this but I'm finding it rather unwieldy - it's
> > > just a great blob of code.  Is there some overall
> > > what-it-does-and-how-it-does-it roadmap?
> > 
> > The overall goal is to map persistent memory / NV-DIMMs directly to
> > userspace.  We have that functionality in the XIP code, but the way
> > it's structured is unsuitable for filesystems like ext4 & XFS, and
> > it has some pretty ugly races.
> 
> When thinking about looking at the patchset I wonder things like how
> does mmap work, in what situations does a page get COWed, how do we
> handle partial pages at EOF, etc.  I guess that's all part of the
> filemap_xip legacy, the details of which I've totally forgotten.

mmap works by installing a PTE that points to the storage.  This implies
that the NV-DIMM has to be the kind that always has everything mapped
(there are other types that require commands to be sent to move windows
around that point into the storage ... DAX is not for these types
of DIMMs).

We use a VM_MIXEDMAP vma.  The PTEs pointing to PFNs will just get
copied across on fork.  Read-faults on holes are covered by a read-only
page cache page.  On a write to a hole, any page cache page covering it
will be unmapped and evicted from the page cache.  The mapping for the
faulting task will be replaced with a mapping to the newly established
block, but other mappings will take a fresh fault on their next reference.

Partial pages are mmapable, just as they are with page-cache based
files.  You can even store beyond EOF, just as with page-cache files.
Those stores are, of course, going to end up on persistence, but they
might well end up being zeroed if the file is extended ... again, this
is no different to page-cache based files.

> > > Performance testing results?
> > 
> > I haven't been running any performance tests.  What sort of performance
> > tests would be interesting for you to see?
> 
> fs benchmarks?  `dd' would be a good start ;)
> 
> I assume (because I wasn't told!) that there are two objectives here:
> 
> 1) reduce memory consumption by not maintaining pagecache and
> 2) reduce CPU cost by avoiding the double-copies.
> 
> These things are pretty easily quantified.  And really they must be
> quantified as part of the developer testing, because if you find
> they've worsened then holy cow, what went wrong.

It's really a functionality argument; the users we anticipate for NV-DIMMs
really want to directly map them into memory and do a lot of work through
loads and stores with the kernel not being involved at all, so we don't
actually have any performance targets for things like read/write.
That said, when running xfstests and comparing results between ext4
with and without DAX, I do see many of the tests completing quicker
with DAX than without (others "run for thirty seconds" so there's no
time difference between with/without).

> None of the patch titles identify the subsystem(s) which they're
> hitting.  eg, "Introduce IS_DAX(inode)" is an ext2 patch, but nobody
> would know that from browsing the titles.

I actually see that one as being a VFS patch ... ext2 changing is just
a side-effect.  I can re-split that patch if desired.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-28  1:30       ` Andy Lutomirski
@ 2014-08-28 16:50         ` Matthew Wilcox
  0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-08-28 16:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel

On Wed, Aug 27, 2014 at 06:30:27PM -0700, Andy Lutomirski wrote:
> 4) No page faults ever once a page is writable (I hope -- I'm not sure
> whether this series actually achieves that goal).

I can't think of a circumstance in which you'd end up taking a page fault
after a writable mapping is established.

The next part to this series (that I'm working on now) is PMD support.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-28  8:08 ` Boaz Harrosh
@ 2014-08-28 22:09   ` Zwisler, Ross
  0 siblings, 0 replies; 52+ messages in thread
From: Zwisler, Ross @ 2014-08-28 22:09 UTC (permalink / raw)
  To: openosd; +Cc: linux-kernel, linux-mm, willy, Wilcox, Matthew R, linux-fsdevel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1258 bytes --]

On Thu, 2014-08-28 at 11:08 +0300, Boaz Harrosh wrote:
> On 08/27/2014 06:45 AM, Matthew Wilcox wrote:
> > One of the primary uses for NV-DIMMs is to expose them as a block device
> > and use a filesystem to store files on the NV-DIMM.  While that works,
> > it currently wastes memory and CPU time buffering the files in the page
> > cache.  We have support in ext2 for bypassing the page cache, but it
> > has some races which are unfixable in the current design.  This series
> > of patches rewrite the underlying support, and add support for direct
> > access to ext4.
> > 
> > Note that patch 6/21 has been included in
> > https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next-candidate
> > 
> 
> Matthew hi
> 
> Could you please push this to the regular or a new public tree?
> 
> (Old versions are at: https://github.com/01org/prd)
> 
> Thanks
> Boaz

Hi Boaz,

I've pushed the updated tree to https://github.com/01org/prd in the master
branch.  All the older versions of the code that we've had while rebasing are
still available in their own branches.

Thanks,
- Ross

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 00/21] Support ext4 on NV-DIMMs
  2014-08-28  7:17       ` Dave Chinner
@ 2014-08-30 23:11         ` Christian Stroetmann
  0 siblings, 0 replies; 52+ messages in thread
From: Christian Stroetmann @ 2014-08-30 23:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Christoph Lameter, Matthew Wilcox, linux-fsdevel,
	linux-mm, linux-kernel, willy

On the 28th of August 2014 at 09:17, Dave Chinner wrote:
> On Wed, Aug 27, 2014 at 02:30:55PM -0700, Andrew Morton wrote:
>> On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter<cl@linux.com>  wrote:
>>
>>>> Some explanation of why one would use ext4 instead of, say,
>>>> suitably-modified ramfs/tmpfs/rd/etc?
>>> The NVDIMM contents survive reboot and therefore ramfs and friends wont
>>> work with it.
>> See "suitably modified".  Presumably this type of memory would need to
>> come from a particular page allocator zone.  ramfs would be unweildy
>> due to its use to dentry/inode caches, but rd/etc should be feasible.
> <sigh>

Hello Dave and the others

Thank you very much for your patience and your following summarization.

> That's where we started about two years ago with that horrible
> pramfs trainwreck.
>
> To start with: brd is a block device, not a filesystem. We still
> need the filesystem on top of a persistent ram disk to make it
> useful to applications. We can do this with ext4/XFS right now, and
> that is the fundamental basis on which DAX is built.
>
> For sake of the discussion, however, let's walk through what is
> required to make an "existing" ramfs persistent. Persistence means we
> can't just wipe it and start again if it gets corrupted, and
> rebooting is not a fix for problems.  Hence we need to be able to
> identify it, check it, repair it, ensure metadata operations are
> persistent across machine crashes, etc, so there is all sorts of
> management tools required by a persistent ramfs.
>
> But most important of all: the persistent storage format needs to be
> forwards and backwards compatible across kernel versions.  Hence we
> can't encode any structure the kernel uses internally into the
> persistent storage because they aren't stable structures.  That
> means we need to marshall objects between the persistence domain and
> the volatile domain in an orderly fashion.

Two little questions:
1. If we would omit the compatiblitiy across kernel versions only for 
theoretical reasons,
then would it make sense at all to encode a structure that the kernel 
uses internally and
what advantages could be reached in this way?
2. Have the said structures used by the kernel changed so many times?

> We can avoid using the dentry/inode *caches* by freeing those
> volatile objects the moment reference counts dop to zero rather than
> putting them on LRUs. However, we can't store them in persistent
> storage and we can't avoid using them to interface with the VFS, so
> it makes little sense to burn CPU continually marshalling such
> structures in and out of volatile memory if we have free RAM to do
> so. So even with a "persistent ramfs" caching the working set of
> volatile VFS objects makes sense from a peformance point of view.

I am sorry to say so, but I am confused again and do not understand this 
argument,
because we are already talking about NVDIMMs here. So, if we have those 
volatile
VFS objects already in NVDIMMs so to say, then we have them also in 
persistent
storage and in DRAM at the same time.

>
> Then you've got crash recovery management: NVDIMMs are not
> synchronous: they can still lose data while it is being written on
> power loss. And we can't update persistent memory piecemeal as the
> VFS code modifies metadata - there needs to be synchronisation
> points, otherwise we will always have inconsistent metadata state in
> persistent memory.
>
> Persistent memory also can't do atomic writes across multiple,
> disjoint CPU cachelines or NVDIMMs, and this is what is needed for
> synchroniation points for multi-object metadata modification
> operations to be consistent after a crash.  There is some work in
> the nvme working groups to define this, but so far there hasn't been
> any useful outcome, and then we willhave to wait for CPUs to
> implement those interfaces.
>
> Hence the metadata that indexes the persistent RAM needs to use COW
> techniques, use a log structure or use WAL (journalling).  Hence
> that "persistent ramfs" is now looking much more like a database or
> traditional filesystem.
>
> Further, it's going to need to scale to very large amounts of
> storage.  We're talking about machines with *tens of TB* of NVDIMM
> capacity in the immediate future and so free space manangement and
> concurrency of allocation and freeing of used space is going to be
> fundamental to the performance of the persistent NVRAM filesystem.
> So, you end up with block/allocation groups to subdivide the space.
> Looking a lot like ext4 or XFS at this point.
>
> And now you have to scale to indexing tens of millions of
> everything. At least tens of millions - hundreds of millions to
> billions is more likely, because storing tens of terabytes of small
> files is going to require indexing billions of files. And because
> there is no performance penalty for doing this, people will use the
> filesystem as a great big database. So now you have to have a
> scalable posix compatible directory structures, scalable freespace
> indexation, dynamic, scalable inode allocation, freeing, etc. Oh,
> and it also needs to be highly concurrent to handle machines with
> hundreds of CPU cores.
>
> Funnily enough, we already have a couple of persistent storage
> implementations that solve these problems to varying degrees. ext4
> is one of them, if you ignore the scalability and concurrency
> requirements. XFS is the other. And both will run unmodified on
> a persistant ram block device, which we *already have*.

Yeah! :D

>
> And so back to DAX. What users actually want from their high speed
> persistant RAM storage is direct, cpu addressable access to that
> persistent storage. They don't want to have to care about how to
> find an object in the persistent storage - that's what filesystems
> are for - they just want to be able to read and write to it
> directly. That's what DAX does - it provides existing filesystems
> a method for exposing direct access to the persistent RAM to
> applications in a manner that application developers are already
> familiar with. It's a win-win situation all round.
>
> IOWs, ext4/XFS + DAX gets us to a place that is good enough for most
> users and the hardware capabilities we expect to see in the next 5
> years.  And hopefully that will be long enough to bring a purpose
> built, next generation persistent memory filesystem to production
> quality that can take full advantage of the technology...

Please, if possible, then could you be so kind and give such a very good 
summarization
or a sketch about the future development path and system architecture?
How does this mentioned purpose built, next generation persistent memory 
filesystem
looks like?
How does it differ from the DAX + FS approach and which advantages will 
it offer?
Would it be some kind of an object storage system that possibly uses the 
said structures
used by the kernel (see the two little questions above again)?
Do we have to keep the term file for everything?

>
> Cheers,
>
> Dave.

With all the best
Christian Stroetmann


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler
  2014-08-27  3:45 ` [PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
@ 2014-09-03  7:47   ` Dave Chinner
  2014-09-10 15:23     ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2014-09-03  7:47 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel, willy

On Tue, Aug 26, 2014 at 11:45:29PM -0400, Matthew Wilcox wrote:
> Instead of calling aops->get_xip_mem from the fault handler, the
> filesystem passes a get_block_t that is used to find the appropriate
> blocks.
> 
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

There's a problem in this code to do with faults into unwritten
extents.

> +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> +			get_block_t get_block)
> +{
> +	struct file *file = vma->vm_file;
> +	struct inode *inode = file_inode(file);
> +	struct address_space *mapping = file->f_mapping;
> +	struct page *page;
> +	struct buffer_head bh;
> +	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> +	unsigned blkbits = inode->i_blkbits;
> +	sector_t block;
> +	pgoff_t size;
> +	unsigned long pfn;
> +	int error;
> +	int major = 0;
> +
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (vmf->pgoff >= size)
> +		return VM_FAULT_SIGBUS;
> +
> +	memset(&bh, 0, sizeof(bh));
> +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
> +	bh.b_size = PAGE_SIZE;
> +
> + repeat:
> +	page = find_get_page(mapping, vmf->pgoff);
> +	if (page) {
> +		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
> +			page_cache_release(page);
> +			return VM_FAULT_RETRY;
> +		}
> +		if (unlikely(page->mapping != mapping)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto repeat;
> +		}
> +	}
> +
> +	error = get_block(inode, block, &bh, 0);
> +	if (!error && (bh.b_size < PAGE_SIZE))
> +		error = -EIO;
> +	if (error)
> +		goto unlock_page;

page fault into unwritten region, returns buffer_unwritten(bh) ==
true. Hence buffer_written(bh) is false, and we take this branch:

> +	if (!buffer_written(&bh) && !vmf->cow_page) {
> +		if (vmf->flags & FAULT_FLAG_WRITE) {
> +			error = get_block(inode, block, &bh, 1);

Exactly what are you expecting to happen here? We don't do
allocation because there are already unwritten blocks over this
extent, and so bh will be unchanged when returning. i.e. it will
still be mapping an unwritten extent.

There's another issue here, too. Allocate the block, sets
buffer_new, and we crash before the block is zeroed. Stale data is
exposed to the user if the allocation transaction has already hit
the log. i.e. at minimum data corruption, at worst we just exposed
the contents of /etc/shadow....

....

> +	if (buffer_unwritten(&bh) || buffer_new(&bh))
> +		dax_clear_blocks(inode, bh.b_blocknr, bh.b_size);

Back to unwritten extents, we zero the block here, but the
filesystem still thinks it's an unwritten extent. There's been no IO
completion for the filesystem to mark the extent as containing valid
data.

We do this properly for the do_dax_IO path, but we do not do it
properly in the fault path.

Back to that stale exposure bug: to avoid this stale data exposure,
XFS allocates unwritten extents when doing direct allocation into
holes, then uses IO completion to convert them to written. For DAX,
we are doing direct allocation for page faults (as delayed allocation
makes no sense at all) as well as the IO path, and so we have need
for IO completion callbacks after zeroing just like we do for a
write() via dax_do_io().

Now, I think we can do this pretty easily - the bufferhead has an
endio callback we can use for exactly this purpose. i.e. if the
extent mapping bh is unwritten and the mapping bh->b_end_io is
present, then that end io function needs to be called after
dax_clear_blocks() has run. This will allow the filesystem to then
mark the extents are written, and we have no stale data exposure
issues at all.

In case you hadn't guessed, mmap write IO via DAX doesn't work at
all on XFS with this code. patch below that adds the end_io callback
that makes things work for XFS. I haven't changed the second
get_block() call, but that needs to be removed for unwritten
extents found during the initial lookup (i.e. page fault into
preallocated space).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

dax: add IO completion callback for page faults

From: Dave Chinner <dchinner@redhat.com>

When a page fault drops into a hole, it needs to allocate an extent.
Filesystems may allocate unwritten extents so that the underlying
contents are not exposed until data is written to the extent. In
that case, we need an io completion callback to run once the blocks
have been zeroed to indicate that it is safe for the filesystem to
mark those blocks written without exposing stale data in the event
of a crash.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 96c4fed..387ca78 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -306,6 +306,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	memset(&bh, 0, sizeof(bh));
 	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
 	bh.b_size = PAGE_SIZE;
+	bh.b_end_io = NULL;
 
  repeat:
 	page = find_get_page(mapping, vmf->pgoff);
@@ -364,8 +365,12 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_LOCKED;
 	}
 
-	if (buffer_unwritten(&bh) || buffer_new(&bh))
+	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
+		/* XXX: errors zeroing the blocks are propagated how? */
 		dax_clear_blocks(inode, bh.b_blocknr, bh.b_size);
+		if (bh.b_end_io)
+			bh.b_end_io(&bh, 1);
+	}
 
 	/* Check we didn't race with a read fault installing a new page */
 	if (!page && major)

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 19/21] xip: Add xip_zero_page_range
  2014-08-27  3:45 ` [PATCH v10 19/21] xip: Add xip_zero_page_range Matthew Wilcox
@ 2014-09-03  9:21   ` Dave Chinner
  2014-09-04 21:08     ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2014-09-03  9:21 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel, willy, Ross Zwisler

On Tue, Aug 26, 2014 at 11:45:39PM -0400, Matthew Wilcox wrote:
> This new function allows us to support hole-punch for XIP files by zeroing
> a partial page, as opposed to the xip_truncate_page() function which can
> only truncate to the end of the page.  Reimplement xip_truncate_page() as
> a macro that calls xip_zero_page_range().
> 
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> [ported to 3.13-rc2]
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  Documentation/filesystems/dax.txt |  1 +
>  fs/dax.c                          | 20 ++++++++++++++------
>  include/linux/fs.h                |  9 ++++++++-
>  3 files changed, 23 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
> index 635adaa..ebcd97f 100644
> --- a/Documentation/filesystems/dax.txt
> +++ b/Documentation/filesystems/dax.txt
> @@ -62,6 +62,7 @@ Filesystem support consists of
>    for fault and page_mkwrite (which should probably call dax_fault() and
>    dax_mkwrite(), passing the appropriate get_block() callback)
>  - calling dax_truncate_page() instead of block_truncate_page() for DAX files
> +- calling dax_zero_page_range() instead of zero_user() for DAX files
>  - ensuring that there is sufficient locking between reads, writes,
>    truncates and page faults
>  
> diff --git a/fs/dax.c b/fs/dax.c
> index d54f7d3..96c4fed 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -445,13 +445,16 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  EXPORT_SYMBOL_GPL(dax_fault);
>  
>  /**
> - * dax_truncate_page - handle a partial page being truncated in a DAX file
> + * dax_zero_page_range - zero a range within a page of a DAX file
>   * @inode: The file being truncated
>   * @from: The file offset that is being truncated to
> + * @length: The number of bytes to zero
>   * @get_block: The filesystem method used to translate file offsets to blocks
>   *
> - * Similar to block_truncate_page(), this function can be called by a
> - * filesystem when it is truncating an DAX file to handle the partial page.
> + * This function can be called by a filesystem when it is zeroing part of a
> + * page in a DAX file.  This is intended for hole-punch operations.  If
> + * you are truncating a file, the helper function dax_truncate_page() may be
> + * more convenient.
>   *
>   * We work in terms of PAGE_CACHE_SIZE here for commonality with
>   * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
> @@ -459,12 +462,12 @@ EXPORT_SYMBOL_GPL(dax_fault);
>   * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
>   * since the file might be mmaped.
>   */
> -int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
> +int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
> +							get_block_t get_block)
>  {
>  	struct buffer_head bh;
>  	pgoff_t index = from >> PAGE_CACHE_SHIFT;
>  	unsigned offset = from & (PAGE_CACHE_SIZE-1);
> -	unsigned length = PAGE_CACHE_ALIGN(from) - from;
>  	int err;
>  
>  	/* Block boundary? Nothing to do */
> @@ -481,9 +484,14 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
>  		if (err < 0)
>  			return err;
> +		/*
> +		 * ext4 sometimes asks to zero past the end of a block.  It
> +		 * really just wants to zero to the end of the block.
> +		 */
> +		length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset);
>  		memset(addr + offset, 0, length);

Sorry, what?

You introduce that bug with the way dax_truncate_page() is redefined
to always pass PAGE_CACHE_SIZE a a length later on in this patch.
into the function. That's hardly an ext4 bug....

>  	}
>  
>  	return 0;
>  }
> -EXPORT_SYMBOL_GPL(dax_truncate_page);
> +EXPORT_SYMBOL_GPL(dax_zero_page_range);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e6b48cc..b0078df 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2490,6 +2490,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
>  
>  #ifdef CONFIG_FS_DAX
>  int dax_clear_blocks(struct inode *, sector_t block, long size);
> +int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);

It's still defined as a function that doesn't exist now....

>  ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
>  		loff_t, get_block_t, dio_iodone_t, int flags);
> @@ -2501,7 +2502,8 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
>  	return 0;
>  }
>  
> -static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
> +static inline int dax_zero_page_range(struct inode *inode, loff_t from,
> +						unsigned len, get_block_t gb)
>  {
>  	return 0;
>  }
> @@ -2514,6 +2516,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb,
>  }
>  #endif
>  
> +/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */
> +#define dax_truncate_page(inode, from, get_block)	\
> +	dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block)

And then redefined as a macro here. This is wrong, IMO,
dax_truncate_page() should remain as a function and it should
correctly calculate how much of the page shoul dbe trimmed, not
leave landmines that other code has to clean up...

(Yup, I'm tracking down a truncate bug in XFS from fsx...)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 20/21] ext4: Add DAX functionality
  2014-08-27  3:45 ` [PATCH v10 20/21] ext4: Add DAX functionality Matthew Wilcox
@ 2014-09-03 11:13   ` Dave Chinner
  2014-09-10 16:49     ` Boaz Harrosh
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2014-09-03 11:13 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel, Ross Zwisler, willy

On Tue, Aug 26, 2014 at 11:45:40PM -0400, Matthew Wilcox wrote:
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> This is a port of the DAX functionality found in the current version of
> ext2.
....
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index e75f840..fa9ec8d 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -691,14 +691,22 @@ retry:
>  			inode_dio_done(inode);
>  			goto locked;
>  		}
> -		ret = __blockdev_direct_IO(rw, iocb, inode,
> -				 inode->i_sb->s_bdev, iter, offset,
> -				 ext4_get_block, NULL, NULL, 0);
> +		if (IS_DAX(inode))
> +			ret = dax_do_io(rw, iocb, inode, iter, offset,
> +					ext4_get_block, NULL, 0);
> +		else
> +			ret = __blockdev_direct_IO(rw, iocb, inode,
> +					inode->i_sb->s_bdev, iter, offset,
> +					ext4_get_block, NULL, NULL, 0);
>  		inode_dio_done(inode);
>  	} else {
>  locked:
> -		ret = blockdev_direct_IO(rw, iocb, inode, iter,
> -				 offset, ext4_get_block);
> +		if (IS_DAX(inode))
> +			ret = dax_do_io(rw, iocb, inode, iter, offset,
> +					ext4_get_block, NULL, DIO_LOCKING);
> +		else
> +			ret = blockdev_direct_IO(rw, iocb, inode, iter,
> +					offset, ext4_get_block);
>  
>  		if (unlikely((rw & WRITE) && ret < 0)) {
>  			loff_t isize = i_size_read(inode);

When direct IO fails ext4 falls back to buffered IO, right? And
dax_do_io() can return partial writes, yes?

So that means if you get, say, ENOSPC part way through a DAX write,
ext4 can start dirtying the page cache from
__generic_file_write_iter() because the DAX write didn't wholly
complete? And say this ENOSPC races with space being freed from
another inode, then the buffered write will succeed and we'll end up
with coherency issues, right?

This is not an idle question - XFS if firing asserts all over the
place when doing ENOSPC testing because DAX is returning partial
writes and the XFS direct IO code is expecting them to either wholly
complete or wholly fail. I can make the DAX variant do allow partial
writes, but I'm not going to add a useless fallback to buffered IO
for XFS when the (fully featured) direct allocation fails.

Indeed, I note that in the dax_fault code, any page found in the
page cache is explicitly removed and released, and the direct mapped
block replaces that page in the vma. IOWs, this code expects pages
to be clean as we're only supposed to have regions covered by holes
using cached pages (dax_load_hole()). So if we've done a buffered
write, we're going to toss out dirty pages the moment there is a
page fault on the range and map the unmodified backing store in
instead.

That just seems wrong. Maybe I've forgotten something, but this
looks like a wart that we don't need and shouldn't bake into this
interface as both ext4 and XFS can allocate into holes and extend
files from from the direct IO interfaces. Of course, correct me if
I'm wrong about ext4 capabilities...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/1] xfs: add DAX support
  2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
                   ` (22 preceding siblings ...)
  2014-08-28  8:08 ` Boaz Harrosh
@ 2014-09-03 12:05 ` Dave Chinner
  23 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2014-09-03 12:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel, willy


From: Dave Chinner <dchinner@redhat.com>

Add initial DAX support to XFS. This is EXPERIMENTAL, and it *will*
eat your data. You have been warned, and will be repeatedly warned
if you try to use it:

# mount -o dax /dev/ram0 /mnt/test
[ 2539.332402] XFS (ram0): DAX enabled. Warning: EXPERIMENTAL, use
at your own risk
[ 2539.334625] XFS (ram0): Mounting V5 Filesystem
[ 2539.338604] XFS (ram0): Ending clean mount


Notes:
	- uses a temporary mount option to enable. Needs to be able
	  to detect the capability automatically and switch it on
	  on demand. Mount option will go away once pmem devices
	  are in use and detectable.
	- needs per-inode flags to mark inodes as DAX enabled, and
	  an inheritance flag to enable automatic filesystem
	  propagation of the property
	- passes most of xfstests
	- fails occasionally with zero length writes instead of
	  ENOSPC errors, so error propagation inside/from th DAX
	  code need work
	- no performance testing has been done
	- no stress testing has been done
	- no significant data correctness testing has been done
	- no crash recovery testing has been done (outside what
	  xfstests does)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c      | 131 ++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_aops.h      |   7 ++-
 fs/xfs/xfs_bmap_util.c |  23 ++++++--
 fs/xfs/xfs_file.c      | 151 ++++++++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_iops.c      |  34 ++++++-----
 fs/xfs/xfs_iops.h      |   6 ++
 fs/xfs/xfs_mount.h     |   2 +
 fs/xfs/xfs_super.c     |  25 +++++++-
 8 files changed, 280 insertions(+), 99 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b984647..67b76b8 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1233,13 +1233,44 @@ xfs_vm_releasepage(
 	return try_to_free_buffers(page);
 }
 
+/*
+ * For DAX we need a mapping buffer callback for unwritten extent conversion
+ * when page faults allocation blocks and then zero them.
+ */
+static void
+xfs_dax_unwritten_end_io(
+	struct buffer_head	*bh,
+	int			uptodate)
+{
+	struct xfs_ioend	*ioend = bh->b_private;
+	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
+	int			error;
+
+	ASSERT(IS_DAX(ioend->io_inode));
+
+	/* if there was an error zeroing, then don't convert it */
+	if (!uptodate)
+		goto out_free;
+
+	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
+	if (error)
+		xfs_warn(ip->i_mount,
+"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
+			__func__, ip->i_ino, ioend->io_offset,
+			ioend->io_size, error);
+out_free:
+	mempool_free(ioend, xfs_ioend_pool);
+
+}
+
 STATIC int
 __xfs_get_blocks(
 	struct inode		*inode,
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	int			direct)
+	bool			direct,
+	bool			clear)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1304,6 +1335,7 @@ __xfs_get_blocks(
 			if (error)
 				return error;
 			new = 1;
+
 		} else {
 			/*
 			 * Delalloc reservations do not require a transaction,
@@ -1340,7 +1372,20 @@ __xfs_get_blocks(
 		if (create || !ISUNWRITTEN(&imap))
 			xfs_map_buffer(inode, bh_result, &imap, offset);
 		if (create && ISUNWRITTEN(&imap)) {
-			if (direct) {
+			if (clear) {
+				/*
+				 * DAX needs a special io completion for
+				 * clearing the buffer. Abuse the xfs_ioend for
+				 * this.
+				 */
+				struct xfs_ioend *ioend;
+
+				ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
+				ioend->io_offset = offset;
+				ioend->io_size = size;
+				bh_result->b_end_io = xfs_dax_unwritten_end_io;
+				bh_result->b_private = ioend;
+			} else if (direct) {
 				bh_result->b_private = inode;
 				set_buffer_defer_completion(bh_result);
 			}
@@ -1425,7 +1470,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
 }
 
 STATIC int
@@ -1435,7 +1480,17 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
+}
+
+int
+xfs_get_blocks_dax(
+	struct inode		*inode,
+	sector_t		iblock,
+	struct buffer_head	*bh_result,
+	int			create)
+{
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
 }
 
 /*
@@ -1482,6 +1537,30 @@ xfs_end_io_direct_write(
 	xfs_finish_ioend_sync(ioend);
 }
 
+static inline ssize_t
+xfs_vm_do_dio(
+	struct inode		*inode,
+	int			rw,
+	struct kiocb		*iocb,
+	struct iov_iter		*iter,
+	loff_t			offset,
+	void			(*endio)(struct kiocb	*iocb,
+					 loff_t		offset,
+					 ssize_t	size,
+					 void		*private),
+	int			flags)
+{
+	struct block_device	*bdev;
+
+	if (IS_DAX(inode))
+		return dax_do_io(rw, iocb, inode, iter, offset,
+				 xfs_get_blocks_direct, endio, 0);
+
+	bdev = xfs_find_bdev_for_inode(inode);
+	return  __blockdev_direct_IO(rw, iocb, inode, bdev, iter, offset,
+				     xfs_get_blocks_direct, endio, NULL, flags);
+}
+
 STATIC ssize_t
 xfs_vm_direct_IO(
 	int			rw,
@@ -1490,39 +1569,29 @@ xfs_vm_direct_IO(
 	loff_t			offset)
 {
 	struct inode		*inode = iocb->ki_filp->f_mapping->host;
-	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
 	struct xfs_ioend	*ioend = NULL;
 	ssize_t			ret;
+	size_t			size;
 
-	if (rw & WRITE) {
-		size_t size = iov_iter_count(iter);
+	if (rw & READ)
+		return xfs_vm_do_dio(inode, rw, iocb, iter, offset, NULL, 0);
 
-		/*
-		 * We cannot preallocate a size update transaction here as we
-		 * don't know whether allocation is necessary or not. Hence we
-		 * can only tell IO completion that one is necessary if we are
-		 * not doing unwritten extent conversion.
-		 */
-		iocb->private = ioend = xfs_alloc_ioend(inode, XFS_IO_DIRECT);
-		if (offset + size > XFS_I(inode)->i_d.di_size)
-			ioend->io_isdirect = 1;
-
-		ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
-					    offset, xfs_get_blocks_direct,
-					    xfs_end_io_direct_write, NULL,
-					    DIO_ASYNC_EXTEND);
-		if (ret != -EIOCBQUEUED && iocb->private)
-			goto out_destroy_ioend;
-	} else {
-		ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
-					    offset, xfs_get_blocks_direct,
-					    NULL, NULL, 0);
-	}
+	/*
+	 * We cannot preallocate a size update transaction here as we
+	 * don't know whether allocation is necessary or not. Hence we
+	 * can only tell IO completion that one is necessary if we are
+	 * not doing unwritten extent conversion.
+	 */
+	size = iov_iter_count(iter);
+	iocb->private = ioend = xfs_alloc_ioend(inode, XFS_IO_DIRECT);
+	if (offset + size > XFS_I(inode)->i_d.di_size)
+		ioend->io_isdirect = 1;
 
-	return ret;
+	ret = xfs_vm_do_dio(inode, rw, iocb, iter, offset,
+			    xfs_end_io_direct_write, DIO_ASYNC_EXTEND);
 
-out_destroy_ioend:
-	xfs_destroy_ioend(ioend);
+	if (ret != -EIOCBQUEUED && iocb->private)
+		xfs_destroy_ioend(ioend);
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index f94dd45..0264bc5 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -56,8 +56,11 @@ typedef struct xfs_ioend {
 } xfs_ioend_t;
 
 extern const struct address_space_operations xfs_address_space_operations;
-extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
+int	xfs_get_blocks(struct inode *inode, sector_t offset,
+		       struct buffer_head *map_bh, int create);
+int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
+			   struct buffer_head *map_bh, int create);
 
-extern void xfs_count_page_state(struct page *, int *, int *);
+void xfs_count_page_state(struct page *, int *, int *);
 
 #endif /* __XFS_AOPS_H__ */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 08979d8..47819a4 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1136,14 +1136,29 @@ xfs_zero_remaining_bytes(
 			break;
 		ASSERT(imap.br_blockcount >= 1);
 		ASSERT(imap.br_startoff == offset_fsb);
+		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+
+		if (imap.br_startblock == HOLESTARTBLOCK ||
+		    imap.br_state == XFS_EXT_UNWRITTEN) {
+			/* skip the entire extent */
+			lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff +
+						      imap.br_blockcount) - 1;
+			continue;
+		}
+
 		lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff + 1) - 1;
 		if (lastoffset > endoff)
 			lastoffset = endoff;
-		if (imap.br_startblock == HOLESTARTBLOCK)
-			continue;
-		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
-		if (imap.br_state == XFS_EXT_UNWRITTEN)
+
+		/* DAX can just zero the backing device directly */
+		if (IS_DAX(VFS_I(ip))) {
+			error = dax_zero_page_range(VFS_I(ip), offset,
+						    lastoffset - offset + 1,
+						    xfs_get_blocks_dax);
+			if (error)
+				return error;
 			continue;
+		}
 
 		error = xfs_buf_read_uncached(XFS_IS_REALTIME_INODE(ip) ?
 				mp->m_rtdev_targp : mp->m_ddev_targp,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index eb596b4..d3d101e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -99,7 +99,8 @@ xfs_iozero(
 {
 	struct page		*page;
 	struct address_space	*mapping;
-	int			status;
+	int			status = 0;
+
 
 	mapping = VFS_I(ip)->i_mapping;
 	do {
@@ -111,20 +112,25 @@ xfs_iozero(
 		if (bytes > count)
 			bytes = count;
 
-		status = pagecache_write_begin(NULL, mapping, pos, bytes,
-					AOP_FLAG_UNINTERRUPTIBLE,
-					&page, &fsdata);
-		if (status)
-			break;
+		if (IS_DAX(VFS_I(ip)))
+			dax_zero_page_range(VFS_I(ip), pos, bytes,
+						   xfs_get_blocks_dax);
+		else {
+			status = pagecache_write_begin(NULL, mapping, pos, bytes,
+						AOP_FLAG_UNINTERRUPTIBLE,
+						&page, &fsdata);
+			if (status)
+				break;
 
-		zero_user(page, offset, bytes);
+			zero_user(page, offset, bytes);
 
-		status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
-					page, fsdata);
-		WARN_ON(status <= 0); /* can't return less than zero! */
+			status = pagecache_write_end(NULL, mapping, pos, bytes,
+						bytes, page, fsdata);
+			WARN_ON(status <= 0); /* can't return less than zero! */
+			status = 0;
+		}
 		pos += bytes;
 		count -= bytes;
-		status = 0;
 	} while (count);
 
 	return (-status);
@@ -604,7 +610,7 @@ xfs_file_dio_aio_write(
 					mp->m_rtdev_targp : mp->m_ddev_targp;
 
 	/* DIO must be aligned to device logical sector size */
-	if ((pos | count) & target->bt_logical_sectormask)
+	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
 		return -EINVAL;
 
 	/* "unaligned" here means not aligned to a filesystem block */
@@ -674,8 +680,11 @@ xfs_file_dio_aio_write(
 out:
 	xfs_rw_iunlock(ip, iolock);
 
-	/* No fallback to buffered IO on errors for XFS. */
-	ASSERT(ret < 0 || ret == count);
+	/*
+	 * No fallback to buffered IO on errors for XFS. DAX can result in
+	 * partial writes, but direct IO will either complete fully or fail.
+	 */
+	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
 	return ret;
 }
 
@@ -760,7 +769,7 @@ xfs_file_write_iter(
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
-	if (unlikely(file->f_flags & O_DIRECT))
+	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
 		ret = xfs_file_dio_aio_write(iocb, from);
 	else
 		ret = xfs_file_buffered_aio_write(iocb, from);
@@ -956,31 +965,6 @@ xfs_file_readdir(
 	return 0;
 }
 
-STATIC int
-xfs_file_mmap(
-	struct file	*filp,
-	struct vm_area_struct *vma)
-{
-	vma->vm_ops = &xfs_file_vm_ops;
-
-	file_accessed(filp);
-	return 0;
-}
-
-/*
- * mmap()d file has taken write protection fault and is being made
- * writable. We can set the page state up correctly for a writable
- * page, which means we can do correct delalloc accounting (ENOSPC
- * checking!) and unwritten extent mapping.
- */
-STATIC int
-xfs_vm_page_mkwrite(
-	struct vm_area_struct	*vma,
-	struct vm_fault		*vmf)
-{
-	return block_page_mkwrite(vma, vmf, xfs_get_blocks);
-}
-
 /*
  * This type is designed to indicate the type of offset we would like
  * to search from page cache for xfs_seek_hole_data().
@@ -1356,6 +1340,86 @@ xfs_file_llseek(
 	}
 }
 
+/*
+ * mmap()d file has taken write protection fault and is being made
+ * writable. We can set the page state up correctly for a writable
+ * page, which means we can do correct delalloc accounting (ENOSPC
+ * checking!) and unwritten extent mapping.
+ */
+STATIC int
+xfs_vm_page_mkwrite(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	return block_page_mkwrite(vma, vmf, xfs_get_blocks);
+}
+
+static const struct vm_operations_struct xfs_file_vm_ops = {
+	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_vm_page_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+
+#ifdef CONFIG_FS_DAX
+static int
+xfs_vm_dax_fault(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	return dax_fault(vma, vmf, xfs_get_blocks_dax);
+}
+
+static int
+xfs_vm_dax_page_mkwrite(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	return dax_mkwrite(vma, vmf, xfs_get_blocks_dax);
+}
+
+static const struct vm_operations_struct xfs_file_dax_vm_ops = {
+	.fault		= xfs_vm_dax_fault,
+	.page_mkwrite	= xfs_vm_dax_page_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+#else
+#define xfs_file_dax_operations xfs_file_vm_ops
+#endif /* CONFIG_FS_DAX */
+
+STATIC int
+xfs_file_mmap(
+	struct file	*filp,
+	struct vm_area_struct *vma)
+{
+	file_accessed(filp);
+	if (IS_DAX(file_inode(filp))) {
+		vma->vm_ops = &xfs_file_dax_vm_ops;
+		vma->vm_flags |= VM_MIXEDMAP;
+	} else
+		vma->vm_ops = &xfs_file_vm_ops;
+	return 0;
+}
+
+#ifdef CONFIG_FS_DAX
+const struct file_operations xfs_file_dax_operations = {
+	.llseek		= xfs_file_llseek,
+	.read		= new_sync_read,
+	.write		= new_sync_write,
+	.read_iter	= xfs_file_read_iter,
+	.write_iter	= xfs_file_write_iter,
+	.unlocked_ioctl	= xfs_file_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= xfs_file_compat_ioctl,
+#endif
+	.mmap		= xfs_file_mmap,
+	.open		= xfs_file_open,
+	.release	= xfs_file_release,
+	.fsync		= xfs_file_fsync,
+	.fallocate	= xfs_file_fallocate,
+};
+#endif /* CONFIG_FS_DAX */
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read		= new_sync_read,
@@ -1386,10 +1450,3 @@ const struct file_operations xfs_dir_file_operations = {
 #endif
 	.fsync		= xfs_dir_fsync,
 };
-
-static const struct vm_operations_struct xfs_file_vm_ops = {
-	.fault		= filemap_fault,
-	.map_pages	= filemap_map_pages,
-	.page_mkwrite	= xfs_vm_page_mkwrite,
-	.remap_pages	= generic_file_remap_pages,
-};
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 7212949..63aeca8 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -844,7 +844,11 @@ xfs_setattr_size(
 	 * much we can do about this, except to hope that the caller sees ENOMEM
 	 * and retries the truncate operation.
 	 */
-	error = block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
+	if (IS_DAX(inode))
+		error = dax_truncate_page(inode, newsize, xfs_get_blocks_dax);
+	else
+		error = block_truncate_page(inode->i_mapping, newsize,
+					    xfs_get_blocks);
 	if (error)
 		return error;
 	truncate_setsize(inode, newsize);
@@ -1176,22 +1180,22 @@ xfs_diflags_to_iflags(
 	struct inode		*inode,
 	struct xfs_inode	*ip)
 {
-	if (ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE)
+	uint16_t		flags = ip->i_d.di_flags;
+
+	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
+			    S_NOATIME | S_DAX);
+
+	if (flags & XFS_DIFLAG_IMMUTABLE)
 		inode->i_flags |= S_IMMUTABLE;
-	else
-		inode->i_flags &= ~S_IMMUTABLE;
-	if (ip->i_d.di_flags & XFS_DIFLAG_APPEND)
+	if (flags & XFS_DIFLAG_APPEND)
 		inode->i_flags |= S_APPEND;
-	else
-		inode->i_flags &= ~S_APPEND;
-	if (ip->i_d.di_flags & XFS_DIFLAG_SYNC)
+	if (flags & XFS_DIFLAG_SYNC)
 		inode->i_flags |= S_SYNC;
-	else
-		inode->i_flags &= ~S_SYNC;
-	if (ip->i_d.di_flags & XFS_DIFLAG_NOATIME)
+	if (flags & XFS_DIFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
-	else
-		inode->i_flags &= ~S_NOATIME;
+	/* XXX: Also needs an on-disk per inode flag! */
+	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+		inode->i_flags |= S_DAX;
 }
 
 /*
@@ -1253,6 +1257,10 @@ xfs_setup_inode(
 	case S_IFREG:
 		inode->i_op = &xfs_inode_operations;
 		inode->i_fop = &xfs_file_operations;
+		if (IS_DAX(inode))
+			inode->i_fop = &xfs_file_dax_operations;
+		else
+			inode->i_fop = &xfs_file_operations;
 		inode->i_mapping->a_ops = &xfs_address_space_operations;
 		break;
 	case S_IFDIR:
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 1c34e43..5aeacd2 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -23,6 +23,12 @@ struct xfs_inode;
 extern const struct file_operations xfs_file_operations;
 extern const struct file_operations xfs_dir_file_operations;
 
+#ifdef CONFIG_FS_DAX
+extern const struct file_operations xfs_file_dax_operations;
+#else
+#define xfs_file_dax_operations xfs_file_operations
+#endif
+
 extern ssize_t xfs_vn_listxattr(struct dentry *, char *data, size_t size);
 
 extern void xfs_setup_inode(struct xfs_inode *);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 06f16d5..8f15099 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -208,6 +208,8 @@ typedef struct xfs_mount {
 						   allocator */
 #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
 
+#define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
+
 
 /*
  * Default minimum read and write sizes.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index de6dc75..0c86ab4 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -115,6 +115,8 @@ static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
 #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
 #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
 
+#define MNTOPT_DAX	"dax"	/* XXX: TEST ONLY OPTION */
+
 /*
  * Table driven mount option parser.
  *
@@ -362,6 +364,10 @@ xfs_parseargs(
 		} else if (!strcmp(this_char, MNTOPT_GQUOTANOENF)) {
 			mp->m_qflags |= (XFS_GQUOTA_ACCT | XFS_GQUOTA_ACTIVE);
 			mp->m_qflags &= ~XFS_GQUOTA_ENFD;
+#ifdef CONFIG_FS_DAX
+		} else if (!strcmp(this_char, MNTOPT_DAX)) {
+			mp->m_flags |= XFS_MOUNT_DAX;
+#endif
 		} else if (!strcmp(this_char, MNTOPT_DELAYLOG)) {
 			xfs_warn(mp,
 	"delaylog is the default now, option is deprecated.");
@@ -473,8 +479,8 @@ done:
 }
 
 struct proc_xfs_info {
-	int	flag;
-	char	*str;
+	uint64_t	flag;
+	char		*str;
 };
 
 STATIC int
@@ -495,6 +501,7 @@ xfs_showargs(
 		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
 		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
 		{ XFS_MOUNT_SMALL_INUMS,	"," MNTOPT_32BITINODE },
+		{ XFS_MOUNT_DAX,		"," MNTOPT_DAX },
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
@@ -1473,6 +1480,20 @@ xfs_fs_fill_super(
 	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
 		sb->s_flags |= MS_I_VERSION;
 
+	if (mp->m_flags & XFS_MOUNT_DAX) {
+		xfs_warn(mp,
+	"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
+		if (sb->s_blocksize != PAGE_SIZE) {
+			xfs_alert(mp,
+		"Filesystem block size invalid for DAX Turning DAX off.");
+			mp->m_flags &= ~XFS_MOUNT_DAX;
+		} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			xfs_alert(mp,
+		"Block device does not support DAX Turning DAX off.");
+			mp->m_flags &= ~XFS_MOUNT_DAX;
+		}
+	}
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 19/21] xip: Add xip_zero_page_range
  2014-09-03  9:21   ` Dave Chinner
@ 2014-09-04 21:08     ` Matthew Wilcox
  2014-09-04 21:36       ` Theodore Ts'o
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2014-09-04 21:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel, willy,
	Ross Zwisler

On Wed, Sep 03, 2014 at 07:21:16PM +1000, Dave Chinner wrote:
> > @@ -481,9 +484,14 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
> >  		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
> >  		if (err < 0)
> >  			return err;
> > +		/*
> > +		 * ext4 sometimes asks to zero past the end of a block.  It
> > +		 * really just wants to zero to the end of the block.
> > +		 */
> > +		length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset);
> >  		memset(addr + offset, 0, length);
> 
> Sorry, what?
> 
> You introduce that bug with the way dax_truncate_page() is redefined
> to always pass PAGE_CACHE_SIZE a a length later on in this patch.
> into the function. That's hardly an ext4 bug....

ext4 does (or did?) have this bug (expectation?).  I then take advantage
of the fact that we have to accommodate it, so there are now two places
that have to accommodate it.  I forget what the path was that has that
assumption, but xfstests used to display it.

I'm away this week (... bad timing), but I can certainly fix it elsewhere
in ext4 next week.

> >  int dax_clear_blocks(struct inode *, sector_t block, long size);
> > +int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
> >  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
> 
> It's still defined as a function that doesn't exist now....

Oops.

> > +/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */
> > +#define dax_truncate_page(inode, from, get_block)	\
> > +	dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block)
> 
> And then redefined as a macro here.

Heh, which means we never notice the stale delaration above.  Thanks, C!

> This is wrong, IMO,
> dax_truncate_page() should remain as a function and it should
> correctly calculate how much of the page shoul dbe trimmed, not
> leave landmines that other code has to clean up...
> 
> (Yup, I'm tracking down a truncate bug in XFS from fsx...)

I'll put an assert in the rewrite, make sure that nobody's trying to
overtruncate.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 19/21] xip: Add xip_zero_page_range
  2014-09-04 21:08     ` Matthew Wilcox
@ 2014-09-04 21:36       ` Theodore Ts'o
  2014-09-08 18:59         ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: Theodore Ts'o @ 2014-09-04 21:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Matthew Wilcox, linux-fsdevel, linux-mm,
	linux-kernel, Ross Zwisler

On Thu, Sep 04, 2014 at 05:08:02PM -0400, Matthew Wilcox wrote:
> 
> ext4 does (or did?) have this bug (expectation?).  I then take advantage
> of the fact that we have to accommodate it, so there are now two places
> that have to accommodate it.  I forget what the path was that has that
> assumption, but xfstests used to display it.
> 
> I'm away this week (... bad timing), but I can certainly fix it elsewhere
> in ext4 next week.

Huh?  Can you say more about what it is or was doing?  And where?

I tried to look for it, and I'm not seeing it, but I'm not entirely
sure from your description whether I'm looking in the right place.

Cheers,

     	       		   	       	       - Ted

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 19/21] xip: Add xip_zero_page_range
  2014-09-04 21:36       ` Theodore Ts'o
@ 2014-09-08 18:59         ` Matthew Wilcox
  0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2014-09-08 18:59 UTC (permalink / raw)
  To: Theodore Ts'o, Matthew Wilcox, Dave Chinner, Matthew Wilcox,
	linux-fsdevel, linux-mm, linux-kernel, Ross Zwisler

On Thu, Sep 04, 2014 at 05:36:41PM -0400, Theodore Ts'o wrote:
> On Thu, Sep 04, 2014 at 05:08:02PM -0400, Matthew Wilcox wrote:
> > 
> > ext4 does (or did?) have this bug (expectation?).  I then take advantage
> > of the fact that we have to accommodate it, so there are now two places
> > that have to accommodate it.  I forget what the path was that has that
> > assumption, but xfstests used to display it.
> > 
> > I'm away this week (... bad timing), but I can certainly fix it elsewhere
> > in ext4 next week.
> 
> Huh?  Can you say more about what it is or was doing?  And where?
> 
> I tried to look for it, and I'm not seeing it, but I'm not entirely
> sure from your description whether I'm looking in the right place.

I wrote this patch:

diff --git a/fs/dax.c b/fs/dax.c
index 96c4fed..bdf6622 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -473,6 +473,7 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	/* Block boundary? Nothing to do */
 	if (!length)
 		return 0;
+	BUG_ON((offset + length) > PAGE_CACHE_SIZE);
 
 	memset(&bh, 0, sizeof(bh));
 	bh.b_size = PAGE_CACHE_SIZE;
@@ -484,14 +485,31 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
 		if (err < 0)
 			return err;
-		/*
-		 * ext4 sometimes asks to zero past the end of a block.  It
-		 * really just wants to zero to the end of the block.
-		 */
-		length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset);
 		memset(addr + offset, 0, length);
 	}
 
 	return 0;
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks.  Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+	unsigned length = PAGE_CACHE_ALIGN(from) - from;
+	return dax_zero_page_range(inode, from, length, get_block);
+}
+EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b0078df..d0182a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2502,6 +2502,12 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
 	return 0;
 }
 
+static inline int dax_truncate_page(struct inode *inode, loff_t from,
+								get_block_t gb)
+{
+	return 0;
+}
+
 static inline int dax_zero_page_range(struct inode *inode, loff_t from,
 						unsigned len, get_block_t gb)
 {
@@ -2516,11 +2522,6 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb,
 }
 #endif
 
-/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */
-#define dax_truncate_page(inode, from, get_block)	\
-	dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block)
-
-
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
 			    loff_t file_offset);

When running generic/008, it hit the BUG_ON in dax_zero_page_range():

[  506.752872] Call Trace:
[  506.752891]  [<ffffffffa02303cb>] ? __ext4_handle_dirty_metadata+0x9b/0x210 [ext4]
[  506.752910]  [<ffffffffa0200ffa>] ext4_block_zero_page_range+0x1ba/0x400 [ext4]
[  506.752930]  [<ffffffffa022f708>] ? ext4_fallocate+0x818/0xb70 [ext4]
[  506.752947]  [<ffffffffa020188e>] ext4_zero_partial_blocks+0xae/0xf0 [ext4]
[  506.752966]  [<ffffffffa022f719>] ext4_fallocate+0x829/0xb70 [ext4]
[  506.752980]  [<ffffffff811fee96>] do_fallocate+0x126/0x1b0
[  506.752992]  [<ffffffff811fef63>] SyS_fallocate+0x43/0x70

Someone appears to already know about this, since this code exists
in the current ext4_block_zero_page_range() [which I renamed to
__ext4_block_zero_page_range() in my patchset]:

        /*
         * correct length if it does not fall between
         * 'from' and the end of the block
         */
        if (length > max || length < 0)
                length = max;

Applying the following patch on top of the DAX patchset and the above
patch fixes everything nicely, but does result in a small amount of
code duplication.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e71adf6..5edd903 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3231,7 +3231,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 {
 	ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned blocksize, max, pos;
+	unsigned blocksize, pos;
 	ext4_lblk_t iblock;
 	struct inode *inode = mapping->host;
 	struct buffer_head *bh;
@@ -3244,14 +3244,6 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 		return -ENOMEM;
 
 	blocksize = inode->i_sb->s_blocksize;
-	max = blocksize - (offset & (blocksize - 1));
-
-	/*
-	 * correct length if it does not fall between
-	 * 'from' and the end of the block
-	 */
-	if (length > max || length < 0)
-		length = max;
 
 	iblock = index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
 
@@ -3327,6 +3319,17 @@ static int ext4_block_zero_page_range(handle_t *handle,
 		struct address_space *mapping, loff_t from, loff_t length)
 {
 	struct inode *inode = mapping->host;
+	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned blocksize = inode->i_sb->s_blocksize;
+	unsigned max = blocksize - (offset & (blocksize - 1));
+
+	/*
+	 * correct length if it does not fall between
+	 * 'from' and the end of the block
+	 */
+	if (length > max || length < 0)
+		length = max;
+
 	if (IS_DAX(inode))
 		return dax_zero_page_range(inode, from, length, ext4_get_block);
 	return __ext4_block_zero_page_range(handle, mapping, from, length);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler
  2014-09-03  7:47   ` Dave Chinner
@ 2014-09-10 15:23     ` Matthew Wilcox
  2014-09-11  3:09       ` Dave Chinner
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2014-09-10 15:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel, willy

On Wed, Sep 03, 2014 at 05:47:24PM +1000, Dave Chinner wrote:
> > +	error = get_block(inode, block, &bh, 0);
> > +	if (!error && (bh.b_size < PAGE_SIZE))
> > +		error = -EIO;
> > +	if (error)
> > +		goto unlock_page;
> 
> page fault into unwritten region, returns buffer_unwritten(bh) ==
> true. Hence buffer_written(bh) is false, and we take this branch:
> 
> > +	if (!buffer_written(&bh) && !vmf->cow_page) {
> > +		if (vmf->flags & FAULT_FLAG_WRITE) {
> > +			error = get_block(inode, block, &bh, 1);
> 
> Exactly what are you expecting to happen here? We don't do
> allocation because there are already unwritten blocks over this
> extent, and so bh will be unchanged when returning. i.e. it will
> still be mapping an unwritten extent.

I was expecting calling get_block() on an unwritten extent to convert it
to a written extent.  Your suggestion below of using b_end_io() to do that
is a better idea.

So this should be:

	if (!buffer_mapped(&bh) && !vmf->cow_page) {

... right?

> dax: add IO completion callback for page faults
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> When a page fault drops into a hole, it needs to allocate an extent.
> Filesystems may allocate unwritten extents so that the underlying
> contents are not exposed until data is written to the extent. In
> that case, we need an io completion callback to run once the blocks
> have been zeroed to indicate that it is safe for the filesystem to
> mark those blocks written without exposing stale data in the event
> of a crash.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/dax.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 96c4fed..387ca78 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -306,6 +306,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	memset(&bh, 0, sizeof(bh));
>  	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
>  	bh.b_size = PAGE_SIZE;
> +	bh.b_end_io = NULL;

Given the above memset, I don't think we need to explicitly set b_end_io
to NULL.

>   repeat:
>  	page = find_get_page(mapping, vmf->pgoff);
> @@ -364,8 +365,12 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		return VM_FAULT_LOCKED;
>  	}
>  
> -	if (buffer_unwritten(&bh) || buffer_new(&bh))
> +	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> +		/* XXX: errors zeroing the blocks are propagated how? */
>  		dax_clear_blocks(inode, bh.b_blocknr, bh.b_size);

That's a great question.  I think we need to segfault here.

> +		if (bh.b_end_io)
> +			bh.b_end_io(&bh, 1);
> +	}

I think ext4 is going to need to set b_end_io too.  Right now, it uses the
dio_iodone_t to convert unwritten extents to written extents, but we don't
have (and I don't think we should have) a kiocb for page faults.

So, if it's OK with you, I'm going to fold this patch into version 11 and
add your Reviewed-by to it.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 20/21] ext4: Add DAX functionality
  2014-09-03 11:13   ` Dave Chinner
@ 2014-09-10 16:49     ` Boaz Harrosh
  2014-09-11  4:38       ` Dave Chinner
  0 siblings, 1 reply; 52+ messages in thread
From: Boaz Harrosh @ 2014-09-10 16:49 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, linux-kernel, Ross Zwisler, willy

On 09/03/2014 02:13 PM, Dave Chinner wrote:
<>
> 
> When direct IO fails ext4 falls back to buffered IO, right? And
> dax_do_io() can return partial writes, yes?
> 

There is no buffered writes with DAX. .I.E buffered writes are always
direct as well. (No page cache)

> So that means if you get, say, ENOSPC part way through a DAX write,
> ext4 can start dirtying the page cache from
> __generic_file_write_iter() because the DAX write didn't wholly
> complete? And say this ENOSPC races with space being freed from
> another inode, then the buffered write will succeed and we'll end up
> with coherency issues, right?
> 
> This is not an idle question - XFS if firing asserts all over the
> place when doing ENOSPC testing because DAX is returning partial
> writes and the XFS direct IO code is expecting them to either wholly
> complete or wholly fail. I can make the DAX variant do allow partial
> writes, but I'm not going to add a useless fallback to buffered IO
> for XFS when the (fully featured) direct allocation fails.
> 

Right, no fall back. Because a fallback is just a retry, because in any
way DAX assumes there is never a page_cache_page for a written data

> Indeed, I note that in the dax_fault code, any page found in the
> page cache is explicitly removed and released, and the direct mapped
> block replaces that page in the vma. IOWs, this code expects pages
> to be clean as we're only supposed to have regions covered by holes
> using cached pages (dax_load_hole()). 

Exactly, page_cache_page are only/always "regions covered by holes"

Once there is a real block allocated for an offset it will be directly
mapped to the vm without a page_cache_page.

> So if we've done a buffered
> write, we're going to toss out dirty pages the moment there is a
> page fault on the range and map the unmodified backing store in
> instead.
> 

No! There is never "buffered write" with DAX. That is: there is never
a page_cache_page that holds data which will belong to the storage
later. DAX means zero-page-cache

> That just seems wrong. Maybe I've forgotten something, but this
> looks like a wart that we don't need and shouldn't bake into this
> interface as both ext4 and XFS can allocate into holes and extend
> files from from the direct IO interfaces. Of course, correct me if
> I'm wrong about ext4 capabilities...
> 

Yes you have misread the patchset, all writes are always done directly
to bdev->direct_access(..) memory *never* via a copy to page_cache.

Currently The only existence of radix-tree pages is for ZERO pages that
cover holes, which get thrown out as clean or COWed on mkwrite

BTW Matthew: It took me a while to figure out the VFS/VMA api but
I managed to map a single ZERO page to all holes and COW them to
real blocks on mkwrite. It needed a combination of flags but the
main trick is that at mkwrite I do:

	/* our zero page doesn't really hold the correct offset to the file in
	 * page->index so vmf->pgoff is incorrect, lets fix that */
	vmf->pgoff = vma->vm_pgoff + (((unsigned long)vmf->virtual_address -
			vma->vm_start) >> PAGE_SHIFT);
	/* call fault handler to get a real page for writing */
	ret = _xip_file_fault(vma, vmf);
	/* invalidate all other mappings to that location */
	unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, PAGE_SIZE, 1);

	/* mkwrite must lock the original page and return VM_FAULT_LOCKED */
	if (ret == VM_FAULT_NOPAGE) {
		lock_page(m1fs_zero_page);
		ret = VM_FAULT_LOCKED;
	}
	return ret;

At _xip_file_fault() also called from .fault I do in the case of a hole:
	if (!(vmf->flags & FAULT_FLAG_WRITE)) {
		...
		block = _find_data_block(inode, vmf->pgoff);
		if (!block) {
			vmf->page = g_zero_page;
			err = vm_insert_page(vma,
					(unsigned long)vmf->virtual_address,
					vmf->page);
			goto after_insert;
		}
	} else {

Above g_zero_page is my own global zero page, PAGE_ZERO will not work.
_find_data_block() is like your get_buffer but only for the read case,
the write case uses a different _get_block_create().

Please tell me if it is interesting for you? I can try to patch your DAX
patchset to do the same. This can always be done later as an optimization.

> Cheers,
> Dave.
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler
  2014-09-10 15:23     ` Matthew Wilcox
@ 2014-09-11  3:09       ` Dave Chinner
  2014-09-24 15:43         ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2014-09-11  3:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel

On Wed, Sep 10, 2014 at 11:23:37AM -0400, Matthew Wilcox wrote:
> On Wed, Sep 03, 2014 at 05:47:24PM +1000, Dave Chinner wrote:
> > > +	error = get_block(inode, block, &bh, 0);
> > > +	if (!error && (bh.b_size < PAGE_SIZE))
> > > +		error = -EIO;
> > > +	if (error)
> > > +		goto unlock_page;
> > 
> > page fault into unwritten region, returns buffer_unwritten(bh) ==
> > true. Hence buffer_written(bh) is false, and we take this branch:
> > 
> > > +	if (!buffer_written(&bh) && !vmf->cow_page) {
> > > +		if (vmf->flags & FAULT_FLAG_WRITE) {
> > > +			error = get_block(inode, block, &bh, 1);
> > 
> > Exactly what are you expecting to happen here? We don't do
> > allocation because there are already unwritten blocks over this
> > extent, and so bh will be unchanged when returning. i.e. it will
> > still be mapping an unwritten extent.
> 
> I was expecting calling get_block() on an unwritten extent to convert it
> to a written extent.  Your suggestion below of using b_end_io() to do that
> is a better idea.
> 
> So this should be:
> 
> 	if (!buffer_mapped(&bh) && !vmf->cow_page) {
> 
> ... right?

Yes, that is the conclusion I reached as well. ;)

> > dax: add IO completion callback for page faults
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When a page fault drops into a hole, it needs to allocate an extent.
> > Filesystems may allocate unwritten extents so that the underlying
> > contents are not exposed until data is written to the extent. In
> > that case, we need an io completion callback to run once the blocks
> > have been zeroed to indicate that it is safe for the filesystem to
> > mark those blocks written without exposing stale data in the event
> > of a crash.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/dax.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 96c4fed..387ca78 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -306,6 +306,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  	memset(&bh, 0, sizeof(bh));
> >  	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
> >  	bh.b_size = PAGE_SIZE;
> > +	bh.b_end_io = NULL;
> 
> Given the above memset, I don't think we need to explicitly set b_end_io
> to NULL.

I missed that ;)

> >   repeat:
> >  	page = find_get_page(mapping, vmf->pgoff);
> > @@ -364,8 +365,12 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  		return VM_FAULT_LOCKED;
> >  	}
> >  
> > -	if (buffer_unwritten(&bh) || buffer_new(&bh))
> > +	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> > +		/* XXX: errors zeroing the blocks are propagated how? */
> >  		dax_clear_blocks(inode, bh.b_blocknr, bh.b_size);
> 
> That's a great question.  I think we need to segfault here.

I suspect there are other cases where we need to do similar "trigger
segv" error handling rather than ignoring errors altogether...

> 
> > +		if (bh.b_end_io)
> > +			bh.b_end_io(&bh, 1);
> > +	}
> 
> I think ext4 is going to need to set b_end_io too.  Right now, it uses the
> dio_iodone_t to convert unwritten extents to written extents, but we don't
> have (and I don't think we should have) a kiocb for page faults.

Yes, ext4 is going to need this as well. After I got XFS running
without problems, I then went back and ran xfstests on ext4 and it
failed many of the tests that do operations into unwritten regions.

> So, if it's OK with you, I'm going to fold this patch into version 11 and
> add your Reviewed-by to it.

Fold it in, I'll review the result ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 20/21] ext4: Add DAX functionality
  2014-09-10 16:49     ` Boaz Harrosh
@ 2014-09-11  4:38       ` Dave Chinner
  2014-09-14 12:25         ` Boaz Harrosh
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2014-09-11  4:38 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel,
	Ross Zwisler, willy

On Wed, Sep 10, 2014 at 07:49:40PM +0300, Boaz Harrosh wrote:
> On 09/03/2014 02:13 PM, Dave Chinner wrote:
> <>
> > 
> > When direct IO fails ext4 falls back to buffered IO, right? And
> > dax_do_io() can return partial writes, yes?
> > 
> 
> There is no buffered writes with DAX. .I.E buffered writes are always
> direct as well. (No page cache)

Yes, I know. But you didn't actually read the code I pointed out,
did you?

> > So that means if you get, say, ENOSPC part way through a DAX write,
> > ext4 can start dirtying the page cache from
> > __generic_file_write_iter() because the DAX write didn't wholly
> > complete? And say this ENOSPC races with space being freed from
> > another inode, then the buffered write will succeed and we'll end up
> > with coherency issues, right?
> > 
> > This is not an idle question - XFS if firing asserts all over the
> > place when doing ENOSPC testing because DAX is returning partial
> > writes and the XFS direct IO code is expecting them to either wholly
> > complete or wholly fail. I can make the DAX variant do allow partial
> > writes, but I'm not going to add a useless fallback to buffered IO
> > for XFS when the (fully featured) direct allocation fails.
> > 
> 
> Right, no fall back.

And so ext4 is buggy, because what ext4 does ....

> Because a fallback is just a retry, because in any
> way DAX assumes there is never a page_cache_page for a written data

... is not a retry - it falls back to a fundamentally different
code path. i.e:

sys_write()
....
	new_sync_write
	  ext4_file_write_iter
	    __generic_file_write_iter(O_DIRECT)
	      written = generic_file_direct_write()
	      if (error || complete write)
	        return
	      /* short write! do buffered IO to finish! */
	      generic_perform_write()
	        loop {
			ext4_write_begin
			ext4_write_end
		}

and so we allocate pages in the page cache and do buffered IO into
them because DAX doesn't hook ->writebegin/write_end as we are
supposed to intercept all buffered IO at a higher level.

This causes data corruption when tested at ENOSPC on DAX enabled
ext4 filesystems. I think that it's an oversight and hence a bug
that needs to be fixed but I'm first asking Willy to see if it was
intentional or not because maybe I missed sometihng in the past 4
months since I've paid really close attention to the DAX code.

And in saying that, Boaz, I'd suggest you spend some time looking at
the history of the DAX patchset. Pay careful note to who came up
with the original idea and architecture that led to the IO path you
are so stridently defending.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 20/21] ext4: Add DAX functionality
  2014-09-11  4:38       ` Dave Chinner
@ 2014-09-14 12:25         ` Boaz Harrosh
  2014-09-15  6:15           ` Dave Chinner
  0 siblings, 1 reply; 52+ messages in thread
From: Boaz Harrosh @ 2014-09-14 12:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel,
	Ross Zwisler, willy

On 09/11/2014 07:38 AM, Dave Chinner wrote:
<>
> 
> And so ext4 is buggy, because what ext4 does ....
> 
> ... is not a retry - it falls back to a fundamentally different
> code path. i.e:
> 
> sys_write()
> ....
> 	new_sync_write
> 	  ext4_file_write_iter
> 	    __generic_file_write_iter(O_DIRECT)
> 	      written = generic_file_direct_write()
> 	      if (error || complete write)
> 	        return
> 	      /* short write! do buffered IO to finish! */
> 	      generic_perform_write()
> 	        loop {
> 			ext4_write_begin
> 			ext4_write_end
> 		}
> 
> and so we allocate pages in the page cache and do buffered IO into
> them because DAX doesn't hook ->writebegin/write_end as we are
> supposed to intercept all buffered IO at a higher level.
> 
> This causes data corruption when tested at ENOSPC on DAX enabled
> ext4 filesystems. I think that it's an oversight and hence a bug
> that needs to be fixed but I'm first asking Willy to see if it was
> intentional or not because maybe I missed sometihng in the past 4
> months since I've paid really close attention to the DAX code.
> 
> And in saying that, Boaz, I'd suggest you spend some time looking at
> the history of the DAX patchset. Pay careful note to who came up
> with the original idea and architecture that led to the IO path you
> are so stridently defending.....
> 

Yes! you are completely right, and I have not seen this bug. The same bug
exist with ext2 as well. I think this is a bug in patch:
	[PATCH v10 07/21] Replace XIP read and write with DAX I/O

It needs a:
@@ -2584,7 +2584,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		loff_t endbyte;
 
 		written = generic_file_direct_write(iocb, from, pos);
-		if (written < 0 || written == count)
+		if (written < 0 || written == count || IS_DAX(inode))
 			goto out;
 
 		/*

Or something like that. Is that what you meant?

(You have commented on the ext4 patch but this is already earlier in ext2
 so I did not see it, sorry. "If you explain slow I finally get it ;-)" )

> Cheers,
> Dave.

Yes I agree this is a very bad data corruption bug. I also think that the
read path should not be allowed to fall back to buffered IO just the same
for the same reason. We must not allow any real data in page_cache for a
DAX file.

Thanks for explaining
Boaz



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 07/21] Replace XIP read and write with DAX I/O
  2014-08-27  3:45 ` [PATCH v10 07/21] Replace XIP read and write with DAX I/O Matthew Wilcox
@ 2014-09-14 14:11   ` Boaz Harrosh
  0 siblings, 0 replies; 52+ messages in thread
From: Boaz Harrosh @ 2014-09-14 14:11 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel, Dave Chinner; +Cc: willy

On 08/27/2014 06:45 AM, Matthew Wilcox wrote:
> Use the generic AIO infrastructure instead of custom read and write
> methods.  In addition to giving us support for AIO, this adds the missing
> locking between read() and truncate().
> 
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  MAINTAINERS        |   6 ++
>  fs/Makefile        |   1 +
>  fs/dax.c           | 195 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/ext2/file.c     |   6 +-
>  fs/ext2/inode.c    |   8 +-
>  include/linux/fs.h |  18 ++++-
>  mm/filemap.c       |   6 +-
>  mm/filemap_xip.c   | 234 -----------------------------------------------------
>  8 files changed, 229 insertions(+), 245 deletions(-)
>  create mode 100644 fs/dax.c
> 
<>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 90effcd..19bdb68 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1690,8 +1690,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  	loff_t *ppos = &iocb->ki_pos;
>  	loff_t pos = *ppos;
>  
> -	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
> -	if (file->f_flags & O_DIRECT) {
> +	if (io_is_direct(file)) {
>  		struct address_space *mapping = file->f_mapping;
>  		struct inode *inode = mapping->host;
>  		size_t count = iov_iter_count(iter);
> @@ -2579,8 +2578,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	if (err)
>  		goto out;
>  
> -	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
> -	if (unlikely(file->f_flags & O_DIRECT)) {
> +	if (io_is_direct(file)) {
>  		loff_t endbyte;
>  
>  		written = generic_file_direct_write(iocb, from, pos);

Hi Matthew

As pointed out by Dave Chinner, I think we must add the below hunks to this patch.
I do not see a case where it is allowed with current DAX code for any FS to
enable both DAX access/mmap in parallel to any buffered read/write.

Do we want to also put a
	WARN_ON(IS_DAX(inode));

In generic_perform_write and/or in extX->write_begin() ?

----
diff --git a/mm/filemap.c b/mm/filemap.c
index 19bdb68..22210c9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1719,7 +1719,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		 * and return.  Otherwise fallthrough to buffered io for
 		 * the rest of the read.
 		 */
-		if (retval < 0 || !iov_iter_count(iter) || *ppos >= size) {
+		if (retval < 0 || !iov_iter_count(iter) || *ppos >= size ||
+		    IS_DAX(inode)) {
 			file_accessed(file);
 			goto out;
 		}
@@ -2582,7 +2583,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		loff_t endbyte;
 
 		written = generic_file_direct_write(iocb, from, pos);
-		if (written < 0 || written == count)
+		if (written < 0 || written == count || IS_DAX(inode))
 			goto out;
 
 		/*
----

Thanks
Boaz


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 20/21] ext4: Add DAX functionality
  2014-09-14 12:25         ` Boaz Harrosh
@ 2014-09-15  6:15           ` Dave Chinner
  2014-09-15  9:41             ` Boaz Harrosh
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2014-09-15  6:15 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel,
	Ross Zwisler, willy

On Sun, Sep 14, 2014 at 03:25:45PM +0300, Boaz Harrosh wrote:
> On 09/11/2014 07:38 AM, Dave Chinner wrote:
> <>
> > 
> > And so ext4 is buggy, because what ext4 does ....
> > 
> > ... is not a retry - it falls back to a fundamentally different
> > code path. i.e:
> > 
> > sys_write()
> > ....
> > 	new_sync_write
> > 	  ext4_file_write_iter
> > 	    __generic_file_write_iter(O_DIRECT)
> > 	      written = generic_file_direct_write()
> > 	      if (error || complete write)
> > 	        return
> > 	      /* short write! do buffered IO to finish! */
> > 	      generic_perform_write()
> > 	        loop {
> > 			ext4_write_begin
> > 			ext4_write_end
> > 		}
> > 
> > and so we allocate pages in the page cache and do buffered IO into
> > them because DAX doesn't hook ->writebegin/write_end as we are
> > supposed to intercept all buffered IO at a higher level.
> > 
> > This causes data corruption when tested at ENOSPC on DAX enabled
> > ext4 filesystems. I think that it's an oversight and hence a bug
> > that needs to be fixed but I'm first asking Willy to see if it was
> > intentional or not because maybe I missed sometihng in the past 4
> > months since I've paid really close attention to the DAX code.
> > 
> > And in saying that, Boaz, I'd suggest you spend some time looking at
> > the history of the DAX patchset. Pay careful note to who came up
> > with the original idea and architecture that led to the IO path you
> > are so stridently defending.....
> > 
> 
> Yes! you are completely right, and I have not seen this bug. The same bug
> exist with ext2 as well. I think this is a bug in patch:
> 	[PATCH v10 07/21] Replace XIP read and write with DAX I/O
> 
> It needs a:
> @@ -2584,7 +2584,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		loff_t endbyte;
>  
>  		written = generic_file_direct_write(iocb, from, pos);
> -		if (written < 0 || written == count)
> +		if (written < 0 || written == count || IS_DAX(inode))
>  			goto out;
>  
>  		/*
> 
> Or something like that. Is that what you meant?

Well, that's one way of working around the immediate issue, but I
don't think it solves the whole problem. e.g. what do you do with the
bit of the partial write that failed? We may have allocated space
for it but not written data to it, so to simply fail exposes stale
data in the file(*).

Hence it's not clear to me that simply returning the short write is
a valid solution for DAX-enabled filesystems. I think that the
above - initially, at least - is much better than falling back to
buffered IO but filesystems are going to have to be updated to work
correctly without that fallback.

> Yes I agree this is a very bad data corruption bug. I also think
> that the read path should not be allowed to fall back to buffered
> IO just the same for the same reason. We must not allow any real
> data in page_cache for a DAX file.

Right, I didn't check the read path for the same issue as XFS won't
return a short read on direct IO unless the read spans EOF. And in
that case it won't ever do buffered reads. ;)

Cheers,

Dave.

(*) XFS avoids this problem by always using unwritten extents for
direct IO allocation, but I'm pretty sure that ext4 doesn't do this.
Using unwritten extents means that we don't expose stale data in the
event we don't end up writing to the allocated space.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 20/21] ext4: Add DAX functionality
  2014-09-15  6:15           ` Dave Chinner
@ 2014-09-15  9:41             ` Boaz Harrosh
  0 siblings, 0 replies; 52+ messages in thread
From: Boaz Harrosh @ 2014-09-15  9:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel,
	Ross Zwisler, willy

On 09/15/2014 09:15 AM, Dave Chinner wrote:
> On Sun, Sep 14, 2014 at 03:25:45PM +0300, Boaz Harrosh wrote:
<>
> 
> Well, that's one way of working around the immediate issue, but I
> don't think it solves the whole problem. e.g. what do you do with the
> bit of the partial write that failed? We may have allocated space
> for it but not written data to it, so to simply fail exposes stale
> data in the file(*).
> 

I'm confused. From what you said and what I read of the dax_do_io
code the only possible error is ENOSPC from getblock. 
(since ->direct_access() and memcopy cannot fail.)

Is it possible to fail with ENOSPC and still allocate an unwritten
block?

> Hence it's not clear to me that simply returning the short write is
> a valid solution for DAX-enabled filesystems. I think that the
> above - initially, at least - is much better than falling back to
> buffered IO but filesystems are going to have to be updated to work
> correctly without that fallback.
> 

The way I read dax_do_io it will call getblock, write or zero it
and continue to the next one only after that.
If not we should establish an handshake that will at least zero out
any error blocks, and or d-allocates them. But can you see such code
path in dax_do_io?

>> Yes I agree this is a very bad data corruption bug. I also think
>> that the read path should not be allowed to fall back to buffered
>> IO just the same for the same reason. We must not allow any real
>> data in page_cache for a DAX file.
> 
> Right, I didn't check the read path for the same issue as XFS won't
> return a short read on direct IO unless the read spans EOF. And in
> that case it won't ever do buffered reads. ;)
> 

Right read is less problematic. I guess. But we should not attempt
a buffered read anyway.

> Cheers,
> Dave.
> 
> (*) XFS avoids this problem by always using unwritten extents for
> direct IO allocation, but I'm pretty sure that ext4 doesn't do this.
> Using unwritten extents means that we don't expose stale data in the
> event we don't end up writing to the allocated space.
> 
If only we had an xfstest for this ?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler
  2014-09-11  3:09       ` Dave Chinner
@ 2014-09-24 15:43         ` Matthew Wilcox
  2014-09-25  1:01           ` Dave Chinner
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2014-09-24 15:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel

On Thu, Sep 11, 2014 at 01:09:26PM +1000, Dave Chinner wrote:
> On Wed, Sep 10, 2014 at 11:23:37AM -0400, Matthew Wilcox wrote:
> > On Wed, Sep 03, 2014 at 05:47:24PM +1000, Dave Chinner wrote:
> > > > +	error = get_block(inode, block, &bh, 0);
> > > > +	if (!error && (bh.b_size < PAGE_SIZE))
> > > > +		error = -EIO;
> > > > +	if (error)
> > > > +		goto unlock_page;
> > > 
> > > page fault into unwritten region, returns buffer_unwritten(bh) ==
> > > true. Hence buffer_written(bh) is false, and we take this branch:
> > > 
> > > > +	if (!buffer_written(&bh) && !vmf->cow_page) {
> > > > +		if (vmf->flags & FAULT_FLAG_WRITE) {
> > > > +			error = get_block(inode, block, &bh, 1);
> > > 
> > > Exactly what are you expecting to happen here? We don't do
> > > allocation because there are already unwritten blocks over this
> > > extent, and so bh will be unchanged when returning. i.e. it will
> > > still be mapping an unwritten extent.
> > 
> > I was expecting calling get_block() on an unwritten extent to convert it
> > to a written extent.  Your suggestion below of using b_end_io() to do that
> > is a better idea.
> > 
> > So this should be:
> > 
> > 	if (!buffer_mapped(&bh) && !vmf->cow_page) {
> > 
> > ... right?
> 
> Yes, that is the conclusion I reached as well. ;)

Now I know why I was expecting get_block() on an unwritten extent to
convert it to a written extent.  That's the way ext4 behaves!

[  236.660772] got bh ffffffffa06e3bd0 1000
[  236.660814] got bh for write ffffffffa06e3bd0 60
[  236.660821] calling end_io ffffffffa06e3bd0 60

(1000 is BH_Unwritten, 60 is BH_Mapped | BH_New)

The code producing this output:

        error = get_block(inode, block, &bh, 0);
printk("got bh %p %lx\n", bh.b_end_io, bh.b_state);
        if (!error && (bh.b_size < PAGE_SIZE))
                error = -EIO;
        if (error)
                goto unlock_page;

        if (!buffer_mapped(&bh) && !vmf->cow_page) {
                if (vmf->flags & FAULT_FLAG_WRITE) {
                        error = get_block(inode, block, &bh, 1);
printk("got bh for write %p %lx\n", bh.b_end_io, bh.b_state);

# xfs_io -f -c "truncate 20k" -c "fiemap -v" -c "falloc 0 20k" -c "fiemap -v" -c "mmap -w 0 20k" -c "fiemap -v" -c "mwrite 4k 4k" -c "fiemap -v" /mnt/ram0/b
/mnt/ram0/b:
/mnt/ram0/b:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..39]:         263176..263215      40 0x801
/mnt/ram0/b:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..39]:         263176..263215      40 0x801
/mnt/ram0/b:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..39]:         263176..263215      40   0x1

Actually, this looks wrong ... ext4 should only have converted one block
of the extent to written, not all of it.  I think that means ext4 is
exposing stale data :-(  I'll keep digging.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler
  2014-09-24 15:43         ` Matthew Wilcox
@ 2014-09-25  1:01           ` Dave Chinner
  0 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2014-09-25  1:01 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel

On Wed, Sep 24, 2014 at 11:43:07AM -0400, Matthew Wilcox wrote:
> On Thu, Sep 11, 2014 at 01:09:26PM +1000, Dave Chinner wrote:
> > On Wed, Sep 10, 2014 at 11:23:37AM -0400, Matthew Wilcox wrote:
> > > On Wed, Sep 03, 2014 at 05:47:24PM +1000, Dave Chinner wrote:
> > > > > +	error = get_block(inode, block, &bh, 0);
> > > > > +	if (!error && (bh.b_size < PAGE_SIZE))
> > > > > +		error = -EIO;
> > > > > +	if (error)
> > > > > +		goto unlock_page;
> > > > 
> > > > page fault into unwritten region, returns buffer_unwritten(bh) ==
> > > > true. Hence buffer_written(bh) is false, and we take this branch:
> > > > 
> > > > > +	if (!buffer_written(&bh) && !vmf->cow_page) {
> > > > > +		if (vmf->flags & FAULT_FLAG_WRITE) {
> > > > > +			error = get_block(inode, block, &bh, 1);
> > > > 
> > > > Exactly what are you expecting to happen here? We don't do
> > > > allocation because there are already unwritten blocks over this
> > > > extent, and so bh will be unchanged when returning. i.e. it will
> > > > still be mapping an unwritten extent.
> > > 
> > > I was expecting calling get_block() on an unwritten extent to convert it
> > > to a written extent.  Your suggestion below of using b_end_io() to do that
> > > is a better idea.
> > > 
> > > So this should be:
> > > 
> > > 	if (!buffer_mapped(&bh) && !vmf->cow_page) {
> > > 
> > > ... right?
> > 
> > Yes, that is the conclusion I reached as well. ;)
> 
> Now I know why I was expecting get_block() on an unwritten extent to
> convert it to a written extent.  That's the way ext4 behaves!

That seems wrong. Unwritten extent conversion should only occur
on IO completion...

> 
> [  236.660772] got bh ffffffffa06e3bd0 1000
> [  236.660814] got bh for write ffffffffa06e3bd0 60
> [  236.660821] calling end_io ffffffffa06e3bd0 60
> 
> (1000 is BH_Unwritten, 60 is BH_Mapped | BH_New)
> 
> The code producing this output:
> 
>         error = get_block(inode, block, &bh, 0);
> printk("got bh %p %lx\n", bh.b_end_io, bh.b_state);
>         if (!error && (bh.b_size < PAGE_SIZE))
>                 error = -EIO;
>         if (error)
>                 goto unlock_page;
> 
>         if (!buffer_mapped(&bh) && !vmf->cow_page) {
>                 if (vmf->flags & FAULT_FLAG_WRITE) {
>                         error = get_block(inode, block, &bh, 1);
> printk("got bh for write %p %lx\n", bh.b_end_io, bh.b_state);

%pF will do symbol decoding for you ;)

> 
> # xfs_io -f -c "truncate 20k" -c "fiemap -v" -c "falloc 0 20k" -c "fiemap -v" -c "mmap -w 0 20k" -c "fiemap -v" -c "mwrite 4k 4k" -c "fiemap -v" /mnt/ram0/b
> /mnt/ram0/b:
> /mnt/ram0/b:
>  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
>    0: [0..39]:         263176..263215      40 0x801
> /mnt/ram0/b:
>  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
>    0: [0..39]:         263176..263215      40 0x801
> /mnt/ram0/b:
>  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
>    0: [0..39]:         263176..263215      40   0x1
> 
> Actually, this looks wrong ... ext4 should only have converted one block
> of the extent to written, not all of it.  I think that means ext4 is
> exposing stale data :-(  I'll keep digging.

Check to see if ext4 has zeroed the entire extent - it does some
convoluted "hole filling" in certain siutations where it extends the
range of allocation operations by writing zeros around the range that
it was asked to allocate.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2014-09-25  1:02 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-27  3:45 [PATCH v10 00/21] Support ext4 on NV-DIMMs Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 01/21] axonram: Fix bug in direct_access Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 02/21] Change direct_access calling convention Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 03/21] Fix XIP fault vs truncate race Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 04/21] Allow page fault handlers to perform the COW Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 05/21] Introduce IS_DAX(inode) Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 06/21] Add copy_to_iter(), copy_from_iter() and iov_iter_zero() Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 07/21] Replace XIP read and write with DAX I/O Matthew Wilcox
2014-09-14 14:11   ` Boaz Harrosh
2014-08-27  3:45 ` [PATCH v10 08/21] Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
2014-09-03  7:47   ` Dave Chinner
2014-09-10 15:23     ` Matthew Wilcox
2014-09-11  3:09       ` Dave Chinner
2014-09-24 15:43         ` Matthew Wilcox
2014-09-25  1:01           ` Dave Chinner
2014-08-27  3:45 ` [PATCH v10 10/21] Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 11/21] Replace XIP documentation with DAX documentation Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 12/21] Remove get_xip_mem Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 13/21] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 14/21] ext2: Remove ext2_use_xip Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 15/21] ext2: Remove xip.c and xip.h Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 16/21] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 17/21] ext2: Remove ext2_aops_xip Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 18/21] Get rid of most mentions of XIP in ext2 Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 19/21] xip: Add xip_zero_page_range Matthew Wilcox
2014-09-03  9:21   ` Dave Chinner
2014-09-04 21:08     ` Matthew Wilcox
2014-09-04 21:36       ` Theodore Ts'o
2014-09-08 18:59         ` Matthew Wilcox
2014-08-27  3:45 ` [PATCH v10 20/21] ext4: Add DAX functionality Matthew Wilcox
2014-09-03 11:13   ` Dave Chinner
2014-09-10 16:49     ` Boaz Harrosh
2014-09-11  4:38       ` Dave Chinner
2014-09-14 12:25         ` Boaz Harrosh
2014-09-15  6:15           ` Dave Chinner
2014-09-15  9:41             ` Boaz Harrosh
2014-08-27  3:45 ` [PATCH v10 21/21] brd: Rename XIP to DAX Matthew Wilcox
2014-08-27 20:06 ` [PATCH v10 00/21] Support ext4 on NV-DIMMs Andrew Morton
2014-08-27 21:12   ` Matthew Wilcox
2014-08-27 21:46     ` Andrew Morton
2014-08-28  1:30       ` Andy Lutomirski
2014-08-28 16:50         ` Matthew Wilcox
2014-08-28 15:45       ` Matthew Wilcox
2014-08-27 21:22   ` Christoph Lameter
2014-08-27 21:30     ` Andrew Morton
2014-08-27 23:04       ` One Thousand Gnomes
2014-08-28  7:17       ` Dave Chinner
2014-08-30 23:11         ` Christian Stroetmann
2014-08-28  8:08 ` Boaz Harrosh
2014-08-28 22:09   ` Zwisler, Ross
2014-09-03 12:05 ` [PATCH 1/1] xfs: add DAX support Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).