linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage
@ 2014-10-24 21:20 Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 01/20] axonram: Fix bug in direct_access Matthew Wilcox
                   ` (21 more replies)
  0 siblings, 22 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

DAX is a replacement for the variation of XIP currently supported by
the ext2 filesystem.  We have three different things in the tree called
'XIP', and the new focus is on access to data rather than executables,
so a name change was in order.  DAX stands for Direct Access.  The X is
for eXciting.

The new focus on data access has resulted in more careful attention to
races that exist in the current XIP code, but are not hit by the use-case
that it was designed for.  XIP's architecture worked fine for ext2, but
DAX is architected to work with modern filsystems such as ext4 and XFS.
DAX is not intended for use with btrfs; the value that btrfs adds relies
on manipulating data and writing data to different locations, while DAX's
value is for write-in-place and keeping the kernel from touching the data.

DAX was developed in order to support NV-DIMMs, but it's become clear that
its usefuless extends beyond NV-DIMMs and there are several potential
customers including the tracing machinery.  Other people want to place
the kernel log in an area of memory, as long as they have a BIOS that
does not clear DRAM on reboot.

Patch 1 is a bug fix.  It is obviously correct, and should be included
into 3.18.

Patch 2 starts the transformation by changing how ->direct_access works.
Much code is moved from the drivers and filesystems into the block layer,
and we add the flexibility of being able to map more than one page at
a time.  It would be good to get this patch into 3.18 as it is also
useful for people who are pursuing non-DAX approaches to working with
persistent memory.

Patch 3 is also a bug fix, probably worth including in 3.18.

Patches 4 & 5 are infrastructure for DAX.

Patches 6-10 replace the XIP code with its DAX equivalents, transforming
ext2 to use the DAX code as we go.  Note that patch 10 is the
Documentation patch.

Patches 11-17 clean up after the XIP code, removing the infrastructure
that is no longer needed and renaming various XIP things to DAX.
Most of these patches were added after Jan found things he didn't
like in an earlier version of the ext4 patch ... that had been copied
from ext2.  So ext2 i being transformed to do things the same way that
ext4 will later.  The ability to mount ext2 filesystems with the 'xip'
option is retained, although the 'dax' option is now preferred.

Patch 18 adds some DAX infrastructure to support ext4.

Patch 19 adds DAX support to ext4.  It is broadly similar to ext2's DAX
support, but it is more efficient than ext4's due to its support for
unwritten extents.

Patch 20 is another cleanup patch renaming XIP to DAX.


My thanks to Mathieu Desnoyers for his reviews of the v11 patchset.  Most
of the changes below were based on his feedback.

Changes since v11:
 - Rebased to 3.18-rc1, dropping patch "vfs: Add copy_to_iter(),
   copy_from_iter() and iov_iter_zero()" as it was merged through Al's tree.
 - Added cc to stable@vger.kernel.org on patch 1
 - Fixed comment style in brd.c (Mathieu)
 - Make more functions in fs.h common with and without CONFIG_FS_DAX set
 - Improve type-checking with !CONFIG_FS_DAX
 - Simplify check for holes in dax_io()
 - Harden the loop in dax_clear_blocks()
 - Add missing check against truncate of a page covering a hole
 - Fix the page-fault handler to work for block devices too
 - Change a few more places that mentioned 'XIP' into 'DAX'
 - Update DAX documentation in a couple of places

Matthew Wilcox (19):
  axonram: Fix bug in direct_access
  block: Change direct_access calling convention
  mm: Fix XIP fault vs truncate race
  mm: Allow page fault handlers to perform the COW
  vfs,ext2: Introduce IS_DAX(inode)
  dax,ext2: Replace XIP read and write with DAX I/O
  dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks
  dax,ext2: Replace the XIP page fault handler with the DAX page fault
    handler
  dax,ext2: Replace xip_truncate_page with dax_truncate_page
  dax: Replace XIP documentation with DAX documentation
  vfs: Remove get_xip_mem
  ext2: Remove ext2_xip_verify_sb()
  ext2: Remove ext2_use_xip
  ext2: Remove xip.c and xip.h
  vfs,ext2: Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to
    CONFIG_FS_DAX
  ext2: Remove ext2_aops_xip
  ext2: Get rid of most mentions of XIP in ext2
  dax: Add dax_zero_page_range
  brd: Rename XIP to DAX

Ross Zwisler (1):
  ext4: Add DAX functionality

 Documentation/filesystems/00-INDEX |   5 +-
 Documentation/filesystems/Locking  |   3 -
 Documentation/filesystems/dax.txt  |  91 +++++++
 Documentation/filesystems/ext2.txt |   5 +-
 Documentation/filesystems/ext4.txt |   4 +
 Documentation/filesystems/vfs.txt  |   7 -
 Documentation/filesystems/xip.txt  |  68 -----
 MAINTAINERS                        |   6 +
 arch/powerpc/sysdev/axonram.c      |  19 +-
 drivers/block/Kconfig              |  13 +-
 drivers/block/brd.c                |  28 +-
 drivers/s390/block/dcssblk.c       |  21 +-
 fs/Kconfig                         |  21 +-
 fs/Makefile                        |   1 +
 fs/block_dev.c                     |  40 +++
 fs/dax.c                           | 530 +++++++++++++++++++++++++++++++++++++
 fs/exofs/inode.c                   |   1 -
 fs/ext2/Kconfig                    |  11 -
 fs/ext2/Makefile                   |   1 -
 fs/ext2/ext2.h                     |  10 +-
 fs/ext2/file.c                     |  45 +++-
 fs/ext2/inode.c                    |  38 +--
 fs/ext2/namei.c                    |  13 +-
 fs/ext2/super.c                    |  53 ++--
 fs/ext2/xip.c                      |  91 -------
 fs/ext2/xip.h                      |  26 --
 fs/ext4/ext4.h                     |   6 +
 fs/ext4/file.c                     |  50 +++-
 fs/ext4/indirect.c                 |  18 +-
 fs/ext4/inode.c                    |  89 +++++--
 fs/ext4/namei.c                    |  10 +-
 fs/ext4/super.c                    |  39 ++-
 fs/open.c                          |   5 +-
 include/linux/blkdev.h             |   6 +-
 include/linux/fs.h                 |  34 +--
 include/linux/mm.h                 |   1 +
 include/linux/rmap.h               |   2 +-
 mm/Makefile                        |   1 -
 mm/fadvise.c                       |   6 +-
 mm/filemap.c                       |  25 +-
 mm/filemap_xip.c                   | 483 ---------------------------------
 mm/madvise.c                       |   2 +-
 mm/memory.c                        |  33 ++-
 scripts/diffconfig                 |   1 -
 44 files changed, 1069 insertions(+), 893 deletions(-)
 create mode 100644 Documentation/filesystems/dax.txt
 delete mode 100644 Documentation/filesystems/xip.txt
 create mode 100644 fs/dax.c
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h
 delete mode 100644 mm/filemap_xip.c

-- 
2.1.1


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v12 01/20] axonram: Fix bug in direct_access
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 02/20] block: Change direct_access calling convention Matthew Wilcox
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton, stable

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
---
 arch/powerpc/sysdev/axonram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ad56edc..e8bb33b 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
 	}
 
 	*kaddr = (void *)(bank->ph_addr + offset);
-	*pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
 	return 0;
 }
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 02/20] block: Change direct_access calling convention
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 01/20] axonram: Fix bug in direct_access Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 03/20] mm: Fix XIP fault vs truncate race Matthew Wilcox
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Boaz Harrosh <boaz@plexistor.com>
---
 Documentation/filesystems/xip.txt | 15 +++++++++------
 arch/powerpc/sysdev/axonram.c     | 17 ++++-------------
 drivers/block/brd.c               | 14 +++++++-------
 drivers/s390/block/dcssblk.c      | 21 +++++++++-----------
 fs/block_dev.c                    | 40 +++++++++++++++++++++++++++++++++++++++
 fs/ext2/xip.c                     | 31 +++++++++++++-----------------
 include/linux/blkdev.h            |  6 ++++--
 7 files changed, 86 insertions(+), 58 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b774729 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
 Execute-in-place is implemented in three steps: block device operation,
 address space operation, and file operations.
 
-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory.  It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested.  The function should return the number
+of bytes that can be contiguously accessed at that offset.  It may also
+return a negative errno if an error occurs.
 
 The block device operation is optional, these block devices support it as of
 today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index e8bb33b..4afff8d 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,26 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
-static int
+static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void **kaddr, unsigned long *pfn)
+		       void **kaddr, unsigned long *pfn, long size)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
-	loff_t offset;
-
-	offset = sector;
-	if (device->bd_part != NULL)
-		offset += device->bd_part->start_sect;
-	offset <<= AXON_RAM_SECTOR_SHIFT;
-	if (offset >= bank->size) {
-		dev_err(&bank->device->dev, "Access outside of address space\n");
-		return -ERANGE;
-	}
+	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
 
 	*kaddr = (void *)(bank->ph_addr + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return 0;
+	return bank->size - offset;
 }
 
 static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3598110..89e90ec 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -370,25 +370,25 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 #ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
-			void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+			void **kaddr, unsigned long *pfn, long size)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
 
 	if (!brd)
 		return -ENODEV;
-	if (sector & (PAGE_SECTORS-1))
-		return -EINVAL;
-	if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
-		return -ERANGE;
 	page = brd_insert_page(brd, sector);
 	if (!page)
 		return -ENOSPC;
 	*kaddr = page_address(page);
 	*pfn = page_to_pfn(page);
 
-	return 0;
+	/*
+	 * TODO: If size > PAGE_SIZE, we could look to see if the next page in
+	 * the file happens to be mapped to the next page of physical RAM.
+	 */
+	return PAGE_SIZE;
 }
 #endif
 
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 0f47175..96bc411 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
 static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-				 void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+				 void **kaddr, unsigned long *pfn, long size);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -866,25 +866,22 @@ fail:
 	bio_io_error(bio);
 }
 
-static int
+static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void **kaddr, unsigned long *pfn)
+			void **kaddr, unsigned long *pfn, long size)
 {
 	struct dcssblk_dev_info *dev_info;
-	unsigned long pgoff;
+	unsigned long offset, dev_sz;
 
 	dev_info = bdev->bd_disk->private_data;
 	if (!dev_info)
 		return -ENODEV;
-	if (secnum % (PAGE_SIZE/512))
-		return -EINVAL;
-	pgoff = secnum / (PAGE_SIZE / 512);
-	if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
-		return -ERANGE;
-	*kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+	dev_sz = dev_info->end - dev_info->start;
+	offset = secnum * 512;
+	*kaddr = (void *) (dev_info->start + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return 0;
+	return dev_sz - offset;
 }
 
 static void
diff --git a/fs/block_dev.c b/fs/block_dev.c
index cc9d411..2d2668b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -423,6 +423,46 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL_GPL(bdev_write_page);
 
+/**
+ * bdev_direct_access() - Get the address for directly-accessibly memory
+ * @bdev: The device containing the memory
+ * @sector: The offset within the device
+ * @addr: Where to put the address of the memory
+ * @pfn: The Page Frame Number for the memory
+ * @size: The number of bytes requested
+ *
+ * If a block device is made up of directly addressable memory, this function
+ * will tell the caller the PFN and the address of the memory.  The address
+ * may be directly dereferenced within the kernel without the need to call
+ * ioremap(), kmap() or similar.  The PFN is suitable for inserting into
+ * page tables.
+ *
+ * Return: negative errno if an error occurs, otherwise the number of bytes
+ * accessible at this address.
+ */
+long bdev_direct_access(struct block_device *bdev, sector_t sector,
+			void **addr, unsigned long *pfn, long size)
+{
+	long avail;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+
+	if (size < 0)
+		return size;
+	if (!ops->direct_access)
+		return -EOPNOTSUPP;
+	if ((sector + DIV_ROUND_UP(size, 512)) >
+					part_nr_sects_read(bdev->bd_part))
+		return -ERANGE;
+	sector += get_start_sect(bdev);
+	if (sector % (PAGE_SIZE / 512))
+		return -EINVAL;
+	avail = ops->direct_access(bdev, sector, addr, pfn, size);
+	if (!avail)
+		return -ERANGE;
+	return min(avail, size);
+}
+EXPORT_SYMBOL_GPL(bdev_direct_access);
+
 /*
  * pseudo-fs
  */
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..bbc5fec 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,12 @@
 #include "ext2.h"
 #include "xip.h"
 
-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
-		      void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+				void **kaddr, unsigned long *pfn, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
-	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	sector_t sector;
-
-	sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
-	BUG_ON(!ops->direct_access);
-	return ops->direct_access(bdev, sector, kaddr, pfn);
+	sector_t sector = block * (PAGE_SIZE / 512);
+	return bdev_direct_access(bdev, sector, kaddr, pfn, size);
 }
 
 static inline int
@@ -53,12 +47,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
 {
 	void *kaddr;
 	unsigned long pfn;
-	int rc;
+	long size;
 
-	rc = __inode_direct_access(inode, block, &kaddr, &pfn);
-	if (!rc)
-		clear_page(kaddr);
-	return rc;
+	size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+	if (size < 0)
+		return size;
+	clear_page(kaddr);
+	return 0;
 }
 
 void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +72,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
 int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
 				void **kmem, unsigned long *pfn)
 {
-	int rc;
+	long rc;
 	sector_t block;
 
 	/* first, retrieve the sector number */
@@ -86,6 +81,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
 		return rc;
 
 	/* retrieve address of the target data */
-	rc = __inode_direct_access(mapping->host, block, kmem, pfn);
-	return rc;
+	rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+	return (rc < 0) ? rc : 0;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 0207a78..3b93ec7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1604,8 +1604,8 @@ struct block_device_operations {
 	int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
-	int (*direct_access) (struct block_device *, sector_t,
-						void **, unsigned long *);
+	long (*direct_access)(struct block_device *, sector_t,
+					void **, unsigned long *pfn, long size);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1623,6 +1623,8 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
 extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
+extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
+						unsigned long *pfn, long size);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 03/20] mm: Fix XIP fault vs truncate race
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 01/20] axonram: Fix bug in direct_access Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 02/20] block: Change direct_access calling convention Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2015-01-12 23:09   ` Andrew Morton
  2014-10-24 21:20 ` [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW Matthew Wilcox
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate.  We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead.  It is locked in the
truncate path in unmap_mapping_range() after updating i_size.  So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 mm/filemap_xip.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
 		__xip_unmap(mapping, vmf->pgoff);
 
 found:
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			mutex_unlock(&mapping->i_mmap_mutex);
+			return VM_FAULT_SIGBUS;
+		}
 		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
 							xip_pfn);
+		mutex_unlock(&mapping->i_mmap_mutex);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		/*
@@ -285,16 +294,27 @@ found:
 		}
 		if (error != -ENODATA)
 			goto out;
+
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			ret = VM_FAULT_SIGBUS;
+			goto unlock;
+		}
 		/* not shared and writable, use xip_sparse_page() */
 		page = xip_sparse_page();
 		if (!page)
-			goto out;
+			goto unlock;
 		err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
 							page);
 		if (err == -ENOMEM)
-			goto out;
+			goto unlock;
 
 		ret = VM_FAULT_NOPAGE;
+unlock:
+		mutex_unlock(&mapping->i_mmap_mutex);
 out:
 		write_seqcount_end(&xip_sparse_seq);
 		mutex_unlock(&xip_sparse_mutex);
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (2 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 03/20] mm: Fix XIP fault vs truncate race Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2015-01-12 23:09   ` Andrew Morton
  2015-02-05  9:16   ` Yigal Korman
  2014-10-24 21:20 ` [PATCH v12 05/20] vfs,ext2: Introduce IS_DAX(inode) Matthew Wilcox
                   ` (17 subsequent siblings)
  21 siblings, 2 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page.  It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

The handler cannot insert the page itself if there is already a read-only
mapping at that address, so allow the handler to return VM_FAULT_LOCKED
and set the fault_page to be NULL.  This indicates to the MM code that
the i_mmap_mutex is held instead of the page lock.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 33 ++++++++++++++++++++++++---------
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 02d11ee..88d1ef4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -209,6 +209,7 @@ struct vm_fault {
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	void __user *virtual_address;	/* Faulting virtual address */
 
+	struct page *cow_page;		/* Handler may choose to COW */
 	struct page *page;		/* ->fault handlers should return a
 					 * page here, unless VM_FAULT_NOPAGE
 					 * is set (which is also implied by
diff --git a/mm/memory.c b/mm/memory.c
index 1cc6bfb..6dee424 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2002,6 +2002,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	vmf.pgoff = page->index;
 	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 	vmf.page = page;
+	vmf.cow_page = NULL;
 
 	ret = vma->vm_ops->page_mkwrite(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
@@ -2701,7 +2702,8 @@ oom:
  * See filemap_fault() and __lock_page_retry().
  */
 static int __do_fault(struct vm_area_struct *vma, unsigned long address,
-		pgoff_t pgoff, unsigned int flags, struct page **page)
+			pgoff_t pgoff, unsigned int flags,
+			struct page *cow_page, struct page **page)
 {
 	struct vm_fault vmf;
 	int ret;
@@ -2710,10 +2712,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.page = NULL;
+	vmf.cow_page = cow_page;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
+	if (!vmf.page)
+		goto out;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
 		if (ret & VM_FAULT_LOCKED)
@@ -2727,6 +2732,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	else
 		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
 
+ out:
 	*page = vmf.page;
 	return ret;
 }
@@ -2900,7 +2906,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(pte, ptl);
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
@@ -2940,26 +2946,35 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
-	copy_user_highpage(new_page, fault_page, address, vma);
+	if (fault_page)
+		copy_user_highpage(new_page, fault_page, address, vma);
 	__SetPageUptodate(new_page);
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (unlikely(!pte_same(*pte, orig_pte))) {
 		pte_unmap_unlock(pte, ptl);
-		unlock_page(fault_page);
-		page_cache_release(fault_page);
+		if (fault_page) {
+			unlock_page(fault_page);
+			page_cache_release(fault_page);
+		} else {
+			mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+		}
 		goto uncharge_out;
 	}
 	do_set_pte(vma, address, new_page, pte, true, true);
 	mem_cgroup_commit_charge(new_page, memcg, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
-	unlock_page(fault_page);
-	page_cache_release(fault_page);
+	if (fault_page) {
+		unlock_page(fault_page);
+		page_cache_release(fault_page);
+	} else {
+		mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+	}
 	return ret;
 uncharge_out:
 	mem_cgroup_cancel_charge(new_page, memcg);
@@ -2978,7 +2993,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int dirtied = 0;
 	int ret, tmp;
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 05/20] vfs,ext2: Introduce IS_DAX(inode)
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (3 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 06/20] dax,ext2: Replace XIP read and write with DAX I/O Matthew Wilcox
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip().  Prevent I/Os to
files tagged with the DAX flag from falling back to buffered I/O.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 fs/ext2/inode.c    |  9 ++++++---
 fs/ext2/xip.h      |  2 --
 include/linux/fs.h |  6 ++++++
 mm/filemap.c       | 19 ++++++++++++-------
 4 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 36d35c3..0cb0448 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode,
 		goto cleanup;
 	}
 
-	if (ext2_use_xip(inode->i_sb)) {
+	if (IS_DAX(inode)) {
 		/*
 		 * we need to clear the block
 		 */
@@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 
 	inode_dio_wait(inode);
 
-	if (mapping_is_xip(inode->i_mapping))
+	if (IS_DAX(inode))
 		error = xip_truncate_page(inode->i_mapping, newsize);
 	else if (test_opt(inode->i_sb, NOBH))
 		error = nobh_truncate_page(inode->i_mapping,
@@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
 {
 	unsigned int flags = EXT2_I(inode)->i_flags;
 
-	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+	inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+				S_DIRSYNC | S_DAX);
 	if (flags & EXT2_SYNC_FL)
 		inode->i_flags |= S_SYNC;
 	if (flags & EXT2_APPEND_FL)
@@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	if (test_opt(inode->i_sb, XIP))
+		inode->i_flags |= S_DAX;
 }
 
 /* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
 }
 int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
 				void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
 #else
-#define mapping_is_xip(map)			0
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #define ext2_clear_xip_target(inode, chain)	0
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a957d43..ff0acb2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1587,6 +1587,11 @@ struct super_operations {
 #define S_IMA		1024	/* Inode has an associated IMA struct */
 #define S_AUTOMOUNT	2048	/* Automount/referral quasi-directory */
 #define S_NOSEC		4096	/* no suid or xattr security attributes */
+#ifdef CONFIG_FS_XIP
+#define S_DAX		8192	/* Direct Access, avoiding the page cache */
+#else
+#define S_DAX		0	/* Make all the DAX code disappear */
+#endif
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1624,6 +1629,7 @@ struct super_operations {
 #define IS_IMA(inode)		((inode)->i_flags & S_IMA)
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
+#define IS_DAX(inode)		((inode)->i_flags & S_DAX)
 
 /*
  * Inode state bits.  Protected by inode->i_lock
diff --git a/mm/filemap.c b/mm/filemap.c
index 14b4642..2b13a4a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1727,9 +1727,11 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		 * we've already read everything we wanted to, or if
 		 * there was a short read because we hit EOF, go ahead
 		 * and return.  Otherwise fallthrough to buffered io for
-		 * the rest of the read.
+		 * the rest of the read.  Buffered reads will not work for
+		 * DAX files, so don't bother trying.
 		 */
-		if (retval < 0 || !iov_iter_count(iter) || *ppos >= size) {
+		if (retval < 0 || !iov_iter_count(iter) || *ppos >= size ||
+		    IS_DAX(inode)) {
 			file_accessed(file);
 			goto out;
 		}
@@ -2593,13 +2595,16 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		loff_t endbyte;
 
 		written = generic_file_direct_write(iocb, from, pos);
-		if (written < 0 || written == count)
-			goto out;
-
 		/*
-		 * direct-io write to a hole: fall through to buffered I/O
-		 * for completing the rest of the request.
+		 * If the write stopped short of completing, fall back to
+		 * buffered writes.  Some filesystems do this for writes to
+		 * holes, for example.  For DAX files, a buffered write will
+		 * not succeed (even if it did, DAX does not handle dirty
+		 * page-cache pages correctly).
 		 */
+		if (written < 0 || written == count || IS_DAX(inode))
+			goto out;
+
 		pos += written;
 		count -= written;
 
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 06/20] dax,ext2: Replace XIP read and write with DAX I/O
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (4 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 05/20] vfs,ext2: Introduce IS_DAX(inode) Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2015-01-12 23:09   ` Andrew Morton
  2014-10-24 21:20 ` [PATCH v12 07/20] dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

Use the generic AIO infrastructure instead of custom read and write
methods.  In addition to giving us support for AIO, this adds the missing
locking between read() and truncate().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 MAINTAINERS        |   6 ++
 fs/Makefile        |   1 +
 fs/dax.c           | 186 ++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/file.c     |   6 +-
 fs/ext2/inode.c    |   8 +-
 include/linux/fs.h |  12 ++-
 mm/filemap.c       |   6 +-
 mm/filemap_xip.c   | 234 -----------------------------------------------------
 8 files changed, 214 insertions(+), 245 deletions(-)
 create mode 100644 fs/dax.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a20df9b..20c28f2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3006,6 +3006,12 @@ L:	linux-i2c@vger.kernel.org
 S:	Maintained
 F:	drivers/i2c/busses/i2c-diolan-u2c.c
 
+DIRECT ACCESS (DAX)
+M:	Matthew Wilcox <willy@linux.intel.com>
+L:	linux-fsdevel@vger.kernel.org
+S:	Supported
+F:	fs/dax.c
+
 DIRECTORY NOTIFICATION (DNOTIFY)
 M:	Eric Paris <eparis@parisplace.org>
 S:	Maintained
diff --git a/fs/Makefile b/fs/Makefile
index 90c8852..0325ec3 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_FS_XIP)		+= dax.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
diff --git a/fs/dax.c b/fs/dax.c
new file mode 100644
index 0000000..1a2bdbf
--- /dev/null
+++ b/fs/dax.c
@@ -0,0 +1,186 @@
+/*
+ * fs/dax.c - Direct Access filesystem code
+ * Copyright (c) 2013-2014 Intel Corporation
+ * Author: Matthew Wilcox <matthew.r.wilcox@intel.com>
+ * Author: Ross Zwisler <ross.zwisler@linux.intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/mutex.h>
+#include <linux/uio.h>
+
+static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
+{
+	unsigned long pfn;
+	sector_t sector = bh->b_blocknr << (blkbits - 9);
+	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
+}
+
+static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
+			loff_t end)
+{
+	loff_t final = end - pos + first; /* The final byte of the buffer */
+
+	if (first > 0)
+		memset(addr, 0, first);
+	if (final < size)
+		memset(addr + final, 0, size - final);
+}
+
+static bool buffer_written(struct buffer_head *bh)
+{
+	return buffer_mapped(bh) && !buffer_unwritten(bh);
+}
+
+/*
+ * When ext4 encounters a hole, it returns without modifying the buffer_head
+ * which means that we can't trust b_size.  To cope with this, we set b_state
+ * to 0 before calling get_block and, if any bit is set, we know we can trust
+ * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
+ * and would save us time calling get_block repeatedly.
+ */
+static bool buffer_size_valid(struct buffer_head *bh)
+{
+	return bh->b_state != 0;
+}
+
+static ssize_t dax_io(int rw, struct inode *inode, struct iov_iter *iter,
+			loff_t start, loff_t end, get_block_t get_block,
+			struct buffer_head *bh)
+{
+	ssize_t retval = 0;
+	loff_t pos = start;
+	loff_t max = start;
+	loff_t bh_max = start;
+	void *addr;
+	bool hole = false;
+
+	if (rw != WRITE)
+		end = min(end, i_size_read(inode));
+
+	while (pos < end) {
+		unsigned len;
+		if (pos == max) {
+			unsigned blkbits = inode->i_blkbits;
+			sector_t block = pos >> blkbits;
+			unsigned first = pos - (block << blkbits);
+			long size;
+
+			if (pos == bh_max) {
+				bh->b_size = PAGE_ALIGN(end - pos);
+				bh->b_state = 0;
+				retval = get_block(inode, block, bh,
+								rw == WRITE);
+				if (retval)
+					break;
+				if (!buffer_size_valid(bh))
+					bh->b_size = 1 << blkbits;
+				bh_max = pos - first + bh->b_size;
+			} else {
+				unsigned done = bh->b_size -
+						(bh_max - (pos - first));
+				bh->b_blocknr += done >> blkbits;
+				bh->b_size -= done;
+			}
+
+			hole = (rw != WRITE) && !buffer_written(bh);
+			if (hole) {
+				addr = NULL;
+				size = bh->b_size - first;
+			} else {
+				retval = dax_get_addr(bh, &addr, blkbits);
+				if (retval < 0)
+					break;
+				if (buffer_unwritten(bh) || buffer_new(bh))
+					dax_new_buf(addr, retval, first, pos,
+									end);
+				addr += first;
+				size = retval - first;
+			}
+			max = min(pos + size, end);
+		}
+
+		if (rw == WRITE)
+			len = copy_from_iter(addr, max - pos, iter);
+		else if (!hole)
+			len = copy_to_iter(addr, max - pos, iter);
+		else
+			len = iov_iter_zero(max - pos, iter);
+
+		if (!len)
+			break;
+
+		pos += len;
+		addr += len;
+	}
+
+	return (pos == start) ? retval : pos - start;
+}
+
+/**
+ * dax_do_io - Perform I/O to a DAX file
+ * @rw: READ to read or WRITE to write
+ * @iocb: The control block for this I/O
+ * @inode: The file which the I/O is directed at
+ * @iter: The addresses to do I/O from or to
+ * @pos: The file offset where the I/O starts
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ * @end_io: A filesystem callback for I/O completion
+ * @flags: See below
+ *
+ * This function uses the same locking scheme as do_blockdev_direct_IO:
+ * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
+ * caller for writes.  For reads, we take and release the i_mutex ourselves.
+ * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
+ * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
+ * is in progress.
+ */
+ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+			struct iov_iter *iter, loff_t pos,
+			get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+	struct buffer_head bh;
+	ssize_t retval = -EINVAL;
+	loff_t end = pos + iov_iter_count(iter);
+
+	memset(&bh, 0, sizeof(bh));
+
+	if ((flags & DIO_LOCKING) && (rw == READ)) {
+		struct address_space *mapping = inode->i_mapping;
+		mutex_lock(&inode->i_mutex);
+		retval = filemap_write_and_wait_range(mapping, pos, end - 1);
+		if (retval) {
+			mutex_unlock(&inode->i_mutex);
+			goto out;
+		}
+	}
+
+	/* Protects against truncate */
+	atomic_inc(&inode->i_dio_count);
+
+	retval = dax_io(rw, inode, iter, pos, end, get_block, &bh);
+
+	if ((flags & DIO_LOCKING) && (rw == READ))
+		mutex_unlock(&inode->i_mutex);
+
+	if ((retval > 0) && end_io)
+		end_io(iocb, pos, retval, bh.b_private);
+
+	inode_dio_done(inode);
+ out:
+	return retval;
+}
+EXPORT_SYMBOL_GPL(dax_do_io);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 7c87b22..a247123 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = {
 #ifdef CONFIG_EXT2_FS_XIP
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= xip_file_read,
-	.write		= xip_file_write,
+	.read		= new_sync_read,
+	.write		= new_sync_write,
+	.read_iter	= generic_file_read_iter,
+	.write_iter	= generic_file_write_iter,
 	.unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0cb0448..3ccd5fd 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -859,7 +859,12 @@ ext2_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,
 	size_t count = iov_iter_count(iter);
 	ssize_t ret;
 
-	ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, ext2_get_block);
+	if (IS_DAX(inode))
+		ret = dax_do_io(rw, iocb, inode, iter, offset, ext2_get_block,
+				NULL, DIO_LOCKING);
+	else
+		ret = blockdev_direct_IO(rw, iocb, inode, iter, offset,
+					 ext2_get_block);
 	if (ret < 0 && (rw & WRITE))
 		ext2_write_failed(mapping, offset + count);
 	return ret;
@@ -888,6 +893,7 @@ const struct address_space_operations ext2_aops = {
 const struct address_space_operations ext2_aops_xip = {
 	.bmap			= ext2_bmap,
 	.get_xip_mem		= ext2_get_xip_mem,
+	.direct_IO		= ext2_direct_IO,
 };
 
 const struct address_space_operations ext2_nobh_aops = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ff0acb2..e024dc3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2472,12 +2472,11 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
 extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
+ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
+		loff_t, get_block_t, dio_iodone_t, int flags);
+
 #ifdef CONFIG_FS_XIP
-extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
-			     loff_t *ppos);
 extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
-extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
-			      size_t len, loff_t *ppos);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
 #else
 static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
@@ -2641,6 +2640,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
 extern void save_mount_options(struct super_block *sb, char *options);
 extern void replace_mount_options(struct super_block *sb, char *options);
 
+static inline bool io_is_direct(struct file *filp)
+{
+	return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
diff --git a/mm/filemap.c b/mm/filemap.c
index 2b13a4a..743278a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1699,8 +1699,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	loff_t *ppos = &iocb->ki_pos;
 	loff_t pos = *ppos;
 
-	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
-	if (file->f_flags & O_DIRECT) {
+	if (io_is_direct(file)) {
 		struct address_space *mapping = file->f_mapping;
 		struct inode *inode = mapping->host;
 		size_t count = iov_iter_count(iter);
@@ -2590,8 +2589,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (err)
 		goto out;
 
-	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
-	if (unlikely(file->f_flags & O_DIRECT)) {
+	if (io_is_direct(file)) {
 		loff_t endbyte;
 
 		written = generic_file_direct_write(iocb, from, pos);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c8d23e9..f7c37a1 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void)
 }
 
 /*
- * This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_mem() function for the actual low-level
- * stuff.
- *
- * Note the struct file* is not used at all.  It may be NULL.
- */
-static ssize_t
-do_xip_mapping_read(struct address_space *mapping,
-		    struct file_ra_state *_ra,
-		    struct file *filp,
-		    char __user *buf,
-		    size_t len,
-		    loff_t *ppos)
-{
-	struct inode *inode = mapping->host;
-	pgoff_t index, end_index;
-	unsigned long offset;
-	loff_t isize, pos;
-	size_t copied = 0, error = 0;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	pos = *ppos;
-	index = pos >> PAGE_CACHE_SHIFT;
-	offset = pos & ~PAGE_CACHE_MASK;
-
-	isize = i_size_read(inode);
-	if (!isize)
-		goto out;
-
-	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
-	do {
-		unsigned long nr, left;
-		void *xip_mem;
-		unsigned long xip_pfn;
-		int zero = 0;
-
-		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
-		if (index >= end_index) {
-			if (index > end_index)
-				goto out;
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
-			if (nr <= offset) {
-				goto out;
-			}
-		}
-		nr = nr - offset;
-		if (nr > len - copied)
-			nr = len - copied;
-
-		error = mapping->a_ops->get_xip_mem(mapping, index, 0,
-							&xip_mem, &xip_pfn);
-		if (unlikely(error)) {
-			if (error == -ENODATA) {
-				/* sparse */
-				zero = 1;
-			} else
-				goto out;
-		}
-
-		/* If users can be writing to this page using arbitrary
-		 * virtual addresses, take care about potential aliasing
-		 * before reading the page on the kernel side.
-		 */
-		if (mapping_writably_mapped(mapping))
-			/* address based flush */ ;
-
-		/*
-		 * Ok, we have the mem, so now we can copy it to user space...
-		 *
-		 * The actor routine returns how many bytes were actually used..
-		 * NOTE! This may not be the same as how much of a user buffer
-		 * we filled up (we may be padding etc), so we can only update
-		 * "pos" here (the actor routine has to update the user buffer
-		 * pointers and the remaining count).
-		 */
-		if (!zero)
-			left = __copy_to_user(buf+copied, xip_mem+offset, nr);
-		else
-			left = __clear_user(buf + copied, nr);
-
-		if (left) {
-			error = -EFAULT;
-			goto out;
-		}
-
-		copied += (nr - left);
-		offset += (nr - left);
-		index += offset >> PAGE_CACHE_SHIFT;
-		offset &= ~PAGE_CACHE_MASK;
-	} while (copied < len);
-
-out:
-	*ppos = pos + copied;
-	if (filp)
-		file_accessed(filp);
-
-	return (copied ? copied : error);
-}
-
-ssize_t
-xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
-{
-	if (!access_ok(VERIFY_WRITE, buf, len))
-		return -EFAULT;
-
-	return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
-			    buf, len, ppos);
-}
-EXPORT_SYMBOL_GPL(xip_file_read);
-
-/*
  * __xip_unmap is invoked from xip_unmap and
  * xip_write
  *
@@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
 }
 EXPORT_SYMBOL_GPL(xip_file_mmap);
 
-static ssize_t
-__xip_file_write(struct file *filp, const char __user *buf,
-		  size_t count, loff_t pos, loff_t *ppos)
-{
-	struct address_space * mapping = filp->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode 	*inode = mapping->host;
-	long		status = 0;
-	size_t		bytes;
-	ssize_t		written = 0;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	do {
-		unsigned long index;
-		unsigned long offset;
-		size_t copied;
-		void *xip_mem;
-		unsigned long xip_pfn;
-
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
-		if (bytes > count)
-			bytes = count;
-
-		status = a_ops->get_xip_mem(mapping, index, 0,
-						&xip_mem, &xip_pfn);
-		if (status == -ENODATA) {
-			/* we allocate a new page unmap it */
-			mutex_lock(&xip_sparse_mutex);
-			status = a_ops->get_xip_mem(mapping, index, 1,
-							&xip_mem, &xip_pfn);
-			mutex_unlock(&xip_sparse_mutex);
-			if (!status)
-				/* unmap page at pgoff from all other vmas */
-				__xip_unmap(mapping, index);
-		}
-
-		if (status)
-			break;
-
-		copied = bytes -
-			__copy_from_user_nocache(xip_mem + offset, buf, bytes);
-
-		if (likely(copied > 0)) {
-			status = copied;
-
-			if (status >= 0) {
-				written += status;
-				count -= status;
-				pos += status;
-				buf += status;
-			}
-		}
-		if (unlikely(copied != bytes))
-			if (status >= 0)
-				status = -EFAULT;
-		if (status < 0)
-			break;
-	} while (count);
-	*ppos = pos;
-	/*
-	 * No need to use i_size_read() here, the i_size
-	 * cannot change under us because we hold i_mutex.
-	 */
-	if (pos > inode->i_size) {
-		i_size_write(inode, pos);
-		mark_inode_dirty(inode);
-	}
-
-	return written ? written : status;
-}
-
-ssize_t
-xip_file_write(struct file *filp, const char __user *buf, size_t len,
-	       loff_t *ppos)
-{
-	struct address_space *mapping = filp->f_mapping;
-	struct inode *inode = mapping->host;
-	size_t count;
-	loff_t pos;
-	ssize_t ret;
-
-	mutex_lock(&inode->i_mutex);
-
-	if (!access_ok(VERIFY_READ, buf, len)) {
-		ret=-EFAULT;
-		goto out_up;
-	}
-
-	pos = *ppos;
-	count = len;
-
-	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
-
-	ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
-	if (ret)
-		goto out_backing;
-	if (count == 0)
-		goto out_backing;
-
-	ret = file_remove_suid(filp);
-	if (ret)
-		goto out_backing;
-
-	ret = file_update_time(filp);
-	if (ret)
-		goto out_backing;
-
-	ret = __xip_file_write (filp, buf, count, pos, ppos);
-
- out_backing:
-	current->backing_dev_info = NULL;
- out_up:
-	mutex_unlock(&inode->i_mutex);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(xip_file_write);
-
 /*
  * truncate a page used for execute in place
  * functionality is analog to block_truncate_page but does use get_xip_mem
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 07/20] dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (5 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 06/20] dax,ext2: Replace XIP read and write with DAX I/O Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2015-01-12 23:09   ` Andrew Morton
  2014-10-24 21:20 ` [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

This is practically generic code; other filesystems will want to call
it from other places, but there's nothing ext2-specific about it.

Make it a little more generic by allowing it to take a count of the number
of bytes to zero rather than fixing it to a single page.  Thanks to Dave
Hansen for suggesting that I need to call cond_resched() if zeroing more
than one page.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/dax.c           | 37 +++++++++++++++++++++++++++++++++++++
 fs/ext2/inode.c    |  8 +++++---
 fs/ext2/xip.c      | 14 --------------
 fs/ext2/xip.h      |  3 ---
 include/linux/fs.h |  1 +
 5 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1a2bdbf..69c3126 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -20,8 +20,45 @@
 #include <linux/fs.h>
 #include <linux/genhd.h>
 #include <linux/mutex.h>
+#include <linux/sched.h>
 #include <linux/uio.h>
 
+int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	sector_t sector = block << (inode->i_blkbits - 9);
+
+	might_sleep();
+	do {
+		void *addr;
+		unsigned long pfn;
+		long count;
+
+		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+		if (count < 0)
+			return count;
+		BUG_ON(size < count);
+		while (count > 0) {
+			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
+			if (pgsz > count)
+				pgsz = count;
+			if (pgsz < PAGE_SIZE)
+				memset(addr, 0, pgsz);
+			else
+				clear_page(addr);
+			addr += pgsz;
+			size -= pgsz;
+			count -= pgsz;
+			BUG_ON(pgsz & 511);
+			sector += pgsz / 512;
+			cond_resched();
+		}
+	} while (size);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_clear_blocks);
+
 static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
 {
 	unsigned long pfn;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 3ccd5fd..52978b8 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode,
 
 	if (IS_DAX(inode)) {
 		/*
-		 * we need to clear the block
+		 * block must be initialised before we put it in the tree
+		 * so that it's not found by another thread before it's
+		 * initialised
 		 */
-		err = ext2_clear_xip_target (inode,
-			le32_to_cpu(chain[depth-1].key));
+		err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
+						1 << inode->i_blkbits);
 		if (err) {
 			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index bbc5fec..8cfca3a 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -42,20 +42,6 @@ __ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
 	return rc;
 }
 
-int
-ext2_clear_xip_target(struct inode *inode, sector_t block)
-{
-	void *kaddr;
-	unsigned long pfn;
-	long size;
-
-	size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
-	if (size < 0)
-		return size;
-	clear_page(kaddr);
-	return 0;
-}
-
 void ext2_xip_verify_sb(struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 29be737..b2592f2 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -7,8 +7,6 @@
 
 #ifdef CONFIG_EXT2_FS_XIP
 extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, sector_t);
-
 static inline int ext2_use_xip (struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -19,6 +17,5 @@ int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
 #else
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
-#define ext2_clear_xip_target(inode, chain)	0
 #define ext2_get_xip_mem			NULL
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e024dc3..aeff5dd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2474,6 +2474,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 		loff_t, get_block_t, dio_iodone_t, int flags);
+int dax_clear_blocks(struct inode *, sector_t block, long size);
 
 #ifdef CONFIG_FS_XIP
 extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (6 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 07/20] dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2015-01-12 23:09   ` Andrew Morton
  2014-10-24 21:20 ` [PATCH v12 09/20] dax,ext2: Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

Instead of calling aops->get_xip_mem from the fault handler, the
filesystem passes a get_block_t that is used to find the appropriate
blocks.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c           | 237 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/file.c     |  35 +++++++-
 include/linux/fs.h |   4 +-
 mm/filemap_xip.c   | 206 ----------------------------------------------
 4 files changed, 273 insertions(+), 209 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 69c3126..19b665e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -19,9 +19,13 @@
 #include <linux/buffer_head.h>
 #include <linux/fs.h>
 #include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/memcontrol.h>
+#include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
+#include <linux/vmstat.h>
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -221,3 +225,236 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 	return retval;
 }
 EXPORT_SYMBOL_GPL(dax_do_io);
+
+/*
+ * The user has performed a load from a hole in the file.  Allocating
+ * a new page in the file would cause excessive storage usage for
+ * workloads with sparse files.  We allocate a page cache page instead.
+ * We'll kick it out of the page cache if it's ever written to,
+ * otherwise it will simply fall out of the page cache under memory
+ * pressure without ever having been dirtied.
+ */
+static int dax_load_hole(struct address_space *mapping, struct page *page,
+							struct vm_fault *vmf)
+{
+	unsigned long size;
+	struct inode *inode = mapping->host;
+	if (!page)
+		page = find_or_create_page(mapping, vmf->pgoff,
+						GFP_KERNEL | __GFP_ZERO);
+	if (!page)
+		return VM_FAULT_OOM;
+	/* Recheck i_size under page lock to avoid truncate race */
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size) {
+		unlock_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static int copy_user_bh(struct page *to, struct buffer_head *bh,
+			unsigned blkbits, unsigned long vaddr)
+{
+	void *vfrom, *vto;
+	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
+		return -EIO;
+	vto = kmap_atomic(to);
+	copy_user_page(vto, vfrom, vaddr, to);
+	kunmap_atomic(vto);
+	return 0;
+}
+
+static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
+			struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct address_space *mapping = inode->i_mapping;
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+	unsigned long vaddr = (unsigned long)vmf->virtual_address;
+	void *addr;
+	unsigned long pfn;
+	pgoff_t size;
+	int error;
+
+	mutex_lock(&mapping->i_mmap_mutex);
+
+	/*
+	 * Check truncate didn't happen while we were allocating a block.
+	 * If it did, this block may or may not be still allocated to the
+	 * file.  We can't tell the filesystem to free it because we can't
+	 * take i_mutex here.  In the worst case, the file still has blocks
+	 * allocated past the end of the file.
+	 */
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (unlikely(vmf->pgoff >= size)) {
+		error = -EIO;
+		goto out;
+	}
+
+	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
+	if (error < 0)
+		goto out;
+	if (error < PAGE_SIZE) {
+		error = -EIO;
+		goto out;
+	}
+
+	if (buffer_unwritten(bh) || buffer_new(bh))
+		clear_page(addr);
+
+	error = vm_insert_mixed(vma, vaddr, pfn);
+
+ out:
+	mutex_unlock(&mapping->i_mmap_mutex);
+
+	if (bh->b_end_io)
+		bh->b_end_io(bh, 1);
+
+	return error;
+}
+
+static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct page *page;
+	struct buffer_head bh;
+	unsigned long vaddr = (unsigned long)vmf->virtual_address;
+	unsigned blkbits = inode->i_blkbits;
+	sector_t block;
+	pgoff_t size;
+	int error;
+	int major = 0;
+
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	memset(&bh, 0, sizeof(bh));
+	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
+	bh.b_size = PAGE_SIZE;
+
+ repeat:
+	page = find_get_page(mapping, vmf->pgoff);
+	if (page) {
+		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+			page_cache_release(page);
+			return VM_FAULT_RETRY;
+		}
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
+		}
+		size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			error = -EIO;
+			goto unlock_page;
+		}
+	}
+
+	error = get_block(inode, block, &bh, 0);
+	if (!error && (bh.b_size < PAGE_SIZE))
+		error = -EIO;
+	if (error)
+		goto unlock_page;
+
+	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			error = get_block(inode, block, &bh, 1);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			major = VM_FAULT_MAJOR;
+			if (!error && (bh.b_size < PAGE_SIZE))
+				error = -EIO;
+			if (error)
+				goto unlock_page;
+		} else {
+			return dax_load_hole(mapping, page, vmf);
+		}
+	}
+
+	if (vmf->cow_page) {
+		struct page *new_page = vmf->cow_page;
+		if (buffer_written(&bh))
+			error = copy_user_bh(new_page, &bh, blkbits, vaddr);
+		else
+			clear_user_highpage(new_page, vaddr);
+		if (error)
+			goto unlock_page;
+		vmf->page = page;
+		if (!page) {
+			mutex_lock(&mapping->i_mmap_mutex);
+			/* Check we didn't race with truncate */
+			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
+								PAGE_SHIFT;
+			if (vmf->pgoff >= size) {
+				mutex_unlock(&mapping->i_mmap_mutex);
+				error = -EIO;
+				goto out;
+			}
+		}
+		return VM_FAULT_LOCKED;
+	}
+
+	/* Check we didn't race with a read fault installing a new page */
+	if (!page && major)
+		page = find_lock_page(mapping, vmf->pgoff);
+
+	if (page) {
+		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+							PAGE_CACHE_SIZE, 0);
+		delete_from_page_cache(page);
+		unlock_page(page);
+		page_cache_release(page);
+	}
+
+	error = dax_insert_mapping(inode, &bh, vma, vmf);
+
+ out:
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM | major;
+	/* -EBUSY is fine, somebody else faulted on the same PTE */
+	if ((error < 0) && (error != -EBUSY))
+		return VM_FAULT_SIGBUS | major;
+	return VM_FAULT_NOPAGE | major;
+
+ unlock_page:
+	if (page) {
+		unlock_page(page);
+		page_cache_release(page);
+	}
+	goto out;
+}
+
+/**
+ * dax_fault - handle a page fault on a DAX file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * fault handler for DAX files.
+ */
+int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	int result;
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	if (vmf->flags & FAULT_FLAG_WRITE) {
+		sb_start_pagefault(sb);
+		file_update_time(vma->vm_file);
+	}
+	result = do_dax_fault(vma, vmf, get_block);
+	if (vmf->flags & FAULT_FLAG_WRITE)
+		sb_end_pagefault(sb);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_fault);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index a247123..da8dc64 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,37 @@
 #include "xattr.h"
 #include "acl.h"
 
+#ifdef CONFIG_EXT2_FS_XIP
+static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_fault(vma, vmf, ext2_get_block);
+}
+
+static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_mkwrite(vma, vmf, ext2_get_block);
+}
+
+static const struct vm_operations_struct ext2_dax_vm_ops = {
+	.fault		= ext2_dax_fault,
+	.page_mkwrite	= ext2_dax_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if (!IS_DAX(file_inode(file)))
+		return generic_file_mmap(file, vma);
+
+	file_accessed(file);
+	vma->vm_ops = &ext2_dax_vm_ops;
+	vma->vm_flags |= VM_MIXEDMAP;
+	return 0;
+}
+#else
+#define ext2_file_mmap	generic_file_mmap
+#endif
+
 /*
  * Called when filp is released. This happens when all file descriptors
  * for a single struct file are closed. Note that different open() calls
@@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
-	.mmap		= generic_file_mmap,
+	.mmap		= ext2_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
@@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
-	.mmap		= xip_file_mmap,
+	.mmap		= ext2_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index aeff5dd..84ef250 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -49,6 +49,7 @@ struct swap_info_struct;
 struct seq_file;
 struct workqueue_struct;
 struct iov_iter;
+struct vm_fault;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -2475,9 +2476,10 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 		loff_t, get_block_t, dio_iodone_t, int flags);
 int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
 #ifdef CONFIG_FS_XIP
-extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
 #else
 static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index f7c37a1..9dd45f3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -22,212 +22,6 @@
 #include <asm/io.h>
 
 /*
- * We do use our own empty page to avoid interference with other users
- * of ZERO_PAGE(), such as /dev/zero
- */
-static DEFINE_MUTEX(xip_sparse_mutex);
-static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
-static struct page *__xip_sparse_page;
-
-/* called under xip_sparse_mutex */
-static struct page *xip_sparse_page(void)
-{
-	if (!__xip_sparse_page) {
-		struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-
-		if (page)
-			__xip_sparse_page = page;
-	}
-	return __xip_sparse_page;
-}
-
-/*
- * __xip_unmap is invoked from xip_unmap and
- * xip_write
- *
- * This function walks all vmas of the address_space and unmaps the
- * __xip_sparse_page when found at pgoff.
- */
-static void
-__xip_unmap (struct address_space * mapping,
-		     unsigned long pgoff)
-{
-	struct vm_area_struct *vma;
-	struct mm_struct *mm;
-	unsigned long address;
-	pte_t *pte;
-	pte_t pteval;
-	spinlock_t *ptl;
-	struct page *page;
-	unsigned count;
-	int locked = 0;
-
-	count = read_seqcount_begin(&xip_sparse_seq);
-
-	page = __xip_sparse_page;
-	if (!page)
-		return;
-
-retry:
-	mutex_lock(&mapping->i_mmap_mutex);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
-		mm = vma->vm_mm;
-		address = vma->vm_start +
-			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl, 1);
-		if (pte) {
-			/* Nuke the page table entry. */
-			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
-			page_remove_rmap(page);
-			dec_mm_counter(mm, MM_FILEPAGES);
-			BUG_ON(pte_dirty(pteval));
-			pte_unmap_unlock(pte, ptl);
-			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
-			page_cache_release(page);
-		}
-	}
-	mutex_unlock(&mapping->i_mmap_mutex);
-
-	if (locked) {
-		mutex_unlock(&xip_sparse_mutex);
-	} else if (read_seqcount_retry(&xip_sparse_seq, count)) {
-		mutex_lock(&xip_sparse_mutex);
-		locked = 1;
-		goto retry;
-	}
-}
-
-/*
- * xip_fault() is invoked via the vma operations vector for a
- * mapped memory region to read in file data during a page fault.
- *
- * This function is derived from filemap_fault, but used for execute in place
- */
-static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	pgoff_t size;
-	void *xip_mem;
-	unsigned long xip_pfn;
-	struct page *page;
-	int error;
-
-	/* XXX: are VM_FAULT_ codes OK? */
-again:
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (vmf->pgoff >= size)
-		return VM_FAULT_SIGBUS;
-
-	error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
-						&xip_mem, &xip_pfn);
-	if (likely(!error))
-		goto found;
-	if (error != -ENODATA)
-		return VM_FAULT_OOM;
-
-	/* sparse block */
-	if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
-	    (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
-	    (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
-		int err;
-
-		/* maybe shared writable, allocate new block */
-		mutex_lock(&xip_sparse_mutex);
-		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
-							&xip_mem, &xip_pfn);
-		mutex_unlock(&xip_sparse_mutex);
-		if (error)
-			return VM_FAULT_SIGBUS;
-		/* unmap sparse mappings at pgoff from all other vmas */
-		__xip_unmap(mapping, vmf->pgoff);
-
-found:
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
-		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			mutex_unlock(&mapping->i_mmap_mutex);
-			return VM_FAULT_SIGBUS;
-		}
-		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
-							xip_pfn);
-		mutex_unlock(&mapping->i_mmap_mutex);
-		if (err == -ENOMEM)
-			return VM_FAULT_OOM;
-		/*
-		 * err == -EBUSY is fine, we've raced against another thread
-		 * that faulted-in the same page
-		 */
-		if (err != -EBUSY)
-			BUG_ON(err);
-		return VM_FAULT_NOPAGE;
-	} else {
-		int err, ret = VM_FAULT_OOM;
-
-		mutex_lock(&xip_sparse_mutex);
-		write_seqcount_begin(&xip_sparse_seq);
-		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
-							&xip_mem, &xip_pfn);
-		if (unlikely(!error)) {
-			write_seqcount_end(&xip_sparse_seq);
-			mutex_unlock(&xip_sparse_mutex);
-			goto again;
-		}
-		if (error != -ENODATA)
-			goto out;
-
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
-		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			ret = VM_FAULT_SIGBUS;
-			goto unlock;
-		}
-		/* not shared and writable, use xip_sparse_page() */
-		page = xip_sparse_page();
-		if (!page)
-			goto unlock;
-		err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
-							page);
-		if (err == -ENOMEM)
-			goto unlock;
-
-		ret = VM_FAULT_NOPAGE;
-unlock:
-		mutex_unlock(&mapping->i_mmap_mutex);
-out:
-		write_seqcount_end(&xip_sparse_seq);
-		mutex_unlock(&xip_sparse_mutex);
-
-		return ret;
-	}
-}
-
-static const struct vm_operations_struct xip_file_vm_ops = {
-	.fault	= xip_file_fault,
-	.page_mkwrite	= filemap_page_mkwrite,
-	.remap_pages = generic_file_remap_pages,
-};
-
-int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
-{
-	BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
-
-	file_accessed(file);
-	vma->vm_ops = &xip_file_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xip_file_mmap);
-
-/*
  * truncate a page used for execute in place
  * functionality is analog to block_truncate_page but does use get_xip_mem
  * to get the page instead of page cache
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 09/20] dax,ext2: Replace xip_truncate_page with dax_truncate_page
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (7 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2015-01-12 23:09   ` Andrew Morton
  2014-10-24 21:20 ` [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation Matthew Wilcox
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

It takes a get_block parameter just like nobh_truncate_page() and
block_truncate_page()

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 fs/dax.c           | 44 ++++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/inode.c    |  2 +-
 include/linux/fs.h | 10 +---------
 mm/filemap_xip.c   | 40 ----------------------------------------
 4 files changed, 46 insertions(+), 50 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 19b665e..e838ec8 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -458,3 +458,47 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return result;
 }
 EXPORT_SYMBOL_GPL(dax_fault);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks.  Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+	struct buffer_head bh;
+	pgoff_t index = from >> PAGE_CACHE_SHIFT;
+	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned length = PAGE_CACHE_ALIGN(from) - from;
+	int err;
+
+	/* Block boundary? Nothing to do */
+	if (!length)
+		return 0;
+
+	memset(&bh, 0, sizeof(bh));
+	bh.b_size = PAGE_CACHE_SIZE;
+	err = get_block(inode, index, &bh, 0);
+	if (err < 0)
+		return err;
+	if (buffer_written(&bh)) {
+		void *addr;
+		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
+		if (err < 0)
+			return err;
+		memset(addr + offset, 0, length);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 52978b8..5ac0a34 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1210,7 +1210,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	inode_dio_wait(inode);
 
 	if (IS_DAX(inode))
-		error = xip_truncate_page(inode->i_mapping, newsize);
+		error = dax_truncate_page(inode, newsize, ext2_get_block);
 	else if (test_opt(inode->i_sb, NOBH))
 		error = nobh_truncate_page(inode->i_mapping,
 				newsize, ext2_get_block);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 84ef250..d3787b5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2476,18 +2476,10 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 		loff_t, get_block_t, dio_iodone_t, int flags);
 int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
-#ifdef CONFIG_FS_XIP
-extern int xip_truncate_page(struct address_space *mapping, loff_t from);
-#else
-static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
-{
-	return 0;
-}
-#endif
-
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
 			    loff_t file_offset);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 9dd45f3..6316578 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -21,43 +21,3 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 
-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
-{
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned blocksize;
-	unsigned length;
-	void *xip_mem;
-	unsigned long xip_pfn;
-	int err;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	blocksize = 1 << mapping->host->i_blkbits;
-	length = offset & (blocksize - 1);
-
-	/* Block boundary? Nothing to do */
-	if (!length)
-		return 0;
-
-	length = blocksize - length;
-
-	err = mapping->a_ops->get_xip_mem(mapping, index, 0,
-						&xip_mem, &xip_pfn);
-	if (unlikely(err)) {
-		if (err == -ENODATA)
-			/* Hole? No need to truncate */
-			return 0;
-		else
-			return err;
-	}
-	memset(xip_mem + offset, 0, length);
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xip_truncate_page);
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (8 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 09/20] dax,ext2: Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2015-01-12 23:10   ` Andrew Morton
  2016-01-21 18:38   ` Jared Hulbert
  2014-10-24 21:20 ` [PATCH v12 11/20] vfs: Remove get_xip_mem Matthew Wilcox
                   ` (11 subsequent siblings)
  21 siblings, 2 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Matthew Wilcox, Andrew Morton

From: Matthew Wilcox <willy@linux.intel.com>

Based on the original XIP documentation, this documents the current
state of affairs, and includes instructions on how users can enable DAX
if their devices and kernel support it.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
---
 Documentation/filesystems/00-INDEX |  5 ++-
 Documentation/filesystems/dax.txt  | 89 ++++++++++++++++++++++++++++++++++++++
 Documentation/filesystems/xip.txt  | 71 ------------------------------
 3 files changed, 92 insertions(+), 73 deletions(-)
 create mode 100644 Documentation/filesystems/dax.txt
 delete mode 100644 Documentation/filesystems/xip.txt

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index ac28149..9922939 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -34,6 +34,9 @@ configfs/
 	- directory containing configfs documentation and example code.
 cramfs.txt
 	- info on the cram filesystem for small storage (ROMs etc).
+dax.txt
+	- info on avoiding the page cache for files stored on CPU-addressable
+	  storage devices.
 debugfs.txt
 	- info on the debugfs filesystem.
 devpts.txt
@@ -154,5 +157,3 @@ xfs-self-describing-metadata.txt
 	- info on XFS Self Describing Metadata.
 xfs.txt
 	- info and mount options for the XFS filesystem.
-xip.txt
-	- info on execute-in-place for file mappings.
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
new file mode 100644
index 0000000..635adaa
--- /dev/null
+++ b/Documentation/filesystems/dax.txt
@@ -0,0 +1,89 @@
+Direct Access for files
+-----------------------
+
+Motivation
+----------
+
+The page cache is usually used to buffer reads and writes to files.
+It is also used to provide the pages which are mapped into userspace
+by a call to mmap.
+
+For block devices that are memory-like, the page cache pages would be
+unnecessary copies of the original storage.  The DAX code removes the
+extra copy by performing reads and writes directly to the storage device.
+For file mappings, the storage device is mapped directly into userspace.
+
+
+Usage
+-----
+
+If you have a block device which supports DAX, you can make a filesystem
+on it as usual.  When mounting it, use the -o dax option manually
+or add 'dax' to the options in /etc/fstab.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support DAX in your block driver, implement the 'direct_access'
+block device operation.  It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory.  It also returns a
+kernel virtual address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested.  The function should return the number
+of bytes that can be contiguously accessed at that offset.  It may also
+return a negative errno if an error occurs.
+
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times.  If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access.  Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
+- dcssblk: s390 dcss block device driver
+
+
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of
+- adding support to mark inodes as being DAX by setting the S_DAX flag in
+  i_flags
+- implementing the direct_IO address space operation, and calling
+  dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
+- implementing an mmap file operation for DAX files which sets the
+  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
+  for fault and page_mkwrite (which should probably call dax_fault() and
+  dax_mkwrite(), passing the appropriate get_block() callback)
+- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- ensuring that there is sufficient locking between reads, writes,
+  truncates and page faults
+
+The get_block() callback passed to the DAX functions may return
+uninitialised extents.  If it does, it must ensure that simultaneous
+calls to get_block() (for example by a page-fault racing with a read()
+or a write()) work correctly.
+
+These filesystems may be used for inspiration:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+
+
+Shortcomings
+------------
+
+Even if the kernel or its modules are stored on a filesystem that supports
+DAX on a block device that supports DAX, they will still be copied into RAM.
+
+Calling get_user_pages() on a range of user memory that has been mmaped
+from a DAX file will fail as there are no 'struct page' to describe
+those pages.  This problem is being worked on.  That means that O_DIRECT
+reads/writes to those memory ranges from a non-DAX file will fail (note
+that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
+that is being accessed that is key here).  Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
deleted file mode 100644
index b774729..0000000
--- a/Documentation/filesystems/xip.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-Execute-in-place for file mappings
-----------------------------------
-
-Motivation
-----------
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read&write type file operations also transfer data from/to the page
-cache.
-
-For memory backed storage devices that use the block device interface, the page
-cache pages are in fact copies of the original storage. Various approaches
-exist to work around the need for an extra copy. The ramdisk driver for example
-does read the data into the page cache, keeps a reference, and discards the
-original data behind later on.
-
-Execute-in-place solves this issue the other way around: instead of keeping
-data in the page cache, the need to have a page cache copy is eliminated
-completely. With execute-in-place, read&write type operations are performed
-directly from/to the memory backed storage device. For file mappings, the
-storage device itself is mapped directly into userspace.
-
-This implementation was initially written for shared memory segments between
-different virtual machines on s390 hardware to allow multiple machines to
-share the same binaries and libraries.
-
-Implementation
---------------
-Execute-in-place is implemented in three steps: block device operation,
-address space operation, and file operations.
-
-A block device operation named direct_access is used to translate the
-block device sector number to a page frame number (pfn) that identifies
-the physical page for the memory.  It also returns a kernel virtual
-address that can be used to access the memory.
-
-The direct_access method takes a 'size' parameter that indicates the
-number of bytes being requested.  The function should return the number
-of bytes that can be contiguously accessed at that offset.  It may also
-return a negative errno if an error occurs.
-
-The block device operation is optional, these block devices support it as of
-today:
-- dcssblk: s390 dcss block device driver
-
-An address space operation named get_xip_mem is used to retrieve references
-to a page frame number and a kernel address. To obtain these values a reference
-to an address_space is provided. This function assigns values to the kmem and
-pfn parameters. The third argument indicates whether the function should allocate
-blocks if needed.
-
-This address space operation is mutually exclusive with readpage&writepage that
-do page cache read/write operations.
-The following filesystems support it as of today:
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
-
-A set of file operations that do utilize get_xip_page can be found in
-mm/filemap_xip.c . The following file operation implementations are provided:
-- aio_read/aio_write
-- readv/writev
-- sendfile
-
-The generic file operations do_sync_read/do_sync_write can be used to implement
-classic synchronous IO calls.
-
-Shortcomings
-------------
-This implementation is limited to storage devices that are cpu addressable at
-all times (no highmem or such). It works well on rom/ram, but enhancements are
-needed to make it work with flash in read+write mode.
-Putting the Linux kernel and/or its modules on a xip filesystem does not mean
-they are not copied.
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 11/20] vfs: Remove get_xip_mem
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (9 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 12/20] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

All callers of get_xip_mem() are now gone.  Remove checks for it,
initialisers of it, documentation of it and the only implementation
of it.  Also remove mm/filemap_xip.c as it is now empty.  Also remove
documentation of the long-gone get_xip_page().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/Locking |  3 ---
 Documentation/filesystems/vfs.txt |  7 ------
 fs/exofs/inode.c                  |  1 -
 fs/ext2/inode.c                   |  1 -
 fs/ext2/xip.c                     | 45 ---------------------------------------
 fs/ext2/xip.h                     |  3 ---
 fs/open.c                         |  5 +----
 include/linux/fs.h                |  2 --
 include/linux/rmap.h              |  2 +-
 mm/Makefile                       |  1 -
 mm/fadvise.c                      |  6 ++++--
 mm/filemap_xip.c                  | 23 --------------------
 mm/madvise.c                      |  2 +-
 13 files changed, 7 insertions(+), 94 deletions(-)
 delete mode 100644 mm/filemap_xip.c

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 94d93b1..a75d5b5 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -197,8 +197,6 @@ prototypes:
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
-	int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
-				unsigned long *);
 	int (*migratepage)(struct address_space *, struct page *, struct page *);
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
@@ -223,7 +221,6 @@ invalidatepage:		yes
 releasepage:		yes
 freepage:		yes
 direct_IO:
-get_xip_mem:					maybe
 migratepage:		yes (both)
 launder_page:		yes
 is_partially_uptodate:	yes
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index fceff7c..9f2e7a9 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -590,8 +590,6 @@ struct address_space_operations {
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
-	struct page* (*get_xip_page)(struct address_space *, sector_t,
-			int);
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct page *, struct page *);
 	int (*launder_page) (struct page *);
@@ -741,11 +739,6 @@ struct address_space_operations {
         and transfer data directly between the storage and the
         application's address space.
 
-  get_xip_page: called by the VM to translate a block number to a page.
-	The page is valid until the corresponding filesystem is unmounted.
-	Filesystems that want to use execute-in-place (XIP) need to implement
-	it.  An example implementation can be found in fs/ext2/xip.c.
-
   migrate_page:  This is used to compact the physical memory usage.
         If the VM wants to relocate a page (maybe off a memory card
         that is signalling imminent failure) it will pass a new page
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 3f9cafd..c408a53 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = {
 	.direct_IO	= exofs_direct_IO,
 
 	/* With these NULL has special meaning or default is not exported */
-	.get_xip_mem	= NULL,
 	.migratepage	= NULL,
 	.launder_page	= NULL,
 	.is_partially_uptodate = NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 5ac0a34..59d6c7d 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -894,7 +894,6 @@ const struct address_space_operations ext2_aops = {
 
 const struct address_space_operations ext2_aops_xip = {
 	.bmap			= ext2_bmap,
-	.get_xip_mem		= ext2_get_xip_mem,
 	.direct_IO		= ext2_direct_IO,
 };
 
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 8cfca3a..132d4da 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,35 +13,6 @@
 #include "ext2.h"
 #include "xip.h"
 
-static inline long __inode_direct_access(struct inode *inode, sector_t block,
-				void **kaddr, unsigned long *pfn, long size)
-{
-	struct block_device *bdev = inode->i_sb->s_bdev;
-	sector_t sector = block * (PAGE_SIZE / 512);
-	return bdev_direct_access(bdev, sector, kaddr, pfn, size);
-}
-
-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
-		   sector_t *result)
-{
-	struct buffer_head tmp;
-	int rc;
-
-	memset(&tmp, 0, sizeof(struct buffer_head));
-	tmp.b_size = 1 << inode->i_blkbits;
-	rc = ext2_get_block(inode, pgoff, &tmp, create);
-	*result = tmp.b_blocknr;
-
-	/* did we get a sparse block (hole in the file)? */
-	if (!tmp.b_blocknr && !rc) {
-		BUG_ON(create);
-		rc = -ENODATA;
-	}
-
-	return rc;
-}
-
 void ext2_xip_verify_sb(struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -54,19 +25,3 @@ void ext2_xip_verify_sb(struct super_block *sb)
 			     "not supported by bdev");
 	}
 }
-
-int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
-				void **kmem, unsigned long *pfn)
-{
-	long rc;
-	sector_t block;
-
-	/* first, retrieve the sector number */
-	rc = __ext2_get_block(mapping->host, pgoff, create, &block);
-	if (rc)
-		return rc;
-
-	/* retrieve address of the target data */
-	rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
-	return (rc < 0) ? rc : 0;
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index b2592f2..e7b9f0a 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -12,10 +12,7 @@ static inline int ext2_use_xip (struct super_block *sb)
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
-int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
-				void **, unsigned long *);
 #else
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
-#define ext2_get_xip_mem			NULL
 #endif
diff --git a/fs/open.c b/fs/open.c
index d6fd3ac..ca68e47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -655,11 +655,8 @@ int open_check_o_direct(struct file *f)
 {
 	/* NB: we're sure to have correct a_ops only after f_op->open */
 	if (f->f_flags & O_DIRECT) {
-		if (!f->f_mapping->a_ops ||
-		    ((!f->f_mapping->a_ops->direct_IO) &&
-		    (!f->f_mapping->a_ops->get_xip_mem))) {
+		if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
 			return -EINVAL;
-		}
 	}
 	return 0;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d3787b5..527684e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -347,8 +347,6 @@ struct address_space_operations {
 	int (*releasepage) (struct page *, gfp_t);
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
-	int (*get_xip_mem)(struct address_space *, pgoff_t, int,
-						void **, unsigned long *);
 	/*
 	 * migrate the contents of a page to the specified target. If
 	 * migrate_mode is MIGRATE_ASYNC, it must not block.
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c0c2bce..9fe2ec2 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -188,7 +188,7 @@ int page_referenced(struct page *, int is_locked,
 int try_to_unmap(struct page *, enum ttu_flags flags);
 
 /*
- * Called from mm/filemap_xip.c to unmap empty zero page
+ * Used by uprobes to replace a userspace page safely
  */
 pte_t *__page_check_address(struct page *, struct mm_struct *,
 				unsigned long, spinlock_t **, int);
diff --git a/mm/Makefile b/mm/Makefile
index 8405eb0..77f9638 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -51,7 +51,6 @@ obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KMEMCHECK) += kmemcheck.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
-obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..1f1925f 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -28,6 +28,7 @@
 SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 {
 	struct fd f = fdget(fd);
+	struct inode *inode;
 	struct address_space *mapping;
 	struct backing_dev_info *bdi;
 	loff_t endbyte;			/* inclusive */
@@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 	if (!f.file)
 		return -EBADF;
 
-	if (S_ISFIFO(file_inode(f.file)->i_mode)) {
+	inode = file_inode(f.file);
+	if (S_ISFIFO(inode->i_mode)) {
 		ret = -ESPIPE;
 		goto out;
 	}
@@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 		goto out;
 	}
 
-	if (mapping->a_ops->get_xip_mem) {
+	if (IS_DAX(inode)) {
 		switch (advice) {
 		case POSIX_FADV_NORMAL:
 		case POSIX_FADV_RANDOM:
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
deleted file mode 100644
index 6316578..0000000
--- a/mm/filemap_xip.c
+++ /dev/null
@@ -1,23 +0,0 @@
-/*
- *	linux/mm/filemap_xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte <cotte@de.ibm.com>
- *
- * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
- *
- */
-
-#include <linux/fs.h>
-#include <linux/pagemap.h>
-#include <linux/export.h>
-#include <linux/uio.h>
-#include <linux/rmap.h>
-#include <linux/mmu_notifier.h>
-#include <linux/sched.h>
-#include <linux/seqlock.h>
-#include <linux/mutex.h>
-#include <linux/gfp.h>
-#include <asm/tlbflush.h>
-#include <asm/io.h>
-
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30..1611ebf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	if (!file)
 		return -EBADF;
 
-	if (file->f_mapping->a_ops->get_xip_mem) {
+	if (IS_DAX(file_inode(file))) {
 		/* no bad return value, but ignore advice */
 		return 0;
 	}
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 12/20] ext2: Remove ext2_xip_verify_sb()
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (10 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 11/20] vfs: Remove get_xip_mem Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 13/20] ext2: Remove ext2_use_xip Matthew Wilcox
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed.  It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the
equivalent check and delete the definition.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/super.c | 33 ++++++++++++---------------------
 fs/ext2/xip.c   | 12 ------------
 fs/ext2/xip.h   |  2 --
 3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 170dc41..f975854 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 		((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
 		 MS_POSIXACL : 0);
 
-	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-				    EXT2_MOUNT_XIP if not */
-
 	if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
 	    (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
 	     EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-	if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
-		if (!silent)
+	if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+		if (blocksize != PAGE_SIZE) {
 			ext2_msg(sb, KERN_ERR,
-				"error: unsupported blocksize for xip");
-		goto failed_mount;
+					"error: unsupported blocksize for xip");
+			goto failed_mount;
+		}
+		if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			ext2_msg(sb, KERN_ERR,
+					"error: device does not support xip");
+			goto failed_mount;
+		}
 	}
 
 	/* If the blocksize doesn't match, re-read the thing.. */
@@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 {
 	struct ext2_sb_info * sbi = EXT2_SB(sb);
 	struct ext2_super_block * es;
-	unsigned long old_mount_opt = sbi->s_mount_opt;
 	struct ext2_mount_options old_opts;
 	unsigned long old_sb_flags;
 	int err;
@@ -1274,22 +1276,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
 		((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
-	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-				    EXT2_MOUNT_XIP if not */
-
-	if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) {
-		ext2_msg(sb, KERN_WARNING,
-			"warning: unsupported blocksize for xip");
-		err = -EINVAL;
-		goto restore_opts;
-	}
-
 	es = sbi->s_es;
-	if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) {
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
 		ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
 			 "xip flag with busy inodes while remounting");
-		sbi->s_mount_opt &= ~EXT2_MOUNT_XIP;
-		sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP;
+		sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
 	}
 	if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
 		spin_unlock(&sbi->s_lock);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 132d4da..66ca113 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,15 +13,3 @@
 #include "ext2.h"
 #include "xip.h"
 
-void ext2_xip_verify_sb(struct super_block *sb)
-{
-	struct ext2_sb_info *sbi = EXT2_SB(sb);
-
-	if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) &&
-	    !sb->s_bdev->bd_disk->fops->direct_access) {
-		sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
-		ext2_msg(sb, KERN_WARNING,
-			     "warning: ignoring xip option - "
-			     "not supported by bdev");
-	}
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index e7b9f0a..87eeb04 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -6,13 +6,11 @@
  */
 
 #ifdef CONFIG_EXT2_FS_XIP
-extern void ext2_xip_verify_sb (struct super_block *);
 static inline int ext2_use_xip (struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
 #else
-#define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #endif
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 13/20] ext2: Remove ext2_use_xip
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (11 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 12/20] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 14/20] ext2: Remove xip.c and xip.h Matthew Wilcox
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

Replace ext2_use_xip() with test_opt(XIP) which expands to the same code

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 fs/ext2/ext2.h  | 4 ++++
 fs/ext2/inode.c | 2 +-
 fs/ext2/namei.c | 4 ++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index d9a17d0..5ecf570 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,11 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID32		0x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
+#ifdef CONFIG_FS_XIP
 #define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
+#else
+#define EXT2_MOUNT_XIP			0
+#endif
 #define EXT2_MOUNT_USRQUOTA		0x020000  /* user quota */
 #define EXT2_MOUNT_GRPQUOTA		0x040000  /* group quota */
 #define EXT2_MOUNT_RESERVATION		0x080000  /* Preallocation */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 59d6c7d..cba3833 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1394,7 +1394,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (ext2_use_xip(inode->i_sb)) {
+		if (test_opt(inode->i_sb, XIP)) {
 			inode->i_mapping->a_ops = &ext2_aops_xip;
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index c268d0a..846c356 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (ext2_use_xip(inode->i_sb)) {
+	if (test_opt(inode->i_sb, XIP)) {
 		inode->i_mapping->a_ops = &ext2_aops_xip;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (ext2_use_xip(inode->i_sb)) {
+	if (test_opt(inode->i_sb, XIP)) {
 		inode->i_mapping->a_ops = &ext2_aops_xip;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 14/20] ext2: Remove xip.c and xip.h
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (12 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 13/20] ext2: Remove ext2_use_xip Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 15/20] vfs,ext2: Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Matthew Wilcox
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

These files are now empty, so delete them

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 fs/ext2/Makefile |  1 -
 fs/ext2/inode.c  |  1 -
 fs/ext2/namei.c  |  1 -
 fs/ext2/super.c  |  1 -
 fs/ext2/xip.c    | 15 ---------------
 fs/ext2/xip.h    | 16 ----------------
 6 files changed, 35 deletions(-)
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h

diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index f42af45..445b0e9 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
 ext2-$(CONFIG_EXT2_FS_XATTR)	 += xattr.o xattr_user.o xattr_trusted.o
 ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
 ext2-$(CONFIG_EXT2_FS_SECURITY)	 += xattr_security.o
-ext2-$(CONFIG_EXT2_FS_XIP)	 += xip.o
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index cba3833..154cbcf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -34,7 +34,6 @@
 #include <linux/aio.h>
 #include "ext2.h"
 #include "acl.h"
-#include "xip.h"
 #include "xattr.h"
 
 static int __ext2_write_inode(struct inode *inode, int do_sync);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 846c356..7ca803f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
 {
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index f975854..b1f25f8 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static void ext2_sync_super(struct super_block *sb,
 			    struct ext2_super_block *es, int wait);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
deleted file mode 100644
index 66ca113..0000000
--- a/fs/ext2/xip.c
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- *  linux/fs/ext2/xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (cotte@de.ibm.com)
- */
-
-#include <linux/mm.h>
-#include <linux/fs.h>
-#include <linux/genhd.h>
-#include <linux/buffer_head.h>
-#include <linux/blkdev.h>
-#include "ext2.h"
-#include "xip.h"
-
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
deleted file mode 100644
index 87eeb04..0000000
--- a/fs/ext2/xip.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/*
- *  linux/fs/ext2/xip.h
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (cotte@de.ibm.com)
- */
-
-#ifdef CONFIG_EXT2_FS_XIP
-static inline int ext2_use_xip (struct super_block *sb)
-{
-	struct ext2_sb_info *sbi = EXT2_SB(sb);
-	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
-}
-#else
-#define ext2_use_xip(sb)			0
-#endif
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 15/20] vfs,ext2: Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (13 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 14/20] ext2: Remove xip.c and xip.h Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 16/20] ext2: Remove ext2_aops_xip Matthew Wilcox
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

The fewer Kconfig options we have the better.  Use the generic
CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/Kconfig         | 21 ++++++++++++++-------
 fs/Makefile        |  2 +-
 fs/ext2/Kconfig    | 11 -----------
 fs/ext2/ext2.h     |  2 +-
 fs/ext2/file.c     |  4 ++--
 fs/ext2/super.c    |  4 ++--
 include/linux/fs.h |  2 +-
 scripts/diffconfig |  1 -
 8 files changed, 21 insertions(+), 26 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index db5dc15..731e702 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -13,13 +13,6 @@ if BLOCK
 source "fs/ext2/Kconfig"
 source "fs/ext3/Kconfig"
 source "fs/ext4/Kconfig"
-
-config FS_XIP
-# execute in place
-	bool
-	depends on EXT2_FS_XIP
-	default y
-
 source "fs/jbd/Kconfig"
 source "fs/jbd2/Kconfig"
 
@@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig"
 source "fs/btrfs/Kconfig"
 source "fs/nilfs2/Kconfig"
 
+config FS_DAX
+	bool "Direct Access (DAX) support"
+	depends on MMU
+	help
+	  Direct Access (DAX) can be used on memory-backed block devices.
+	  If the block device supports DAX and the filesystem supports DAX,
+	  then you can avoid using the pagecache to buffer I/Os.  Turning
+	  on this option will compile in support for DAX; you will need to
+	  mount the filesystem using the -o dax option.
+
+	  If you do not have a block device that is capable of using this,
+	  or if unsure, say N.  Saying Y will increase the size of the kernel
+	  by about 5kB.
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index 0325ec3..df4a4cf 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,7 +28,7 @@ obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_XIP)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 14a6780..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -42,14 +42,3 @@ config EXT2_FS_SECURITY
 
 	  If you are not using a security module that requires using
 	  extended attributes for file security labels, say N.
-
-config EXT2_FS_XIP
-	bool "Ext2 execute in place support"
-	depends on EXT2_FS && MMU
-	help
-	  Execute in place can be used on memory-backed block devices. If you
-	  enable this option, you can select to mount block devices which are
-	  capable of this feature without using the page cache.
-
-	  If you do not use a block device that is capable of using this,
-	  or if unsure, say N.
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 5ecf570..b30c3bd 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,7 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID32		0x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 #define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
 #else
 #define EXT2_MOUNT_XIP			0
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index da8dc64..46b333d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,7 +25,7 @@
 #include "xattr.h"
 #include "acl.h"
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	return dax_fault(vma, vmf, ext2_get_block);
@@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = {
 	.splice_write	= iter_file_splice_write,
 };
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= new_sync_read,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index b1f25f8..60f7b53 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
 		seq_puts(seq, ",grpquota");
 #endif
 
-#if defined(CONFIG_EXT2_FS_XIP)
+#ifdef CONFIG_FS_DAX
 	if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
 		seq_puts(seq, ",xip");
 #endif
@@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb)
 			break;
 #endif
 		case Opt_xip:
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 			set_opt (sbi->s_mount_opt, XIP);
 #else
 			ext2_msg(sb, KERN_INFO, "xip option not supported");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 527684e..dad6628 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1586,7 +1586,7 @@ struct super_operations {
 #define S_IMA		1024	/* Inode has an associated IMA struct */
 #define S_AUTOMOUNT	2048	/* Automount/referral quasi-directory */
 #define S_NOSEC		4096	/* no suid or xattr security attributes */
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 #define S_DAX		8192	/* Direct Access, avoiding the page cache */
 #else
 #define S_DAX		0	/* Make all the DAX code disappear */
diff --git a/scripts/diffconfig b/scripts/diffconfig
index 6d67283..0db267d 100755
--- a/scripts/diffconfig
+++ b/scripts/diffconfig
@@ -28,7 +28,6 @@ If no config files are specified, .config and .config.old are used.
 Example usage:
  $ diffconfig .config config-with-some-changes
 -EXT2_FS_XATTR  n
--EXT2_FS_XIP  n
  CRAMFS  n -> y
  EXT2_FS  y -> n
  LOG_BUF_SHIFT  14 -> 16
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 16/20] ext2: Remove ext2_aops_xip
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (14 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 15/20] vfs,ext2: Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 17/20] ext2: Get rid of most mentions of XIP in ext2 Matthew Wilcox
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

We shouldn't need a special address_space_operations any more

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 fs/ext2/ext2.h  | 1 -
 fs/ext2/inode.c | 7 +------
 fs/ext2/namei.c | 4 ++--
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b30c3bd..b8b1c11 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
-extern const struct address_space_operations ext2_aops_xip;
 extern const struct address_space_operations ext2_nobh_aops;
 
 /* namei.c */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 154cbcf..034fd42 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -891,11 +891,6 @@ const struct address_space_operations ext2_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
-const struct address_space_operations ext2_aops_xip = {
-	.bmap			= ext2_bmap,
-	.direct_IO		= ext2_direct_IO,
-};
-
 const struct address_space_operations ext2_nobh_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
@@ -1394,7 +1389,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
 		if (test_opt(inode->i_sb, XIP)) {
-			inode->i_mapping->a_ops = &ext2_aops_xip;
+			inode->i_mapping->a_ops = &ext2_aops;
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 7ca803f..0db888c 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 
 	inode->i_op = &ext2_file_inode_operations;
 	if (test_opt(inode->i_sb, XIP)) {
-		inode->i_mapping->a_ops = &ext2_aops_xip;
+		inode->i_mapping->a_ops = &ext2_aops;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 
 	inode->i_op = &ext2_file_inode_operations;
 	if (test_opt(inode->i_sb, XIP)) {
-		inode->i_mapping->a_ops = &ext2_aops_xip;
+		inode->i_mapping->a_ops = &ext2_aops;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 17/20] ext2: Get rid of most mentions of XIP in ext2
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (15 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 16/20] ext2: Remove ext2_aops_xip Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 18/20] dax: Add dax_zero_page_range Matthew Wilcox
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton

To help people transition, accept the 'xip' mount option (and report it
in /proc/mounts), but print a message encouraging people to switch over
to the 'dax' option.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 Documentation/filesystems/ext2.txt |  5 +++--
 fs/ext2/ext2.h                     | 13 +++++++------
 fs/ext2/file.c                     |  2 +-
 fs/ext2/inode.c                    |  6 +++---
 fs/ext2/namei.c                    |  8 ++++----
 fs/ext2/super.c                    | 25 ++++++++++++++++---------
 6 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 67639f9..b971456 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -20,6 +20,9 @@ minixdf				Makes `df' act like Minix.
 check=none, nocheck	(*)	Don't do extra checking of bitmaps on mount
 				(check=normal and check=strict options removed)
 
+dax				Use direct access (no page cache).  See
+				Documentation/filesystems/dax.txt.
+
 debug				Extra debugging information is sent to the
 				kernel syslog.  Useful for developers.
 
@@ -56,8 +59,6 @@ noacl				Don't support POSIX ACLs.
 
 nobh				Do not attach buffer_heads to file pagecache.
 
-xip				Use execute in place (no caching) if possible
-
 grpquota,noquota,quota,usrquota	Quota options are silently ignored by ext2.
 
 
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b8b1c11..46133a0 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,14 +380,15 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID32		0x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_DAX
-#define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
-#else
-#define EXT2_MOUNT_XIP			0
-#endif
+#define EXT2_MOUNT_XIP			0x010000  /* Obsolete, use DAX */
 #define EXT2_MOUNT_USRQUOTA		0x020000  /* user quota */
 #define EXT2_MOUNT_GRPQUOTA		0x040000  /* group quota */
 #define EXT2_MOUNT_RESERVATION		0x080000  /* Preallocation */
+#ifdef CONFIG_FS_DAX
+#define EXT2_MOUNT_DAX			0x100000  /* Direct Access */
+#else
+#define EXT2_MOUNT_DAX			0
+#endif
 
 
 #define clear_opt(o, opt)		o &= ~EXT2_MOUNT_##opt
@@ -789,7 +790,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
 		      int datasync);
 extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_xip_file_operations;
+extern const struct file_operations ext2_dax_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 46b333d..5b8cab5 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = {
 };
 
 #ifdef CONFIG_FS_DAX
-const struct file_operations ext2_xip_file_operations = {
+const struct file_operations ext2_dax_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= new_sync_read,
 	.write		= new_sync_write,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 034fd42..6434bc0 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1286,7 +1286,7 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
-	if (test_opt(inode->i_sb, XIP))
+	if (test_opt(inode->i_sb, DAX))
 		inode->i_flags |= S_DAX;
 }
 
@@ -1388,9 +1388,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, XIP)) {
+		if (test_opt(inode->i_sb, DAX)) {
 			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_xip_file_operations;
+			inode->i_fop = &ext2_dax_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
 			inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0db888c..148f6e3 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, XIP)) {
+	if (test_opt(inode->i_sb, DAX)) {
 		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_xip_file_operations;
+		inode->i_fop = &ext2_dax_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
@@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, XIP)) {
+	if (test_opt(inode->i_sb, DAX)) {
 		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_xip_file_operations;
+		inode->i_fop = &ext2_dax_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 60f7b53..2e82bf2 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -290,6 +290,8 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
 #ifdef CONFIG_FS_DAX
 	if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
 		seq_puts(seq, ",xip");
+	if (sbi->s_mount_opt & EXT2_MOUNT_DAX)
+		seq_puts(seq, ",dax");
 #endif
 
 	if (!test_opt(sb, RESERVATION))
@@ -393,7 +395,7 @@ enum {
 	Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic,
 	Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug,
 	Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr,
-	Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota,
+	Opt_acl, Opt_noacl, Opt_xip, Opt_dax, Opt_ignore, Opt_err, Opt_quota,
 	Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation
 };
 
@@ -422,6 +424,7 @@ static const match_table_t tokens = {
 	{Opt_acl, "acl"},
 	{Opt_noacl, "noacl"},
 	{Opt_xip, "xip"},
+	{Opt_dax, "dax"},
 	{Opt_grpquota, "grpquota"},
 	{Opt_ignore, "noquota"},
 	{Opt_quota, "quota"},
@@ -549,10 +552,14 @@ static int parse_options(char *options, struct super_block *sb)
 			break;
 #endif
 		case Opt_xip:
+			ext2_msg(sb, KERN_INFO, "use dax instead of xip");
+			set_opt(sbi->s_mount_opt, XIP);
+			/* Fall through */
+		case Opt_dax:
 #ifdef CONFIG_FS_DAX
-			set_opt (sbi->s_mount_opt, XIP);
+			set_opt(sbi->s_mount_opt, DAX);
 #else
-			ext2_msg(sb, KERN_INFO, "xip option not supported");
+			ext2_msg(sb, KERN_INFO, "dax option not supported");
 #endif
 			break;
 
@@ -896,15 +903,15 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-	if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+	if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
 		if (blocksize != PAGE_SIZE) {
 			ext2_msg(sb, KERN_ERR,
-					"error: unsupported blocksize for xip");
+					"error: unsupported blocksize for dax");
 			goto failed_mount;
 		}
 		if (!sb->s_bdev->bd_disk->fops->direct_access) {
 			ext2_msg(sb, KERN_ERR,
-					"error: device does not support xip");
+					"error: device does not support dax");
 			goto failed_mount;
 		}
 	}
@@ -1276,10 +1283,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 		((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
 	es = sbi->s_es;
-	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) {
 		ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
-			 "xip flag with busy inodes while remounting");
-		sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
+			 "dax flag with busy inodes while remounting");
+		sbi->s_mount_opt ^= EXT2_MOUNT_DAX;
 	}
 	if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
 		spin_unlock(&sbi->s_lock);
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 18/20] dax: Add dax_zero_page_range
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (16 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 17/20] ext2: Get rid of most mentions of XIP in ext2 Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2015-01-12 23:10   ` Andrew Morton
  2014-10-24 21:20 ` [PATCH v12 19/20] ext4: Add DAX functionality Matthew Wilcox
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, willy, Andrew Morton, Ross Zwisler

This new function allows us to support hole-punch for DAX files by zeroing
a partial page, as opposed to the dax_truncate_page() function which can
only truncate to the end of the page.  Reimplement dax_truncate_page() to
call dax_zero_page_range().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
[ported to 3.13-rc2]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 Documentation/filesystems/dax.txt |  1 +
 fs/dax.c                          | 36 +++++++++++++++++++++++++++++++-----
 include/linux/fs.h                |  1 +
 3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 635adaa..ebcd97f 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -62,6 +62,7 @@ Filesystem support consists of
   for fault and page_mkwrite (which should probably call dax_fault() and
   dax_mkwrite(), passing the appropriate get_block() callback)
 - calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- calling dax_zero_page_range() instead of zero_user() for DAX files
 - ensuring that there is sufficient locking between reads, writes,
   truncates and page faults
 
diff --git a/fs/dax.c b/fs/dax.c
index e838ec8..24f6e14 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -460,13 +460,16 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
- * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * dax_zero_page_range - zero a range within a page of a DAX file
  * @inode: The file being truncated
  * @from: The file offset that is being truncated to
+ * @length: The number of bytes to zero
  * @get_block: The filesystem method used to translate file offsets to blocks
  *
- * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an DAX file to handle the partial page.
+ * This function can be called by a filesystem when it is zeroing part of a
+ * page in a DAX file.  This is intended for hole-punch operations.  If
+ * you are truncating a file, the helper function dax_truncate_page() may be
+ * more convenient.
  *
  * We work in terms of PAGE_CACHE_SIZE here for commonality with
  * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
@@ -474,17 +477,18 @@ EXPORT_SYMBOL_GPL(dax_fault);
  * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
  * since the file might be mmaped.
  */
-int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
+							get_block_t get_block)
 {
 	struct buffer_head bh;
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned length = PAGE_CACHE_ALIGN(from) - from;
 	int err;
 
 	/* Block boundary? Nothing to do */
 	if (!length)
 		return 0;
+	BUG_ON((offset + length) > PAGE_CACHE_SIZE);
 
 	memset(&bh, 0, sizeof(bh));
 	bh.b_size = PAGE_CACHE_SIZE;
@@ -501,4 +505,26 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks.  Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+	unsigned length = PAGE_CACHE_ALIGN(from) - from;
+	return dax_zero_page_range(inode, from, length, get_block);
+}
 EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dad6628..563a6ca 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2474,6 +2474,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 		loff_t, get_block_t, dio_iodone_t, int flags);
 int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 19/20] ext4: Add DAX functionality
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (17 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 18/20] dax: Add dax_zero_page_range Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-10-24 21:20 ` [PATCH v12 20/20] brd: Rename XIP to DAX Matthew Wilcox
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Ross Zwisler, willy, Andrew Morton, Matthew Wilcox

From: Ross Zwisler <ross.zwisler@linux.intel.com>

This is a port of the DAX functionality found in the current version of
ext2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
[heavily tweaked]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/dax.txt  |  1 +
 Documentation/filesystems/ext4.txt |  4 ++
 fs/ext4/ext4.h                     |  6 +++
 fs/ext4/file.c                     | 50 ++++++++++++++++++++-
 fs/ext4/indirect.c                 | 18 +++++---
 fs/ext4/inode.c                    | 89 ++++++++++++++++++++++++++------------
 fs/ext4/namei.c                    | 10 ++++-
 fs/ext4/super.c                    | 39 ++++++++++++++++-
 8 files changed, 180 insertions(+), 37 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index ebcd97f..be376d9 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -73,6 +73,7 @@ or a write()) work correctly.
 
 These filesystems may be used for inspiration:
 - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
 
 
 Shortcomings
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..6c0108e 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,10 @@ max_dir_size_kb=n	This limits the size of directories so that any
 i_version		Enable 64-bit inode version support. This option is
 			off by default.
 
+dax			Use direct access (no page cache).  See
+			Documentation/filesystems/dax.txt.  Note that
+			this option is incompatible with data=journal.
+
 Data Mode
 =========
 There are 3 different data modes:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b0c225c..a390cb6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -969,6 +969,11 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_ERRORS_MASK		0x00070
 #define EXT4_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
 #define EXT4_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
+#ifdef CONFIG_FS_DAX
+#define EXT4_MOUNT_DAX			0x00200	/* Direct Access */
+#else
+#define EXT4_MOUNT_DAX			0
+#endif
 #define EXT4_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
 #define EXT4_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
 #define EXT4_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
@@ -2574,6 +2579,7 @@ extern const struct file_operations ext4_dir_operations;
 /* file.c */
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_dax_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 
 /* inline.c */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aca7b24..1c837b7 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -95,7 +95,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct mutex *aio_mutex = NULL;
 	struct blk_plug plug;
-	int o_direct = file->f_flags & O_DIRECT;
+	int o_direct = io_is_direct(file);
 	int overwrite = 0;
 	size_t length = iov_iter_count(from);
 	ssize_t ret;
@@ -191,6 +191,27 @@ errout:
 	return ret;
 }
 
+#ifdef CONFIG_FS_DAX
+static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_fault(vma, vmf, ext4_get_block);
+					/* Is this the right get_block? */
+}
+
+static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_mkwrite(vma, vmf, ext4_get_block);
+}
+
+static const struct vm_operations_struct ext4_dax_vm_ops = {
+	.fault		= ext4_dax_fault,
+	.page_mkwrite	= ext4_dax_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+#else
+#define ext4_dax_vm_ops	ext4_file_vm_ops
+#endif
+
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.map_pages	= filemap_map_pages,
@@ -201,7 +222,12 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	file_accessed(file);
-	vma->vm_ops = &ext4_file_vm_ops;
+	if (IS_DAX(file_inode(file))) {
+		vma->vm_ops = &ext4_dax_vm_ops;
+		vma->vm_flags |= VM_MIXEDMAP;
+	} else {
+		vma->vm_ops = &ext4_file_vm_ops;
+	}
 	return 0;
 }
 
@@ -600,6 +626,26 @@ const struct file_operations ext4_file_operations = {
 	.fallocate	= ext4_fallocate,
 };
 
+#ifdef CONFIG_FS_DAX
+const struct file_operations ext4_dax_file_operations = {
+	.llseek		= ext4_llseek,
+	.read		= new_sync_read,
+	.write		= new_sync_write,
+	.read_iter	= generic_file_read_iter,
+	.write_iter	= ext4_file_write_iter,
+	.unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ext4_compat_ioctl,
+#endif
+	.mmap		= ext4_file_mmap,
+	.open		= ext4_file_open,
+	.release	= ext4_release_file,
+	.fsync		= ext4_sync_file,
+	/* Splice not yet supported with DAX */
+	.fallocate	= ext4_fallocate,
+};
+#endif
+
 const struct inode_operations ext4_file_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index e75f840..fa9ec8d 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -691,14 +691,22 @@ retry:
 			inode_dio_done(inode);
 			goto locked;
 		}
-		ret = __blockdev_direct_IO(rw, iocb, inode,
-				 inode->i_sb->s_bdev, iter, offset,
-				 ext4_get_block, NULL, NULL, 0);
+		if (IS_DAX(inode))
+			ret = dax_do_io(rw, iocb, inode, iter, offset,
+					ext4_get_block, NULL, 0);
+		else
+			ret = __blockdev_direct_IO(rw, iocb, inode,
+					inode->i_sb->s_bdev, iter, offset,
+					ext4_get_block, NULL, NULL, 0);
 		inode_dio_done(inode);
 	} else {
 locked:
-		ret = blockdev_direct_IO(rw, iocb, inode, iter,
-				 offset, ext4_get_block);
+		if (IS_DAX(inode))
+			ret = dax_do_io(rw, iocb, inode, iter, offset,
+					ext4_get_block, NULL, DIO_LOCKING);
+		else
+			ret = blockdev_direct_IO(rw, iocb, inode, iter,
+					offset, ext4_get_block);
 
 		if (unlikely((rw & WRITE) && ret < 0)) {
 			loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9..542205f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -676,6 +676,18 @@ has_zeroout:
 	return retval;
 }
 
+static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
+{
+	struct inode *inode = bh->b_assoc_map->host;
+	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
+	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
+	int err;
+	if (!uptodate)
+		return;
+	WARN_ON(!buffer_unwritten(bh));
+	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
+}
+
 /* Maximum number of blocks we map for direct IO at once. */
 #define DIO_MAX_BLOCKS 4096
 
@@ -713,6 +725,11 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 
 		map_bh(bh, inode->i_sb, map.m_pblk);
 		bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
+		if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) {
+			bh->b_assoc_map = inode->i_mapping;
+			bh->b_private = (void *)(unsigned long)iblock;
+			bh->b_end_io = ext4_end_io_unwritten;
+		}
 		if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
 			set_buffer_defer_completion(bh);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
@@ -3043,13 +3060,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 		get_block_func = ext4_get_block_write;
 		dio_flags = DIO_LOCKING;
 	}
-	ret = __blockdev_direct_IO(rw, iocb, inode,
-				   inode->i_sb->s_bdev, iter,
-				   offset,
-				   get_block_func,
-				   ext4_end_io_dio,
-				   NULL,
-				   dio_flags);
+	if (IS_DAX(inode))
+		ret = dax_do_io(rw, iocb, inode, iter, offset, get_block_func,
+				ext4_end_io_dio, dio_flags);
+	else
+		ret = __blockdev_direct_IO(rw, iocb, inode,
+					   inode->i_sb->s_bdev, iter, offset,
+					   get_block_func,
+					   ext4_end_io_dio, NULL, dio_flags);
 
 	/*
 	 * Put our reference to io_end. This can free the io_end structure e.g.
@@ -3213,19 +3231,12 @@ void ext4_set_aops(struct inode *inode)
 		inode->i_mapping->a_ops = &ext4_aops;
 }
 
-/*
- * ext4_block_zero_page_range() zeros out a mapping of length 'length'
- * starting from file offset 'from'.  The range to be zero'd must
- * be contained with in one block.  If the specified range exceeds
- * the end of the block it will be shortened to end of the block
- * that cooresponds to 'from'
- */
-static int ext4_block_zero_page_range(handle_t *handle,
+static int __ext4_block_zero_page_range(handle_t *handle,
 		struct address_space *mapping, loff_t from, loff_t length)
 {
 	ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned blocksize, max, pos;
+	unsigned blocksize, pos;
 	ext4_lblk_t iblock;
 	struct inode *inode = mapping->host;
 	struct buffer_head *bh;
@@ -3238,14 +3249,6 @@ static int ext4_block_zero_page_range(handle_t *handle,
 		return -ENOMEM;
 
 	blocksize = inode->i_sb->s_blocksize;
-	max = blocksize - (offset & (blocksize - 1));
-
-	/*
-	 * correct length if it does not fall between
-	 * 'from' and the end of the block
-	 */
-	if (length > max || length < 0)
-		length = max;
 
 	iblock = index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
 
@@ -3311,6 +3314,33 @@ unlock:
 }
 
 /*
+ * ext4_block_zero_page_range() zeros out a mapping of length 'length'
+ * starting from file offset 'from'.  The range to be zero'd must
+ * be contained with in one block.  If the specified range exceeds
+ * the end of the block it will be shortened to end of the block
+ * that cooresponds to 'from'
+ */
+static int ext4_block_zero_page_range(handle_t *handle,
+		struct address_space *mapping, loff_t from, loff_t length)
+{
+	struct inode *inode = mapping->host;
+	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned blocksize = inode->i_sb->s_blocksize;
+	unsigned max = blocksize - (offset & (blocksize - 1));
+
+	/*
+	 * correct length if it does not fall between
+	 * 'from' and the end of the block
+	 */
+	if (length > max || length < 0)
+		length = max;
+
+	if (IS_DAX(inode))
+		return dax_zero_page_range(inode, from, length, ext4_get_block);
+	return __ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
+/*
  * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
  * up to the end of the block which corresponds to `from'.
  * This required during truncate. We need to physically zero the tail end
@@ -3831,8 +3861,10 @@ void ext4_set_inode_flags(struct inode *inode)
 		new_fl |= S_NOATIME;
 	if (flags & EXT4_DIRSYNC_FL)
 		new_fl |= S_DIRSYNC;
+	if (test_opt(inode->i_sb, DAX))
+		new_fl |= S_DAX;
 	inode_set_flags(inode, new_fl,
-			S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+			S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
 }
 
 /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
@@ -4086,7 +4118,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
@@ -4556,7 +4591,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		 * Truncate pagecache after we've waited for commit
 		 * in data=journal mode to make pages freeable.
 		 */
-			truncate_pagecache(inode, inode->i_size);
+		truncate_pagecache(inode, inode->i_size);
 	}
 	/*
 	 * We want to call ext4_truncate() even if attr->ia_size ==
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 603e4eb..8d744a5 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2264,7 +2264,10 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
 		if (!err && IS_DIRSYNC(dir))
@@ -2328,7 +2331,10 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		d_tmpfile(dentry, inode);
 		err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 05c1592..a68662d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1162,7 +1162,7 @@ enum {
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
 	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
-	Opt_usrquota, Opt_grpquota, Opt_i_version,
+	Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_dax,
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
 	Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
@@ -1224,6 +1224,7 @@ static const match_table_t tokens = {
 	{Opt_barrier, "barrier"},
 	{Opt_nobarrier, "nobarrier"},
 	{Opt_i_version, "i_version"},
+	{Opt_dax, "dax"},
 	{Opt_stripe, "stripe=%u"},
 	{Opt_delalloc, "delalloc"},
 	{Opt_nodelalloc, "nodelalloc"},
@@ -1406,6 +1407,7 @@ static const struct mount_opts {
 	{Opt_min_batch_time, 0, MOPT_GTE0},
 	{Opt_inode_readahead_blks, 0, MOPT_GTE0},
 	{Opt_init_itable, 0, MOPT_GTE0},
+	{Opt_dax, EXT4_MOUNT_DAX, MOPT_SET},
 	{Opt_stripe, 0, MOPT_GTE0},
 	{Opt_resuid, 0, MOPT_GTE0},
 	{Opt_resgid, 0, MOPT_GTE0},
@@ -1642,6 +1644,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		}
 		sbi->s_jquota_fmt = m->mount_opt;
 #endif
+#ifndef CONFIG_FS_DAX
+	} else if (token == Opt_dax) {
+		ext4_msg(sb, KERN_INFO, "dax option not supported");
+		return -1;
+#endif
 	} else {
 		if (!args->from)
 			arg = 1;
@@ -3572,6 +3579,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 				 "both data=journal and dioread_nolock");
 			goto failed_mount;
 		}
+		if (test_opt(sb, DAX)) {
+			ext4_msg(sb, KERN_ERR, "can't mount with "
+				 "both data=journal and dax");
+			goto failed_mount;
+		}
 		if (test_opt(sb, DELALLOC))
 			clear_opt(sb, DELALLOC);
 	}
@@ -3635,6 +3647,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto failed_mount;
 	}
 
+	if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
+		if (blocksize != PAGE_SIZE) {
+			ext4_msg(sb, KERN_ERR,
+					"error: unsupported blocksize for dax");
+			goto failed_mount;
+		}
+		if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			ext4_msg(sb, KERN_ERR,
+					"error: device does not support dax");
+			goto failed_mount;
+		}
+	}
+
 	if (sb->s_blocksize != blocksize) {
 		/* Validate the filesystem blocksize */
 		if (!sb_set_blocksize(sb, blocksize)) {
@@ -4841,6 +4866,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 			err = -EINVAL;
 			goto restore_opts;
 		}
+		if (test_opt(sb, DAX)) {
+			ext4_msg(sb, KERN_ERR, "can't mount with "
+				 "both data=journal and dax");
+			err = -EINVAL;
+			goto restore_opts;
+		}
+	}
+
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) {
+		ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+			"dax flag with busy inodes while remounting");
+		sbi->s_mount_opt ^= EXT4_MOUNT_DAX;
 	}
 
 	if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 20/20] brd: Rename XIP to DAX
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (18 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 19/20] ext4: Add DAX functionality Matthew Wilcox
@ 2014-10-24 21:20 ` Matthew Wilcox
  2014-12-10 14:03 ` [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Christoph Hellwig
  2015-01-12 23:09 ` Andrew Morton
  21 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-10-24 21:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrew Morton, Matthew Wilcox

From: Matthew Wilcox <willy@linux.intel.com>

Since this is relating to FS_XIP, not KERNEL_XIP, it should be called
DAX instead of XIP.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 drivers/block/Kconfig | 13 +++++++------
 drivers/block/brd.c   | 14 +++++++-------
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 014a1cf..1b8094d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -393,14 +393,15 @@ config BLK_DEV_RAM_SIZE
 	  The default value is 4096 kilobytes. Only change this if you know
 	  what you are doing.
 
-config BLK_DEV_XIP
-	bool "Support XIP filesystems on RAM block device"
-	depends on BLK_DEV_RAM
+config BLK_DEV_RAM_DAX
+	bool "Support Direct Access (DAX) to RAM block devices"
+	depends on BLK_DEV_RAM && FS_DAX
 	default n
 	help
-	  Support XIP filesystems (such as ext2 with XIP support on) on
-	  top of block ram device. This will slightly enlarge the kernel, and
-	  will prevent RAM block device backing store memory from being
+	  Support filesystems using DAX to access RAM block devices.  This
+	  avoids double-buffering data in the page cache before copying it
+	  to the block device.  Answering Y will slightly enlarge the kernel,
+	  and will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
 config CDROM_PKTCDVD
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 89e90ec..898b4f2 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
 	 * Must use NOIO because we don't want to recurse back into the
 	 * block or filesystem layers from page reclaim.
 	 *
-	 * Cannot support XIP and highmem, because our ->direct_access
-	 * routine for XIP must return memory that is always addressable.
-	 * If XIP was reworked to use pfns and kmap throughout, this
+	 * Cannot support DAX and highmem, because our ->direct_access
+	 * routine for DAX must return memory that is always addressable.
+	 * If DAX was reworked to use pfns and kmap throughout, this
 	 * restriction might be able to be lifted.
 	 */
 	gfp_flags = GFP_NOIO | __GFP_ZERO;
-#ifndef CONFIG_BLK_DEV_XIP
+#ifndef CONFIG_BLK_DEV_RAM_DAX
 	gfp_flags |= __GFP_HIGHMEM;
 #endif
 	page = alloc_page(gfp_flags);
@@ -369,7 +369,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 	return err;
 }
 
-#ifdef CONFIG_BLK_DEV_XIP
+#ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
 			void **kaddr, unsigned long *pfn, long size)
 {
@@ -390,6 +390,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 	 */
 	return PAGE_SIZE;
 }
+#else
+#define brd_direct_access NULL
 #endif
 
 static int brd_ioctl(struct block_device *bdev, fmode_t mode,
@@ -430,9 +432,7 @@ static const struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		brd_rw_page,
 	.ioctl =		brd_ioctl,
-#ifdef CONFIG_BLK_DEV_XIP
 	.direct_access =	brd_direct_access,
-#endif
 };
 
 /*
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (19 preceding siblings ...)
  2014-10-24 21:20 ` [PATCH v12 20/20] brd: Rename XIP to DAX Matthew Wilcox
@ 2014-12-10 14:03 ` Christoph Hellwig
  2014-12-10 14:12   ` Matthew Wilcox
  2015-01-12 23:09 ` Andrew Morton
  21 siblings, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2014-12-10 14:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-mm, willy, Andrew Morton

What is the status of this patch set?

On Fri, Oct 24, 2014 at 05:20:32PM -0400, Matthew Wilcox wrote:
> DAX is a replacement for the variation of XIP currently supported by
> the ext2 filesystem.  We have three different things in the tree called
> 'XIP', and the new focus is on access to data rather than executables,
> so a name change was in order.  DAX stands for Direct Access.  The X is
> for eXciting.
> 
> The new focus on data access has resulted in more careful attention to
> races that exist in the current XIP code, but are not hit by the use-case
> that it was designed for.  XIP's architecture worked fine for ext2, but
> DAX is architected to work with modern filsystems such as ext4 and XFS.
> DAX is not intended for use with btrfs; the value that btrfs adds relies
> on manipulating data and writing data to different locations, while DAX's
> value is for write-in-place and keeping the kernel from touching the data.
> 
> DAX was developed in order to support NV-DIMMs, but it's become clear that
> its usefuless extends beyond NV-DIMMs and there are several potential
> customers including the tracing machinery.  Other people want to place
> the kernel log in an area of memory, as long as they have a BIOS that
> does not clear DRAM on reboot.
> 
> Patch 1 is a bug fix.  It is obviously correct, and should be included
> into 3.18.
> 
> Patch 2 starts the transformation by changing how ->direct_access works.
> Much code is moved from the drivers and filesystems into the block layer,
> and we add the flexibility of being able to map more than one page at
> a time.  It would be good to get this patch into 3.18 as it is also
> useful for people who are pursuing non-DAX approaches to working with
> persistent memory.
> 
> Patch 3 is also a bug fix, probably worth including in 3.18.
> 
> Patches 4 & 5 are infrastructure for DAX.
> 
> Patches 6-10 replace the XIP code with its DAX equivalents, transforming
> ext2 to use the DAX code as we go.  Note that patch 10 is the
> Documentation patch.
> 
> Patches 11-17 clean up after the XIP code, removing the infrastructure
> that is no longer needed and renaming various XIP things to DAX.
> Most of these patches were added after Jan found things he didn't
> like in an earlier version of the ext4 patch ... that had been copied
> from ext2.  So ext2 i being transformed to do things the same way that
> ext4 will later.  The ability to mount ext2 filesystems with the 'xip'
> option is retained, although the 'dax' option is now preferred.
> 
> Patch 18 adds some DAX infrastructure to support ext4.
> 
> Patch 19 adds DAX support to ext4.  It is broadly similar to ext2's DAX
> support, but it is more efficient than ext4's due to its support for
> unwritten extents.
> 
> Patch 20 is another cleanup patch renaming XIP to DAX.
> 
> 
> My thanks to Mathieu Desnoyers for his reviews of the v11 patchset.  Most
> of the changes below were based on his feedback.
> 
> Changes since v11:
>  - Rebased to 3.18-rc1, dropping patch "vfs: Add copy_to_iter(),
>    copy_from_iter() and iov_iter_zero()" as it was merged through Al's tree.
>  - Added cc to stable@vger.kernel.org on patch 1
>  - Fixed comment style in brd.c (Mathieu)
>  - Make more functions in fs.h common with and without CONFIG_FS_DAX set
>  - Improve type-checking with !CONFIG_FS_DAX
>  - Simplify check for holes in dax_io()
>  - Harden the loop in dax_clear_blocks()
>  - Add missing check against truncate of a page covering a hole
>  - Fix the page-fault handler to work for block devices too
>  - Change a few more places that mentioned 'XIP' into 'DAX'
>  - Update DAX documentation in a couple of places
> 
> Matthew Wilcox (19):
>   axonram: Fix bug in direct_access
>   block: Change direct_access calling convention
>   mm: Fix XIP fault vs truncate race
>   mm: Allow page fault handlers to perform the COW
>   vfs,ext2: Introduce IS_DAX(inode)
>   dax,ext2: Replace XIP read and write with DAX I/O
>   dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks
>   dax,ext2: Replace the XIP page fault handler with the DAX page fault
>     handler
>   dax,ext2: Replace xip_truncate_page with dax_truncate_page
>   dax: Replace XIP documentation with DAX documentation
>   vfs: Remove get_xip_mem
>   ext2: Remove ext2_xip_verify_sb()
>   ext2: Remove ext2_use_xip
>   ext2: Remove xip.c and xip.h
>   vfs,ext2: Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to
>     CONFIG_FS_DAX
>   ext2: Remove ext2_aops_xip
>   ext2: Get rid of most mentions of XIP in ext2
>   dax: Add dax_zero_page_range
>   brd: Rename XIP to DAX
> 
> Ross Zwisler (1):
>   ext4: Add DAX functionality
> 
>  Documentation/filesystems/00-INDEX |   5 +-
>  Documentation/filesystems/Locking  |   3 -
>  Documentation/filesystems/dax.txt  |  91 +++++++
>  Documentation/filesystems/ext2.txt |   5 +-
>  Documentation/filesystems/ext4.txt |   4 +
>  Documentation/filesystems/vfs.txt  |   7 -
>  Documentation/filesystems/xip.txt  |  68 -----
>  MAINTAINERS                        |   6 +
>  arch/powerpc/sysdev/axonram.c      |  19 +-
>  drivers/block/Kconfig              |  13 +-
>  drivers/block/brd.c                |  28 +-
>  drivers/s390/block/dcssblk.c       |  21 +-
>  fs/Kconfig                         |  21 +-
>  fs/Makefile                        |   1 +
>  fs/block_dev.c                     |  40 +++
>  fs/dax.c                           | 530 +++++++++++++++++++++++++++++++++++++
>  fs/exofs/inode.c                   |   1 -
>  fs/ext2/Kconfig                    |  11 -
>  fs/ext2/Makefile                   |   1 -
>  fs/ext2/ext2.h                     |  10 +-
>  fs/ext2/file.c                     |  45 +++-
>  fs/ext2/inode.c                    |  38 +--
>  fs/ext2/namei.c                    |  13 +-
>  fs/ext2/super.c                    |  53 ++--
>  fs/ext2/xip.c                      |  91 -------
>  fs/ext2/xip.h                      |  26 --
>  fs/ext4/ext4.h                     |   6 +
>  fs/ext4/file.c                     |  50 +++-
>  fs/ext4/indirect.c                 |  18 +-
>  fs/ext4/inode.c                    |  89 +++++--
>  fs/ext4/namei.c                    |  10 +-
>  fs/ext4/super.c                    |  39 ++-
>  fs/open.c                          |   5 +-
>  include/linux/blkdev.h             |   6 +-
>  include/linux/fs.h                 |  34 +--
>  include/linux/mm.h                 |   1 +
>  include/linux/rmap.h               |   2 +-
>  mm/Makefile                        |   1 -
>  mm/fadvise.c                       |   6 +-
>  mm/filemap.c                       |  25 +-
>  mm/filemap_xip.c                   | 483 ---------------------------------
>  mm/madvise.c                       |   2 +-
>  mm/memory.c                        |  33 ++-
>  scripts/diffconfig                 |   1 -
>  44 files changed, 1069 insertions(+), 893 deletions(-)
>  create mode 100644 Documentation/filesystems/dax.txt
>  delete mode 100644 Documentation/filesystems/xip.txt
>  create mode 100644 fs/dax.c
>  delete mode 100644 fs/ext2/xip.c
>  delete mode 100644 fs/ext2/xip.h
>  delete mode 100644 mm/filemap_xip.c
> 
> -- 
> 2.1.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
---end quoted text---

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage
  2014-12-10 14:03 ` [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Christoph Hellwig
@ 2014-12-10 14:12   ` Matthew Wilcox
  2014-12-10 14:28     ` Jeff Moyer
                       ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Matthew Wilcox @ 2014-12-10 14:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm, willy,
	Andrew Morton

On Wed, Dec 10, 2014 at 06:03:47AM -0800, Christoph Hellwig wrote:
> What is the status of this patch set?

I have no outstanding bug reports against it.  Linus told me that he
wants to see it come through Andrew's tree.  I have an email two weeks
ago from Andrew saying that it's on his list.  I would love to see it
merged since it's almost a year old at this point.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage
  2014-12-10 14:12   ` Matthew Wilcox
@ 2014-12-10 14:28     ` Jeff Moyer
  2014-12-10 20:53     ` Dave Chinner
  2015-01-05 18:41     ` Christoph Hellwig
  2 siblings, 0 replies; 60+ messages in thread
From: Jeff Moyer @ 2014-12-10 14:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Matthew Wilcox, linux-fsdevel, linux-kernel,
	linux-mm, Andrew Morton

Matthew Wilcox <willy@linux.intel.com> writes:

> On Wed, Dec 10, 2014 at 06:03:47AM -0800, Christoph Hellwig wrote:
>> What is the status of this patch set?
>
> I have no outstanding bug reports against it.  Linus told me that he
> wants to see it come through Andrew's tree.  I have an email two weeks
> ago from Andrew saying that it's on his list.  I would love to see it
> merged since it's almost a year old at this point.

I'd also like to see this go in soon.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage
  2014-12-10 14:12   ` Matthew Wilcox
  2014-12-10 14:28     ` Jeff Moyer
@ 2014-12-10 20:53     ` Dave Chinner
  2015-01-05 18:41     ` Christoph Hellwig
  2 siblings, 0 replies; 60+ messages in thread
From: Dave Chinner @ 2014-12-10 20:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Matthew Wilcox, linux-fsdevel, linux-kernel,
	linux-mm, Andrew Morton

On Wed, Dec 10, 2014 at 09:12:11AM -0500, Matthew Wilcox wrote:
> On Wed, Dec 10, 2014 at 06:03:47AM -0800, Christoph Hellwig wrote:
> > What is the status of this patch set?
> 
> I have no outstanding bug reports against it.  Linus told me that he
> wants to see it come through Andrew's tree.  I have an email two weeks
> ago from Andrew saying that it's on his list.  I would love to see it
> merged since it's almost a year old at this point.

Yup, and I've been sitting on the XFS patches to enable DAX for
quite a few months. I'm waiting for it to hit the upstream trees so
I can push it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage
  2014-12-10 14:12   ` Matthew Wilcox
  2014-12-10 14:28     ` Jeff Moyer
  2014-12-10 20:53     ` Dave Chinner
@ 2015-01-05 18:41     ` Christoph Hellwig
  2015-01-06  8:47       ` Andrew Morton
  2 siblings, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2015-01-05 18:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm,
	Andrew Morton, Linus Torvalds

On Wed, Dec 10, 2014 at 09:12:11AM -0500, Matthew Wilcox wrote:
> On Wed, Dec 10, 2014 at 06:03:47AM -0800, Christoph Hellwig wrote:
> > What is the status of this patch set?
> 
> I have no outstanding bug reports against it.  Linus told me that he
> wants to see it come through Andrew's tree.  I have an email two weeks
> ago from Andrew saying that it's on his list.  I would love to see it
> merged since it's almost a year old at this point.

And since then another month and aother merge window has passed.  Is
there any way to speed up merging big patch sets like this one?

Another one is non-blocking read one that has real life use on one
of the biggest server side webapp frameworks but doesn't seem to make
progress, which is a bit frustrating.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage
  2015-01-05 18:41     ` Christoph Hellwig
@ 2015-01-06  8:47       ` Andrew Morton
  2015-01-08 11:49         ` pread2/ pwrite2 Christoph Hellwig
                           ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Andrew Morton @ 2015-01-06  8:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Matthew Wilcox, linux-fsdevel, linux-kernel,
	linux-mm, Linus Torvalds, Milosz Tanski

On Mon, 5 Jan 2015 10:41:43 -0800 Christoph Hellwig <hch@infradead.org> wrote:

> On Wed, Dec 10, 2014 at 09:12:11AM -0500, Matthew Wilcox wrote:
> > On Wed, Dec 10, 2014 at 06:03:47AM -0800, Christoph Hellwig wrote:
> > > What is the status of this patch set?
> > 
> > I have no outstanding bug reports against it.  Linus told me that he
> > wants to see it come through Andrew's tree.  I have an email two weeks
> > ago from Andrew saying that it's on his list.  I would love to see it
> > merged since it's almost a year old at this point.
> 
> And since then another month and aother merge window has passed.  Is
> there any way to speed up merging big patch sets like this one?

I took a look at dax last time and found it to be unreviewable due to
lack of design description, objectives and code comments.  Hopefully
that's been addressed - I should get back to it fairly soon as I chew
through merge window and holiday backlog.

> Another one is non-blocking read one that has real life use on one
> of the biggest server side webapp frameworks but doesn't seem to make
> progress, which is a bit frustrating.

I took a look at pread2() as well and I have two main issues:

- The patchset includes a pwrite2() syscall which has nothing to do
  with nonblocking reads and which was poorly described and had little
  justification for inclusion.

- We've talked for years about implementing this via fincore+pread
  and at least two fincore implementations are floating about.  Now
  along comes pread2() which does it all in one hit.

  Which approach is best?  I expect fincore+pread is simpler, more
  flexible and more maintainable.  But pread2() will have lower CPU
  consumption and lower average-case latency.

  But how *much* better is pread2()?  I expect the difference will be
  minor because these operations are associated with a great big
  cache-stomping memcpy.  If the pread2() advantage is "insignificant
  for real world workloads" then perhaps it isn't the best way to go.

  I just don't know, and diligence requires that we answer the
  question.  But all I've seen in response to these questions is
  handwaving.  It would be a shame to make a mistake because nobody
  found the time to perform the investigation.

Also, integration of pread2() into xfstests is (or was) happening and
the results of that aren't yet known.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* pread2/ pwrite2
  2015-01-06  8:47       ` Andrew Morton
@ 2015-01-08 11:49         ` Christoph Hellwig
  2015-01-09 19:30           ` Steve French
  2015-01-08 16:28         ` [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Milosz Tanski
  2015-01-12 14:47         ` Matthew Wilcox
  2 siblings, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2015-01-08 11:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, linux-kernel, linux-mm, Linus Torvalds, Milosz Tanski

On Tue, Jan 06, 2015 at 12:47:14AM -0800, Andrew Morton wrote:
> > progress, which is a bit frustrating.
> 
> I took a look at pread2() as well and I have two main issues:
> 
> - The patchset includes a pwrite2() syscall which has nothing to do
>   with nonblocking reads and which was poorly described and had little
>   justification for inclusion.

It allows to do O_SYNC writes on a per-I/O basis.  This is very useful
for file servers (smb, cifs) as well as storage target devices.

Note: that part was my addition, and the complaint about lacking
description ever made it to me.  Can you point to the relevant
questions?

> - We've talked for years about implementing this via fincore+pread
>   and at least two fincore implementations are floating about.  Now
>   along comes pread2() which does it all in one hit.
> 
>   Which approach is best?  I expect fincore+pread is simpler, more
>   flexible and more maintainable.  But pread2() will have lower CPU
>   consumption and lower average-case latency.

fincore+pread is inherently racy and thus entirely unsuitable for the
use case of a non-blockign main thread.

Nevermind that the pread2 path is way simpler than any of the proposed
fincore patches.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage
  2015-01-06  8:47       ` Andrew Morton
  2015-01-08 11:49         ` pread2/ pwrite2 Christoph Hellwig
@ 2015-01-08 16:28         ` Milosz Tanski
  2015-01-08 17:36           ` Jeremy Allison
  2015-01-12 14:47         ` Matthew Wilcox
  2 siblings, 1 reply; 60+ messages in thread
From: Milosz Tanski @ 2015-01-08 16:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Matthew Wilcox, Matthew Wilcox, linux-fsdevel,
	LKML, linux-mm, Linus Torvalds

On Tue, Jan 6, 2015 at 3:47 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Mon, 5 Jan 2015 10:41:43 -0800 Christoph Hellwig <hch@infradead.org> wrote:
>
>> On Wed, Dec 10, 2014 at 09:12:11AM -0500, Matthew Wilcox wrote:
>> > On Wed, Dec 10, 2014 at 06:03:47AM -0800, Christoph Hellwig wrote:
>> > > What is the status of this patch set?
>> >
>> > I have no outstanding bug reports against it.  Linus told me that he
>> > wants to see it come through Andrew's tree.  I have an email two weeks
>> > ago from Andrew saying that it's on his list.  I would love to see it
>> > merged since it's almost a year old at this point.
>>
>> And since then another month and aother merge window has passed.  Is
>> there any way to speed up merging big patch sets like this one?
>
> I took a look at dax last time and found it to be unreviewable due to
> lack of design description, objectives and code comments.  Hopefully
> that's been addressed - I should get back to it fairly soon as I chew
> through merge window and holiday backlog.
>
>> Another one is non-blocking read one that has real life use on one
>> of the biggest server side webapp frameworks but doesn't seem to make
>> progress, which is a bit frustrating.
>
> I took a look at pread2() as well and I have two main issues:
>
> - The patchset includes a pwrite2() syscall which has nothing to do
>   with nonblocking reads and which was poorly described and had little
>   justification for inclusion.
>
> - We've talked for years about implementing this via fincore+pread
>   and at least two fincore implementations are floating about.  Now
>   along comes pread2() which does it all in one hit.
>
>   Which approach is best?  I expect fincore+pread is simpler, more
>   flexible and more maintainable.  But pread2() will have lower CPU
>   consumption and lower average-case latency.
>
>   But how *much* better is pread2()?  I expect the difference will be
>   minor because these operations are associated with a great big
>   cache-stomping memcpy.  If the pread2() advantage is "insignificant
>   for real world workloads" then perhaps it isn't the best way to go.
>
>   I just don't know, and diligence requires that we answer the
>   question.  But all I've seen in response to these questions is
>   handwaving.  It would be a shame to make a mistake because nobody
>   found the time to perform the investigation.
>
> Also, integration of pread2() into xfstests is (or was) happening and
> the results of that aren't yet known.
>

Andrew I  got busier with my other job related things between the
Thanksgiving & Christmas then anticipated. However, I have updated and
taken apart the patchset into two pieces (preadv2 and pwritev2). That
should make evaluating the two separately easier. With the help of
Volker I hacked up preadv2 support into samba and I hopefully have
some numbers from it soon. Finally, I'm putting together a test case
for the typical webapp middle-tier service (epoll + threadpool for
diskio).

Haven't stopped, just progressing on that slower due to external factors.

P.S: Sorry for re-send. On the road and was using gmail to respond
with... it randomly forgets plain-text only settings.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage
  2015-01-08 16:28         ` [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Milosz Tanski
@ 2015-01-08 17:36           ` Jeremy Allison
  0 siblings, 0 replies; 60+ messages in thread
From: Jeremy Allison @ 2015-01-08 17:36 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Andrew Morton, Christoph Hellwig, Matthew Wilcox, Matthew Wilcox,
	linux-fsdevel, LKML, linux-mm, Linus Torvalds

On Thu, Jan 08, 2015 at 11:28:40AM -0500, Milosz Tanski wrote:
> >
> 
> Andrew I  got busier with my other job related things between the
> Thanksgiving & Christmas then anticipated. However, I have updated and
> taken apart the patchset into two pieces (preadv2 and pwritev2). That
> should make evaluating the two separately easier. With the help of
> Volker I hacked up preadv2 support into samba and I hopefully have
> some numbers from it soon. Finally, I'm putting together a test case

I'd be very interested in seeing that patch code and those
numbers !

Cheers,

	Jeremy.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: pread2/ pwrite2
  2015-01-08 11:49         ` pread2/ pwrite2 Christoph Hellwig
@ 2015-01-09 19:30           ` Steve French
  0 siblings, 0 replies; 60+ messages in thread
From: Steve French @ 2015-01-09 19:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, linux-fsdevel, LKML, linux-mm, Linus Torvalds,
	Milosz Tanski

On Thu, Jan 8, 2015 at 5:49 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jan 06, 2015 at 12:47:14AM -0800, Andrew Morton wrote:
>> > progress, which is a bit frustrating.
>>
>> I took a look at pread2() as well and I have two main issues:
>>
>> - The patchset includes a pwrite2() syscall which has nothing to do
>>   with nonblocking reads and which was poorly described and had little
>>   justification for inclusion.
>
> It allows to do O_SYNC writes on a per-I/O basis.  This is very useful
> for file servers (smb, cifs) as well as storage target devices.

This would be particularly useful for SMB3 as the protocol now allows
write-through vs. no-write-through flag on every write request (not just
on an open, it can be changed on a particular i/o to write-through).
There is also a cache/no-cache hint that can be sent on reads/writes in
the newest SMB3 dialect well (but it is less clear to me how we would
ever decide to set that on the Linux client).




-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage
  2015-01-06  8:47       ` Andrew Morton
  2015-01-08 11:49         ` pread2/ pwrite2 Christoph Hellwig
  2015-01-08 16:28         ` [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Milosz Tanski
@ 2015-01-12 14:47         ` Matthew Wilcox
  2 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2015-01-12 14:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Matthew Wilcox, Matthew Wilcox, linux-fsdevel,
	linux-kernel, linux-mm, Linus Torvalds, Milosz Tanski

On Tue, Jan 06, 2015 at 12:47:14AM -0800, Andrew Morton wrote:
> On Mon, 5 Jan 2015 10:41:43 -0800 Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Wed, Dec 10, 2014 at 09:12:11AM -0500, Matthew Wilcox wrote:
> > > On Wed, Dec 10, 2014 at 06:03:47AM -0800, Christoph Hellwig wrote:
> > > > What is the status of this patch set?
> > > 
> > > I have no outstanding bug reports against it.  Linus told me that he
> > > wants to see it come through Andrew's tree.  I have an email two weeks
> > > ago from Andrew saying that it's on his list.  I would love to see it
> > > merged since it's almost a year old at this point.
> > 
> > And since then another month and aother merge window has passed.  Is
> > there any way to speed up merging big patch sets like this one?
> 
> I took a look at dax last time and found it to be unreviewable due to
> lack of design description, objectives and code comments.  Hopefully
> that's been addressed - I should get back to it fairly soon as I chew
> through merge window and holiday backlog.

Now that Jens has merged patches 1 and 2 into his block tree, you don't
need to spend any time looking at those.  If I could trouble you to
merge patches 3 & 4 through mm, the rest of the patches are VFS/ext2,
and maye we could merge those through Al's tree instead of taking your
valuable time?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage
  2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
                   ` (20 preceding siblings ...)
  2014-12-10 14:03 ` [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Christoph Hellwig
@ 2015-01-12 23:09 ` Andrew Morton
  21 siblings, 0 replies; 60+ messages in thread
From: Andrew Morton @ 2015-01-12 23:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, willy

On Fri, 24 Oct 2014 17:20:32 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:

> DAX is a replacement for the variation of XIP currently supported by
> the ext2 filesystem.

Looks pretty nice to me, thanks.  I had a bunch of relatively minor
review questions - mainly stuff which would benefit from some short
comments.

I had to do some mangling due to the intervening
i_mmap_mutex->i_mmap_lock_read/write.  I ended up choosing
i_mmap_lock_read() throughout, which needs careful checking please.  I
also converted the changelogs.  It still compiles!


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 03/20] mm: Fix XIP fault vs truncate race
  2014-10-24 21:20 ` [PATCH v12 03/20] mm: Fix XIP fault vs truncate race Matthew Wilcox
@ 2015-01-12 23:09   ` Andrew Morton
  2015-01-13 18:50     ` Matthew Wilcox
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2015-01-12 23:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, willy

On Fri, 24 Oct 2014 17:20:35 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:

> Pagecache faults recheck i_size after taking the page lock to ensure that
> the fault didn't race against a truncate.  We don't have a page to lock
> in the XIP case, so use the i_mmap_mutex instead.  It is locked in the
> truncate path in unmap_mapping_range() after updating i_size.  So while
> we hold it in the fault path, we are guaranteed that either i_size has
> already been updated in the truncate path, or that the truncate will
> subsequently call zap_page_range_single() and so remove the mapping we
> have just inserted.
> 
> There is a window of time in which i_size has been reduced and the
> thread has a mapping to a page which will be removed from the file,
> but this is harmless as the page will not be allocated to a different
> purpose before the thread's access to it is revoked.
> 

i_mmap_mutex is no more.  I made what are hopefulyl the appropriate
changes.

Also, that new locking rule is pretty subtle and we need to find a way
of alerting readers (and modifiers) of mm/memory.c to DAX's use of
i_mmap_lock().  Please review my suggested addition for accuracy and
cmopleteness.


From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-fix-xip-fault-vs-truncate-race-fix

switch to i_mmap_lock_read(), add comment in unmap_single_vma()

Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap_xip.c |   20 +++++++++++++-------
 mm/memory.c      |    5 +++++
 2 files changed, 18 insertions(+), 7 deletions(-)

diff -puN mm/filemap_xip.c~mm-fix-xip-fault-vs-truncate-race-fix mm/filemap_xip.c
--- a/mm/filemap_xip.c~mm-fix-xip-fault-vs-truncate-race-fix
+++ a/mm/filemap_xip.c
@@ -255,17 +255,20 @@ again:
 		__xip_unmap(mapping, vmf->pgoff);
 
 found:
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
+		/*
+		 * We must recheck i_size under i_mmap_rwsem to prevent races
+		 * with truncation
+		 */
+		i_mmap_lock_read(mapping);
 		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
 							PAGE_CACHE_SHIFT;
 		if (unlikely(vmf->pgoff >= size)) {
-			mutex_unlock(&mapping->i_mmap_mutex);
+			i_mmap_unlock_read(mapping);
 			return VM_FAULT_SIGBUS;
 		}
 		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
 							xip_pfn);
-		mutex_unlock(&mapping->i_mmap_mutex);
+		i_mmap_unlock_read(mapping);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		/*
@@ -290,8 +293,11 @@ found:
 		if (error != -ENODATA)
 			goto out;
 
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
+		/*
+		 * We must recheck i_size under i_mmap_rwsem to prevent races
+		 * with truncation
+		 */
+		i_mmap_lock_read(mapping);
 		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
 							PAGE_CACHE_SHIFT;
 		if (unlikely(vmf->pgoff >= size)) {
@@ -309,7 +315,7 @@ found:
 
 		ret = VM_FAULT_NOPAGE;
 unlock:
-		mutex_unlock(&mapping->i_mmap_mutex);
+		i_mmap_unlock_read(mapping);
 out:
 		write_seqcount_end(&xip_sparse_seq);
 		mutex_unlock(&xip_sparse_mutex);
diff -puN mm/memory.c~mm-fix-xip-fault-vs-truncate-race-fix mm/memory.c
--- a/mm/memory.c~mm-fix-xip-fault-vs-truncate-race-fix
+++ a/mm/memory.c
@@ -1327,6 +1327,11 @@ static void unmap_single_vma(struct mmu_
 			 * safe to do nothing in this case.
 			 */
 			if (vma->vm_file) {
+				/*
+				 * Note that DAX uses i_mmap_lock to serialise
+				 * against file truncate - truncate calls into
+				 * unmap_single_vma().
+				 */
 				i_mmap_lock_write(vma->vm_file->f_mapping);
 				__unmap_hugepage_range_final(tlb, vma, start, end, NULL);
 				i_mmap_unlock_write(vma->vm_file->f_mapping);
_


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW
  2014-10-24 21:20 ` [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW Matthew Wilcox
@ 2015-01-12 23:09   ` Andrew Morton
  2015-01-13 18:58     ` Matthew Wilcox
  2015-02-05  9:16   ` Yigal Korman
  1 sibling, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2015-01-12 23:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, willy

On Fri, 24 Oct 2014 17:20:36 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:

> Currently COW of an XIP file is done by first bringing in a read-only
> mapping, then retrying the fault and copying the page.  It is much more
> efficient to tell the fault handler that a COW is being attempted (by
> passing in the pre-allocated page in the vm_fault structure), and allow
> the handler to perform the COW operation itself.
> 
> The handler cannot insert the page itself if there is already a read-only
> mapping at that address, so allow the handler to return VM_FAULT_LOCKED
> and set the fault_page to be NULL.  This indicates to the MM code that
> the i_mmap_mutex is held instead of the page lock.

Again, the locking gets a bit subtle.  How can we make this clearer to
readers of the core code.  I had a shot but it's a bit lame - DAX uses
i_mmap_lock for what???

If I know that, I'd know whether to have used i_mmap_lock_read() or
i_mmap_lock_write() :(


From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-allow-page-fault-handlers-to-perform-the-cow-fix

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff -puN include/linux/mm.h~mm-allow-page-fault-handlers-to-perform-the-cow-fix include/linux/mm.h
diff -puN mm/memory.c~mm-allow-page-fault-handlers-to-perform-the-cow-fix mm/memory.c
--- a/mm/memory.c~mm-allow-page-fault-handlers-to-perform-the-cow-fix
+++ a/mm/memory.c
@@ -2961,7 +2961,11 @@ static int do_cow_fault(struct mm_struct
 			unlock_page(fault_page);
 			page_cache_release(fault_page);
 		} else {
-			mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+			/*
+			 * DAX doesn't have a page to lock, so it uses
+			 * i_mmap_lock()
+			 */
+			i_mmap_unlock_read(&vma->vm_file->f_mapping);
 		}
 		goto uncharge_out;
 	}
@@ -2973,7 +2977,11 @@ static int do_cow_fault(struct mm_struct
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
 	} else {
-		mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+			/*
+			 * DAX doesn't have a page to lock, so it uses
+			 * i_mmap_lock()
+			 */
+			i_mmap_unlock_read(&vma->vm_file->f_mapping);
 	}
 	return ret;
 uncharge_out:
_


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 06/20] dax,ext2: Replace XIP read and write with DAX I/O
  2014-10-24 21:20 ` [PATCH v12 06/20] dax,ext2: Replace XIP read and write with DAX I/O Matthew Wilcox
@ 2015-01-12 23:09   ` Andrew Morton
  2015-01-13 20:59     ` Matthew Wilcox
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2015-01-12 23:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, willy

On Fri, 24 Oct 2014 17:20:38 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:

> Use the generic AIO infrastructure instead of custom read and write
> methods.  In addition to giving us support for AIO, this adds the missing
> locking between read() and truncate().
> 
> ...
>
> +/*
> + * When ext4 encounters a hole, it returns without modifying the buffer_head
> + * which means that we can't trust b_size.  To cope with this, we set b_state
> + * to 0 before calling get_block and, if any bit is set, we know we can trust
> + * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
> + * and would save us time calling get_block repeatedly.
> + */
> +static bool buffer_size_valid(struct buffer_head *bh)
> +{
> +	return bh->b_state != 0;
> +}

Yitch.  Is there a cleaner way of doing this?

> +static ssize_t dax_io(int rw, struct inode *inode, struct iov_iter *iter,
> +			loff_t start, loff_t end, get_block_t get_block,
> +			struct buffer_head *bh)

hm, some documentation would be nice.  I expected "dax_io" to do IO,
but this doesn't.  Is it well named?

> +{
> +	ssize_t retval = 0;
> +	loff_t pos = start;
> +	loff_t max = start;
> +	loff_t bh_max = start;
> +	void *addr;
> +	bool hole = false;
> +
> +	if (rw != WRITE)
> +		end = min(end, i_size_read(inode));
> +
> +	while (pos < end) {
> +		unsigned len;
> +		if (pos == max) {
> +			unsigned blkbits = inode->i_blkbits;
> +			sector_t block = pos >> blkbits;
> +			unsigned first = pos - (block << blkbits);
> +			long size;
> +
> +			if (pos == bh_max) {
> +				bh->b_size = PAGE_ALIGN(end - pos);
> +				bh->b_state = 0;
> +				retval = get_block(inode, block, bh,
> +								rw == WRITE);
> +				if (retval)
> +					break;
> +				if (!buffer_size_valid(bh))
> +					bh->b_size = 1 << blkbits;
> +				bh_max = pos - first + bh->b_size;
> +			} else {
> +				unsigned done = bh->b_size -
> +						(bh_max - (pos - first));
> +				bh->b_blocknr += done >> blkbits;
> +				bh->b_size -= done;
> +			}
> +
> +			hole = (rw != WRITE) && !buffer_written(bh);
> +			if (hole) {
> +				addr = NULL;
> +				size = bh->b_size - first;
> +			} else {
> +				retval = dax_get_addr(bh, &addr, blkbits);
> +				if (retval < 0)
> +					break;
> +				if (buffer_unwritten(bh) || buffer_new(bh))
> +					dax_new_buf(addr, retval, first, pos,
> +									end);
> +				addr += first;
> +				size = retval - first;
> +			}
> +			max = min(pos + size, end);
> +		}
> +
> +		if (rw == WRITE)
> +			len = copy_from_iter(addr, max - pos, iter);
> +		else if (!hole)
> +			len = copy_to_iter(addr, max - pos, iter);
> +		else
> +			len = iov_iter_zero(max - pos, iter);
> +
> +		if (!len)
> +			break;
> +
> +		pos += len;
> +		addr += len;
> +	}
> +
> +	return (pos == start) ? retval : pos - start;
> +}
> +
> +/**
> + * dax_do_io - Perform I/O to a DAX file
> + * @rw: READ to read or WRITE to write
> + * @iocb: The control block for this I/O
> + * @inode: The file which the I/O is directed at
> + * @iter: The addresses to do I/O from or to
> + * @pos: The file offset where the I/O starts
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + * @end_io: A filesystem callback for I/O completion
> + * @flags: See below
> + *
> + * This function uses the same locking scheme as do_blockdev_direct_IO:
> + * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
> + * caller for writes.  For reads, we take and release the i_mutex ourselves.
> + * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
> + * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
> + * is in progress.

It would be helpful here to explain *why* this code uses i_dio_count:
what is trying to protect (against)?

Oh, is that how it works ;)

Perhaps a few BUG_ON(!mutex_is_locked(&inode->i_mutex)) would clarfiy
and prevent mistakes.

> + */
> +ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
> +			struct iov_iter *iter, loff_t pos,
> +			get_block_t get_block, dio_iodone_t end_io, int flags)
>
> ...
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 07/20] dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks
  2014-10-24 21:20 ` [PATCH v12 07/20] dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
@ 2015-01-12 23:09   ` Andrew Morton
  2015-01-13 21:39     ` Matthew Wilcox
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2015-01-12 23:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, willy

On Fri, 24 Oct 2014 17:20:39 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:

> This is practically generic code; other filesystems will want to call
> it from other places, but there's nothing ext2-specific about it.
> 
> Make it a little more generic by allowing it to take a count of the number
> of bytes to zero rather than fixing it to a single page.  Thanks to Dave
> Hansen for suggesting that I need to call cond_resched() if zeroing more
> than one page.
> 
> ...
>
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -20,8 +20,45 @@
>  #include <linux/fs.h>
>  #include <linux/genhd.h>
>  #include <linux/mutex.h>
> +#include <linux/sched.h>
>  #include <linux/uio.h>
>  
> +int dax_clear_blocks(struct inode *inode, sector_t block, long size)
> +{
> +	struct block_device *bdev = inode->i_sb->s_bdev;
> +	sector_t sector = block << (inode->i_blkbits - 9);
> +
> +	might_sleep();
> +	do {
> +		void *addr;
> +		unsigned long pfn;
> +		long count;
> +
> +		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
> +		if (count < 0)
> +			return count;
> +		BUG_ON(size < count);
> +		while (count > 0) {
> +			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
> +			if (pgsz > count)
> +				pgsz = count;
> +			if (pgsz < PAGE_SIZE)
> +				memset(addr, 0, pgsz);
> +			else
> +				clear_page(addr);

Are there any cache issues in all this code? flush_dcache_page(addr)?

> +			addr += pgsz;
> +			size -= pgsz;
> +			count -= pgsz;
> +			BUG_ON(pgsz & 511);
> +			sector += pgsz / 512;
> +			cond_resched();
> +		}
> +	} while (size);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(dax_clear_blocks);
>
> ...
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler
  2014-10-24 21:20 ` [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
@ 2015-01-12 23:09   ` Andrew Morton
  2015-01-13 21:53     ` Matthew Wilcox
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2015-01-12 23:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, willy

On Fri, 24 Oct 2014 17:20:40 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:

> Instead of calling aops->get_xip_mem from the fault handler, the
> filesystem passes a get_block_t that is used to find the appropriate
> blocks.
> 
> ...
>
> +static int copy_user_bh(struct page *to, struct buffer_head *bh,
> +			unsigned blkbits, unsigned long vaddr)
> +{
> +	void *vfrom, *vto;
> +	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
> +		return -EIO;
> +	vto = kmap_atomic(to);
> +	copy_user_page(vto, vfrom, vaddr, to);
> +	kunmap_atomic(vto);

Again, please check the cache-flush aspects.  copy_user_page() appears
to be reponsible for handling coherency issues on the destination
vaddr, but what about *vto?

> +	return 0;
> +}
> +
> +static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> +			struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
> +	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> +	void *addr;
> +	unsigned long pfn;
> +	pgoff_t size;
> +	int error;
> +
> +	mutex_lock(&mapping->i_mmap_mutex);
> +
> +	/*
> +	 * Check truncate didn't happen while we were allocating a block.
> +	 * If it did, this block may or may not be still allocated to the
> +	 * file.  We can't tell the filesystem to free it because we can't
> +	 * take i_mutex here.

(what's preventing us from taking i_mutex?)

>  	   In the worst case, the file still has blocks
> +	 * allocated past the end of the file.
> +	 */
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (unlikely(vmf->pgoff >= size)) {
> +		error = -EIO;
> +		goto out;
> +	}

How does this play with holepunching?  Checking i_size won't work there?

> +	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
> +	if (error < 0)
> +		goto out;
> +	if (error < PAGE_SIZE) {
> +		error = -EIO;
> +		goto out;

hm, what's going on here.  It's known that bh->b_size >= PAGE_SIZE?  I
don't recall seeing anything which explained that to me.  Help.

> +	}
> +
> +	if (buffer_unwritten(bh) || buffer_new(bh))
> +		clear_page(addr);
> +
> +	error = vm_insert_mixed(vma, vaddr, pfn);
> +
> + out:
> +	mutex_unlock(&mapping->i_mmap_mutex);
> +
> +	if (bh->b_end_io)
> +		bh->b_end_io(bh, 1);
> +
> +	return error;
> +}
> +
> +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> +			get_block_t get_block)
> +{
> +	struct file *file = vma->vm_file;
> +	struct address_space *mapping = file->f_mapping;
> +	struct inode *inode = mapping->host;
> +	struct page *page;
> +	struct buffer_head bh;
> +	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> +	unsigned blkbits = inode->i_blkbits;
> +	sector_t block;
> +	pgoff_t size;
> +	int error;
> +	int major = 0;
> +
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (vmf->pgoff >= size)
> +		return VM_FAULT_SIGBUS;
> +
> +	memset(&bh, 0, sizeof(bh));
> +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
> +	bh.b_size = PAGE_SIZE;

ah, there.

PAGE_SIZE varies a lot between architectures.  What are the
implications of this>?

> + repeat:
> +	page = find_get_page(mapping, vmf->pgoff);
> +	if (page) {
> +		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
> +			page_cache_release(page);
> +			return VM_FAULT_RETRY;
> +		}
> +		if (unlikely(page->mapping != mapping)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto repeat;
> +		}
> +		size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +		if (unlikely(vmf->pgoff >= size)) {
> +			error = -EIO;

What happened when this happens?

> +			goto unlock_page;
> +		}
> +	}
> +
> +	error = get_block(inode, block, &bh, 0);
> +	if (!error && (bh.b_size < PAGE_SIZE))
> +		error = -EIO;

How could this happen?

> +	if (error)
> +		goto unlock_page;
> +
> +	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
> +		if (vmf->flags & FAULT_FLAG_WRITE) {
> +			error = get_block(inode, block, &bh, 1);
> +			count_vm_event(PGMAJFAULT);
> +			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> +			major = VM_FAULT_MAJOR;
> +			if (!error && (bh.b_size < PAGE_SIZE))
> +				error = -EIO;
> +			if (error)
> +				goto unlock_page;
> +		} else {
> +			return dax_load_hole(mapping, page, vmf);
> +		}
> +	}
> +
> +	if (vmf->cow_page) {
> +		struct page *new_page = vmf->cow_page;
> +		if (buffer_written(&bh))
> +			error = copy_user_bh(new_page, &bh, blkbits, vaddr);
> +		else
> +			clear_user_highpage(new_page, vaddr);
> +		if (error)
> +			goto unlock_page;
> +		vmf->page = page;
> +		if (!page) {
> +			mutex_lock(&mapping->i_mmap_mutex);
> +			/* Check we didn't race with truncate */
> +			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
> +								PAGE_SHIFT;
> +			if (vmf->pgoff >= size) {
> +				mutex_unlock(&mapping->i_mmap_mutex);
> +				error = -EIO;
> +				goto out;
> +			}
> +		}
> +		return VM_FAULT_LOCKED;
> +	}
> +
> +	/* Check we didn't race with a read fault installing a new page */
> +	if (!page && major)
> +		page = find_lock_page(mapping, vmf->pgoff);
> +
> +	if (page) {
> +		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
> +							PAGE_CACHE_SIZE, 0);
> +		delete_from_page_cache(page);
> +		unlock_page(page);
> +		page_cache_release(page);
> +	}
> +
> +	error = dax_insert_mapping(inode, &bh, vma, vmf);
> +
> + out:
> +	if (error == -ENOMEM)
> +		return VM_FAULT_OOM | major;
> +	/* -EBUSY is fine, somebody else faulted on the same PTE */
> +	if ((error < 0) && (error != -EBUSY))
> +		return VM_FAULT_SIGBUS | major;
> +	return VM_FAULT_NOPAGE | major;
> +
> + unlock_page:
> +	if (page) {
> +		unlock_page(page);
> +		page_cache_release(page);
> +	}
> +	goto out;
> +}
> 
> ...
>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 09/20] dax,ext2: Replace xip_truncate_page with dax_truncate_page
  2014-10-24 21:20 ` [PATCH v12 09/20] dax,ext2: Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
@ 2015-01-12 23:09   ` Andrew Morton
  2015-01-13 21:55     ` Matthew Wilcox
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2015-01-12 23:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, willy

On Fri, 24 Oct 2014 17:20:41 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:

> It takes a get_block parameter just like nobh_truncate_page() and
> block_truncate_page()
> 
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -458,3 +458,47 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	return result;
>  }
>  EXPORT_SYMBOL_GPL(dax_fault);
> +
> +/**
> + * dax_truncate_page - handle a partial page being truncated in a DAX file
> + * @inode: The file being truncated
> + * @from: The file offset that is being truncated to
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + *
> + * Similar to block_truncate_page(), this function can be called by a
> + * filesystem when it is truncating an DAX file to handle the partial page.
> + *
> + * We work in terms of PAGE_CACHE_SIZE here for commonality with
> + * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
> + * took care of disposing of the unnecessary blocks.

But PAGE_SIZE==PAGE_CACHE_SIZE.  Unclear what you're saying here.

> + Even if the filesystem
> + * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
> + * since the file might be mmaped.
> + */
> 
> ...
>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
  2014-10-24 21:20 ` [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation Matthew Wilcox
@ 2015-01-12 23:10   ` Andrew Morton
  2016-01-21 18:38   ` Jared Hulbert
  1 sibling, 0 replies; 60+ messages in thread
From: Andrew Morton @ 2015-01-12 23:10 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, Matthew Wilcox

On Fri, 24 Oct 2014 17:20:42 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:

> Based on the original XIP documentation, this documents the current
> state of affairs, and includes instructions on how users can enable DAX
> if their devices and kernel support it.

Nice ;)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 18/20] dax: Add dax_zero_page_range
  2014-10-24 21:20 ` [PATCH v12 18/20] dax: Add dax_zero_page_range Matthew Wilcox
@ 2015-01-12 23:10   ` Andrew Morton
  2015-01-12 23:20     ` Ross Zwisler
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2015-01-12 23:10 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, willy, Ross Zwisler

On Fri, 24 Oct 2014 17:20:50 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:

> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> [ported to 3.13-rc2]
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

I never know what this means :(

I switched it to 

[ross.zwisler@linux.intel.com: ported to 3.13-rc2]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

but perhaps that was wrong?



also, coupla typos:


diff -puN fs/dax.c~dax-add-dax_zero_page_range-fix fs/dax.c
--- a/fs/dax.c~dax-add-dax_zero_page_range-fix
+++ a/fs/dax.c
@@ -475,7 +475,7 @@ EXPORT_SYMBOL_GPL(dax_fault);
  * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
  * took care of disposing of the unnecessary blocks.  Even if the filesystem
  * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
- * since the file might be mmaped.
+ * since the file might be mmapped.
  */
 int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 							get_block_t get_block)
@@ -514,13 +514,13 @@ EXPORT_SYMBOL_GPL(dax_zero_page_range);
  * @get_block: The filesystem method used to translate file offsets to blocks
  *
  * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an DAX file to handle the partial page.
+ * filesystem when it is truncating a DAX file to handle the partial page.
  *
  * We work in terms of PAGE_CACHE_SIZE here for commonality with
  * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
  * took care of disposing of the unnecessary blocks.  Even if the filesystem
  * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
- * since the file might be mmaped.
+ * since the file might be mmapped.
  */
 int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 {
diff -puN include/linux/fs.h~dax-add-dax_zero_page_range-fix include/linux/fs.h
_


akpm3:/usr/src/linux-3.19-rc4> grep -r mmaped .| wc -l
70
akpm3:/usr/src/linux-3.19-rc4> grep -r mmapped .| wc -l 
107

lol.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 18/20] dax: Add dax_zero_page_range
  2015-01-12 23:10   ` Andrew Morton
@ 2015-01-12 23:20     ` Ross Zwisler
  0 siblings, 0 replies; 60+ messages in thread
From: Ross Zwisler @ 2015-01-12 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm, willy

On Mon, 2015-01-12 at 15:10 -0800, Andrew Morton wrote:
> On Fri, 24 Oct 2014 17:20:50 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:
> 
> > Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> > [ported to 3.13-rc2]
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> I never know what this means :(
> 
> I switched it to 
> 
> [ross.zwisler@linux.intel.com: ported to 3.13-rc2]
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The way that you've interpreted it is correct.  Thanks!

- Ross

> but perhaps that was wrong?
> 
> 
> 
> 
> also, coupla typos:
> 
> 
> diff -puN fs/dax.c~dax-add-dax_zero_page_range-fix fs/dax.c
> --- a/fs/dax.c~dax-add-dax_zero_page_range-fix
> +++ a/fs/dax.c
> @@ -475,7 +475,7 @@ EXPORT_SYMBOL_GPL(dax_fault);
>   * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
>   * took care of disposing of the unnecessary blocks.  Even if the filesystem
>   * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
> - * since the file might be mmaped.
> + * since the file might be mmapped.
>   */
>  int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
>  							get_block_t get_block)
> @@ -514,13 +514,13 @@ EXPORT_SYMBOL_GPL(dax_zero_page_range);
>   * @get_block: The filesystem method used to translate file offsets to blocks
>   *
>   * Similar to block_truncate_page(), this function can be called by a
> - * filesystem when it is truncating an DAX file to handle the partial page.
> + * filesystem when it is truncating a DAX file to handle the partial page.
>   *
>   * We work in terms of PAGE_CACHE_SIZE here for commonality with
>   * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
>   * took care of disposing of the unnecessary blocks.  Even if the filesystem
>   * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
> - * since the file might be mmaped.
> + * since the file might be mmapped.
>   */
>  int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  {
> diff -puN include/linux/fs.h~dax-add-dax_zero_page_range-fix include/linux/fs.h
> _
> 
> 
> akpm3:/usr/src/linux-3.19-rc4> grep -r mmaped .| wc -l
> 70
> akpm3:/usr/src/linux-3.19-rc4> grep -r mmapped .| wc -l 
> 107
> 
> lol.




^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 03/20] mm: Fix XIP fault vs truncate race
  2015-01-12 23:09   ` Andrew Morton
@ 2015-01-13 18:50     ` Matthew Wilcox
  0 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2015-01-13 18:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm, willy

On Mon, Jan 12, 2015 at 03:09:29PM -0800, Andrew Morton wrote:
> On Fri, 24 Oct 2014 17:20:35 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:
> > Pagecache faults recheck i_size after taking the page lock to ensure that
> > the fault didn't race against a truncate.  We don't have a page to lock
> > in the XIP case, so use the i_mmap_mutex instead.  It is locked in the
> > truncate path in unmap_mapping_range() after updating i_size.  So while
> > we hold it in the fault path, we are guaranteed that either i_size has
> > already been updated in the truncate path, or that the truncate will
> > subsequently call zap_page_range_single() and so remove the mapping we
> > have just inserted.
> > 
> > There is a window of time in which i_size has been reduced and the
> > thread has a mapping to a page which will be removed from the file,
> > but this is harmless as the page will not be allocated to a different
> > purpose before the thread's access to it is revoked.
> > 
> 
> i_mmap_mutex is no more.  I made what are hopefulyl the appropriate
> changes.
> 
> Also, that new locking rule is pretty subtle and we need to find a way
> of alerting readers (and modifiers) of mm/memory.c to DAX's use of
> i_mmap_lock().  Please review my suggested addition for accuracy and
> cmopleteness.

I find the existing locking rules for truncate pretty subtle too!
It's easy to define what the rule is, but "why does it work" is, as you
say, subtle.

> +++ a/mm/filemap_xip.c
> @@ -255,17 +255,20 @@ again:
>  		__xip_unmap(mapping, vmf->pgoff);
>  
>  found:
> -		/* We must recheck i_size under i_mmap_mutex */
> -		mutex_lock(&mapping->i_mmap_mutex);
> +		/*
> +		 * We must recheck i_size under i_mmap_rwsem to prevent races
> +		 * with truncation
> +		 */
> +		i_mmap_lock_read(mapping);

I think this is correct.  The truncate code has a write lock, so it cannot
be running at the same time as a read lock.

> diff -puN mm/memory.c~mm-fix-xip-fault-vs-truncate-race-fix mm/memory.c
> --- a/mm/memory.c~mm-fix-xip-fault-vs-truncate-race-fix
> +++ a/mm/memory.c
> @@ -1327,6 +1327,11 @@ static void unmap_single_vma(struct mmu_
>  			 * safe to do nothing in this case.
>  			 */
>  			if (vma->vm_file) {
> +				/*
> +				 * Note that DAX uses i_mmap_lock to serialise
> +				 * against file truncate - truncate calls into
> +				 * unmap_single_vma().
> +				 */
>  				i_mmap_lock_write(vma->vm_file->f_mapping);
>  				__unmap_hugepage_range_final(tlb, vma, start, end, NULL);
>  				i_mmap_unlock_write(vma->vm_file->f_mapping);
> _
> 

But this comment is in the wrong place!  This code is only for the hugetlbfs
case, and would do nothing to protect the DAX code.  I think you want this
instead:

diff --git a/mm/memory.c b/mm/memory.c
index 54f3a9b..67bbbb7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2384,7 +2384,7 @@ void unmap_mapping_range(struct address_space *mapping,
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
 
-
+	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
 	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);

Filesystems are obliged to update i_size before calling
truncate_pagecache(), which does:

        unmap_mapping_range(mapping, holebegin, 0, 1);
        truncate_inode_pages(mapping, newsize);
        unmap_mapping_range(mapping, holebegin, 0, 1);

So if we hold i_mmap_lock_read(), we know that unmap_mapping_range()
is blocked waiting for it, and so any page less than i_size is safe to
insert, because it will be removed once unmap_mapping_range() proceeds.

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW
  2015-01-12 23:09   ` Andrew Morton
@ 2015-01-13 18:58     ` Matthew Wilcox
  0 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2015-01-13 18:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm, willy

On Mon, Jan 12, 2015 at 03:09:35PM -0800, Andrew Morton wrote:
> On Fri, 24 Oct 2014 17:20:36 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:
> > Currently COW of an XIP file is done by first bringing in a read-only
> > mapping, then retrying the fault and copying the page.  It is much more
> > efficient to tell the fault handler that a COW is being attempted (by
> > passing in the pre-allocated page in the vm_fault structure), and allow
> > the handler to perform the COW operation itself.
> > 
> > The handler cannot insert the page itself if there is already a read-only
> > mapping at that address, so allow the handler to return VM_FAULT_LOCKED
> > and set the fault_page to be NULL.  This indicates to the MM code that
> > the i_mmap_mutex is held instead of the page lock.
> 
> Again, the locking gets a bit subtle.  How can we make this clearer to
> readers of the core code.  I had a shot but it's a bit lame - DAX uses
> i_mmap_lock for what???

It's not just DAX ... any fault handler that wants to optimise its COW
can use the same technique.  I could turn this around and ask the mm
people why it is the struct page has to be returned locked; what is it
protecting against?

I'm pretty sure the answer is only truncate, and so (as with the previous
patch), the read lock is perfectly appropriate.

> If I know that, I'd know whether to have used i_mmap_lock_read() or
> i_mmap_lock_write() :(
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm-allow-page-fault-handlers-to-perform-the-cow-fix
> 
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/memory.c |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff -puN include/linux/mm.h~mm-allow-page-fault-handlers-to-perform-the-cow-fix include/linux/mm.h
> diff -puN mm/memory.c~mm-allow-page-fault-handlers-to-perform-the-cow-fix mm/memory.c
> --- a/mm/memory.c~mm-allow-page-fault-handlers-to-perform-the-cow-fix
> +++ a/mm/memory.c
> @@ -2961,7 +2961,11 @@ static int do_cow_fault(struct mm_struct
>  			unlock_page(fault_page);
>  			page_cache_release(fault_page);
>  		} else {
> -			mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> +			/*
> +			 * DAX doesn't have a page to lock, so it uses
> +			 * i_mmap_lock()
> +			 */
> +			i_mmap_unlock_read(&vma->vm_file->f_mapping);

How about:
			/*
			 * The fault handler has no page to lock, so it
			 * holds i_mmap_lock for read to protect against
			 * truncate.
			 */

>  		}
>  		goto uncharge_out;
>  	}
> @@ -2973,7 +2977,11 @@ static int do_cow_fault(struct mm_struct
>  		unlock_page(fault_page);
>  		page_cache_release(fault_page);
>  	} else {
> -		mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> +			/*
> +			 * DAX doesn't have a page to lock, so it uses
> +			 * i_mmap_lock()
> +			 */
> +			i_mmap_unlock_read(&vma->vm_file->f_mapping);

(as Jan already pointed out, the indentation needs to be fixed here anyway)

>  	}
>  	return ret;
>  uncharge_out:
> _
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 06/20] dax,ext2: Replace XIP read and write with DAX I/O
  2015-01-12 23:09   ` Andrew Morton
@ 2015-01-13 20:59     ` Matthew Wilcox
  0 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2015-01-13 20:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm, willy

On Mon, Jan 12, 2015 at 03:09:41PM -0800, Andrew Morton wrote:
> On Fri, 24 Oct 2014 17:20:38 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:
> > +/*
> > + * When ext4 encounters a hole, it returns without modifying the buffer_head
> > + * which means that we can't trust b_size.  To cope with this, we set b_state
> > + * to 0 before calling get_block and, if any bit is set, we know we can trust
> > + * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
> > + * and would save us time calling get_block repeatedly.
> > + */
> > +static bool buffer_size_valid(struct buffer_head *bh)
> > +{
> > +	return bh->b_state != 0;
> > +}
> 
> Yitch.  Is there a cleaner way of doing this?

I'm hoping to fix ext* and then this problem can go away ...

> > +static ssize_t dax_io(int rw, struct inode *inode, struct iov_iter *iter,
> > +			loff_t start, loff_t end, get_block_t get_block,
> > +			struct buffer_head *bh)
> 
> hm, some documentation would be nice.  I expected "dax_io" to do IO,
> but this doesn't.  Is it well named?

It does do I/O!

> > +		if (rw == WRITE)
> > +			len = copy_from_iter(addr, max - pos, iter);
> > +		else if (!hole)
> > +			len = copy_to_iter(addr, max - pos, iter);
> > +		else
> > +			len = iov_iter_zero(max - pos, iter);

> > + * This function uses the same locking scheme as do_blockdev_direct_IO:
> > + * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
> > + * caller for writes.  For reads, we take and release the i_mutex ourselves.
> > + * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
> > + * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
> > + * is in progress.
> 
> It would be helpful here to explain *why* this code uses i_dio_count:
> what is trying to protect (against)?

Rather than just referencing the documentation in fs/direct_io.c?  I
find it tends to get stale if we have documentation in multiple places.

> Oh, is that how it works ;)
> 
> Perhaps a few BUG_ON(!mutex_is_locked(&inode->i_mutex)) would clarfiy
> and prevent mistakes.

Perhaps ... although there aren't any in blockdev_direct_IO(), and all the
callers are of the form:

	if (IS_DAX)
		dax_do_io()
	else
		blockdev_direct_IO()

so they've already got their flags and locking sorted out.

> > + */
> > +ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
> > +			struct iov_iter *iter, loff_t pos,
> > +			get_block_t get_block, dio_iodone_t end_io, int flags)
> >
> > ...
> >

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 07/20] dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks
  2015-01-12 23:09   ` Andrew Morton
@ 2015-01-13 21:39     ` Matthew Wilcox
  0 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2015-01-13 21:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm, willy

On Mon, Jan 12, 2015 at 03:09:47PM -0800, Andrew Morton wrote:
> > +int dax_clear_blocks(struct inode *inode, sector_t block, long size)
> > +{
...
> > +			if (pgsz < PAGE_SIZE)
> > +				memset(addr, 0, pgsz);
> > +			else
> > +				clear_page(addr);
> 
> Are there any cache issues in all this code? flush_dcache_page(addr)?

Here, no.  This is only called to initialise a newly allocated block.

Elsewhere, maybe.  When i was originally working on this, I think I had
code that forced mmaps of DAX files to be aligned to SHMLBA, because I
remember noticing a bug in sparc64's remap_file_range().  Unfortunately,
in the various rewrites, that got lost.  So it needs to be put back in.

flush_dcache_page() in particular won't work because it needs a struct
page.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler
  2015-01-12 23:09   ` Andrew Morton
@ 2015-01-13 21:53     ` Matthew Wilcox
  2015-01-13 22:47       ` Andrew Morton
  0 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2015-01-13 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm, willy

On Mon, Jan 12, 2015 at 03:09:52PM -0800, Andrew Morton wrote:
> On Fri, 24 Oct 2014 17:20:40 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:
> 
> > Instead of calling aops->get_xip_mem from the fault handler, the
> > filesystem passes a get_block_t that is used to find the appropriate
> > blocks.
> > 
> > ...
> >
> > +static int copy_user_bh(struct page *to, struct buffer_head *bh,
> > +			unsigned blkbits, unsigned long vaddr)
> > +{
> > +	void *vfrom, *vto;
> > +	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
> > +		return -EIO;
> > +	vto = kmap_atomic(to);
> > +	copy_user_page(vto, vfrom, vaddr, to);
> > +	kunmap_atomic(vto);
> 
> Again, please check the cache-flush aspects.  copy_user_page() appears
> to be reponsible for handling coherency issues on the destination
> vaddr, but what about *vto?

vto is a new kernel address ... if there's any dirty data for that
address, it should have been flushed by the prior kunmap_atomic(), right?

> > +	mutex_lock(&mapping->i_mmap_mutex);
> > +
> > +	/*
> > +	 * Check truncate didn't happen while we were allocating a block.
> > +	 * If it did, this block may or may not be still allocated to the
> > +	 * file.  We can't tell the filesystem to free it because we can't
> > +	 * take i_mutex here.
> 
> (what's preventing us from taking i_mutex?)

We're in a page fault handler, and we may already be holding i_mutex.
We're definitely holding mmap_sem, and to quote from mm/rmap.c:

/*
 * Lock ordering in mm:
 *
 * inode->i_mutex       (while writing or truncating, not reading or faulting)
 *   mm->mmap_sem

> >  	   In the worst case, the file still has blocks
> > +	 * allocated past the end of the file.
> > +	 */
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	if (unlikely(vmf->pgoff >= size)) {
> > +		error = -EIO;
> > +		goto out;
> > +	}
> 
> How does this play with holepunching?  Checking i_size won't work there?

It doesn't.  But the same problem exists with non-DAX files too, and
when I pointed it out, it was met with a shrug from the crowd.  I saw a
patch series just recently that fixes it for XFS, but as far as I know,
btrfs and ext4 still don't play well with pagefault vs hole-punch races.

> > +	memset(&bh, 0, sizeof(bh));
> > +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
> > +	bh.b_size = PAGE_SIZE;
> 
> ah, there.
> 
> PAGE_SIZE varies a lot between architectures.  What are the
> implications of this>?

At the moment, you can only do DAX for blocksizes that are equal to
PAGE_SIZE.  That's a restriction that existed for the previous XIP code,
and I haven't fixed it all for DAX yet.  I'd like to, but it's not high on
my list of things to fix.  Since these are in-mmeory filesystems, there's
not likely to be high demand to move the filesystem between machines.

> > + repeat:
> > +	page = find_get_page(mapping, vmf->pgoff);
> > +	if (page) {
> > +		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
> > +			page_cache_release(page);
> > +			return VM_FAULT_RETRY;
> > +		}
> > +		if (unlikely(page->mapping != mapping)) {
> > +			unlock_page(page);
> > +			page_cache_release(page);
> > +			goto repeat;
> > +		}
> > +		size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +		if (unlikely(vmf->pgoff >= size)) {
> > +			error = -EIO;
> 
> What happened when this happens?

This case is where we have a struct page covering a hole in the file from
a read fault and we've raced with a truncate.  It's basically the same code
that's in filemap_fault().

> > +			goto unlock_page;
> > +		}
> > +	}
> > +
> > +	error = get_block(inode, block, &bh, 0);
> > +	if (!error && (bh.b_size < PAGE_SIZE))
> > +		error = -EIO;
> 
> How could this happen?

The only way I can think of is if the filesystem was corrupted.  But it's
worth programming defensively, no?


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 09/20] dax,ext2: Replace xip_truncate_page with dax_truncate_page
  2015-01-12 23:09   ` Andrew Morton
@ 2015-01-13 21:55     ` Matthew Wilcox
  0 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2015-01-13 21:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm, willy

On Mon, Jan 12, 2015 at 03:09:58PM -0800, Andrew Morton wrote:
> > + * Similar to block_truncate_page(), this function can be called by a
> > + * filesystem when it is truncating an DAX file to handle the partial page.
> > + *
> > + * We work in terms of PAGE_CACHE_SIZE here for commonality with
> > + * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
> > + * took care of disposing of the unnecessary blocks.
> 
> But PAGE_SIZE==PAGE_CACHE_SIZE.  Unclear what you're saying here.

The last I heard, some people were trying to resurrect the PAGE_CACHE_SIZE
> PAGE_SIZE patches.  I'd be grateful if the distinction between PAGE_SIZE
and PAGE_CACHE_SIZE went away, tbh.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler
  2015-01-13 21:53     ` Matthew Wilcox
@ 2015-01-13 22:47       ` Andrew Morton
  0 siblings, 0 replies; 60+ messages in thread
From: Andrew Morton @ 2015-01-13 22:47 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm

On Tue, 13 Jan 2015 16:53:34 -0500 Matthew Wilcox <willy@linux.intel.com> wrote:

> /*
>  * Lock ordering in mm:
>  *
>  * inode->i_mutex       (while writing or truncating, not reading or faulting)
>  *   mm->mmap_sem
> 
> > >  	   In the worst case, the file still has blocks
> > > +	 * allocated past the end of the file.
> > > +	 */
> > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > +	if (unlikely(vmf->pgoff >= size)) {
> > > +		error = -EIO;
> > > +		goto out;
> > > +	}
> > 
> > How does this play with holepunching?  Checking i_size won't work there?
> 
> It doesn't.  But the same problem exists with non-DAX files too, and
> when I pointed it out, it was met with a shrug from the crowd.  I saw a
> patch series just recently that fixes it for XFS, but as far as I know,
> btrfs and ext4 still don't play well with pagefault vs hole-punch races.

What are the user-visible effects of the race?

> > > +	memset(&bh, 0, sizeof(bh));
> > > +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
> > > +	bh.b_size = PAGE_SIZE;
> > 
> > ah, there.
> > 
> > PAGE_SIZE varies a lot between architectures.  What are the
> > implications of this>?
> 
> At the moment, you can only do DAX for blocksizes that are equal to
> PAGE_SIZE.  That's a restriction that existed for the previous XIP code,
> and I haven't fixed it all for DAX yet.  I'd like to, but it's not high on
> my list of things to fix.  Since these are in-mmeory filesystems, there's
> not likely to be high demand to move the filesystem between machines.

hm, I guess not.

This means that our users will need to mkfs their filesystems with
blocksize==pagesize.  The "error: unsupported blocksize for dax" printk
should get the message across, but a mention in
Documentation/filesystems/dax.txt's "Shortcomings" section wouldn't
hurt.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW
  2014-10-24 21:20 ` [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW Matthew Wilcox
  2015-01-12 23:09   ` Andrew Morton
@ 2015-02-05  9:16   ` Yigal Korman
  2015-02-05 21:39     ` Matthew Wilcox
  1 sibling, 1 reply; 60+ messages in thread
From: Yigal Korman @ 2015-02-05  9:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-mm, willy, Andrew Morton

On Sat, Oct 25, 2014 at 12:20 AM, Matthew Wilcox
<matthew.r.wilcox@intel.com> wrote:
> Currently COW of an XIP file is done by first bringing in a read-only
> mapping, then retrying the fault and copying the page.  It is much more
> efficient to tell the fault handler that a COW is being attempted (by
> passing in the pre-allocated page in the vm_fault structure), and allow
> the handler to perform the COW operation itself.
>
> The handler cannot insert the page itself if there is already a read-only
> mapping at that address, so allow the handler to return VM_FAULT_LOCKED
> and set the fault_page to be NULL.  This indicates to the MM code that
> the i_mmap_mutex is held instead of the page lock.

I have a question on a related issue (I think).
I've noticed that for pfn-only mappings (VM_FAULT_NOPAGE)
do_shared_fault only maps the pfn with r/o permissions.
So if I use DAX to write the mmap()-ed pfn I get two faults - first
handled by do_shared_fault and then again for making it r/w in
do_wp_page.
Is this simply a missing optimization like was done here with the
cow_page? or am I missing something?

>
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c        | 33 ++++++++++++++++++++++++---------
>  2 files changed, 25 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 02d11ee..88d1ef4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -209,6 +209,7 @@ struct vm_fault {
>         pgoff_t pgoff;                  /* Logical page offset based on vma */
>         void __user *virtual_address;   /* Faulting virtual address */
>
> +       struct page *cow_page;          /* Handler may choose to COW */
>         struct page *page;              /* ->fault handlers should return a
>                                          * page here, unless VM_FAULT_NOPAGE
>                                          * is set (which is also implied by
> diff --git a/mm/memory.c b/mm/memory.c
> index 1cc6bfb..6dee424 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2002,6 +2002,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>         vmf.pgoff = page->index;
>         vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
>         vmf.page = page;
> +       vmf.cow_page = NULL;
>
>         ret = vma->vm_ops->page_mkwrite(vma, &vmf);
>         if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> @@ -2701,7 +2702,8 @@ oom:
>   * See filemap_fault() and __lock_page_retry().
>   */
>  static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> -               pgoff_t pgoff, unsigned int flags, struct page **page)
> +                       pgoff_t pgoff, unsigned int flags,
> +                       struct page *cow_page, struct page **page)
>  {
>         struct vm_fault vmf;
>         int ret;
> @@ -2710,10 +2712,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>         vmf.pgoff = pgoff;
>         vmf.flags = flags;
>         vmf.page = NULL;
> +       vmf.cow_page = cow_page;
>
>         ret = vma->vm_ops->fault(vma, &vmf);
>         if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>                 return ret;
> +       if (!vmf.page)
> +               goto out;
>
>         if (unlikely(PageHWPoison(vmf.page))) {
>                 if (ret & VM_FAULT_LOCKED)
> @@ -2727,6 +2732,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>         else
>                 VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
>
> + out:
>         *page = vmf.page;
>         return ret;
>  }
> @@ -2900,7 +2906,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>                 pte_unmap_unlock(pte, ptl);
>         }
>
> -       ret = __do_fault(vma, address, pgoff, flags, &fault_page);
> +       ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
>         if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>                 return ret;
>
> @@ -2940,26 +2946,35 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>                 return VM_FAULT_OOM;
>         }
>
> -       ret = __do_fault(vma, address, pgoff, flags, &fault_page);
> +       ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
>         if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>                 goto uncharge_out;
>
> -       copy_user_highpage(new_page, fault_page, address, vma);
> +       if (fault_page)
> +               copy_user_highpage(new_page, fault_page, address, vma);
>         __SetPageUptodate(new_page);
>
>         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>         if (unlikely(!pte_same(*pte, orig_pte))) {
>                 pte_unmap_unlock(pte, ptl);
> -               unlock_page(fault_page);
> -               page_cache_release(fault_page);
> +               if (fault_page) {
> +                       unlock_page(fault_page);
> +                       page_cache_release(fault_page);
> +               } else {
> +                       mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> +               }
>                 goto uncharge_out;
>         }
>         do_set_pte(vma, address, new_page, pte, true, true);
>         mem_cgroup_commit_charge(new_page, memcg, false);
>         lru_cache_add_active_or_unevictable(new_page, vma);
>         pte_unmap_unlock(pte, ptl);
> -       unlock_page(fault_page);
> -       page_cache_release(fault_page);
> +       if (fault_page) {
> +               unlock_page(fault_page);
> +               page_cache_release(fault_page);
> +       } else {
> +               mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> +       }
>         return ret;
>  uncharge_out:
>         mem_cgroup_cancel_charge(new_page, memcg);
> @@ -2978,7 +2993,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>         int dirtied = 0;
>         int ret, tmp;
>
> -       ret = __do_fault(vma, address, pgoff, flags, &fault_page);
> +       ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
>         if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>                 return ret;
>
> --
> 2.1.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW
  2015-02-05  9:16   ` Yigal Korman
@ 2015-02-05 21:39     ` Matthew Wilcox
  2015-02-08 11:48       ` Yigal Korman
  0 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2015-02-05 21:39 UTC (permalink / raw)
  To: Yigal Korman
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm, willy,
	Andrew Morton

On Thu, Feb 05, 2015 at 11:16:53AM +0200, Yigal Korman wrote:
> I have a question on a related issue (I think).
> I've noticed that for pfn-only mappings (VM_FAULT_NOPAGE)
> do_shared_fault only maps the pfn with r/o permissions.
> So if I use DAX to write the mmap()-ed pfn I get two faults - first
> handled by do_shared_fault and then again for making it r/w in
> do_wp_page.
> Is this simply a missing optimization like was done here with the
> cow_page? or am I missing something?

I have also noticed this behaviour.  I tracked down why it's happening:

DAX calls:
        error = vm_insert_mixed(vma, vaddr, pfn);
which calls:
        return insert_pfn(vma, addr, pfn, vma->vm_page_prot);

If you insert some debugging, you'll notice here that vm_page_prot does
not include PROT_WRITE.

That got cleared during mmap_region() where it does:

        if (vma_wants_writenotify(vma)) {
                pgprot_t pprot = vma->vm_page_prot;
...
                vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);


And why do we want writenotify (according to the VM)?  Because we have:

        /* The backer wishes to know when pages are first written to? */
        if (vma->vm_ops && vma->vm_ops->page_mkwrite)
                return 1;

We don't really want to be notified on a first write; we want the page to be
inserted write-enabled.  But in the case where we've covered a hole with a
read-only zero page, we need to be notified so we can allocate a page of
storage.

So, how to fix?  We could adjust vm_page_prot to include PROT_WRITE.
I think that should work, since we'll only insert zeroed pages for read
faults, and so the maybe_mkwrite() won't be called in do_set_pte().
I'm just not entirely sure where to set it.  Perhaps a MM person could
make a helpful suggestion?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW
  2015-02-05 21:39     ` Matthew Wilcox
@ 2015-02-08 11:48       ` Yigal Korman
  0 siblings, 0 replies; 60+ messages in thread
From: Yigal Korman @ 2015-02-08 11:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm, Andrew Morton

On Thu, Feb 5, 2015 at 11:39 PM, Matthew Wilcox <willy@linux.intel.com> wrote:
>
> On Thu, Feb 05, 2015 at 11:16:53AM +0200, Yigal Korman wrote:
> > I have a question on a related issue (I think).
> > I've noticed that for pfn-only mappings (VM_FAULT_NOPAGE)
> > do_shared_fault only maps the pfn with r/o permissions.
> > So if I use DAX to write the mmap()-ed pfn I get two faults - first
> > handled by do_shared_fault and then again for making it r/w in
> > do_wp_page.
> > Is this simply a missing optimization like was done here with the
> > cow_page? or am I missing something?
>
> I have also noticed this behaviour.  I tracked down why it's happening:
>
> DAX calls:
>         error = vm_insert_mixed(vma, vaddr, pfn);
> which calls:
>         return insert_pfn(vma, addr, pfn, vma->vm_page_prot);
>
> If you insert some debugging, you'll notice here that vm_page_prot does
> not include PROT_WRITE.
>
> That got cleared during mmap_region() where it does:
>
>         if (vma_wants_writenotify(vma)) {
>                 pgprot_t pprot = vma->vm_page_prot;
> ...
>                 vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
>
>
> And why do we want writenotify (according to the VM)?  Because we have:
>
>         /* The backer wishes to know when pages are first written to? */
>         if (vma->vm_ops && vma->vm_ops->page_mkwrite)
>                 return 1;
>
> We don't really want to be notified on a first write; we want the page to be
> inserted write-enabled.  But in the case where we've covered a hole with a
> read-only zero page, we need to be notified so we can allocate a page of
> storage.
>
> So, how to fix?  We could adjust vm_page_prot to include PROT_WRITE.
> I think that should work, since we'll only insert zeroed pages for read
> faults, and so the maybe_mkwrite() won't be called in do_set_pte().
> I'm just not entirely sure where to set it.  Perhaps a MM person could
> make a helpful suggestion?

I was thinking that do_shared_fault should simply call maybe_mkwrite()
in case of VM_FAULT_NOPAGE.
I think it's what do_wp_page does afterwards anyway:

entry = maybe_mkwrite(pte_mkdirty(entry), vma);

But I'm sure it's not the whole picture...
Help from MM would indeed be appreciated.

Y

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
  2014-10-24 21:20 ` [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation Matthew Wilcox
  2015-01-12 23:10   ` Andrew Morton
@ 2016-01-21 18:38   ` Jared Hulbert
  2016-01-22 13:07     ` Wilcox, Matthew R
  1 sibling, 1 reply; 60+ messages in thread
From: Jared Hulbert @ 2016-01-21 18:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linux FS Devel, LKML, Linux Memory Management List,
	Matthew Wilcox, Andrew Morton, Carsten Otte, Chris Brandt

HI!  I've been out of the community for a while, but I'm trying to
step back in here and catch up with some of my old areas of specialty.
Couple questions, sorry to drag up such old conversations.

The DAX documentation that made it into kernel 4.0 has the following
line  "The DAX code does not work correctly on architectures which
have virtually mapped caches such as ARM, MIPS and SPARC."

1) It really doesn't support ARM.....!!!!?  I never had problems with
the old filemap_xip.c stuff on ARM, what changed?
2) Is there a thread discussing this?

On Fri, Oct 24, 2014 at 2:20 PM, Matthew Wilcox
<matthew.r.wilcox@intel.com> wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
>
> Based on the original XIP documentation, this documents the current
> state of affairs, and includes instructions on how users can enable DAX
> if their devices and kernel support it.
>
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
> ---
>  Documentation/filesystems/00-INDEX |  5 ++-
>  Documentation/filesystems/dax.txt  | 89 ++++++++++++++++++++++++++++++++++++++
>  Documentation/filesystems/xip.txt  | 71 ------------------------------
>  3 files changed, 92 insertions(+), 73 deletions(-)
>  create mode 100644 Documentation/filesystems/dax.txt
>  delete mode 100644 Documentation/filesystems/xip.txt
>
> diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
> index ac28149..9922939 100644
> --- a/Documentation/filesystems/00-INDEX
> +++ b/Documentation/filesystems/00-INDEX
> @@ -34,6 +34,9 @@ configfs/
>         - directory containing configfs documentation and example code.
>  cramfs.txt
>         - info on the cram filesystem for small storage (ROMs etc).
> +dax.txt
> +       - info on avoiding the page cache for files stored on CPU-addressable
> +         storage devices.
>  debugfs.txt
>         - info on the debugfs filesystem.
>  devpts.txt
> @@ -154,5 +157,3 @@ xfs-self-describing-metadata.txt
>         - info on XFS Self Describing Metadata.
>  xfs.txt
>         - info and mount options for the XFS filesystem.
> -xip.txt
> -       - info on execute-in-place for file mappings.
> diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
> new file mode 100644
> index 0000000..635adaa
> --- /dev/null
> +++ b/Documentation/filesystems/dax.txt
> @@ -0,0 +1,89 @@
> +Direct Access for files
> +-----------------------
> +
> +Motivation
> +----------
> +
> +The page cache is usually used to buffer reads and writes to files.
> +It is also used to provide the pages which are mapped into userspace
> +by a call to mmap.
> +
> +For block devices that are memory-like, the page cache pages would be
> +unnecessary copies of the original storage.  The DAX code removes the
> +extra copy by performing reads and writes directly to the storage device.
> +For file mappings, the storage device is mapped directly into userspace.
> +
> +
> +Usage
> +-----
> +
> +If you have a block device which supports DAX, you can make a filesystem
> +on it as usual.  When mounting it, use the -o dax option manually
> +or add 'dax' to the options in /etc/fstab.
> +
> +
> +Implementation Tips for Block Driver Writers
> +--------------------------------------------
> +
> +To support DAX in your block driver, implement the 'direct_access'
> +block device operation.  It is used to translate the sector number
> +(expressed in units of 512-byte sectors) to a page frame number (pfn)
> +that identifies the physical page for the memory.  It also returns a
> +kernel virtual address that can be used to access the memory.
> +
> +The direct_access method takes a 'size' parameter that indicates the
> +number of bytes being requested.  The function should return the number
> +of bytes that can be contiguously accessed at that offset.  It may also
> +return a negative errno if an error occurs.
> +
> +In order to support this method, the storage must be byte-accessible by
> +the CPU at all times.  If your device uses paging techniques to expose
> +a large amount of memory through a smaller window, then you cannot
> +implement direct_access.  Equally, if your device can occasionally
> +stall the CPU for an extended period, you should also not attempt to
> +implement direct_access.
> +
> +These block devices may be used for inspiration:
> +- axonram: Axon DDR2 device driver
> +- brd: RAM backed block device driver
> +- dcssblk: s390 dcss block device driver
> +
> +
> +Implementation Tips for Filesystem Writers
> +------------------------------------------
> +
> +Filesystem support consists of
> +- adding support to mark inodes as being DAX by setting the S_DAX flag in
> +  i_flags
> +- implementing the direct_IO address space operation, and calling
> +  dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
> +- implementing an mmap file operation for DAX files which sets the
> +  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
> +  for fault and page_mkwrite (which should probably call dax_fault() and
> +  dax_mkwrite(), passing the appropriate get_block() callback)
> +- calling dax_truncate_page() instead of block_truncate_page() for DAX files
> +- ensuring that there is sufficient locking between reads, writes,
> +  truncates and page faults
> +
> +The get_block() callback passed to the DAX functions may return
> +uninitialised extents.  If it does, it must ensure that simultaneous
> +calls to get_block() (for example by a page-fault racing with a read()
> +or a write()) work correctly.
> +
> +These filesystems may be used for inspiration:
> +- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
> +
> +
> +Shortcomings
> +------------
> +
> +Even if the kernel or its modules are stored on a filesystem that supports
> +DAX on a block device that supports DAX, they will still be copied into RAM.
> +
> +Calling get_user_pages() on a range of user memory that has been mmaped
> +from a DAX file will fail as there are no 'struct page' to describe
> +those pages.  This problem is being worked on.  That means that O_DIRECT
> +reads/writes to those memory ranges from a non-DAX file will fail (note
> +that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
> +that is being accessed that is key here).  Other things that will not
> +work include RDMA, sendfile() and splice().
> diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
> deleted file mode 100644
> index b774729..0000000
> --- a/Documentation/filesystems/xip.txt
> +++ /dev/null
> @@ -1,71 +0,0 @@
> -Execute-in-place for file mappings
> -----------------------------------
> -
> -Motivation
> -----------
> -File mappings are performed by mapping page cache pages to userspace. In
> -addition, read&write type file operations also transfer data from/to the page
> -cache.
> -
> -For memory backed storage devices that use the block device interface, the page
> -cache pages are in fact copies of the original storage. Various approaches
> -exist to work around the need for an extra copy. The ramdisk driver for example
> -does read the data into the page cache, keeps a reference, and discards the
> -original data behind later on.
> -
> -Execute-in-place solves this issue the other way around: instead of keeping
> -data in the page cache, the need to have a page cache copy is eliminated
> -completely. With execute-in-place, read&write type operations are performed
> -directly from/to the memory backed storage device. For file mappings, the
> -storage device itself is mapped directly into userspace.
> -
> -This implementation was initially written for shared memory segments between
> -different virtual machines on s390 hardware to allow multiple machines to
> -share the same binaries and libraries.
> -
> -Implementation
> ---------------
> -Execute-in-place is implemented in three steps: block device operation,
> -address space operation, and file operations.
> -
> -A block device operation named direct_access is used to translate the
> -block device sector number to a page frame number (pfn) that identifies
> -the physical page for the memory.  It also returns a kernel virtual
> -address that can be used to access the memory.
> -
> -The direct_access method takes a 'size' parameter that indicates the
> -number of bytes being requested.  The function should return the number
> -of bytes that can be contiguously accessed at that offset.  It may also
> -return a negative errno if an error occurs.
> -
> -The block device operation is optional, these block devices support it as of
> -today:
> -- dcssblk: s390 dcss block device driver
> -
> -An address space operation named get_xip_mem is used to retrieve references
> -to a page frame number and a kernel address. To obtain these values a reference
> -to an address_space is provided. This function assigns values to the kmem and
> -pfn parameters. The third argument indicates whether the function should allocate
> -blocks if needed.
> -
> -This address space operation is mutually exclusive with readpage&writepage that
> -do page cache read/write operations.
> -The following filesystems support it as of today:
> -- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
> -
> -A set of file operations that do utilize get_xip_page can be found in
> -mm/filemap_xip.c . The following file operation implementations are provided:
> -- aio_read/aio_write
> -- readv/writev
> -- sendfile
> -
> -The generic file operations do_sync_read/do_sync_write can be used to implement
> -classic synchronous IO calls.
> -
> -Shortcomings
> -------------
> -This implementation is limited to storage devices that are cpu addressable at
> -all times (no highmem or such). It works well on rom/ram, but enhancements are
> -needed to make it work with flash in read+write mode.
> -Putting the Linux kernel and/or its modules on a xip filesystem does not mean
> -they are not copied.
> --
> 2.1.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
  2016-01-21 18:38   ` Jared Hulbert
@ 2016-01-22 13:07     ` Wilcox, Matthew R
  2016-01-22 13:48       ` Chris Brandt
  2016-01-24  9:03       ` Jared Hulbert
  0 siblings, 2 replies; 60+ messages in thread
From: Wilcox, Matthew R @ 2016-01-22 13:07 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Linux FS Devel, LKML, Linux Memory Management List,
	Matthew Wilcox, Andrew Morton, Carsten Otte, Chris Brandt

Hi Jared,

The old filemap_xip code was living in a state of sin ;-)  It was writing to the kernel's mapping of an address, and then not flushing the cache before telling userspace that the data was updated.  That left userspace able to read stale data, which might actually have been a security hole (had that page previously contained, say, /etc/passwd).

We don't have cache flushing functions that work without a struct page.  So we need to come up with a new solution.  My preferred solution is to explicitly map the memory before using it.  On ARM, MIPS & SPARC, each page should be mapped to an address that is at a multiple of SHMLBA from the address that the user has the page mapped at.  On other architectures, there is no d-cache flush problem, so they can use an identity map.

Or you can just enable the DAX code and continue living in the state of sin that you were in before.  It probably won't bite you ... maybe ...

-----Original Message-----
From: Jared Hulbert [mailto:jaredeh@gmail.com] 
Sent: Thursday, January 21, 2016 10:38 AM
To: Wilcox, Matthew R
Cc: Linux FS Devel; LKML; Linux Memory Management List; Matthew Wilcox; Andrew Morton; Carsten Otte; Chris Brandt
Subject: Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation

HI!  I've been out of the community for a while, but I'm trying to
step back in here and catch up with some of my old areas of specialty.
Couple questions, sorry to drag up such old conversations.

The DAX documentation that made it into kernel 4.0 has the following
line  "The DAX code does not work correctly on architectures which
have virtually mapped caches such as ARM, MIPS and SPARC."

1) It really doesn't support ARM.....!!!!?  I never had problems with
the old filemap_xip.c stuff on ARM, what changed?
2) Is there a thread discussing this?

On Fri, Oct 24, 2014 at 2:20 PM, Matthew Wilcox
<matthew.r.wilcox@intel.com> wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
>
> Based on the original XIP documentation, this documents the current
> state of affairs, and includes instructions on how users can enable DAX
> if their devices and kernel support it.
>
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
> ---
>  Documentation/filesystems/00-INDEX |  5 ++-
>  Documentation/filesystems/dax.txt  | 89 ++++++++++++++++++++++++++++++++++++++
>  Documentation/filesystems/xip.txt  | 71 ------------------------------
>  3 files changed, 92 insertions(+), 73 deletions(-)
>  create mode 100644 Documentation/filesystems/dax.txt
>  delete mode 100644 Documentation/filesystems/xip.txt
>
> diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
> index ac28149..9922939 100644
> --- a/Documentation/filesystems/00-INDEX
> +++ b/Documentation/filesystems/00-INDEX
> @@ -34,6 +34,9 @@ configfs/
>         - directory containing configfs documentation and example code.
>  cramfs.txt
>         - info on the cram filesystem for small storage (ROMs etc).
> +dax.txt
> +       - info on avoiding the page cache for files stored on CPU-addressable
> +         storage devices.
>  debugfs.txt
>         - info on the debugfs filesystem.
>  devpts.txt
> @@ -154,5 +157,3 @@ xfs-self-describing-metadata.txt
>         - info on XFS Self Describing Metadata.
>  xfs.txt
>         - info and mount options for the XFS filesystem.
> -xip.txt
> -       - info on execute-in-place for file mappings.
> diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
> new file mode 100644
> index 0000000..635adaa
> --- /dev/null
> +++ b/Documentation/filesystems/dax.txt
> @@ -0,0 +1,89 @@
> +Direct Access for files
> +-----------------------
> +
> +Motivation
> +----------
> +
> +The page cache is usually used to buffer reads and writes to files.
> +It is also used to provide the pages which are mapped into userspace
> +by a call to mmap.
> +
> +For block devices that are memory-like, the page cache pages would be
> +unnecessary copies of the original storage.  The DAX code removes the
> +extra copy by performing reads and writes directly to the storage device.
> +For file mappings, the storage device is mapped directly into userspace.
> +
> +
> +Usage
> +-----
> +
> +If you have a block device which supports DAX, you can make a filesystem
> +on it as usual.  When mounting it, use the -o dax option manually
> +or add 'dax' to the options in /etc/fstab.
> +
> +
> +Implementation Tips for Block Driver Writers
> +--------------------------------------------
> +
> +To support DAX in your block driver, implement the 'direct_access'
> +block device operation.  It is used to translate the sector number
> +(expressed in units of 512-byte sectors) to a page frame number (pfn)
> +that identifies the physical page for the memory.  It also returns a
> +kernel virtual address that can be used to access the memory.
> +
> +The direct_access method takes a 'size' parameter that indicates the
> +number of bytes being requested.  The function should return the number
> +of bytes that can be contiguously accessed at that offset.  It may also
> +return a negative errno if an error occurs.
> +
> +In order to support this method, the storage must be byte-accessible by
> +the CPU at all times.  If your device uses paging techniques to expose
> +a large amount of memory through a smaller window, then you cannot
> +implement direct_access.  Equally, if your device can occasionally
> +stall the CPU for an extended period, you should also not attempt to
> +implement direct_access.
> +
> +These block devices may be used for inspiration:
> +- axonram: Axon DDR2 device driver
> +- brd: RAM backed block device driver
> +- dcssblk: s390 dcss block device driver
> +
> +
> +Implementation Tips for Filesystem Writers
> +------------------------------------------
> +
> +Filesystem support consists of
> +- adding support to mark inodes as being DAX by setting the S_DAX flag in
> +  i_flags
> +- implementing the direct_IO address space operation, and calling
> +  dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
> +- implementing an mmap file operation for DAX files which sets the
> +  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
> +  for fault and page_mkwrite (which should probably call dax_fault() and
> +  dax_mkwrite(), passing the appropriate get_block() callback)
> +- calling dax_truncate_page() instead of block_truncate_page() for DAX files
> +- ensuring that there is sufficient locking between reads, writes,
> +  truncates and page faults
> +
> +The get_block() callback passed to the DAX functions may return
> +uninitialised extents.  If it does, it must ensure that simultaneous
> +calls to get_block() (for example by a page-fault racing with a read()
> +or a write()) work correctly.
> +
> +These filesystems may be used for inspiration:
> +- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
> +
> +
> +Shortcomings
> +------------
> +
> +Even if the kernel or its modules are stored on a filesystem that supports
> +DAX on a block device that supports DAX, they will still be copied into RAM.
> +
> +Calling get_user_pages() on a range of user memory that has been mmaped
> +from a DAX file will fail as there are no 'struct page' to describe
> +those pages.  This problem is being worked on.  That means that O_DIRECT
> +reads/writes to those memory ranges from a non-DAX file will fail (note
> +that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
> +that is being accessed that is key here).  Other things that will not
> +work include RDMA, sendfile() and splice().
> diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
> deleted file mode 100644
> index b774729..0000000
> --- a/Documentation/filesystems/xip.txt
> +++ /dev/null
> @@ -1,71 +0,0 @@
> -Execute-in-place for file mappings
> -----------------------------------
> -
> -Motivation
> -----------
> -File mappings are performed by mapping page cache pages to userspace. In
> -addition, read&write type file operations also transfer data from/to the page
> -cache.
> -
> -For memory backed storage devices that use the block device interface, the page
> -cache pages are in fact copies of the original storage. Various approaches
> -exist to work around the need for an extra copy. The ramdisk driver for example
> -does read the data into the page cache, keeps a reference, and discards the
> -original data behind later on.
> -
> -Execute-in-place solves this issue the other way around: instead of keeping
> -data in the page cache, the need to have a page cache copy is eliminated
> -completely. With execute-in-place, read&write type operations are performed
> -directly from/to the memory backed storage device. For file mappings, the
> -storage device itself is mapped directly into userspace.
> -
> -This implementation was initially written for shared memory segments between
> -different virtual machines on s390 hardware to allow multiple machines to
> -share the same binaries and libraries.
> -
> -Implementation
> ---------------
> -Execute-in-place is implemented in three steps: block device operation,
> -address space operation, and file operations.
> -
> -A block device operation named direct_access is used to translate the
> -block device sector number to a page frame number (pfn) that identifies
> -the physical page for the memory.  It also returns a kernel virtual
> -address that can be used to access the memory.
> -
> -The direct_access method takes a 'size' parameter that indicates the
> -number of bytes being requested.  The function should return the number
> -of bytes that can be contiguously accessed at that offset.  It may also
> -return a negative errno if an error occurs.
> -
> -The block device operation is optional, these block devices support it as of
> -today:
> -- dcssblk: s390 dcss block device driver
> -
> -An address space operation named get_xip_mem is used to retrieve references
> -to a page frame number and a kernel address. To obtain these values a reference
> -to an address_space is provided. This function assigns values to the kmem and
> -pfn parameters. The third argument indicates whether the function should allocate
> -blocks if needed.
> -
> -This address space operation is mutually exclusive with readpage&writepage that
> -do page cache read/write operations.
> -The following filesystems support it as of today:
> -- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
> -
> -A set of file operations that do utilize get_xip_page can be found in
> -mm/filemap_xip.c . The following file operation implementations are provided:
> -- aio_read/aio_write
> -- readv/writev
> -- sendfile
> -
> -The generic file operations do_sync_read/do_sync_write can be used to implement
> -classic synchronous IO calls.
> -
> -Shortcomings
> -------------
> -This implementation is limited to storage devices that are cpu addressable at
> -all times (no highmem or such). It works well on rom/ram, but enhancements are
> -needed to make it work with flash in read+write mode.
> -Putting the Linux kernel and/or its modules on a xip filesystem does not mean
> -they are not copied.
> --
> 2.1.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
  2016-01-22 13:07     ` Wilcox, Matthew R
@ 2016-01-22 13:48       ` Chris Brandt
  2016-01-22 14:39         ` Matthew Wilcox
  2016-01-24  9:03       ` Jared Hulbert
  1 sibling, 1 reply; 60+ messages in thread
From: Chris Brandt @ 2016-01-22 13:48 UTC (permalink / raw)
  To: Wilcox, Matthew R, Jared Hulbert
  Cc: Linux FS Devel, LKML, Linux Memory Management List,
	Matthew Wilcox, Andrew Morton, Carsten Otte

I believe the motivation for the new DAX code was being able to read/write data directly to specific physical memory. However, with the AXFS file system, XIP file mapping was mostly beneficial for direct access to executable code pages, not data. Code pages were XIP-ed, and data pages were copied to RAM as normal. This results in a significant reduction in system RAM, especially when used with an XIP_KERNEL. In some systems, most of your RAM is eaten up by lots of code pages from big bloated shared libraries, not R/W data. (of course I'm talking about smaller embedded system here)


Also, it's up to the file system decide to decide what should be XIP/DAX or not. If your motivation is to DAX/XIP code pages to save RAM, then you don't have to worry about '/etc/password' cache issues, because that file would be handled in a traditional manner.

I think it comes down to what your motivation to DAX is: DAX data or DAX code


Chris



-----Original Message-----
From: Wilcox, Matthew R [mailto:matthew.r.wilcox@intel.com] 
Sent: Friday, January 22, 2016 8:08 AM
To: Jared Hulbert <jaredeh@gmail.com>
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>; LKML <linux-kernel@vger.kernel.org>; Linux Memory Management List <linux-mm@kvack.org>; Matthew Wilcox <willy@linux.intel.com>; Andrew Morton <akpm@linux-foundation.org>; Carsten Otte <cotte@de.ibm.com>; Chris Brandt <Chris.Brandt@renesas.com>
Subject: RE: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation

Hi Jared,

The old filemap_xip code was living in a state of sin ;-)  It was writing to the kernel's mapping of an address, and then not flushing the cache before telling userspace that the data was updated.  That left userspace able to read stale data, which might actually have been a security hole (had that page previously contained, say, /etc/passwd).

We don't have cache flushing functions that work without a struct page.  So we need to come up with a new solution.  My preferred solution is to explicitly map the memory before using it.  On ARM, MIPS & SPARC, each page should be mapped to an address that is at a multiple of SHMLBA from the address that the user has the page mapped at.  On other architectures, there is no d-cache flush problem, so they can use an identity map.

Or you can just enable the DAX code and continue living in the state of sin that you were in before.  It probably won't bite you ... maybe ...

-----Original Message-----
From: Jared Hulbert [mailto:jaredeh@gmail.com]
Sent: Thursday, January 21, 2016 10:38 AM
To: Wilcox, Matthew R
Cc: Linux FS Devel; LKML; Linux Memory Management List; Matthew Wilcox; Andrew Morton; Carsten Otte; Chris Brandt
Subject: Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation

HI!  I've been out of the community for a while, but I'm trying to step back in here and catch up with some of my old areas of specialty.
Couple questions, sorry to drag up such old conversations.

The DAX documentation that made it into kernel 4.0 has the following line  "The DAX code does not work correctly on architectures which have virtually mapped caches such as ARM, MIPS and SPARC."

1) It really doesn't support ARM.....!!!!?  I never had problems with the old filemap_xip.c stuff on ARM, what changed?
2) Is there a thread discussing this?

On Fri, Oct 24, 2014 at 2:20 PM, Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
>
> Based on the original XIP documentation, this documents the current 
> state of affairs, and includes instructions on how users can enable 
> DAX if their devices and kernel support it.
>
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
> ---
>  Documentation/filesystems/00-INDEX |  5 ++-  
> Documentation/filesystems/dax.txt  | 89 
> ++++++++++++++++++++++++++++++++++++++
>  Documentation/filesystems/xip.txt  | 71 
> ------------------------------
>  3 files changed, 92 insertions(+), 73 deletions(-)  create mode 
> 100644 Documentation/filesystems/dax.txt  delete mode 100644 
> Documentation/filesystems/xip.txt
>
> diff --git a/Documentation/filesystems/00-INDEX 
> b/Documentation/filesystems/00-INDEX
> index ac28149..9922939 100644
> --- a/Documentation/filesystems/00-INDEX
> +++ b/Documentation/filesystems/00-INDEX
> @@ -34,6 +34,9 @@ configfs/
>         - directory containing configfs documentation and example code.
>  cramfs.txt
>         - info on the cram filesystem for small storage (ROMs etc).
> +dax.txt
> +       - info on avoiding the page cache for files stored on CPU-addressable
> +         storage devices.
>  debugfs.txt
>         - info on the debugfs filesystem.
>  devpts.txt
> @@ -154,5 +157,3 @@ xfs-self-describing-metadata.txt
>         - info on XFS Self Describing Metadata.
>  xfs.txt
>         - info and mount options for the XFS filesystem.
> -xip.txt
> -       - info on execute-in-place for file mappings.
> diff --git a/Documentation/filesystems/dax.txt 
> b/Documentation/filesystems/dax.txt
> new file mode 100644
> index 0000000..635adaa
> --- /dev/null
> +++ b/Documentation/filesystems/dax.txt
> @@ -0,0 +1,89 @@
> +Direct Access for files
> +-----------------------
> +
> +Motivation
> +----------
> +
> +The page cache is usually used to buffer reads and writes to files.
> +It is also used to provide the pages which are mapped into userspace 
> +by a call to mmap.
> +
> +For block devices that are memory-like, the page cache pages would be 
> +unnecessary copies of the original storage.  The DAX code removes the 
> +extra copy by performing reads and writes directly to the storage device.
> +For file mappings, the storage device is mapped directly into userspace.
> +
> +
> +Usage
> +-----
> +
> +If you have a block device which supports DAX, you can make a 
> +filesystem on it as usual.  When mounting it, use the -o dax option 
> +manually or add 'dax' to the options in /etc/fstab.
> +
> +
> +Implementation Tips for Block Driver Writers
> +--------------------------------------------
> +
> +To support DAX in your block driver, implement the 'direct_access'
> +block device operation.  It is used to translate the sector number 
> +(expressed in units of 512-byte sectors) to a page frame number (pfn) 
> +that identifies the physical page for the memory.  It also returns a 
> +kernel virtual address that can be used to access the memory.
> +
> +The direct_access method takes a 'size' parameter that indicates the 
> +number of bytes being requested.  The function should return the 
> +number of bytes that can be contiguously accessed at that offset.  It 
> +may also return a negative errno if an error occurs.
> +
> +In order to support this method, the storage must be byte-accessible 
> +by the CPU at all times.  If your device uses paging techniques to 
> +expose a large amount of memory through a smaller window, then you 
> +cannot implement direct_access.  Equally, if your device can 
> +occasionally stall the CPU for an extended period, you should also 
> +not attempt to implement direct_access.
> +
> +These block devices may be used for inspiration:
> +- axonram: Axon DDR2 device driver
> +- brd: RAM backed block device driver
> +- dcssblk: s390 dcss block device driver
> +
> +
> +Implementation Tips for Filesystem Writers
> +------------------------------------------
> +
> +Filesystem support consists of
> +- adding support to mark inodes as being DAX by setting the S_DAX 
> +flag in
> +  i_flags
> +- implementing the direct_IO address space operation, and calling
> +  dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
> +- implementing an mmap file operation for DAX files which sets the
> +  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include 
> +handlers
> +  for fault and page_mkwrite (which should probably call dax_fault() 
> +and
> +  dax_mkwrite(), passing the appropriate get_block() callback)
> +- calling dax_truncate_page() instead of block_truncate_page() for 
> +DAX files
> +- ensuring that there is sufficient locking between reads, writes,
> +  truncates and page faults
> +
> +The get_block() callback passed to the DAX functions may return 
> +uninitialised extents.  If it does, it must ensure that simultaneous 
> +calls to get_block() (for example by a page-fault racing with a 
> +read() or a write()) work correctly.
> +
> +These filesystems may be used for inspiration:
> +- ext2: the second extended filesystem, see 
> +Documentation/filesystems/ext2.txt
> +
> +
> +Shortcomings
> +------------
> +
> +Even if the kernel or its modules are stored on a filesystem that 
> +supports DAX on a block device that supports DAX, they will still be copied into RAM.
> +
> +Calling get_user_pages() on a range of user memory that has been 
> +mmaped from a DAX file will fail as there are no 'struct page' to 
> +describe those pages.  This problem is being worked on.  That means 
> +that O_DIRECT reads/writes to those memory ranges from a non-DAX file 
> +will fail (note that O_DIRECT reads/writes _of a DAX file_ do work, 
> +it is the memory that is being accessed that is key here).  Other 
> +things that will not work include RDMA, sendfile() and splice().
> diff --git a/Documentation/filesystems/xip.txt 
> b/Documentation/filesystems/xip.txt
> deleted file mode 100644
> index b774729..0000000
> --- a/Documentation/filesystems/xip.txt
> +++ /dev/null
> @@ -1,71 +0,0 @@
> -Execute-in-place for file mappings
> -----------------------------------
> -
> -Motivation
> -----------
> -File mappings are performed by mapping page cache pages to userspace. 
> In -addition, read&write type file operations also transfer data 
> from/to the page -cache.
> -
> -For memory backed storage devices that use the block device 
> interface, the page -cache pages are in fact copies of the original 
> storage. Various approaches -exist to work around the need for an 
> extra copy. The ramdisk driver for example -does read the data into 
> the page cache, keeps a reference, and discards the -original data behind later on.
> -
> -Execute-in-place solves this issue the other way around: instead of 
> keeping -data in the page cache, the need to have a page cache copy is 
> eliminated -completely. With execute-in-place, read&write type 
> operations are performed -directly from/to the memory backed storage 
> device. For file mappings, the -storage device itself is mapped directly into userspace.
> -
> -This implementation was initially written for shared memory segments 
> between -different virtual machines on s390 hardware to allow multiple 
> machines to -share the same binaries and libraries.
> -
> -Implementation
> ---------------
> -Execute-in-place is implemented in three steps: block device 
> operation, -address space operation, and file operations.
> -
> -A block device operation named direct_access is used to translate the 
> -block device sector number to a page frame number (pfn) that 
> identifies -the physical page for the memory.  It also returns a 
> kernel virtual -address that can be used to access the memory.
> -
> -The direct_access method takes a 'size' parameter that indicates the 
> -number of bytes being requested.  The function should return the 
> number -of bytes that can be contiguously accessed at that offset.  It 
> may also -return a negative errno if an error occurs.
> -
> -The block device operation is optional, these block devices support 
> it as of
> -today:
> -- dcssblk: s390 dcss block device driver
> -
> -An address space operation named get_xip_mem is used to retrieve 
> references -to a page frame number and a kernel address. To obtain 
> these values a reference -to an address_space is provided. This 
> function assigns values to the kmem and -pfn parameters. The third 
> argument indicates whether the function should allocate -blocks if needed.
> -
> -This address space operation is mutually exclusive with 
> readpage&writepage that -do page cache read/write operations.
> -The following filesystems support it as of today:
> -- ext2: the second extended filesystem, see 
> Documentation/filesystems/ext2.txt
> -
> -A set of file operations that do utilize get_xip_page can be found in 
> -mm/filemap_xip.c . The following file operation implementations are provided:
> -- aio_read/aio_write
> -- readv/writev
> -- sendfile
> -
> -The generic file operations do_sync_read/do_sync_write can be used to 
> implement -classic synchronous IO calls.
> -
> -Shortcomings
> -------------
> -This implementation is limited to storage devices that are cpu 
> addressable at -all times (no highmem or such). It works well on 
> rom/ram, but enhancements are -needed to make it work with flash in read+write mode.
> -Putting the Linux kernel and/or its modules on a xip filesystem does 
> not mean -they are not copied.
> --
> 2.1.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in the body 
> to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
  2016-01-22 13:48       ` Chris Brandt
@ 2016-01-22 14:39         ` Matthew Wilcox
  0 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2016-01-22 14:39 UTC (permalink / raw)
  To: Chris Brandt
  Cc: Wilcox, Matthew R, Jared Hulbert, Linux FS Devel, LKML,
	Linux Memory Management List, Andrew Morton, Carsten Otte

On Fri, Jan 22, 2016 at 01:48:08PM +0000, Chris Brandt wrote:
> I believe the motivation for the new DAX code was being able to
> read/write data directly to specific physical memory. However, with
> the AXFS file system, XIP file mapping was mostly beneficial for direct
> access to executable code pages, not data. Code pages were XIP-ed, and
> data pages were copied to RAM as normal. This results in a significant
> reduction in system RAM, especially when used with an XIP_KERNEL. In
> some systems, most of your RAM is eaten up by lots of code pages from
> big bloated shared libraries, not R/W data. (of course I'm talking about
> smaller embedded system here)

OK, I can't construct a failure case for read-only usages.  If you want
to put together a patch-set that re-enables DAX in a read-only way on
those architectures, I'm fine with that.

I think your time would be better spent fixing the read-write problems;
once we see persistent memory on the embedded platforms, we'll need that
code anyway.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
  2016-01-22 13:07     ` Wilcox, Matthew R
  2016-01-22 13:48       ` Chris Brandt
@ 2016-01-24  9:03       ` Jared Hulbert
  2016-01-25 16:52         ` Matthew Wilcox
  1 sibling, 1 reply; 60+ messages in thread
From: Jared Hulbert @ 2016-01-24  9:03 UTC (permalink / raw)
  To: Wilcox, Matthew R
  Cc: Linux FS Devel, LKML, Linux Memory Management List,
	Matthew Wilcox, Andrew Morton, Carsten Otte, Chris Brandt

I our defense we didn't know we were sinning at the time.

Can you walk me through the cache flushing hole?  How is it okay on
X86 but not VIVT archs?  I'm missing something obvious here.

I thought earlier that vm_insert_mixed() handled the necessary
flushing.  Is that even the part you are worried about?

vm_insert_mixed()->insert_pfn()->update_mmu_cache() _should_ handle
the flush.  Except of course now that I look at the ARM code it looks
like it isn't doing anything if !pfn_valid().  <sigh>  I need to spend
some more time looking at this again.

What flushing functions would you call if you did have a cache page.
There are all kinds of cache flushing functions that work without a
struct page. If nothing else the specialized ASM instructions that do
the various flushes don't use struct page as a parameter.  This isn't
the first I've run into the lack of a sane cache API.  Grep for
inval_cache in the mtd drivers, should have been much easier.  Isn't
the proper solution to fix update_mmu_cache() or build out a pageless
cache flushing API?

I don't get the explicit mapping solution.  What are you mapping
where?  What addresses would be SHMLBA?  Phys, kernel, userspace?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
  2016-01-24  9:03       ` Jared Hulbert
@ 2016-01-25 16:52         ` Matthew Wilcox
  2016-01-25 21:18           ` Jared Hulbert
  0 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2016-01-25 16:52 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Wilcox, Matthew R, Linux FS Devel, LKML,
	Linux Memory Management List, Andrew Morton, Carsten Otte,
	Chris Brandt

On Sun, Jan 24, 2016 at 01:03:49AM -0800, Jared Hulbert wrote:
> I our defense we didn't know we were sinning at the time.

Fair enough.  Cache flushing is Hard.

> Can you walk me through the cache flushing hole?  How is it okay on
> X86 but not VIVT archs?  I'm missing something obvious here.
> 
> I thought earlier that vm_insert_mixed() handled the necessary
> flushing.  Is that even the part you are worried about?

No, that part should be fine.  My concern is about write() calls to files
which are also mmaped.  See Documentation/cachetlb.txt around line 229,
starting with "There exists another whole class of cpu cache issues" ...

> What flushing functions would you call if you did have a cache page.

Well, that's the problem; they don't currently exist.

> There are all kinds of cache flushing functions that work without a
> struct page. If nothing else the specialized ASM instructions that do
> the various flushes don't use struct page as a parameter.  This isn't
> the first I've run into the lack of a sane cache API.  Grep for
> inval_cache in the mtd drivers, should have been much easier.  Isn't
> the proper solution to fix update_mmu_cache() or build out a pageless
> cache flushing API?
> 
> I don't get the explicit mapping solution.  What are you mapping
> where?  What addresses would be SHMLBA?  Phys, kernel, userspace?

The problem comes in dax_io() where the kernel stores to an alias of the
user address (or reads from an alias of the user address).  Theoretically,
we should flush user addresses before we read from the kernel's alias,
and flush the kernel's alias after we store to it.

But if we create a new address for the kernel to use which lands on the
same cache line as the user's address (and this is what SHMLBA is used
to indicate), there is no incoherency between the kernel's view and the
user's view.  And no new cache flushing API is needed.

Is that clearer?  I'm not always good at explaining these things in a
way which makes sense to other people :-(

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
  2016-01-25 16:52         ` Matthew Wilcox
@ 2016-01-25 21:18           ` Jared Hulbert
  2016-01-27 19:51             ` Jared Hulbert
  0 siblings, 1 reply; 60+ messages in thread
From: Jared Hulbert @ 2016-01-25 21:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wilcox, Matthew R, Linux FS Devel, LKML,
	Linux Memory Management List, Andrew Morton, Carsten Otte,
	Chris Brandt

On Mon, Jan 25, 2016 at 8:52 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Sun, Jan 24, 2016 at 01:03:49AM -0800, Jared Hulbert wrote:
>> I our defense we didn't know we were sinning at the time.
>
> Fair enough.  Cache flushing is Hard.
>
>> Can you walk me through the cache flushing hole?  How is it okay on
>> X86 but not VIVT archs?  I'm missing something obvious here.
>>
>> I thought earlier that vm_insert_mixed() handled the necessary
>> flushing.  Is that even the part you are worried about?
>
> No, that part should be fine.  My concern is about write() calls to files
> which are also mmaped.  See Documentation/cachetlb.txt around line 229,
> starting with "There exists another whole class of cpu cache issues" ...

oh wow.  So aren't all the copy_to/from_user() variants specifically
supposed to handle such cases?

>> What flushing functions would you call if you did have a cache page.
>
> Well, that's the problem; they don't currently exist.
>
>> There are all kinds of cache flushing functions that work without a
>> struct page. If nothing else the specialized ASM instructions that do
>> the various flushes don't use struct page as a parameter.  This isn't
>> the first I've run into the lack of a sane cache API.  Grep for
>> inval_cache in the mtd drivers, should have been much easier.  Isn't
>> the proper solution to fix update_mmu_cache() or build out a pageless
>> cache flushing API?
>>
>> I don't get the explicit mapping solution.  What are you mapping
>> where?  What addresses would be SHMLBA?  Phys, kernel, userspace?
>
> The problem comes in dax_io() where the kernel stores to an alias of the
> user address (or reads from an alias of the user address).  Theoretically,
> we should flush user addresses before we read from the kernel's alias,
> and flush the kernel's alias after we store to it.

Reasoning this out loud here.  Please correct.

For the dax read case:
- kernel virt is mapped to pfn
- data is memcpy'd from kernel virt

For the dax write case:
- kernel virt is mapped to pfn
- data is memcpy'd to kernel virt
- user virt map to pfn attempts to read

Is that right?  I see the x86 does a nocache copy_to/from operation,
I'm not familiar with the semantics of that call and it would take me
a while to understand the assembly but I assume it's doing some magic
opcodes that forces the writes down to physical memory with each
load/store.  Does the the caching model of the x86 arch update the
cache entries tied to the physical memory on update?

For architectures that don't do auto coherency magic...

For reads:
- User dcaches need flushing before kernel virtual mapping to ensure
kernel reads latest data.  If the user has unflushed data in the
dcache it would not be reflected in the read copy.
This failure mode only is a problem if the filesystem is RW.

For writes:
- Unlike the read case we don't need up to date data for the user's
mapping of a pfn.  However, the user will need to caches invalidated
to get fresh data, so we should make sure to writeback any affected
lines in the user caches so they don't get lost if we do an
invalidate.  I suppose uncommitted data might corrupt the new data
written from the kernel mapping if the cachelines get flushed later.
- After the data is memcpy'ed to the kernel virt map the cache, and
possibly the write buffers, should be flushed.  Without this flush the
data might not ever get to the user mapped versions.
- Assuming the user maps were all flushed at the outset they should be
reloaded with fresh data on access.

Do I get it more or less?

> But if we create a new address for the kernel to use which lands on the
> same cache line as the user's address (and this is what SHMLBA is used
> to indicate), there is no incoherency between the kernel's view and the
> user's view.  And no new cache flushing API is needed.

So... how exactly would one force the kernel address to be at the
SHMLBA boundary?

> Is that clearer?  I'm not always good at explaining these things in a
> way which makes sense to other people :-(

Yeah.  I think I'm at 80% comprehension here.  Or at least I think I
am.  Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
  2016-01-25 21:18           ` Jared Hulbert
@ 2016-01-27 19:51             ` Jared Hulbert
  0 siblings, 0 replies; 60+ messages in thread
From: Jared Hulbert @ 2016-01-27 19:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wilcox, Matthew R, Linux FS Devel, LKML,
	Linux Memory Management List, Andrew Morton, Carsten Otte,
	Chris Brandt

On Mon, Jan 25, 2016 at 1:18 PM, Jared Hulbert <jaredeh@gmail.com> wrote:
> On Mon, Jan 25, 2016 at 8:52 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
>> On Sun, Jan 24, 2016 at 01:03:49AM -0800, Jared Hulbert wrote:
>>> I our defense we didn't know we were sinning at the time.
>>
>> Fair enough.  Cache flushing is Hard.
>>
>>> Can you walk me through the cache flushing hole?  How is it okay on
>>> X86 but not VIVT archs?  I'm missing something obvious here.
>>>
>>> I thought earlier that vm_insert_mixed() handled the necessary
>>> flushing.  Is that even the part you are worried about?
>>
>> No, that part should be fine.  My concern is about write() calls to files
>> which are also mmaped.  See Documentation/cachetlb.txt around line 229,
>> starting with "There exists another whole class of cpu cache issues" ...
>
> oh wow.  So aren't all the copy_to/from_user() variants specifically
> supposed to handle such cases?
>
>>> What flushing functions would you call if you did have a cache page.
>>
>> Well, that's the problem; they don't currently exist.
>>
>>> There are all kinds of cache flushing functions that work without a
>>> struct page. If nothing else the specialized ASM instructions that do
>>> the various flushes don't use struct page as a parameter.  This isn't
>>> the first I've run into the lack of a sane cache API.  Grep for
>>> inval_cache in the mtd drivers, should have been much easier.  Isn't
>>> the proper solution to fix update_mmu_cache() or build out a pageless
>>> cache flushing API?
>>>
>>> I don't get the explicit mapping solution.  What are you mapping
>>> where?  What addresses would be SHMLBA?  Phys, kernel, userspace?
>>
>> The problem comes in dax_io() where the kernel stores to an alias of the
>> user address (or reads from an alias of the user address).  Theoretically,
>> we should flush user addresses before we read from the kernel's alias,
>> and flush the kernel's alias after we store to it.
>
> Reasoning this out loud here.  Please correct.
>
> For the dax read case:
> - kernel virt is mapped to pfn
> - data is memcpy'd from kernel virt
>
> For the dax write case:
> - kernel virt is mapped to pfn
> - data is memcpy'd to kernel virt
> - user virt map to pfn attempts to read
>
> Is that right?  I see the x86 does a nocache copy_to/from operation,
> I'm not familiar with the semantics of that call and it would take me
> a while to understand the assembly but I assume it's doing some magic
> opcodes that forces the writes down to physical memory with each
> load/store.  Does the the caching model of the x86 arch update the
> cache entries tied to the physical memory on update?
>
> For architectures that don't do auto coherency magic...
>
> For reads:
> - User dcaches need flushing before kernel virtual mapping to ensure
> kernel reads latest data.  If the user has unflushed data in the
> dcache it would not be reflected in the read copy.
> This failure mode only is a problem if the filesystem is RW.
>
> For writes:
> - Unlike the read case we don't need up to date data for the user's
> mapping of a pfn.  However, the user will need to caches invalidated
> to get fresh data, so we should make sure to writeback any affected
> lines in the user caches so they don't get lost if we do an
> invalidate.  I suppose uncommitted data might corrupt the new data
> written from the kernel mapping if the cachelines get flushed later.
> - After the data is memcpy'ed to the kernel virt map the cache, and
> possibly the write buffers, should be flushed.  Without this flush the
> data might not ever get to the user mapped versions.
> - Assuming the user maps were all flushed at the outset they should be
> reloaded with fresh data on access.
>
> Do I get it more or less?

I assume the silence means I don't get it.

Moving along...

The need to flush kernel aliases and user alias without a struct page
was articulated and cited as the reason why the DAX doesn't work with
ARM, MIPS, and SPARC.

One of the following routines should work for kernel flushing, right?
--  flush_cache_vmap(unsigned long start, unsigned long end)
--  flush_kernel_vmap_range(void *vaddr, int size)
--  invalidate_kernel_vmap_range(void *vaddr, int size)

For user aliases I'm less confident with here, but at first glance I
don't see why these wouldn't work?
-- flush_cache_page(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn)
-- flush_cache_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end)

Help?!  I missing something here.

>> But if we create a new address for the kernel to use which lands on the
>> same cache line as the user's address (and this is what SHMLBA is used
>> to indicate), there is no incoherency between the kernel's view and the
>> user's view.  And no new cache flushing API is needed.
>
> So... how exactly would one force the kernel address to be at the
> SHMLBA boundary?
>
>> Is that clearer?  I'm not always good at explaining these things in a
>> way which makes sense to other people :-(
>
> Yeah.  I think I'm at 80% comprehension here.  Or at least I think I
> am.  Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2016-01-27 19:51 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 01/20] axonram: Fix bug in direct_access Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 02/20] block: Change direct_access calling convention Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 03/20] mm: Fix XIP fault vs truncate race Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-13 18:50     ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-13 18:58     ` Matthew Wilcox
2015-02-05  9:16   ` Yigal Korman
2015-02-05 21:39     ` Matthew Wilcox
2015-02-08 11:48       ` Yigal Korman
2014-10-24 21:20 ` [PATCH v12 05/20] vfs,ext2: Introduce IS_DAX(inode) Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 06/20] dax,ext2: Replace XIP read and write with DAX I/O Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-13 20:59     ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 07/20] dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-13 21:39     ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-13 21:53     ` Matthew Wilcox
2015-01-13 22:47       ` Andrew Morton
2014-10-24 21:20 ` [PATCH v12 09/20] dax,ext2: Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-13 21:55     ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation Matthew Wilcox
2015-01-12 23:10   ` Andrew Morton
2016-01-21 18:38   ` Jared Hulbert
2016-01-22 13:07     ` Wilcox, Matthew R
2016-01-22 13:48       ` Chris Brandt
2016-01-22 14:39         ` Matthew Wilcox
2016-01-24  9:03       ` Jared Hulbert
2016-01-25 16:52         ` Matthew Wilcox
2016-01-25 21:18           ` Jared Hulbert
2016-01-27 19:51             ` Jared Hulbert
2014-10-24 21:20 ` [PATCH v12 11/20] vfs: Remove get_xip_mem Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 12/20] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 13/20] ext2: Remove ext2_use_xip Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 14/20] ext2: Remove xip.c and xip.h Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 15/20] vfs,ext2: Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 16/20] ext2: Remove ext2_aops_xip Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 17/20] ext2: Get rid of most mentions of XIP in ext2 Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 18/20] dax: Add dax_zero_page_range Matthew Wilcox
2015-01-12 23:10   ` Andrew Morton
2015-01-12 23:20     ` Ross Zwisler
2014-10-24 21:20 ` [PATCH v12 19/20] ext4: Add DAX functionality Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 20/20] brd: Rename XIP to DAX Matthew Wilcox
2014-12-10 14:03 ` [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Christoph Hellwig
2014-12-10 14:12   ` Matthew Wilcox
2014-12-10 14:28     ` Jeff Moyer
2014-12-10 20:53     ` Dave Chinner
2015-01-05 18:41     ` Christoph Hellwig
2015-01-06  8:47       ` Andrew Morton
2015-01-08 11:49         ` pread2/ pwrite2 Christoph Hellwig
2015-01-09 19:30           ` Steve French
2015-01-08 16:28         ` [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Milosz Tanski
2015-01-08 17:36           ` Jeremy Allison
2015-01-12 14:47         ` Matthew Wilcox
2015-01-12 23:09 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).