All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/22] Support ext4 on NV-DIMMs
@ 2014-02-25 14:18 ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

One of the primary uses for NV-DIMMs is to expose them as a block device
and use a filesystem to store files on the NV-DIMM.  While that works,
it currently wastes memory and CPU time buffering the files in the page
cache.  We have support in ext2 for bypassing the page cache, but it
has some races which are unfixable in the current design.  This series
of patches rewrite the underlying support, and add support for direct
access to ext4.

This iteration of the patchset renames the "XIP" support to "DAX".
This fixes the confusion between kernel XIP and filesystem XIP.  It's not
really about executing in-place; it's about direct access to memory-like
storage, bypassing the page cache.  DAX is TLA-compliant, retains the
exciting X, is pronouncable ("Dacks") and is not used elsewhere in
the kernel.  The only major use of DAX outside the kernel is the German
stock exchange, and I think that's pretty unlikely to cause confusion.

Patch 2 *still* wants careful review from the MM people.

I want to particularly credit Ross Zwisler for all the effort he's put
into tracking down bugs.

Testing
~~~~~~~

Seven xfstests still fail reliably with DAX that pass reliably without
DAX: ext4/301 generic/{075,091,112,127,223,263}

Two fail randomly for me whether DAX is enabled or not: generic/{299,300}

Eleven fail reliably without DAX: ext4/{302,303,304}
generic/{219,230,231,232,233,235,270} shared/218

Several are not run, because they need dm_flakey,
CONFIG_FAIL_MAKE_REQUEST, my 3GB ramdisk is too small, they're not
suitable for Linux or they're not suitable for ext4.  My current
score is 18/127 tests fail with -o dax and 11/127 without.

v6:
 - Fix compilation with CONFIG_FS_XIP=n (reported by Ross Zwisler)
 - Removed unused argument from xip_io (patch from Ross Zwisler)
 - Fixed buffer overrun in xip_io (original patch from Ross Zwisler)
 - Prevented reads past i_size (original patch from Ross Zwisler)
 - Fixed documentation errors (reported by Randy Dunlap)
 - Add handling of BH_New (reported by Dave Chinner)
 - Support the way ext4 reports holes (original patch from Ross Zwisler)
 - Zero the *end* of new blocks as well as the beginning (reported by Dave
   Chinner)
 - Rebased on top of 3.14-rc4 plus Kirill's cleanups of __do_fault() which
   are in linux-mm (http://marc.info/?l=linux-mm&m=139206489208546&w=2).
 - Renamed XIP to DAX
 - Fixed writev() so subsequent writes don't overwrite earlier writes
 - Remove code in ext4 to clear blocks in DAX files
 - Fixed dax_zero_page_range() to actually call memset

v5:
 - Improved documentation
 - Fixed a couple of warnings emitted by a newer version of gcc
 - Fixed a bug where we would read/write the wrong sector in xip_io for I/Os
   that were not aligned to PAGE_SIZE
 - Dropped PMD fault patch
 - Added support for unwritten extents


Matthew Wilcox (21):
  Fix XIP fault vs truncate race
  Allow page fault handlers to perform the COW
  axonram: Fix bug in direct_access
  Change direct_access calling convention
  Introduce IS_DAX(inode)
  Replace XIP read and write with DAX I/O
  Replace the XIP page fault handler with the DAX page fault handler
  Replace xip_truncate_page with dax_truncate_page
  Remove mm/filemap_xip.c
  Remove get_xip_mem
  Replace ext2_clear_xip_target with dax_clear_blocks
  ext2: Remove ext2_xip_verify_sb()
  ext2: Remove ext2_use_xip
  ext2: Remove xip.c and xip.h
  Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
  ext2: Remove ext2_aops_xip
  Get rid of most mentions of XIP in ext2
  xip: Add xip_zero_page_range
  ext4: Make ext4_block_zero_page_range static
  ext4: Fix typos
  dax: Add reporting of major faults

Ross Zwisler (1):
  ext4: Add DAX functionality

 Documentation/filesystems/Locking  |   3 -
 Documentation/filesystems/dax.txt  |  84 +++++++
 Documentation/filesystems/ext4.txt |   2 +
 Documentation/filesystems/xip.txt  |  68 ------
 arch/powerpc/sysdev/axonram.c      |   8 +-
 drivers/block/brd.c                |   8 +-
 drivers/s390/block/dcssblk.c       |  19 +-
 fs/Kconfig                         |  21 +-
 fs/Makefile                        |   1 +
 fs/dax.c                           | 451 ++++++++++++++++++++++++++++++++++
 fs/exofs/inode.c                   |   1 -
 fs/ext2/Kconfig                    |  11 -
 fs/ext2/Makefile                   |   1 -
 fs/ext2/ext2.h                     |   9 +-
 fs/ext2/file.c                     |  45 +++-
 fs/ext2/inode.c                    |  37 +--
 fs/ext2/namei.c                    |  13 +-
 fs/ext2/super.c                    |  48 ++--
 fs/ext2/xip.c                      |  91 -------
 fs/ext2/xip.h                      |  26 --
 fs/ext4/ext4.h                     |   8 +-
 fs/ext4/file.c                     |  53 +++-
 fs/ext4/indirect.c                 |  19 +-
 fs/ext4/inode.c                    |  94 +++++---
 fs/ext4/namei.c                    |  10 +-
 fs/ext4/super.c                    |  39 ++-
 fs/open.c                          |   5 +-
 include/linux/blkdev.h             |   4 +-
 include/linux/fs.h                 |  49 +++-
 include/linux/mm.h                 |   2 +
 mm/Makefile                        |   1 -
 mm/fadvise.c                       |   6 +-
 mm/filemap.c                       |   6 +-
 mm/filemap_xip.c                   | 483 -------------------------------------
 mm/madvise.c                       |   2 +-
 mm/memory.c                        |  44 +++-
 36 files changed, 911 insertions(+), 861 deletions(-)
 create mode 100644 Documentation/filesystems/dax.txt
 delete mode 100644 Documentation/filesystems/xip.txt
 create mode 100644 fs/dax.c
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h
 delete mode 100644 mm/filemap_xip.c

-- 
1.8.5.3


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v6 00/22] Support ext4 on NV-DIMMs
@ 2014-02-25 14:18 ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

One of the primary uses for NV-DIMMs is to expose them as a block device
and use a filesystem to store files on the NV-DIMM.  While that works,
it currently wastes memory and CPU time buffering the files in the page
cache.  We have support in ext2 for bypassing the page cache, but it
has some races which are unfixable in the current design.  This series
of patches rewrite the underlying support, and add support for direct
access to ext4.

This iteration of the patchset renames the "XIP" support to "DAX".
This fixes the confusion between kernel XIP and filesystem XIP.  It's not
really about executing in-place; it's about direct access to memory-like
storage, bypassing the page cache.  DAX is TLA-compliant, retains the
exciting X, is pronouncable ("Dacks") and is not used elsewhere in
the kernel.  The only major use of DAX outside the kernel is the German
stock exchange, and I think that's pretty unlikely to cause confusion.

Patch 2 *still* wants careful review from the MM people.

I want to particularly credit Ross Zwisler for all the effort he's put
into tracking down bugs.

Testing
~~~~~~~

Seven xfstests still fail reliably with DAX that pass reliably without
DAX: ext4/301 generic/{075,091,112,127,223,263}

Two fail randomly for me whether DAX is enabled or not: generic/{299,300}

Eleven fail reliably without DAX: ext4/{302,303,304}
generic/{219,230,231,232,233,235,270} shared/218

Several are not run, because they need dm_flakey,
CONFIG_FAIL_MAKE_REQUEST, my 3GB ramdisk is too small, they're not
suitable for Linux or they're not suitable for ext4.  My current
score is 18/127 tests fail with -o dax and 11/127 without.

v6:
 - Fix compilation with CONFIG_FS_XIP=n (reported by Ross Zwisler)
 - Removed unused argument from xip_io (patch from Ross Zwisler)
 - Fixed buffer overrun in xip_io (original patch from Ross Zwisler)
 - Prevented reads past i_size (original patch from Ross Zwisler)
 - Fixed documentation errors (reported by Randy Dunlap)
 - Add handling of BH_New (reported by Dave Chinner)
 - Support the way ext4 reports holes (original patch from Ross Zwisler)
 - Zero the *end* of new blocks as well as the beginning (reported by Dave
   Chinner)
 - Rebased on top of 3.14-rc4 plus Kirill's cleanups of __do_fault() which
   are in linux-mm (http://marc.info/?l=linux-mm&m=139206489208546&w=2).
 - Renamed XIP to DAX
 - Fixed writev() so subsequent writes don't overwrite earlier writes
 - Remove code in ext4 to clear blocks in DAX files
 - Fixed dax_zero_page_range() to actually call memset

v5:
 - Improved documentation
 - Fixed a couple of warnings emitted by a newer version of gcc
 - Fixed a bug where we would read/write the wrong sector in xip_io for I/Os
   that were not aligned to PAGE_SIZE
 - Dropped PMD fault patch
 - Added support for unwritten extents


Matthew Wilcox (21):
  Fix XIP fault vs truncate race
  Allow page fault handlers to perform the COW
  axonram: Fix bug in direct_access
  Change direct_access calling convention
  Introduce IS_DAX(inode)
  Replace XIP read and write with DAX I/O
  Replace the XIP page fault handler with the DAX page fault handler
  Replace xip_truncate_page with dax_truncate_page
  Remove mm/filemap_xip.c
  Remove get_xip_mem
  Replace ext2_clear_xip_target with dax_clear_blocks
  ext2: Remove ext2_xip_verify_sb()
  ext2: Remove ext2_use_xip
  ext2: Remove xip.c and xip.h
  Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
  ext2: Remove ext2_aops_xip
  Get rid of most mentions of XIP in ext2
  xip: Add xip_zero_page_range
  ext4: Make ext4_block_zero_page_range static
  ext4: Fix typos
  dax: Add reporting of major faults

Ross Zwisler (1):
  ext4: Add DAX functionality

 Documentation/filesystems/Locking  |   3 -
 Documentation/filesystems/dax.txt  |  84 +++++++
 Documentation/filesystems/ext4.txt |   2 +
 Documentation/filesystems/xip.txt  |  68 ------
 arch/powerpc/sysdev/axonram.c      |   8 +-
 drivers/block/brd.c                |   8 +-
 drivers/s390/block/dcssblk.c       |  19 +-
 fs/Kconfig                         |  21 +-
 fs/Makefile                        |   1 +
 fs/dax.c                           | 451 ++++++++++++++++++++++++++++++++++
 fs/exofs/inode.c                   |   1 -
 fs/ext2/Kconfig                    |  11 -
 fs/ext2/Makefile                   |   1 -
 fs/ext2/ext2.h                     |   9 +-
 fs/ext2/file.c                     |  45 +++-
 fs/ext2/inode.c                    |  37 +--
 fs/ext2/namei.c                    |  13 +-
 fs/ext2/super.c                    |  48 ++--
 fs/ext2/xip.c                      |  91 -------
 fs/ext2/xip.h                      |  26 --
 fs/ext4/ext4.h                     |   8 +-
 fs/ext4/file.c                     |  53 +++-
 fs/ext4/indirect.c                 |  19 +-
 fs/ext4/inode.c                    |  94 +++++---
 fs/ext4/namei.c                    |  10 +-
 fs/ext4/super.c                    |  39 ++-
 fs/open.c                          |   5 +-
 include/linux/blkdev.h             |   4 +-
 include/linux/fs.h                 |  49 +++-
 include/linux/mm.h                 |   2 +
 mm/Makefile                        |   1 -
 mm/fadvise.c                       |   6 +-
 mm/filemap.c                       |   6 +-
 mm/filemap_xip.c                   | 483 -------------------------------------
 mm/madvise.c                       |   2 +-
 mm/memory.c                        |  44 +++-
 36 files changed, 911 insertions(+), 861 deletions(-)
 create mode 100644 Documentation/filesystems/dax.txt
 delete mode 100644 Documentation/filesystems/xip.txt
 create mode 100644 fs/dax.c
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h
 delete mode 100644 mm/filemap_xip.c

-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v6 01/22] Fix XIP fault vs truncate race
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate.  We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead.  It is locked in the
truncate path in unmap_mapping_range() after updating i_size.  So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 mm/filemap_xip.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
 		__xip_unmap(mapping, vmf->pgoff);
 
 found:
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			mutex_unlock(&mapping->i_mmap_mutex);
+			return VM_FAULT_SIGBUS;
+		}
 		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
 							xip_pfn);
+		mutex_unlock(&mapping->i_mmap_mutex);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		/*
@@ -285,16 +294,27 @@ found:
 		}
 		if (error != -ENODATA)
 			goto out;
+
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			ret = VM_FAULT_SIGBUS;
+			goto unlock;
+		}
 		/* not shared and writable, use xip_sparse_page() */
 		page = xip_sparse_page();
 		if (!page)
-			goto out;
+			goto unlock;
 		err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
 							page);
 		if (err == -ENOMEM)
-			goto out;
+			goto unlock;
 
 		ret = VM_FAULT_NOPAGE;
+unlock:
+		mutex_unlock(&mapping->i_mmap_mutex);
 out:
 		write_seqcount_end(&xip_sparse_seq);
 		mutex_unlock(&xip_sparse_mutex);
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 01/22] Fix XIP fault vs truncate race
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate.  We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead.  It is locked in the
truncate path in unmap_mapping_range() after updating i_size.  So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 mm/filemap_xip.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
 		__xip_unmap(mapping, vmf->pgoff);
 
 found:
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			mutex_unlock(&mapping->i_mmap_mutex);
+			return VM_FAULT_SIGBUS;
+		}
 		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
 							xip_pfn);
+		mutex_unlock(&mapping->i_mmap_mutex);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		/*
@@ -285,16 +294,27 @@ found:
 		}
 		if (error != -ENODATA)
 			goto out;
+
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			ret = VM_FAULT_SIGBUS;
+			goto unlock;
+		}
 		/* not shared and writable, use xip_sparse_page() */
 		page = xip_sparse_page();
 		if (!page)
-			goto out;
+			goto unlock;
 		err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
 							page);
 		if (err == -ENOMEM)
-			goto out;
+			goto unlock;
 
 		ret = VM_FAULT_NOPAGE;
+unlock:
+		mutex_unlock(&mapping->i_mmap_mutex);
 out:
 		write_seqcount_end(&xip_sparse_seq);
 		mutex_unlock(&xip_sparse_mutex);
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 02/22] Allow page fault handlers to perform the COW
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page.  It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

Where the filemap code protects against truncation of the file until
the PTE has been installed with the page lock, the XIP code use the
i_mmap_mutex instead.  We must therefore unlock the i_mmap_mutex after
inserting the PTE.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 44 ++++++++++++++++++++++++++++++++------------
 2 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46e..22260c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -205,6 +205,7 @@ struct vm_fault {
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	void __user *virtual_address;	/* Faulting virtual address */
 
+	struct page *cow_page;		/* Handler may choose to COW */
 	struct page *page;		/* ->fault handlers should return a
 					 * page here, unless VM_FAULT_NOPAGE
 					 * is set (which is also implied by
@@ -1000,6 +1001,7 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned small page */
 #define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. Index encoded in upper bits */
 
+#define VM_FAULT_COWED	0x0080	/* ->fault COWed the page instead */
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46..c7fc9c5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3288,7 +3288,8 @@ oom:
 }
 
 static int __do_fault(struct vm_area_struct *vma, unsigned long address,
-		pgoff_t pgoff, unsigned int flags, struct page **page)
+			pgoff_t pgoff, unsigned int flags,
+			struct page *cow_page, struct page **page)
 {
 	struct vm_fault vmf;
 	int ret;
@@ -3297,10 +3298,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.page = NULL;
+	vmf.cow_page = cow_page;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
+	if (unlikely(ret & VM_FAULT_COWED))
+		goto out;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
 		if (ret & VM_FAULT_LOCKED)
@@ -3314,6 +3318,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	else
 		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
 
+ out:
 	*page = vmf.page;
 	return ret;
 }
@@ -3351,7 +3356,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *pte;
 	int ret;
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
@@ -3368,6 +3373,12 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return ret;
 }
 
+/*
+ * If the fault handler performs the COW, it does not return a page,
+ * so cannot use the page's lock to protect against a concurrent truncate
+ * operation.  Instead it returns with the i_mmap_mutex held, which must
+ * be released after the PTE has been inserted.
+ */
 static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3389,25 +3400,34 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
-	copy_user_highpage(new_page, fault_page, address, vma);
+	if (!(ret & VM_FAULT_COWED))
+		copy_user_highpage(new_page, fault_page, address, vma);
 	__SetPageUptodate(new_page);
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
-		pte_unmap_unlock(pte, ptl);
+	if (unlikely(!pte_same(*pte, orig_pte)))
+		goto unlock_out;
+	do_set_pte(vma, address, new_page, pte, true, true);
+	pte_unmap_unlock(pte, ptl);
+	if (ret & VM_FAULT_COWED) {
+		mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+	} else {
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
-		goto uncharge_out;
 	}
-	do_set_pte(vma, address, new_page, pte, true, true);
-	pte_unmap_unlock(pte, ptl);
-	unlock_page(fault_page);
-	page_cache_release(fault_page);
 	return ret;
+unlock_out:
+	pte_unmap_unlock(pte, ptl);
+	if (ret & VM_FAULT_COWED) {
+		mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+	} else {
+		unlock_page(fault_page);
+		page_cache_release(fault_page);
+	}
 uncharge_out:
 	mem_cgroup_uncharge_page(new_page);
 	page_cache_release(new_page);
@@ -3424,7 +3444,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int dirtied = 0;
 	int ret, tmp;
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 02/22] Allow page fault handlers to perform the COW
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page.  It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

Where the filemap code protects against truncation of the file until
the PTE has been installed with the page lock, the XIP code use the
i_mmap_mutex instead.  We must therefore unlock the i_mmap_mutex after
inserting the PTE.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 44 ++++++++++++++++++++++++++++++++------------
 2 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46e..22260c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -205,6 +205,7 @@ struct vm_fault {
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	void __user *virtual_address;	/* Faulting virtual address */
 
+	struct page *cow_page;		/* Handler may choose to COW */
 	struct page *page;		/* ->fault handlers should return a
 					 * page here, unless VM_FAULT_NOPAGE
 					 * is set (which is also implied by
@@ -1000,6 +1001,7 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned small page */
 #define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. Index encoded in upper bits */
 
+#define VM_FAULT_COWED	0x0080	/* ->fault COWed the page instead */
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46..c7fc9c5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3288,7 +3288,8 @@ oom:
 }
 
 static int __do_fault(struct vm_area_struct *vma, unsigned long address,
-		pgoff_t pgoff, unsigned int flags, struct page **page)
+			pgoff_t pgoff, unsigned int flags,
+			struct page *cow_page, struct page **page)
 {
 	struct vm_fault vmf;
 	int ret;
@@ -3297,10 +3298,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.page = NULL;
+	vmf.cow_page = cow_page;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
+	if (unlikely(ret & VM_FAULT_COWED))
+		goto out;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
 		if (ret & VM_FAULT_LOCKED)
@@ -3314,6 +3318,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	else
 		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
 
+ out:
 	*page = vmf.page;
 	return ret;
 }
@@ -3351,7 +3356,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *pte;
 	int ret;
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
@@ -3368,6 +3373,12 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return ret;
 }
 
+/*
+ * If the fault handler performs the COW, it does not return a page,
+ * so cannot use the page's lock to protect against a concurrent truncate
+ * operation.  Instead it returns with the i_mmap_mutex held, which must
+ * be released after the PTE has been inserted.
+ */
 static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3389,25 +3400,34 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
-	copy_user_highpage(new_page, fault_page, address, vma);
+	if (!(ret & VM_FAULT_COWED))
+		copy_user_highpage(new_page, fault_page, address, vma);
 	__SetPageUptodate(new_page);
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
-		pte_unmap_unlock(pte, ptl);
+	if (unlikely(!pte_same(*pte, orig_pte)))
+		goto unlock_out;
+	do_set_pte(vma, address, new_page, pte, true, true);
+	pte_unmap_unlock(pte, ptl);
+	if (ret & VM_FAULT_COWED) {
+		mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+	} else {
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
-		goto uncharge_out;
 	}
-	do_set_pte(vma, address, new_page, pte, true, true);
-	pte_unmap_unlock(pte, ptl);
-	unlock_page(fault_page);
-	page_cache_release(fault_page);
 	return ret;
+unlock_out:
+	pte_unmap_unlock(pte, ptl);
+	if (ret & VM_FAULT_COWED) {
+		mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+	} else {
+		unlock_page(fault_page);
+		page_cache_release(fault_page);
+	}
 uncharge_out:
 	mem_cgroup_uncharge_page(new_page);
 	page_cache_release(new_page);
@@ -3424,7 +3444,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int dirtied = 0;
 	int ret, tmp;
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 03/22] axonram: Fix bug in direct_access
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 arch/powerpc/sysdev/axonram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 47b6b9f..830edc8 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
 	}
 
 	*kaddr = (void *)(bank->ph_addr + offset);
-	*pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
 	return 0;
 }
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 03/22] axonram: Fix bug in direct_access
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 arch/powerpc/sysdev/axonram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 47b6b9f..830edc8 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
 	}
 
 	*kaddr = (void *)(bank->ph_addr + offset);
-	*pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
 	return 0;
 }
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 04/22] Change direct_access calling convention
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/xip.txt | 15 +++++++++------
 arch/powerpc/sysdev/axonram.c     |  6 +++---
 drivers/block/brd.c               |  8 +++++---
 drivers/s390/block/dcssblk.c      | 19 ++++++++++---------
 fs/ext2/xip.c                     | 30 +++++++++++++-----------------
 include/linux/blkdev.h            |  4 ++--
 6 files changed, 42 insertions(+), 40 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b62eabf 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
 Execute-in-place is implemented in three steps: block device operation,
 address space operation, and file operations.
 
-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory.  It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested.  The function should return the number
+of bytes that it can provide, although it must not exceed the number of
+bytes requested.  It may also return a negative errno if an error occurs.
 
 The block device operation is optional, these block devices support it as of
 today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 830edc8..1697e29 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,9 +139,9 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
-static int
+static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void **kaddr, unsigned long *pfn)
+		       void **kaddr, unsigned long *pfn, long size)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
 	loff_t offset;
@@ -158,7 +158,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
 	*kaddr = (void *)(bank->ph_addr + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return 0;
+	return min_t(unsigned long, size, bank->size - offset);
 }
 
 static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index e73b85c..00da60d 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -361,8 +361,8 @@ out:
 }
 
 #ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
-			void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+			void **kaddr, unsigned long *pfn, long size)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
@@ -379,7 +379,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector,
 	*kaddr = page_address(page);
 	*pfn = page_to_pfn(page);
 
-	return 0;
+	/* Could optimistically check to see if the next page in the
+	 * file is mapped to the next page of physical RAM */
+	return PAGE_SIZE;
 }
 #endif
 
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index ebf41e2..da914b2 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
 static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-				 void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+				 void **kaddr, unsigned long *pfn, long size);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -866,25 +866,26 @@ fail:
 	bio_io_error(bio);
 }
 
-static int
+static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void **kaddr, unsigned long *pfn)
+			void **kaddr, unsigned long *pfn, long size)
 {
 	struct dcssblk_dev_info *dev_info;
-	unsigned long pgoff;
+	unsigned long offset, dev_sz;
 
 	dev_info = bdev->bd_disk->private_data;
 	if (!dev_info)
 		return -ENODEV;
+	dev_sz = dev_info->end - dev_info->start;
 	if (secnum % (PAGE_SIZE/512))
 		return -EINVAL;
-	pgoff = secnum / (PAGE_SIZE / 512);
-	if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
+	offset = secnum * 512;
+	if (offset > dev_sz)
 		return -ERANGE;
-	*kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+	*kaddr = (void *) (dev_info->start + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return 0;
+	return min_t(unsigned long, size, dev_sz - offset);
 }
 
 static void
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..fa40091 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,13 @@
 #include "ext2.h"
 #include "xip.h"
 
-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
-		      void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+				void **kaddr, unsigned long *pfn, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	sector_t sector;
-
-	sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
-	BUG_ON(!ops->direct_access);
-	return ops->direct_access(bdev, sector, kaddr, pfn);
+	sector_t sector = block * (PAGE_SIZE / 512);
+	return ops->direct_access(bdev, sector, kaddr, pfn, size);
 }
 
 static inline int
@@ -53,12 +48,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
 {
 	void *kaddr;
 	unsigned long pfn;
-	int rc;
+	long size;
 
-	rc = __inode_direct_access(inode, block, &kaddr, &pfn);
-	if (!rc)
-		clear_page(kaddr);
-	return rc;
+	size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+	if (size < 0)
+		return size;
+	clear_page(kaddr);
+	return 0;
 }
 
 void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +73,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
 int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
 				void **kmem, unsigned long *pfn)
 {
-	int rc;
+	long rc;
 	sector_t block;
 
 	/* first, retrieve the sector number */
@@ -86,6 +82,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
 		return rc;
 
 	/* retrieve address of the target data */
-	rc = __inode_direct_access(mapping->host, block, kmem, pfn);
-	return rc;
+	rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+	return (rc < 0) ? rc : 0;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4afa4f8..c6f6210 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1560,8 +1560,8 @@ struct block_device_operations {
 	void (*release) (struct gendisk *, fmode_t);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
-	int (*direct_access) (struct block_device *, sector_t,
-						void **, unsigned long *);
+	long (*direct_access) (struct block_device *, sector_t,
+					void **, unsigned long *pfn, long size);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 04/22] Change direct_access calling convention
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/xip.txt | 15 +++++++++------
 arch/powerpc/sysdev/axonram.c     |  6 +++---
 drivers/block/brd.c               |  8 +++++---
 drivers/s390/block/dcssblk.c      | 19 ++++++++++---------
 fs/ext2/xip.c                     | 30 +++++++++++++-----------------
 include/linux/blkdev.h            |  4 ++--
 6 files changed, 42 insertions(+), 40 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b62eabf 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
 Execute-in-place is implemented in three steps: block device operation,
 address space operation, and file operations.
 
-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory.  It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested.  The function should return the number
+of bytes that it can provide, although it must not exceed the number of
+bytes requested.  It may also return a negative errno if an error occurs.
 
 The block device operation is optional, these block devices support it as of
 today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 830edc8..1697e29 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,9 +139,9 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
-static int
+static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void **kaddr, unsigned long *pfn)
+		       void **kaddr, unsigned long *pfn, long size)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
 	loff_t offset;
@@ -158,7 +158,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
 	*kaddr = (void *)(bank->ph_addr + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return 0;
+	return min_t(unsigned long, size, bank->size - offset);
 }
 
 static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index e73b85c..00da60d 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -361,8 +361,8 @@ out:
 }
 
 #ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
-			void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+			void **kaddr, unsigned long *pfn, long size)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
@@ -379,7 +379,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector,
 	*kaddr = page_address(page);
 	*pfn = page_to_pfn(page);
 
-	return 0;
+	/* Could optimistically check to see if the next page in the
+	 * file is mapped to the next page of physical RAM */
+	return PAGE_SIZE;
 }
 #endif
 
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index ebf41e2..da914b2 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
 static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-				 void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+				 void **kaddr, unsigned long *pfn, long size);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -866,25 +866,26 @@ fail:
 	bio_io_error(bio);
 }
 
-static int
+static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void **kaddr, unsigned long *pfn)
+			void **kaddr, unsigned long *pfn, long size)
 {
 	struct dcssblk_dev_info *dev_info;
-	unsigned long pgoff;
+	unsigned long offset, dev_sz;
 
 	dev_info = bdev->bd_disk->private_data;
 	if (!dev_info)
 		return -ENODEV;
+	dev_sz = dev_info->end - dev_info->start;
 	if (secnum % (PAGE_SIZE/512))
 		return -EINVAL;
-	pgoff = secnum / (PAGE_SIZE / 512);
-	if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
+	offset = secnum * 512;
+	if (offset > dev_sz)
 		return -ERANGE;
-	*kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+	*kaddr = (void *) (dev_info->start + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return 0;
+	return min_t(unsigned long, size, dev_sz - offset);
 }
 
 static void
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..fa40091 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,13 @@
 #include "ext2.h"
 #include "xip.h"
 
-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
-		      void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+				void **kaddr, unsigned long *pfn, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	sector_t sector;
-
-	sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
-	BUG_ON(!ops->direct_access);
-	return ops->direct_access(bdev, sector, kaddr, pfn);
+	sector_t sector = block * (PAGE_SIZE / 512);
+	return ops->direct_access(bdev, sector, kaddr, pfn, size);
 }
 
 static inline int
@@ -53,12 +48,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
 {
 	void *kaddr;
 	unsigned long pfn;
-	int rc;
+	long size;
 
-	rc = __inode_direct_access(inode, block, &kaddr, &pfn);
-	if (!rc)
-		clear_page(kaddr);
-	return rc;
+	size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+	if (size < 0)
+		return size;
+	clear_page(kaddr);
+	return 0;
 }
 
 void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +73,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
 int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
 				void **kmem, unsigned long *pfn)
 {
-	int rc;
+	long rc;
 	sector_t block;
 
 	/* first, retrieve the sector number */
@@ -86,6 +82,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
 		return rc;
 
 	/* retrieve address of the target data */
-	rc = __inode_direct_access(mapping->host, block, kmem, pfn);
-	return rc;
+	rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+	return (rc < 0) ? rc : 0;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4afa4f8..c6f6210 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1560,8 +1560,8 @@ struct block_device_operations {
 	void (*release) (struct gendisk *, fmode_t);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
-	int (*direct_access) (struct block_device *, sector_t,
-						void **, unsigned long *);
+	long (*direct_access) (struct block_device *, sector_t,
+					void **, unsigned long *pfn, long size);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 05/22] Introduce IS_DAX(inode)
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/inode.c    | 9 ++++++---
 fs/ext2/xip.h      | 2 --
 include/linux/fs.h | 6 ++++++
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 94ed3684..e7d3192 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode,
 		goto cleanup;
 	}
 
-	if (ext2_use_xip(inode->i_sb)) {
+	if (IS_DAX(inode)) {
 		/*
 		 * we need to clear the block
 		 */
@@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 
 	inode_dio_wait(inode);
 
-	if (mapping_is_xip(inode->i_mapping))
+	if (IS_DAX(inode))
 		error = xip_truncate_page(inode->i_mapping, newsize);
 	else if (test_opt(inode->i_sb, NOBH))
 		error = nobh_truncate_page(inode->i_mapping,
@@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
 {
 	unsigned int flags = EXT2_I(inode)->i_flags;
 
-	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+	inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+				S_DIRSYNC | S_DAX);
 	if (flags & EXT2_SYNC_FL)
 		inode->i_flags |= S_SYNC;
 	if (flags & EXT2_APPEND_FL)
@@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	if (test_opt(inode->i_sb, XIP))
+		inode->i_flags |= S_DAX;
 }
 
 /* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
 }
 int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
 				void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
 #else
-#define mapping_is_xip(map)			0
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #define ext2_clear_xip_target(inode, chain)	0
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6082956..42fe4e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1640,6 +1640,7 @@ struct super_operations {
 #define S_IMA		1024	/* Inode has an associated IMA struct */
 #define S_AUTOMOUNT	2048	/* Automount/referral quasi-directory */
 #define S_NOSEC		4096	/* no suid or xattr security attributes */
+#define S_DAX		8192	/* Direct Access, avoiding the page cache */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1677,6 +1678,11 @@ struct super_operations {
 #define IS_IMA(inode)		((inode)->i_flags & S_IMA)
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
+#ifdef CONFIG_FS_XIP
+#define IS_DAX(inode)		((inode)->i_flags & S_DAX)
+#else
+#define IS_DAX(inode)		0
+#endif
 
 /*
  * Inode state bits.  Protected by inode->i_lock
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 05/22] Introduce IS_DAX(inode)
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/inode.c    | 9 ++++++---
 fs/ext2/xip.h      | 2 --
 include/linux/fs.h | 6 ++++++
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 94ed3684..e7d3192 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode,
 		goto cleanup;
 	}
 
-	if (ext2_use_xip(inode->i_sb)) {
+	if (IS_DAX(inode)) {
 		/*
 		 * we need to clear the block
 		 */
@@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 
 	inode_dio_wait(inode);
 
-	if (mapping_is_xip(inode->i_mapping))
+	if (IS_DAX(inode))
 		error = xip_truncate_page(inode->i_mapping, newsize);
 	else if (test_opt(inode->i_sb, NOBH))
 		error = nobh_truncate_page(inode->i_mapping,
@@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
 {
 	unsigned int flags = EXT2_I(inode)->i_flags;
 
-	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+	inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+				S_DIRSYNC | S_DAX);
 	if (flags & EXT2_SYNC_FL)
 		inode->i_flags |= S_SYNC;
 	if (flags & EXT2_APPEND_FL)
@@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	if (test_opt(inode->i_sb, XIP))
+		inode->i_flags |= S_DAX;
 }
 
 /* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
 }
 int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
 				void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
 #else
-#define mapping_is_xip(map)			0
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #define ext2_clear_xip_target(inode, chain)	0
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6082956..42fe4e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1640,6 +1640,7 @@ struct super_operations {
 #define S_IMA		1024	/* Inode has an associated IMA struct */
 #define S_AUTOMOUNT	2048	/* Automount/referral quasi-directory */
 #define S_NOSEC		4096	/* no suid or xattr security attributes */
+#define S_DAX		8192	/* Direct Access, avoiding the page cache */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1677,6 +1678,11 @@ struct super_operations {
 #define IS_IMA(inode)		((inode)->i_flags & S_IMA)
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
+#ifdef CONFIG_FS_XIP
+#define IS_DAX(inode)		((inode)->i_flags & S_DAX)
+#else
+#define IS_DAX(inode)		0
+#endif
 
 /*
  * Inode state bits.  Protected by inode->i_lock
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 06/22] Replace XIP read and write with DAX I/O
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Use the generic AIO infrastructure instead of custom read and write
methods.  In addition to giving us support for AIO, this adds the missing
locking between read() and truncate().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/Makefile        |   1 +
 fs/dax.c           | 192 +++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/file.c     |   6 +-
 fs/ext2/inode.c    |   7 +-
 include/linux/fs.h |  18 ++++-
 mm/filemap.c       |   6 +-
 mm/filemap_xip.c   | 234 -----------------------------------------------------
 7 files changed, 219 insertions(+), 245 deletions(-)
 create mode 100644 fs/dax.c

diff --git a/fs/Makefile b/fs/Makefile
index 47ac07b..2f194cd 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_FS_XIP)		+= dax.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
diff --git a/fs/dax.c b/fs/dax.c
new file mode 100644
index 0000000..81099f9
--- /dev/null
+++ b/fs/dax.c
@@ -0,0 +1,192 @@
+/*
+ * fs/dax.c - Direct Access filesystem code
+ * Copyright (c) 2013-2014 Intel Corporation
+ * Author: Matthew Wilcox <matthew.r.wilcox@intel.com>
+ * Author: Ross Zwisler <ross.zwisler@linux.intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/mutex.h>
+#include <linux/uio.h>
+
+static long dax_get_addr(struct inode *inode, struct buffer_head *bh,
+								void **addr)
+{
+	struct block_device *bdev = bh->b_bdev;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+	unsigned long pfn;
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+	return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size);
+}
+
+static void dax_new_buf(void *addr, unsigned size, unsigned first,
+					loff_t offset, loff_t end, int rw)
+{
+	loff_t final = end - offset;	/* The final byte in this buffer */
+	if (rw != WRITE) {
+		memset(addr, 0, size);
+		return;
+	}
+
+	if (first > 0)
+		memset(addr, 0, first);
+	if (final < size)
+		memset(addr + final, 0, size - final);
+}
+
+static bool buffer_written(struct buffer_head *bh)
+{
+	return buffer_mapped(bh) && !buffer_unwritten(bh);
+}
+
+static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
+			loff_t start, loff_t end, get_block_t get_block,
+			struct buffer_head *bh)
+{
+	ssize_t retval = 0;
+	unsigned seg = 0;
+	unsigned len;
+	unsigned copied = 0;
+	loff_t offset = start;
+	loff_t max = start;
+	void *addr;
+	bool hole = false;
+
+	if (rw != WRITE)
+		end = min(end, i_size_read(inode));
+
+	while (offset < end) {
+		void __user *buf = iov[seg].iov_base + copied;
+
+		if (max == offset) {
+			sector_t block = offset >> inode->i_blkbits;
+			unsigned first = offset - (block << inode->i_blkbits);
+			long size;
+			memset(bh, 0, sizeof(*bh));
+			bh->b_size = ALIGN(end - offset, PAGE_SIZE);
+			retval = get_block(inode, block, bh, rw == WRITE);
+			if (retval)
+				break;
+			if (rw == WRITE) {
+				if (!buffer_mapped(bh)) {
+					retval = -EIO;
+					break;
+				}
+				hole = false;
+			} else {
+				hole = !buffer_written(bh);
+			}
+
+			if (hole) {
+				addr = NULL;
+				if (buffer_uptodate(bh))
+					size = bh->b_size - first;
+				else
+					size = (1 << inode->i_blkbits) - first;
+			} else {
+				retval = dax_get_addr(inode, bh, &addr);
+				if (retval < 0)
+					break;
+				if (buffer_unwritten(bh) || buffer_new(bh))
+					dax_new_buf(addr, retval, first,
+						   offset, end, rw);
+				addr += first;
+				size = retval - first;
+			}
+			max = min(offset + size, end);
+		}
+
+		len = min_t(unsigned, iov[seg].iov_len - copied, max - offset);
+
+		if (rw == WRITE)
+			len -= __copy_from_user_nocache(addr, buf, len);
+		else if (!hole)
+			len -= __copy_to_user(buf, addr, len);
+		else
+			len -= __clear_user(buf, len);
+
+		if (!len)
+			break;
+
+		offset += len;
+		copied += len;
+		addr += len;
+		if (copied == iov[seg].iov_len) {
+			seg++;
+			copied = 0;
+		}
+	}
+
+	return (offset == start) ? retval : offset - start;
+}
+
+/**
+ * dax_do_io - Perform I/O to a DAX file
+ * @rw: READ to read or WRITE to write
+ * @iocb: The control block for this I/O
+ * @inode: The file which the I/O is directed at
+ * @iov: The user addresses to do I/O from or to
+ * @offset: The file offset where the I/O starts
+ * @nr_segs: The length of the iov array
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ * @end_io: A filesystem callback for I/O completion
+ * @flags: See below
+ *
+ * This function uses the same locking scheme as do_blockdev_direct_IO:
+ * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
+ * caller for writes.  For reads, we take and release the i_mutex ourselves.
+ * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
+ * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
+ * is in progress.
+ */
+ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+		const struct iovec *iov, loff_t offset, unsigned nr_segs,
+		get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+	struct buffer_head bh;
+	unsigned seg;
+	ssize_t retval = -EINVAL;
+	loff_t end = offset;
+
+	for (seg = 0; seg < nr_segs; seg++)
+		end += iov[seg].iov_len;
+
+	if ((flags & DIO_LOCKING) && (rw == READ)) {
+		struct address_space *mapping = inode->i_mapping;
+		mutex_lock(&inode->i_mutex);
+		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
+		if (retval) {
+			mutex_unlock(&inode->i_mutex);
+			goto out;
+		}
+	}
+
+	/* Protects against truncate */
+	atomic_inc(&inode->i_dio_count);
+
+	retval = dax_io(rw, inode, iov, offset, end, get_block, &bh);
+
+	if ((flags & DIO_LOCKING) && (rw == READ))
+		mutex_unlock(&inode->i_mutex);
+
+	inode_dio_done(inode);
+
+	if ((retval > 0) && end_io)
+		end_io(iocb, offset, retval, bh.b_private);
+ out:
+	return retval;
+}
+EXPORT_SYMBOL_GPL(dax_do_io);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 44c36e5..ef5cf96 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = {
 #ifdef CONFIG_EXT2_FS_XIP
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= xip_file_read,
-	.write		= xip_file_write,
+	.read		= do_sync_read,
+	.write		= do_sync_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
 	.unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index e7d3192..f128ebf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	struct inode *inode = mapping->host;
 	ssize_t ret;
 
-	ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+	if (IS_DAX(inode))
+		ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+				ext2_get_block, NULL, DIO_LOCKING);
+	else
+		ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
 				 ext2_get_block);
 	if (ret < 0 && (rw & WRITE))
 		ext2_write_failed(mapping, offset + iov_length(iov, nr_segs));
@@ -888,6 +892,7 @@ const struct address_space_operations ext2_aops = {
 const struct address_space_operations ext2_aops_xip = {
 	.bmap			= ext2_bmap,
 	.get_xip_mem		= ext2_get_xip_mem,
+	.direct_IO		= ext2_direct_IO,
 };
 
 const struct address_space_operations ext2_nobh_aops = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42fe4e5..07888b9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2517,17 +2517,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
-extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
-			     loff_t *ppos);
 extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
-extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
-			      size_t len, loff_t *ppos);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
+		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
 #else
 static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
 {
 	return 0;
 }
+
+static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+		const struct iovec *iov, loff_t offset, unsigned nr_segs,
+		get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+	return -ENOTTY;
+}
 #endif
 
 #ifdef CONFIG_BLOCK
@@ -2677,6 +2682,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
 extern void save_mount_options(struct super_block *sb, char *options);
 extern void replace_mount_options(struct super_block *sb, char *options);
 
+static inline bool io_is_direct(struct file *filp)
+{
+	return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6a..1b7dff6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1417,8 +1417,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 	if (retval)
 		return retval;
 
-	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
-	if (filp->f_flags & O_DIRECT) {
+	if (io_is_direct(filp)) {
 		loff_t size;
 		struct address_space *mapping;
 		struct inode *inode;
@@ -2468,8 +2467,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	if (err)
 		goto out;
 
-	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
-	if (unlikely(file->f_flags & O_DIRECT)) {
+	if (io_is_direct(file)) {
 		loff_t endbyte;
 		ssize_t written_buffered;
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c8d23e9..f7c37a1 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void)
 }
 
 /*
- * This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_mem() function for the actual low-level
- * stuff.
- *
- * Note the struct file* is not used at all.  It may be NULL.
- */
-static ssize_t
-do_xip_mapping_read(struct address_space *mapping,
-		    struct file_ra_state *_ra,
-		    struct file *filp,
-		    char __user *buf,
-		    size_t len,
-		    loff_t *ppos)
-{
-	struct inode *inode = mapping->host;
-	pgoff_t index, end_index;
-	unsigned long offset;
-	loff_t isize, pos;
-	size_t copied = 0, error = 0;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	pos = *ppos;
-	index = pos >> PAGE_CACHE_SHIFT;
-	offset = pos & ~PAGE_CACHE_MASK;
-
-	isize = i_size_read(inode);
-	if (!isize)
-		goto out;
-
-	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
-	do {
-		unsigned long nr, left;
-		void *xip_mem;
-		unsigned long xip_pfn;
-		int zero = 0;
-
-		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
-		if (index >= end_index) {
-			if (index > end_index)
-				goto out;
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
-			if (nr <= offset) {
-				goto out;
-			}
-		}
-		nr = nr - offset;
-		if (nr > len - copied)
-			nr = len - copied;
-
-		error = mapping->a_ops->get_xip_mem(mapping, index, 0,
-							&xip_mem, &xip_pfn);
-		if (unlikely(error)) {
-			if (error == -ENODATA) {
-				/* sparse */
-				zero = 1;
-			} else
-				goto out;
-		}
-
-		/* If users can be writing to this page using arbitrary
-		 * virtual addresses, take care about potential aliasing
-		 * before reading the page on the kernel side.
-		 */
-		if (mapping_writably_mapped(mapping))
-			/* address based flush */ ;
-
-		/*
-		 * Ok, we have the mem, so now we can copy it to user space...
-		 *
-		 * The actor routine returns how many bytes were actually used..
-		 * NOTE! This may not be the same as how much of a user buffer
-		 * we filled up (we may be padding etc), so we can only update
-		 * "pos" here (the actor routine has to update the user buffer
-		 * pointers and the remaining count).
-		 */
-		if (!zero)
-			left = __copy_to_user(buf+copied, xip_mem+offset, nr);
-		else
-			left = __clear_user(buf + copied, nr);
-
-		if (left) {
-			error = -EFAULT;
-			goto out;
-		}
-
-		copied += (nr - left);
-		offset += (nr - left);
-		index += offset >> PAGE_CACHE_SHIFT;
-		offset &= ~PAGE_CACHE_MASK;
-	} while (copied < len);
-
-out:
-	*ppos = pos + copied;
-	if (filp)
-		file_accessed(filp);
-
-	return (copied ? copied : error);
-}
-
-ssize_t
-xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
-{
-	if (!access_ok(VERIFY_WRITE, buf, len))
-		return -EFAULT;
-
-	return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
-			    buf, len, ppos);
-}
-EXPORT_SYMBOL_GPL(xip_file_read);
-
-/*
  * __xip_unmap is invoked from xip_unmap and
  * xip_write
  *
@@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
 }
 EXPORT_SYMBOL_GPL(xip_file_mmap);
 
-static ssize_t
-__xip_file_write(struct file *filp, const char __user *buf,
-		  size_t count, loff_t pos, loff_t *ppos)
-{
-	struct address_space * mapping = filp->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode 	*inode = mapping->host;
-	long		status = 0;
-	size_t		bytes;
-	ssize_t		written = 0;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	do {
-		unsigned long index;
-		unsigned long offset;
-		size_t copied;
-		void *xip_mem;
-		unsigned long xip_pfn;
-
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
-		if (bytes > count)
-			bytes = count;
-
-		status = a_ops->get_xip_mem(mapping, index, 0,
-						&xip_mem, &xip_pfn);
-		if (status == -ENODATA) {
-			/* we allocate a new page unmap it */
-			mutex_lock(&xip_sparse_mutex);
-			status = a_ops->get_xip_mem(mapping, index, 1,
-							&xip_mem, &xip_pfn);
-			mutex_unlock(&xip_sparse_mutex);
-			if (!status)
-				/* unmap page at pgoff from all other vmas */
-				__xip_unmap(mapping, index);
-		}
-
-		if (status)
-			break;
-
-		copied = bytes -
-			__copy_from_user_nocache(xip_mem + offset, buf, bytes);
-
-		if (likely(copied > 0)) {
-			status = copied;
-
-			if (status >= 0) {
-				written += status;
-				count -= status;
-				pos += status;
-				buf += status;
-			}
-		}
-		if (unlikely(copied != bytes))
-			if (status >= 0)
-				status = -EFAULT;
-		if (status < 0)
-			break;
-	} while (count);
-	*ppos = pos;
-	/*
-	 * No need to use i_size_read() here, the i_size
-	 * cannot change under us because we hold i_mutex.
-	 */
-	if (pos > inode->i_size) {
-		i_size_write(inode, pos);
-		mark_inode_dirty(inode);
-	}
-
-	return written ? written : status;
-}
-
-ssize_t
-xip_file_write(struct file *filp, const char __user *buf, size_t len,
-	       loff_t *ppos)
-{
-	struct address_space *mapping = filp->f_mapping;
-	struct inode *inode = mapping->host;
-	size_t count;
-	loff_t pos;
-	ssize_t ret;
-
-	mutex_lock(&inode->i_mutex);
-
-	if (!access_ok(VERIFY_READ, buf, len)) {
-		ret=-EFAULT;
-		goto out_up;
-	}
-
-	pos = *ppos;
-	count = len;
-
-	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
-
-	ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
-	if (ret)
-		goto out_backing;
-	if (count == 0)
-		goto out_backing;
-
-	ret = file_remove_suid(filp);
-	if (ret)
-		goto out_backing;
-
-	ret = file_update_time(filp);
-	if (ret)
-		goto out_backing;
-
-	ret = __xip_file_write (filp, buf, count, pos, ppos);
-
- out_backing:
-	current->backing_dev_info = NULL;
- out_up:
-	mutex_unlock(&inode->i_mutex);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(xip_file_write);
-
 /*
  * truncate a page used for execute in place
  * functionality is analog to block_truncate_page but does use get_xip_mem
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 06/22] Replace XIP read and write with DAX I/O
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Use the generic AIO infrastructure instead of custom read and write
methods.  In addition to giving us support for AIO, this adds the missing
locking between read() and truncate().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/Makefile        |   1 +
 fs/dax.c           | 192 +++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/file.c     |   6 +-
 fs/ext2/inode.c    |   7 +-
 include/linux/fs.h |  18 ++++-
 mm/filemap.c       |   6 +-
 mm/filemap_xip.c   | 234 -----------------------------------------------------
 7 files changed, 219 insertions(+), 245 deletions(-)
 create mode 100644 fs/dax.c

diff --git a/fs/Makefile b/fs/Makefile
index 47ac07b..2f194cd 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_FS_XIP)		+= dax.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
diff --git a/fs/dax.c b/fs/dax.c
new file mode 100644
index 0000000..81099f9
--- /dev/null
+++ b/fs/dax.c
@@ -0,0 +1,192 @@
+/*
+ * fs/dax.c - Direct Access filesystem code
+ * Copyright (c) 2013-2014 Intel Corporation
+ * Author: Matthew Wilcox <matthew.r.wilcox@intel.com>
+ * Author: Ross Zwisler <ross.zwisler@linux.intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/mutex.h>
+#include <linux/uio.h>
+
+static long dax_get_addr(struct inode *inode, struct buffer_head *bh,
+								void **addr)
+{
+	struct block_device *bdev = bh->b_bdev;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+	unsigned long pfn;
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+	return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size);
+}
+
+static void dax_new_buf(void *addr, unsigned size, unsigned first,
+					loff_t offset, loff_t end, int rw)
+{
+	loff_t final = end - offset;	/* The final byte in this buffer */
+	if (rw != WRITE) {
+		memset(addr, 0, size);
+		return;
+	}
+
+	if (first > 0)
+		memset(addr, 0, first);
+	if (final < size)
+		memset(addr + final, 0, size - final);
+}
+
+static bool buffer_written(struct buffer_head *bh)
+{
+	return buffer_mapped(bh) && !buffer_unwritten(bh);
+}
+
+static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
+			loff_t start, loff_t end, get_block_t get_block,
+			struct buffer_head *bh)
+{
+	ssize_t retval = 0;
+	unsigned seg = 0;
+	unsigned len;
+	unsigned copied = 0;
+	loff_t offset = start;
+	loff_t max = start;
+	void *addr;
+	bool hole = false;
+
+	if (rw != WRITE)
+		end = min(end, i_size_read(inode));
+
+	while (offset < end) {
+		void __user *buf = iov[seg].iov_base + copied;
+
+		if (max == offset) {
+			sector_t block = offset >> inode->i_blkbits;
+			unsigned first = offset - (block << inode->i_blkbits);
+			long size;
+			memset(bh, 0, sizeof(*bh));
+			bh->b_size = ALIGN(end - offset, PAGE_SIZE);
+			retval = get_block(inode, block, bh, rw == WRITE);
+			if (retval)
+				break;
+			if (rw == WRITE) {
+				if (!buffer_mapped(bh)) {
+					retval = -EIO;
+					break;
+				}
+				hole = false;
+			} else {
+				hole = !buffer_written(bh);
+			}
+
+			if (hole) {
+				addr = NULL;
+				if (buffer_uptodate(bh))
+					size = bh->b_size - first;
+				else
+					size = (1 << inode->i_blkbits) - first;
+			} else {
+				retval = dax_get_addr(inode, bh, &addr);
+				if (retval < 0)
+					break;
+				if (buffer_unwritten(bh) || buffer_new(bh))
+					dax_new_buf(addr, retval, first,
+						   offset, end, rw);
+				addr += first;
+				size = retval - first;
+			}
+			max = min(offset + size, end);
+		}
+
+		len = min_t(unsigned, iov[seg].iov_len - copied, max - offset);
+
+		if (rw == WRITE)
+			len -= __copy_from_user_nocache(addr, buf, len);
+		else if (!hole)
+			len -= __copy_to_user(buf, addr, len);
+		else
+			len -= __clear_user(buf, len);
+
+		if (!len)
+			break;
+
+		offset += len;
+		copied += len;
+		addr += len;
+		if (copied == iov[seg].iov_len) {
+			seg++;
+			copied = 0;
+		}
+	}
+
+	return (offset == start) ? retval : offset - start;
+}
+
+/**
+ * dax_do_io - Perform I/O to a DAX file
+ * @rw: READ to read or WRITE to write
+ * @iocb: The control block for this I/O
+ * @inode: The file which the I/O is directed at
+ * @iov: The user addresses to do I/O from or to
+ * @offset: The file offset where the I/O starts
+ * @nr_segs: The length of the iov array
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ * @end_io: A filesystem callback for I/O completion
+ * @flags: See below
+ *
+ * This function uses the same locking scheme as do_blockdev_direct_IO:
+ * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
+ * caller for writes.  For reads, we take and release the i_mutex ourselves.
+ * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
+ * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
+ * is in progress.
+ */
+ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+		const struct iovec *iov, loff_t offset, unsigned nr_segs,
+		get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+	struct buffer_head bh;
+	unsigned seg;
+	ssize_t retval = -EINVAL;
+	loff_t end = offset;
+
+	for (seg = 0; seg < nr_segs; seg++)
+		end += iov[seg].iov_len;
+
+	if ((flags & DIO_LOCKING) && (rw == READ)) {
+		struct address_space *mapping = inode->i_mapping;
+		mutex_lock(&inode->i_mutex);
+		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
+		if (retval) {
+			mutex_unlock(&inode->i_mutex);
+			goto out;
+		}
+	}
+
+	/* Protects against truncate */
+	atomic_inc(&inode->i_dio_count);
+
+	retval = dax_io(rw, inode, iov, offset, end, get_block, &bh);
+
+	if ((flags & DIO_LOCKING) && (rw == READ))
+		mutex_unlock(&inode->i_mutex);
+
+	inode_dio_done(inode);
+
+	if ((retval > 0) && end_io)
+		end_io(iocb, offset, retval, bh.b_private);
+ out:
+	return retval;
+}
+EXPORT_SYMBOL_GPL(dax_do_io);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 44c36e5..ef5cf96 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = {
 #ifdef CONFIG_EXT2_FS_XIP
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= xip_file_read,
-	.write		= xip_file_write,
+	.read		= do_sync_read,
+	.write		= do_sync_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
 	.unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index e7d3192..f128ebf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	struct inode *inode = mapping->host;
 	ssize_t ret;
 
-	ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+	if (IS_DAX(inode))
+		ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+				ext2_get_block, NULL, DIO_LOCKING);
+	else
+		ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
 				 ext2_get_block);
 	if (ret < 0 && (rw & WRITE))
 		ext2_write_failed(mapping, offset + iov_length(iov, nr_segs));
@@ -888,6 +892,7 @@ const struct address_space_operations ext2_aops = {
 const struct address_space_operations ext2_aops_xip = {
 	.bmap			= ext2_bmap,
 	.get_xip_mem		= ext2_get_xip_mem,
+	.direct_IO		= ext2_direct_IO,
 };
 
 const struct address_space_operations ext2_nobh_aops = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42fe4e5..07888b9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2517,17 +2517,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
-extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
-			     loff_t *ppos);
 extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
-extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
-			      size_t len, loff_t *ppos);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
+		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
 #else
 static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
 {
 	return 0;
 }
+
+static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+		const struct iovec *iov, loff_t offset, unsigned nr_segs,
+		get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+	return -ENOTTY;
+}
 #endif
 
 #ifdef CONFIG_BLOCK
@@ -2677,6 +2682,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
 extern void save_mount_options(struct super_block *sb, char *options);
 extern void replace_mount_options(struct super_block *sb, char *options);
 
+static inline bool io_is_direct(struct file *filp)
+{
+	return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6a..1b7dff6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1417,8 +1417,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 	if (retval)
 		return retval;
 
-	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
-	if (filp->f_flags & O_DIRECT) {
+	if (io_is_direct(filp)) {
 		loff_t size;
 		struct address_space *mapping;
 		struct inode *inode;
@@ -2468,8 +2467,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	if (err)
 		goto out;
 
-	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
-	if (unlikely(file->f_flags & O_DIRECT)) {
+	if (io_is_direct(file)) {
 		loff_t endbyte;
 		ssize_t written_buffered;
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c8d23e9..f7c37a1 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void)
 }
 
 /*
- * This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_mem() function for the actual low-level
- * stuff.
- *
- * Note the struct file* is not used at all.  It may be NULL.
- */
-static ssize_t
-do_xip_mapping_read(struct address_space *mapping,
-		    struct file_ra_state *_ra,
-		    struct file *filp,
-		    char __user *buf,
-		    size_t len,
-		    loff_t *ppos)
-{
-	struct inode *inode = mapping->host;
-	pgoff_t index, end_index;
-	unsigned long offset;
-	loff_t isize, pos;
-	size_t copied = 0, error = 0;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	pos = *ppos;
-	index = pos >> PAGE_CACHE_SHIFT;
-	offset = pos & ~PAGE_CACHE_MASK;
-
-	isize = i_size_read(inode);
-	if (!isize)
-		goto out;
-
-	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
-	do {
-		unsigned long nr, left;
-		void *xip_mem;
-		unsigned long xip_pfn;
-		int zero = 0;
-
-		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
-		if (index >= end_index) {
-			if (index > end_index)
-				goto out;
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
-			if (nr <= offset) {
-				goto out;
-			}
-		}
-		nr = nr - offset;
-		if (nr > len - copied)
-			nr = len - copied;
-
-		error = mapping->a_ops->get_xip_mem(mapping, index, 0,
-							&xip_mem, &xip_pfn);
-		if (unlikely(error)) {
-			if (error == -ENODATA) {
-				/* sparse */
-				zero = 1;
-			} else
-				goto out;
-		}
-
-		/* If users can be writing to this page using arbitrary
-		 * virtual addresses, take care about potential aliasing
-		 * before reading the page on the kernel side.
-		 */
-		if (mapping_writably_mapped(mapping))
-			/* address based flush */ ;
-
-		/*
-		 * Ok, we have the mem, so now we can copy it to user space...
-		 *
-		 * The actor routine returns how many bytes were actually used..
-		 * NOTE! This may not be the same as how much of a user buffer
-		 * we filled up (we may be padding etc), so we can only update
-		 * "pos" here (the actor routine has to update the user buffer
-		 * pointers and the remaining count).
-		 */
-		if (!zero)
-			left = __copy_to_user(buf+copied, xip_mem+offset, nr);
-		else
-			left = __clear_user(buf + copied, nr);
-
-		if (left) {
-			error = -EFAULT;
-			goto out;
-		}
-
-		copied += (nr - left);
-		offset += (nr - left);
-		index += offset >> PAGE_CACHE_SHIFT;
-		offset &= ~PAGE_CACHE_MASK;
-	} while (copied < len);
-
-out:
-	*ppos = pos + copied;
-	if (filp)
-		file_accessed(filp);
-
-	return (copied ? copied : error);
-}
-
-ssize_t
-xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
-{
-	if (!access_ok(VERIFY_WRITE, buf, len))
-		return -EFAULT;
-
-	return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
-			    buf, len, ppos);
-}
-EXPORT_SYMBOL_GPL(xip_file_read);
-
-/*
  * __xip_unmap is invoked from xip_unmap and
  * xip_write
  *
@@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
 }
 EXPORT_SYMBOL_GPL(xip_file_mmap);
 
-static ssize_t
-__xip_file_write(struct file *filp, const char __user *buf,
-		  size_t count, loff_t pos, loff_t *ppos)
-{
-	struct address_space * mapping = filp->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode 	*inode = mapping->host;
-	long		status = 0;
-	size_t		bytes;
-	ssize_t		written = 0;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	do {
-		unsigned long index;
-		unsigned long offset;
-		size_t copied;
-		void *xip_mem;
-		unsigned long xip_pfn;
-
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
-		if (bytes > count)
-			bytes = count;
-
-		status = a_ops->get_xip_mem(mapping, index, 0,
-						&xip_mem, &xip_pfn);
-		if (status == -ENODATA) {
-			/* we allocate a new page unmap it */
-			mutex_lock(&xip_sparse_mutex);
-			status = a_ops->get_xip_mem(mapping, index, 1,
-							&xip_mem, &xip_pfn);
-			mutex_unlock(&xip_sparse_mutex);
-			if (!status)
-				/* unmap page at pgoff from all other vmas */
-				__xip_unmap(mapping, index);
-		}
-
-		if (status)
-			break;
-
-		copied = bytes -
-			__copy_from_user_nocache(xip_mem + offset, buf, bytes);
-
-		if (likely(copied > 0)) {
-			status = copied;
-
-			if (status >= 0) {
-				written += status;
-				count -= status;
-				pos += status;
-				buf += status;
-			}
-		}
-		if (unlikely(copied != bytes))
-			if (status >= 0)
-				status = -EFAULT;
-		if (status < 0)
-			break;
-	} while (count);
-	*ppos = pos;
-	/*
-	 * No need to use i_size_read() here, the i_size
-	 * cannot change under us because we hold i_mutex.
-	 */
-	if (pos > inode->i_size) {
-		i_size_write(inode, pos);
-		mark_inode_dirty(inode);
-	}
-
-	return written ? written : status;
-}
-
-ssize_t
-xip_file_write(struct file *filp, const char __user *buf, size_t len,
-	       loff_t *ppos)
-{
-	struct address_space *mapping = filp->f_mapping;
-	struct inode *inode = mapping->host;
-	size_t count;
-	loff_t pos;
-	ssize_t ret;
-
-	mutex_lock(&inode->i_mutex);
-
-	if (!access_ok(VERIFY_READ, buf, len)) {
-		ret=-EFAULT;
-		goto out_up;
-	}
-
-	pos = *ppos;
-	count = len;
-
-	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
-
-	ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
-	if (ret)
-		goto out_backing;
-	if (count == 0)
-		goto out_backing;
-
-	ret = file_remove_suid(filp);
-	if (ret)
-		goto out_backing;
-
-	ret = file_update_time(filp);
-	if (ret)
-		goto out_backing;
-
-	ret = __xip_file_write (filp, buf, count, pos, ppos);
-
- out_backing:
-	current->backing_dev_info = NULL;
- out_up:
-	mutex_unlock(&inode->i_mutex);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(xip_file_write);
-
 /*
  * truncate a page used for execute in place
  * functionality is analog to block_truncate_page but does use get_xip_mem
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Instead of calling aops->get_xip_mem from the fault handler, the
filesystem passes a get_block_t that is used to find the appropriate
blocks.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/dax.c           | 167 +++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/file.c     |  35 ++++++++-
 include/linux/fs.h |   4 +-
 mm/filemap_xip.c   | 206 -----------------------------------------------------
 4 files changed, 203 insertions(+), 209 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 81099f9..ebcd8fd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -19,6 +19,8 @@
 #include <linux/buffer_head.h>
 #include <linux/fs.h>
 #include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/uio.h>
 
@@ -32,6 +34,16 @@ static long dax_get_addr(struct inode *inode, struct buffer_head *bh,
 	return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size);
 }
 
+static long dax_get_pfn(struct inode *inode, struct buffer_head *bh,
+							unsigned long *pfn)
+{
+	struct block_device *bdev = bh->b_bdev;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+	void *addr;
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+	return ops->direct_access(bdev, sector, &addr, pfn, bh->b_size);
+}
+
 static void dax_new_buf(void *addr, unsigned size, unsigned first,
 					loff_t offset, loff_t end, int rw)
 {
@@ -190,3 +202,158 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 	return retval;
 }
 EXPORT_SYMBOL_GPL(dax_do_io);
+
+/*
+ * The user has performed a load from a hole in the file.  Allocating
+ * a new page in the file would cause excessive storage usage for
+ * workloads with sparse files.  We allocate a page cache page instead.
+ * We'll kick it out of the page cache if it's ever written to,
+ * otherwise it will simply fall out of the page cache under memory
+ * pressure without ever having been dirtied.
+ */
+static int dax_load_hole(struct address_space *mapping, struct vm_fault *vmf)
+{
+	unsigned long size;
+	struct inode *inode = mapping->host;
+	struct page *page = find_or_create_page(mapping, vmf->pgoff,
+						GFP_KERNEL | __GFP_ZERO);
+	if (!page)
+		return VM_FAULT_OOM;
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size) {
+		unlock_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static void copy_user_bh(struct page *to, struct inode *inode,
+				struct buffer_head *bh, unsigned long vaddr)
+{
+	void *vfrom, *vto;
+	dax_get_addr(inode, bh, &vfrom);	/* XXX: error handling */
+	vto = kmap_atomic(to);
+	copy_user_page(vto, vfrom, vaddr, to);
+	kunmap_atomic(vto);
+}
+
+static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = file->f_mapping;
+	struct buffer_head bh;
+	unsigned long vaddr = (unsigned long)vmf->virtual_address;
+	sector_t block;
+	pgoff_t size;
+	unsigned long pfn;
+	int error;
+
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	memset(&bh, 0, sizeof(bh));
+	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
+	bh.b_size = PAGE_SIZE;
+	error = get_block(inode, block, &bh, 0);
+	if (error || bh.b_size < PAGE_SIZE)
+		return VM_FAULT_SIGBUS;
+
+	if (!buffer_written(&bh) && !vmf->cow_page) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			error = get_block(inode, block, &bh, 1);
+			if (error || bh.b_size < PAGE_SIZE)
+				return VM_FAULT_SIGBUS;
+		} else {
+			return dax_load_hole(mapping, vmf);
+		}
+	}
+
+	/* Recheck i_size under i_mmap_mutex */
+	mutex_lock(&mapping->i_mmap_mutex);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (unlikely(vmf->pgoff >= size)) {
+		mutex_unlock(&mapping->i_mmap_mutex);
+		return VM_FAULT_SIGBUS;
+	}
+	if (vmf->cow_page) {
+		if (buffer_written(&bh))
+			copy_user_bh(vmf->cow_page, inode, &bh, vaddr);
+		else
+			clear_user_highpage(vmf->cow_page, vaddr);
+		return VM_FAULT_COWED;
+	}
+
+	error = dax_get_pfn(inode, &bh, &pfn);
+	if (error > 0)
+		error = vm_insert_mixed(vma, vaddr, pfn);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	/* -EBUSY is fine, somebody else faulted on the same PTE */
+	if (error != -EBUSY)
+		BUG_ON(error);
+	return VM_FAULT_NOPAGE;
+}
+
+/**
+ * dax_fault - handle a page fault on an XIP file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * fault handler for XIP files.
+ */
+int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	int result;
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	sb_start_pagefault(sb);
+	file_update_time(vma->vm_file);
+	result = do_dax_fault(vma, vmf, get_block);
+	sb_end_pagefault(sb);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_fault);
+
+/**
+ * dax_mkwrite - convert a read-only page to read-write in an XIP file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * XIP handles reads of holes by adding pages full of zeroes into the
+ * mapping.  If the page is subsequenty written to, we have to allocate
+ * the page on media and free the page that was in the cache.
+ */
+int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	int result;
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	sb_start_pagefault(sb);
+	file_update_time(vma->vm_file);
+	result = do_dax_fault(vma, vmf, get_block);
+	sb_end_pagefault(sb);
+
+	if (!(result & VM_FAULT_ERROR)) {
+		struct page *page = vmf->page;
+		unmap_mapping_range(page->mapping,
+					(loff_t)page->index << PAGE_CACHE_SHIFT,
+					PAGE_CACHE_SIZE, 0);
+		delete_from_page_cache(page);
+	}
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_mkwrite);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ef5cf96..e3ce10d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,37 @@
 #include "xattr.h"
 #include "acl.h"
 
+#ifdef CONFIG_EXT2_FS_XIP
+static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_fault(vma, vmf, ext2_get_block);
+}
+
+static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_mkwrite(vma, vmf, ext2_get_block);
+}
+
+static const struct vm_operations_struct ext2_dax_vm_ops = {
+	.fault		= ext2_dax_fault,
+	.page_mkwrite	= ext2_dax_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if (!IS_DAX(file_inode(file)))
+		return generic_file_mmap(file, vma);
+
+	file_accessed(file);
+	vma->vm_ops = &ext2_dax_vm_ops;
+	vma->vm_flags |= VM_MIXEDMAP;
+	return 0;
+}
+#else
+#define ext2_file_mmap	generic_file_mmap
+#endif
+
 /*
  * Called when filp is released. This happens when all file descriptors
  * for a single struct file are closed. Note that different open() calls
@@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
-	.mmap		= generic_file_mmap,
+	.mmap		= ext2_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
@@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
-	.mmap		= xip_file_mmap,
+	.mmap		= ext2_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 07888b9..00ad95e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -48,6 +48,7 @@ struct cred;
 struct swap_info_struct;
 struct seq_file;
 struct workqueue_struct;
+struct vm_fault;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -2517,10 +2518,11 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
-extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
 #else
 static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
 {
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index f7c37a1..9dd45f3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -22,212 +22,6 @@
 #include <asm/io.h>
 
 /*
- * We do use our own empty page to avoid interference with other users
- * of ZERO_PAGE(), such as /dev/zero
- */
-static DEFINE_MUTEX(xip_sparse_mutex);
-static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
-static struct page *__xip_sparse_page;
-
-/* called under xip_sparse_mutex */
-static struct page *xip_sparse_page(void)
-{
-	if (!__xip_sparse_page) {
-		struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-
-		if (page)
-			__xip_sparse_page = page;
-	}
-	return __xip_sparse_page;
-}
-
-/*
- * __xip_unmap is invoked from xip_unmap and
- * xip_write
- *
- * This function walks all vmas of the address_space and unmaps the
- * __xip_sparse_page when found at pgoff.
- */
-static void
-__xip_unmap (struct address_space * mapping,
-		     unsigned long pgoff)
-{
-	struct vm_area_struct *vma;
-	struct mm_struct *mm;
-	unsigned long address;
-	pte_t *pte;
-	pte_t pteval;
-	spinlock_t *ptl;
-	struct page *page;
-	unsigned count;
-	int locked = 0;
-
-	count = read_seqcount_begin(&xip_sparse_seq);
-
-	page = __xip_sparse_page;
-	if (!page)
-		return;
-
-retry:
-	mutex_lock(&mapping->i_mmap_mutex);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
-		mm = vma->vm_mm;
-		address = vma->vm_start +
-			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl, 1);
-		if (pte) {
-			/* Nuke the page table entry. */
-			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
-			page_remove_rmap(page);
-			dec_mm_counter(mm, MM_FILEPAGES);
-			BUG_ON(pte_dirty(pteval));
-			pte_unmap_unlock(pte, ptl);
-			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
-			page_cache_release(page);
-		}
-	}
-	mutex_unlock(&mapping->i_mmap_mutex);
-
-	if (locked) {
-		mutex_unlock(&xip_sparse_mutex);
-	} else if (read_seqcount_retry(&xip_sparse_seq, count)) {
-		mutex_lock(&xip_sparse_mutex);
-		locked = 1;
-		goto retry;
-	}
-}
-
-/*
- * xip_fault() is invoked via the vma operations vector for a
- * mapped memory region to read in file data during a page fault.
- *
- * This function is derived from filemap_fault, but used for execute in place
- */
-static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	pgoff_t size;
-	void *xip_mem;
-	unsigned long xip_pfn;
-	struct page *page;
-	int error;
-
-	/* XXX: are VM_FAULT_ codes OK? */
-again:
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (vmf->pgoff >= size)
-		return VM_FAULT_SIGBUS;
-
-	error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
-						&xip_mem, &xip_pfn);
-	if (likely(!error))
-		goto found;
-	if (error != -ENODATA)
-		return VM_FAULT_OOM;
-
-	/* sparse block */
-	if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
-	    (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
-	    (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
-		int err;
-
-		/* maybe shared writable, allocate new block */
-		mutex_lock(&xip_sparse_mutex);
-		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
-							&xip_mem, &xip_pfn);
-		mutex_unlock(&xip_sparse_mutex);
-		if (error)
-			return VM_FAULT_SIGBUS;
-		/* unmap sparse mappings at pgoff from all other vmas */
-		__xip_unmap(mapping, vmf->pgoff);
-
-found:
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
-		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			mutex_unlock(&mapping->i_mmap_mutex);
-			return VM_FAULT_SIGBUS;
-		}
-		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
-							xip_pfn);
-		mutex_unlock(&mapping->i_mmap_mutex);
-		if (err == -ENOMEM)
-			return VM_FAULT_OOM;
-		/*
-		 * err == -EBUSY is fine, we've raced against another thread
-		 * that faulted-in the same page
-		 */
-		if (err != -EBUSY)
-			BUG_ON(err);
-		return VM_FAULT_NOPAGE;
-	} else {
-		int err, ret = VM_FAULT_OOM;
-
-		mutex_lock(&xip_sparse_mutex);
-		write_seqcount_begin(&xip_sparse_seq);
-		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
-							&xip_mem, &xip_pfn);
-		if (unlikely(!error)) {
-			write_seqcount_end(&xip_sparse_seq);
-			mutex_unlock(&xip_sparse_mutex);
-			goto again;
-		}
-		if (error != -ENODATA)
-			goto out;
-
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
-		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			ret = VM_FAULT_SIGBUS;
-			goto unlock;
-		}
-		/* not shared and writable, use xip_sparse_page() */
-		page = xip_sparse_page();
-		if (!page)
-			goto unlock;
-		err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
-							page);
-		if (err == -ENOMEM)
-			goto unlock;
-
-		ret = VM_FAULT_NOPAGE;
-unlock:
-		mutex_unlock(&mapping->i_mmap_mutex);
-out:
-		write_seqcount_end(&xip_sparse_seq);
-		mutex_unlock(&xip_sparse_mutex);
-
-		return ret;
-	}
-}
-
-static const struct vm_operations_struct xip_file_vm_ops = {
-	.fault	= xip_file_fault,
-	.page_mkwrite	= filemap_page_mkwrite,
-	.remap_pages = generic_file_remap_pages,
-};
-
-int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
-{
-	BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
-
-	file_accessed(file);
-	vma->vm_ops = &xip_file_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xip_file_mmap);
-
-/*
  * truncate a page used for execute in place
  * functionality is analog to block_truncate_page but does use get_xip_mem
  * to get the page instead of page cache
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Instead of calling aops->get_xip_mem from the fault handler, the
filesystem passes a get_block_t that is used to find the appropriate
blocks.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/dax.c           | 167 +++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/file.c     |  35 ++++++++-
 include/linux/fs.h |   4 +-
 mm/filemap_xip.c   | 206 -----------------------------------------------------
 4 files changed, 203 insertions(+), 209 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 81099f9..ebcd8fd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -19,6 +19,8 @@
 #include <linux/buffer_head.h>
 #include <linux/fs.h>
 #include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/uio.h>
 
@@ -32,6 +34,16 @@ static long dax_get_addr(struct inode *inode, struct buffer_head *bh,
 	return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size);
 }
 
+static long dax_get_pfn(struct inode *inode, struct buffer_head *bh,
+							unsigned long *pfn)
+{
+	struct block_device *bdev = bh->b_bdev;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+	void *addr;
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+	return ops->direct_access(bdev, sector, &addr, pfn, bh->b_size);
+}
+
 static void dax_new_buf(void *addr, unsigned size, unsigned first,
 					loff_t offset, loff_t end, int rw)
 {
@@ -190,3 +202,158 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 	return retval;
 }
 EXPORT_SYMBOL_GPL(dax_do_io);
+
+/*
+ * The user has performed a load from a hole in the file.  Allocating
+ * a new page in the file would cause excessive storage usage for
+ * workloads with sparse files.  We allocate a page cache page instead.
+ * We'll kick it out of the page cache if it's ever written to,
+ * otherwise it will simply fall out of the page cache under memory
+ * pressure without ever having been dirtied.
+ */
+static int dax_load_hole(struct address_space *mapping, struct vm_fault *vmf)
+{
+	unsigned long size;
+	struct inode *inode = mapping->host;
+	struct page *page = find_or_create_page(mapping, vmf->pgoff,
+						GFP_KERNEL | __GFP_ZERO);
+	if (!page)
+		return VM_FAULT_OOM;
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size) {
+		unlock_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static void copy_user_bh(struct page *to, struct inode *inode,
+				struct buffer_head *bh, unsigned long vaddr)
+{
+	void *vfrom, *vto;
+	dax_get_addr(inode, bh, &vfrom);	/* XXX: error handling */
+	vto = kmap_atomic(to);
+	copy_user_page(vto, vfrom, vaddr, to);
+	kunmap_atomic(vto);
+}
+
+static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = file->f_mapping;
+	struct buffer_head bh;
+	unsigned long vaddr = (unsigned long)vmf->virtual_address;
+	sector_t block;
+	pgoff_t size;
+	unsigned long pfn;
+	int error;
+
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	memset(&bh, 0, sizeof(bh));
+	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
+	bh.b_size = PAGE_SIZE;
+	error = get_block(inode, block, &bh, 0);
+	if (error || bh.b_size < PAGE_SIZE)
+		return VM_FAULT_SIGBUS;
+
+	if (!buffer_written(&bh) && !vmf->cow_page) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			error = get_block(inode, block, &bh, 1);
+			if (error || bh.b_size < PAGE_SIZE)
+				return VM_FAULT_SIGBUS;
+		} else {
+			return dax_load_hole(mapping, vmf);
+		}
+	}
+
+	/* Recheck i_size under i_mmap_mutex */
+	mutex_lock(&mapping->i_mmap_mutex);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (unlikely(vmf->pgoff >= size)) {
+		mutex_unlock(&mapping->i_mmap_mutex);
+		return VM_FAULT_SIGBUS;
+	}
+	if (vmf->cow_page) {
+		if (buffer_written(&bh))
+			copy_user_bh(vmf->cow_page, inode, &bh, vaddr);
+		else
+			clear_user_highpage(vmf->cow_page, vaddr);
+		return VM_FAULT_COWED;
+	}
+
+	error = dax_get_pfn(inode, &bh, &pfn);
+	if (error > 0)
+		error = vm_insert_mixed(vma, vaddr, pfn);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	/* -EBUSY is fine, somebody else faulted on the same PTE */
+	if (error != -EBUSY)
+		BUG_ON(error);
+	return VM_FAULT_NOPAGE;
+}
+
+/**
+ * dax_fault - handle a page fault on an XIP file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * fault handler for XIP files.
+ */
+int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	int result;
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	sb_start_pagefault(sb);
+	file_update_time(vma->vm_file);
+	result = do_dax_fault(vma, vmf, get_block);
+	sb_end_pagefault(sb);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_fault);
+
+/**
+ * dax_mkwrite - convert a read-only page to read-write in an XIP file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * XIP handles reads of holes by adding pages full of zeroes into the
+ * mapping.  If the page is subsequenty written to, we have to allocate
+ * the page on media and free the page that was in the cache.
+ */
+int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	int result;
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	sb_start_pagefault(sb);
+	file_update_time(vma->vm_file);
+	result = do_dax_fault(vma, vmf, get_block);
+	sb_end_pagefault(sb);
+
+	if (!(result & VM_FAULT_ERROR)) {
+		struct page *page = vmf->page;
+		unmap_mapping_range(page->mapping,
+					(loff_t)page->index << PAGE_CACHE_SHIFT,
+					PAGE_CACHE_SIZE, 0);
+		delete_from_page_cache(page);
+	}
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(dax_mkwrite);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ef5cf96..e3ce10d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,37 @@
 #include "xattr.h"
 #include "acl.h"
 
+#ifdef CONFIG_EXT2_FS_XIP
+static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_fault(vma, vmf, ext2_get_block);
+}
+
+static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_mkwrite(vma, vmf, ext2_get_block);
+}
+
+static const struct vm_operations_struct ext2_dax_vm_ops = {
+	.fault		= ext2_dax_fault,
+	.page_mkwrite	= ext2_dax_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if (!IS_DAX(file_inode(file)))
+		return generic_file_mmap(file, vma);
+
+	file_accessed(file);
+	vma->vm_ops = &ext2_dax_vm_ops;
+	vma->vm_flags |= VM_MIXEDMAP;
+	return 0;
+}
+#else
+#define ext2_file_mmap	generic_file_mmap
+#endif
+
 /*
  * Called when filp is released. This happens when all file descriptors
  * for a single struct file are closed. Note that different open() calls
@@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
-	.mmap		= generic_file_mmap,
+	.mmap		= ext2_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
@@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
-	.mmap		= xip_file_mmap,
+	.mmap		= ext2_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 07888b9..00ad95e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -48,6 +48,7 @@ struct cred;
 struct swap_info_struct;
 struct seq_file;
 struct workqueue_struct;
+struct vm_fault;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -2517,10 +2518,11 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
-extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
 #else
 static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
 {
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index f7c37a1..9dd45f3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -22,212 +22,6 @@
 #include <asm/io.h>
 
 /*
- * We do use our own empty page to avoid interference with other users
- * of ZERO_PAGE(), such as /dev/zero
- */
-static DEFINE_MUTEX(xip_sparse_mutex);
-static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
-static struct page *__xip_sparse_page;
-
-/* called under xip_sparse_mutex */
-static struct page *xip_sparse_page(void)
-{
-	if (!__xip_sparse_page) {
-		struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-
-		if (page)
-			__xip_sparse_page = page;
-	}
-	return __xip_sparse_page;
-}
-
-/*
- * __xip_unmap is invoked from xip_unmap and
- * xip_write
- *
- * This function walks all vmas of the address_space and unmaps the
- * __xip_sparse_page when found at pgoff.
- */
-static void
-__xip_unmap (struct address_space * mapping,
-		     unsigned long pgoff)
-{
-	struct vm_area_struct *vma;
-	struct mm_struct *mm;
-	unsigned long address;
-	pte_t *pte;
-	pte_t pteval;
-	spinlock_t *ptl;
-	struct page *page;
-	unsigned count;
-	int locked = 0;
-
-	count = read_seqcount_begin(&xip_sparse_seq);
-
-	page = __xip_sparse_page;
-	if (!page)
-		return;
-
-retry:
-	mutex_lock(&mapping->i_mmap_mutex);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
-		mm = vma->vm_mm;
-		address = vma->vm_start +
-			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl, 1);
-		if (pte) {
-			/* Nuke the page table entry. */
-			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
-			page_remove_rmap(page);
-			dec_mm_counter(mm, MM_FILEPAGES);
-			BUG_ON(pte_dirty(pteval));
-			pte_unmap_unlock(pte, ptl);
-			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
-			page_cache_release(page);
-		}
-	}
-	mutex_unlock(&mapping->i_mmap_mutex);
-
-	if (locked) {
-		mutex_unlock(&xip_sparse_mutex);
-	} else if (read_seqcount_retry(&xip_sparse_seq, count)) {
-		mutex_lock(&xip_sparse_mutex);
-		locked = 1;
-		goto retry;
-	}
-}
-
-/*
- * xip_fault() is invoked via the vma operations vector for a
- * mapped memory region to read in file data during a page fault.
- *
- * This function is derived from filemap_fault, but used for execute in place
- */
-static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	pgoff_t size;
-	void *xip_mem;
-	unsigned long xip_pfn;
-	struct page *page;
-	int error;
-
-	/* XXX: are VM_FAULT_ codes OK? */
-again:
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (vmf->pgoff >= size)
-		return VM_FAULT_SIGBUS;
-
-	error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
-						&xip_mem, &xip_pfn);
-	if (likely(!error))
-		goto found;
-	if (error != -ENODATA)
-		return VM_FAULT_OOM;
-
-	/* sparse block */
-	if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
-	    (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
-	    (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
-		int err;
-
-		/* maybe shared writable, allocate new block */
-		mutex_lock(&xip_sparse_mutex);
-		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
-							&xip_mem, &xip_pfn);
-		mutex_unlock(&xip_sparse_mutex);
-		if (error)
-			return VM_FAULT_SIGBUS;
-		/* unmap sparse mappings at pgoff from all other vmas */
-		__xip_unmap(mapping, vmf->pgoff);
-
-found:
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
-		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			mutex_unlock(&mapping->i_mmap_mutex);
-			return VM_FAULT_SIGBUS;
-		}
-		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
-							xip_pfn);
-		mutex_unlock(&mapping->i_mmap_mutex);
-		if (err == -ENOMEM)
-			return VM_FAULT_OOM;
-		/*
-		 * err == -EBUSY is fine, we've raced against another thread
-		 * that faulted-in the same page
-		 */
-		if (err != -EBUSY)
-			BUG_ON(err);
-		return VM_FAULT_NOPAGE;
-	} else {
-		int err, ret = VM_FAULT_OOM;
-
-		mutex_lock(&xip_sparse_mutex);
-		write_seqcount_begin(&xip_sparse_seq);
-		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
-							&xip_mem, &xip_pfn);
-		if (unlikely(!error)) {
-			write_seqcount_end(&xip_sparse_seq);
-			mutex_unlock(&xip_sparse_mutex);
-			goto again;
-		}
-		if (error != -ENODATA)
-			goto out;
-
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
-		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			ret = VM_FAULT_SIGBUS;
-			goto unlock;
-		}
-		/* not shared and writable, use xip_sparse_page() */
-		page = xip_sparse_page();
-		if (!page)
-			goto unlock;
-		err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
-							page);
-		if (err == -ENOMEM)
-			goto unlock;
-
-		ret = VM_FAULT_NOPAGE;
-unlock:
-		mutex_unlock(&mapping->i_mmap_mutex);
-out:
-		write_seqcount_end(&xip_sparse_seq);
-		mutex_unlock(&xip_sparse_mutex);
-
-		return ret;
-	}
-}
-
-static const struct vm_operations_struct xip_file_vm_ops = {
-	.fault	= xip_file_fault,
-	.page_mkwrite	= filemap_page_mkwrite,
-	.remap_pages = generic_file_remap_pages,
-};
-
-int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
-{
-	BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
-
-	file_accessed(file);
-	vma->vm_ops = &xip_file_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xip_file_mmap);
-
-/*
  * truncate a page used for execute in place
  * functionality is analog to block_truncate_page but does use get_xip_mem
  * to get the page instead of page cache
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 08/22] Replace xip_truncate_page with dax_truncate_page
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

It takes a get_block parameter just like nobh_truncate_page() and
block_truncate_page()

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/dax.c           | 52 ++++++++++++++++++++++++++++++++++++++++++++++++----
 fs/ext2/inode.c    |  2 +-
 include/linux/fs.h |  4 ++--
 mm/filemap_xip.c   | 40 ----------------------------------------
 4 files changed, 51 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ebcd8fd..a6a1adc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -302,13 +302,13 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 
 /**
- * dax_fault - handle a page fault on an XIP file
+ * dax_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @get_block: The filesystem method used to translate file offsets to blocks
  *
  * When a page fault occurs, filesystems may call this helper in their
- * fault handler for XIP files.
+ * fault handler for DAX files.
  */
 int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			get_block_t get_block)
@@ -326,12 +326,12 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
- * dax_mkwrite - convert a read-only page to read-write in an XIP file
+ * dax_mkwrite - convert a read-only page to read-write in a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @get_block: The filesystem method used to translate file offsets to blocks
  *
- * XIP handles reads of holes by adding pages full of zeroes into the
+ * DAX handles reads of holes by adding pages full of zeroes into the
  * mapping.  If the page is subsequenty written to, we have to allocate
  * the page on media and free the page that was in the cache.
  */
@@ -357,3 +357,47 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return result;
 }
 EXPORT_SYMBOL_GPL(dax_mkwrite);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks.  Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+	struct buffer_head bh;
+	pgoff_t index = from >> PAGE_CACHE_SHIFT;
+	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned length = PAGE_CACHE_ALIGN(from) - from;
+	int err;
+
+	/* Block boundary? Nothing to do */
+	if (!length)
+		return 0;
+
+	memset(&bh, 0, sizeof(bh));
+	bh.b_size = PAGE_CACHE_SIZE;
+	err = get_block(inode, index, &bh, 0);
+	if (err < 0)
+		return err;
+	if (buffer_written(&bh)) {
+		void *addr;
+		err = dax_get_addr(inode, &bh, &addr);
+		if (err)
+			return err;
+		memset(addr + offset, 0, length);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f128ebf..252481f 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1207,7 +1207,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	inode_dio_wait(inode);
 
 	if (IS_DAX(inode))
-		error = xip_truncate_page(inode->i_mapping, newsize);
+		error = dax_truncate_page(inode, newsize, ext2_get_block);
 	else if (test_opt(inode->i_sb, NOBH))
 		error = nobh_truncate_page(inode->i_mapping,
 				newsize, ext2_get_block);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 00ad95e..02eeeb7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2518,13 +2518,13 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
-extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
 #else
-static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
+static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
 {
 	return 0;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 9dd45f3..6316578 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -21,43 +21,3 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 
-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
-{
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned blocksize;
-	unsigned length;
-	void *xip_mem;
-	unsigned long xip_pfn;
-	int err;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	blocksize = 1 << mapping->host->i_blkbits;
-	length = offset & (blocksize - 1);
-
-	/* Block boundary? Nothing to do */
-	if (!length)
-		return 0;
-
-	length = blocksize - length;
-
-	err = mapping->a_ops->get_xip_mem(mapping, index, 0,
-						&xip_mem, &xip_pfn);
-	if (unlikely(err)) {
-		if (err == -ENODATA)
-			/* Hole? No need to truncate */
-			return 0;
-		else
-			return err;
-	}
-	memset(xip_mem + offset, 0, length);
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xip_truncate_page);
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 08/22] Replace xip_truncate_page with dax_truncate_page
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

It takes a get_block parameter just like nobh_truncate_page() and
block_truncate_page()

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/dax.c           | 52 ++++++++++++++++++++++++++++++++++++++++++++++++----
 fs/ext2/inode.c    |  2 +-
 include/linux/fs.h |  4 ++--
 mm/filemap_xip.c   | 40 ----------------------------------------
 4 files changed, 51 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ebcd8fd..a6a1adc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -302,13 +302,13 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 
 /**
- * dax_fault - handle a page fault on an XIP file
+ * dax_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @get_block: The filesystem method used to translate file offsets to blocks
  *
  * When a page fault occurs, filesystems may call this helper in their
- * fault handler for XIP files.
+ * fault handler for DAX files.
  */
 int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			get_block_t get_block)
@@ -326,12 +326,12 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
- * dax_mkwrite - convert a read-only page to read-write in an XIP file
+ * dax_mkwrite - convert a read-only page to read-write in a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @get_block: The filesystem method used to translate file offsets to blocks
  *
- * XIP handles reads of holes by adding pages full of zeroes into the
+ * DAX handles reads of holes by adding pages full of zeroes into the
  * mapping.  If the page is subsequenty written to, we have to allocate
  * the page on media and free the page that was in the cache.
  */
@@ -357,3 +357,47 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return result;
 }
 EXPORT_SYMBOL_GPL(dax_mkwrite);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks.  Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+	struct buffer_head bh;
+	pgoff_t index = from >> PAGE_CACHE_SHIFT;
+	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned length = PAGE_CACHE_ALIGN(from) - from;
+	int err;
+
+	/* Block boundary? Nothing to do */
+	if (!length)
+		return 0;
+
+	memset(&bh, 0, sizeof(bh));
+	bh.b_size = PAGE_CACHE_SIZE;
+	err = get_block(inode, index, &bh, 0);
+	if (err < 0)
+		return err;
+	if (buffer_written(&bh)) {
+		void *addr;
+		err = dax_get_addr(inode, &bh, &addr);
+		if (err)
+			return err;
+		memset(addr + offset, 0, length);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f128ebf..252481f 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1207,7 +1207,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	inode_dio_wait(inode);
 
 	if (IS_DAX(inode))
-		error = xip_truncate_page(inode->i_mapping, newsize);
+		error = dax_truncate_page(inode, newsize, ext2_get_block);
 	else if (test_opt(inode->i_sb, NOBH))
 		error = nobh_truncate_page(inode->i_mapping,
 				newsize, ext2_get_block);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 00ad95e..02eeeb7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2518,13 +2518,13 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
-extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
 #else
-static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
+static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
 {
 	return 0;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 9dd45f3..6316578 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -21,43 +21,3 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 
-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
-{
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned blocksize;
-	unsigned length;
-	void *xip_mem;
-	unsigned long xip_pfn;
-	int err;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	blocksize = 1 << mapping->host->i_blkbits;
-	length = offset & (blocksize - 1);
-
-	/* Block boundary? Nothing to do */
-	if (!length)
-		return 0;
-
-	length = blocksize - length;
-
-	err = mapping->a_ops->get_xip_mem(mapping, index, 0,
-						&xip_mem, &xip_pfn);
-	if (unlikely(err)) {
-		if (err == -ENODATA)
-			/* Hole? No need to truncate */
-			return 0;
-		else
-			return err;
-	}
-	memset(xip_mem + offset, 0, length);
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xip_truncate_page);
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 09/22] Remove mm/filemap_xip.c
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

It is now empty as all of its contents have been replaced by fs/xip.c

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 mm/Makefile      |  1 -
 mm/filemap_xip.c | 23 -----------------------
 2 files changed, 24 deletions(-)
 delete mode 100644 mm/filemap_xip.c

diff --git a/mm/Makefile b/mm/Makefile
index 310c90a..454c176 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KMEMCHECK) += kmemcheck.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
-obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
deleted file mode 100644
index 6316578..0000000
--- a/mm/filemap_xip.c
+++ /dev/null
@@ -1,23 +0,0 @@
-/*
- *	linux/mm/filemap_xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte <cotte@de.ibm.com>
- *
- * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
- *
- */
-
-#include <linux/fs.h>
-#include <linux/pagemap.h>
-#include <linux/export.h>
-#include <linux/uio.h>
-#include <linux/rmap.h>
-#include <linux/mmu_notifier.h>
-#include <linux/sched.h>
-#include <linux/seqlock.h>
-#include <linux/mutex.h>
-#include <linux/gfp.h>
-#include <asm/tlbflush.h>
-#include <asm/io.h>
-
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 09/22] Remove mm/filemap_xip.c
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

It is now empty as all of its contents have been replaced by fs/xip.c

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 mm/Makefile      |  1 -
 mm/filemap_xip.c | 23 -----------------------
 2 files changed, 24 deletions(-)
 delete mode 100644 mm/filemap_xip.c

diff --git a/mm/Makefile b/mm/Makefile
index 310c90a..454c176 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KMEMCHECK) += kmemcheck.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
-obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
deleted file mode 100644
index 6316578..0000000
--- a/mm/filemap_xip.c
+++ /dev/null
@@ -1,23 +0,0 @@
-/*
- *	linux/mm/filemap_xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte <cotte@de.ibm.com>
- *
- * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
- *
- */
-
-#include <linux/fs.h>
-#include <linux/pagemap.h>
-#include <linux/export.h>
-#include <linux/uio.h>
-#include <linux/rmap.h>
-#include <linux/mmu_notifier.h>
-#include <linux/sched.h>
-#include <linux/seqlock.h>
-#include <linux/mutex.h>
-#include <linux/gfp.h>
-#include <asm/tlbflush.h>
-#include <asm/io.h>
-
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 10/22] Remove get_xip_mem
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

All callers of get_xip_mem() are now gone.  Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.

Add documentation for writing a filesystem that supports DAX.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
---
 Documentation/filesystems/Locking |  3 --
 Documentation/filesystems/dax.txt | 82 +++++++++++++++++++++++++++++++++++++++
 Documentation/filesystems/xip.txt | 71 ---------------------------------
 fs/exofs/inode.c                  |  1 -
 fs/ext2/inode.c                   |  1 -
 fs/ext2/xip.c                     | 37 ------------------
 fs/ext2/xip.h                     |  3 --
 fs/open.c                         |  5 +--
 include/linux/fs.h                |  2 -
 mm/fadvise.c                      |  6 ++-
 mm/madvise.c                      |  2 +-
 11 files changed, 88 insertions(+), 125 deletions(-)
 create mode 100644 Documentation/filesystems/dax.txt
 delete mode 100644 Documentation/filesystems/xip.txt

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 5b0c083..2780d47 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -194,8 +194,6 @@ prototypes:
 	void (*freepage)(struct page *);
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
-	int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
-				unsigned long *);
 	int (*migratepage)(struct address_space *, struct page *, struct page *);
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long);
@@ -220,7 +218,6 @@ invalidatepage:		yes
 releasepage:		yes
 freepage:		yes
 direct_IO:
-get_xip_mem:					maybe
 migratepage:		yes (both)
 launder_page:		yes
 is_partially_uptodate:	yes
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
new file mode 100644
index 0000000..06f84e5
--- /dev/null
+++ b/Documentation/filesystems/dax.txt
@@ -0,0 +1,82 @@
+Execute-in-place for file mappings
+----------------------------------
+
+Motivation
+----------
+
+File mappings are usually performed by mapping page cache pages to
+userspace.  In addition, read & write file operations also transfer data
+between the page cache and storage.
+
+For memory backed storage devices that use the block device interface,
+the page cache pages are just copies of the original storage.  The
+execute-in-place code removes the extra copy by performing reads and
+writes directly on the memory backed storage device.  For file mappings,
+the storage device itself is mapped directly into userspace.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support DAX in your block driver, implement the 'direct_access'
+block device operation.  It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory.  It also returns a
+kernel virtual address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested.  The function should return the number
+of bytes that it can provide, although it must not exceed the number of
+bytes requested.  It may also return a negative errno if an error occurs.
+
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times.  If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access.  Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
+- dcssblk: s390 dcss block device driver
+
+
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of
+- adding support to mark inodes as being DAX by setting the S_DAX flag in
+  i_flags
+- implementing the direct_IO address space operation, and calling
+  dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
+- implementing an mmap file operation for DAX files which sets the
+  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
+  for fault and page_mkwrite (which should probably call dax_fault() and
+  dax_mkwrite(), passing the appropriate get_block() callback)
+- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- ensuring that there is sufficient locking between reads, writes,
+  truncates and page faults
+
+The get_block() callback passed to the DAX functions may return
+uninitialised extents.  If it does, it must ensure that simultaneous
+calls to get_block() (for example by a page-fault racing with a read()
+or a write()) work correctly.
+
+These filesystems may be used for inspiration:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+
+
+Shortcomings
+------------
+
+Even if the kernel or its modules are stored on a filesystem that supports
+DAX on a block device that supports DAX, they will still be copied into RAM.
+
+Calling get_user_pages() on a range of user memory that has been mmaped
+from a DAX file will fail as there are no 'struct page' to describe
+those pages.  This problem is being worked on.  That means that O_DIRECT
+reads/writes to those memory ranges from a non-DAX file will fail (note
+that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
+that is being accessed that is key here).  Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
deleted file mode 100644
index b62eabf..0000000
--- a/Documentation/filesystems/xip.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-Execute-in-place for file mappings
-----------------------------------
-
-Motivation
-----------
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read&write type file operations also transfer data from/to the page
-cache.
-
-For memory backed storage devices that use the block device interface, the page
-cache pages are in fact copies of the original storage. Various approaches
-exist to work around the need for an extra copy. The ramdisk driver for example
-does read the data into the page cache, keeps a reference, and discards the
-original data behind later on.
-
-Execute-in-place solves this issue the other way around: instead of keeping
-data in the page cache, the need to have a page cache copy is eliminated
-completely. With execute-in-place, read&write type operations are performed
-directly from/to the memory backed storage device. For file mappings, the
-storage device itself is mapped directly into userspace.
-
-This implementation was initially written for shared memory segments between
-different virtual machines on s390 hardware to allow multiple machines to
-share the same binaries and libraries.
-
-Implementation
---------------
-Execute-in-place is implemented in three steps: block device operation,
-address space operation, and file operations.
-
-A block device operation named direct_access is used to translate the
-block device sector number to a page frame number (pfn) that identifies
-the physical page for the memory.  It also returns a kernel virtual
-address that can be used to access the memory.
-
-The direct_access method takes a 'size' parameter that indicates the
-number of bytes being requested.  The function should return the number
-of bytes that it can provide, although it must not exceed the number of
-bytes requested.  It may also return a negative errno if an error occurs.
-
-The block device operation is optional, these block devices support it as of
-today:
-- dcssblk: s390 dcss block device driver
-
-An address space operation named get_xip_mem is used to retrieve references
-to a page frame number and a kernel address. To obtain these values a reference
-to an address_space is provided. This function assigns values to the kmem and
-pfn parameters. The third argument indicates whether the function should allocate
-blocks if needed.
-
-This address space operation is mutually exclusive with readpage&writepage that
-do page cache read/write operations.
-The following filesystems support it as of today:
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
-
-A set of file operations that do utilize get_xip_page can be found in
-mm/filemap_xip.c . The following file operation implementations are provided:
-- aio_read/aio_write
-- readv/writev
-- sendfile
-
-The generic file operations do_sync_read/do_sync_write can be used to implement
-classic synchronous IO calls.
-
-Shortcomings
-------------
-This implementation is limited to storage devices that are cpu addressable at
-all times (no highmem or such). It works well on rom/ram, but enhancements are
-needed to make it work with flash in read+write mode.
-Putting the Linux kernel and/or its modules on a xip filesystem does not mean
-they are not copied.
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index ee4317fa..f9a5bf6 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = {
 	.direct_IO	= exofs_direct_IO,
 
 	/* With these NULL has special meaning or default is not exported */
-	.get_xip_mem	= NULL,
 	.migratepage	= NULL,
 	.launder_page	= NULL,
 	.is_partially_uptodate = NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 252481f..b156fe8 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -891,7 +891,6 @@ const struct address_space_operations ext2_aops = {
 
 const struct address_space_operations ext2_aops_xip = {
 	.bmap			= ext2_bmap,
-	.get_xip_mem		= ext2_get_xip_mem,
 	.direct_IO		= ext2_direct_IO,
 };
 
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index fa40091..ca745ff 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -22,27 +22,6 @@ static inline long __inode_direct_access(struct inode *inode, sector_t block,
 	return ops->direct_access(bdev, sector, kaddr, pfn, size);
 }
 
-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
-		   sector_t *result)
-{
-	struct buffer_head tmp;
-	int rc;
-
-	memset(&tmp, 0, sizeof(struct buffer_head));
-	tmp.b_size = 1 << inode->i_blkbits;
-	rc = ext2_get_block(inode, pgoff, &tmp, create);
-	*result = tmp.b_blocknr;
-
-	/* did we get a sparse block (hole in the file)? */
-	if (!tmp.b_blocknr && !rc) {
-		BUG_ON(create);
-		rc = -ENODATA;
-	}
-
-	return rc;
-}
-
 int
 ext2_clear_xip_target(struct inode *inode, sector_t block)
 {
@@ -69,19 +48,3 @@ void ext2_xip_verify_sb(struct super_block *sb)
 			     "not supported by bdev");
 	}
 }
-
-int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
-				void **kmem, unsigned long *pfn)
-{
-	long rc;
-	sector_t block;
-
-	/* first, retrieve the sector number */
-	rc = __ext2_get_block(mapping->host, pgoff, create, &block);
-	if (rc)
-		return rc;
-
-	/* retrieve address of the target data */
-	rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
-	return (rc < 0) ? rc : 0;
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 29be737..0fa8b7f 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -14,11 +14,8 @@ static inline int ext2_use_xip (struct super_block *sb)
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
-int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
-				void **, unsigned long *);
 #else
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #define ext2_clear_xip_target(inode, chain)	0
-#define ext2_get_xip_mem			NULL
 #endif
diff --git a/fs/open.c b/fs/open.c
index 4b3e1ed..4b16abe 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -665,11 +665,8 @@ int open_check_o_direct(struct file *f)
 {
 	/* NB: we're sure to have correct a_ops only after f_op->open */
 	if (f->f_flags & O_DIRECT) {
-		if (!f->f_mapping->a_ops ||
-		    ((!f->f_mapping->a_ops->direct_IO) &&
-		    (!f->f_mapping->a_ops->get_xip_mem))) {
+		if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
 			return -EINVAL;
-		}
 	}
 	return 0;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 02eeeb7..c7945fd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -372,8 +372,6 @@ struct address_space_operations {
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
-	int (*get_xip_mem)(struct address_space *, pgoff_t, int,
-						void **, unsigned long *);
 	/*
 	 * migrate the contents of a page to the specified target. If
 	 * migrate_mode is MIGRATE_ASYNC, it must not block.
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..1f1925f 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -28,6 +28,7 @@
 SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 {
 	struct fd f = fdget(fd);
+	struct inode *inode;
 	struct address_space *mapping;
 	struct backing_dev_info *bdi;
 	loff_t endbyte;			/* inclusive */
@@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 	if (!f.file)
 		return -EBADF;
 
-	if (S_ISFIFO(file_inode(f.file)->i_mode)) {
+	inode = file_inode(f.file);
+	if (S_ISFIFO(inode->i_mode)) {
 		ret = -ESPIPE;
 		goto out;
 	}
@@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 		goto out;
 	}
 
-	if (mapping->a_ops->get_xip_mem) {
+	if (IS_DAX(inode)) {
 		switch (advice) {
 		case POSIX_FADV_NORMAL:
 		case POSIX_FADV_RANDOM:
diff --git a/mm/madvise.c b/mm/madvise.c
index 539eeb9..b6a2f52 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	if (!file)
 		return -EBADF;
 
-	if (file->f_mapping->a_ops->get_xip_mem) {
+	if (IS_DAX(file_inode(file))) {
 		/* no bad return value, but ignore advice */
 		return 0;
 	}
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 10/22] Remove get_xip_mem
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

All callers of get_xip_mem() are now gone.  Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.

Add documentation for writing a filesystem that supports DAX.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
---
 Documentation/filesystems/Locking |  3 --
 Documentation/filesystems/dax.txt | 82 +++++++++++++++++++++++++++++++++++++++
 Documentation/filesystems/xip.txt | 71 ---------------------------------
 fs/exofs/inode.c                  |  1 -
 fs/ext2/inode.c                   |  1 -
 fs/ext2/xip.c                     | 37 ------------------
 fs/ext2/xip.h                     |  3 --
 fs/open.c                         |  5 +--
 include/linux/fs.h                |  2 -
 mm/fadvise.c                      |  6 ++-
 mm/madvise.c                      |  2 +-
 11 files changed, 88 insertions(+), 125 deletions(-)
 create mode 100644 Documentation/filesystems/dax.txt
 delete mode 100644 Documentation/filesystems/xip.txt

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 5b0c083..2780d47 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -194,8 +194,6 @@ prototypes:
 	void (*freepage)(struct page *);
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
-	int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
-				unsigned long *);
 	int (*migratepage)(struct address_space *, struct page *, struct page *);
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long);
@@ -220,7 +218,6 @@ invalidatepage:		yes
 releasepage:		yes
 freepage:		yes
 direct_IO:
-get_xip_mem:					maybe
 migratepage:		yes (both)
 launder_page:		yes
 is_partially_uptodate:	yes
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
new file mode 100644
index 0000000..06f84e5
--- /dev/null
+++ b/Documentation/filesystems/dax.txt
@@ -0,0 +1,82 @@
+Execute-in-place for file mappings
+----------------------------------
+
+Motivation
+----------
+
+File mappings are usually performed by mapping page cache pages to
+userspace.  In addition, read & write file operations also transfer data
+between the page cache and storage.
+
+For memory backed storage devices that use the block device interface,
+the page cache pages are just copies of the original storage.  The
+execute-in-place code removes the extra copy by performing reads and
+writes directly on the memory backed storage device.  For file mappings,
+the storage device itself is mapped directly into userspace.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support DAX in your block driver, implement the 'direct_access'
+block device operation.  It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory.  It also returns a
+kernel virtual address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested.  The function should return the number
+of bytes that it can provide, although it must not exceed the number of
+bytes requested.  It may also return a negative errno if an error occurs.
+
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times.  If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access.  Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
+- dcssblk: s390 dcss block device driver
+
+
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of
+- adding support to mark inodes as being DAX by setting the S_DAX flag in
+  i_flags
+- implementing the direct_IO address space operation, and calling
+  dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
+- implementing an mmap file operation for DAX files which sets the
+  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
+  for fault and page_mkwrite (which should probably call dax_fault() and
+  dax_mkwrite(), passing the appropriate get_block() callback)
+- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- ensuring that there is sufficient locking between reads, writes,
+  truncates and page faults
+
+The get_block() callback passed to the DAX functions may return
+uninitialised extents.  If it does, it must ensure that simultaneous
+calls to get_block() (for example by a page-fault racing with a read()
+or a write()) work correctly.
+
+These filesystems may be used for inspiration:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+
+
+Shortcomings
+------------
+
+Even if the kernel or its modules are stored on a filesystem that supports
+DAX on a block device that supports DAX, they will still be copied into RAM.
+
+Calling get_user_pages() on a range of user memory that has been mmaped
+from a DAX file will fail as there are no 'struct page' to describe
+those pages.  This problem is being worked on.  That means that O_DIRECT
+reads/writes to those memory ranges from a non-DAX file will fail (note
+that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
+that is being accessed that is key here).  Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
deleted file mode 100644
index b62eabf..0000000
--- a/Documentation/filesystems/xip.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-Execute-in-place for file mappings
-----------------------------------
-
-Motivation
-----------
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read&write type file operations also transfer data from/to the page
-cache.
-
-For memory backed storage devices that use the block device interface, the page
-cache pages are in fact copies of the original storage. Various approaches
-exist to work around the need for an extra copy. The ramdisk driver for example
-does read the data into the page cache, keeps a reference, and discards the
-original data behind later on.
-
-Execute-in-place solves this issue the other way around: instead of keeping
-data in the page cache, the need to have a page cache copy is eliminated
-completely. With execute-in-place, read&write type operations are performed
-directly from/to the memory backed storage device. For file mappings, the
-storage device itself is mapped directly into userspace.
-
-This implementation was initially written for shared memory segments between
-different virtual machines on s390 hardware to allow multiple machines to
-share the same binaries and libraries.
-
-Implementation
---------------
-Execute-in-place is implemented in three steps: block device operation,
-address space operation, and file operations.
-
-A block device operation named direct_access is used to translate the
-block device sector number to a page frame number (pfn) that identifies
-the physical page for the memory.  It also returns a kernel virtual
-address that can be used to access the memory.
-
-The direct_access method takes a 'size' parameter that indicates the
-number of bytes being requested.  The function should return the number
-of bytes that it can provide, although it must not exceed the number of
-bytes requested.  It may also return a negative errno if an error occurs.
-
-The block device operation is optional, these block devices support it as of
-today:
-- dcssblk: s390 dcss block device driver
-
-An address space operation named get_xip_mem is used to retrieve references
-to a page frame number and a kernel address. To obtain these values a reference
-to an address_space is provided. This function assigns values to the kmem and
-pfn parameters. The third argument indicates whether the function should allocate
-blocks if needed.
-
-This address space operation is mutually exclusive with readpage&writepage that
-do page cache read/write operations.
-The following filesystems support it as of today:
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
-
-A set of file operations that do utilize get_xip_page can be found in
-mm/filemap_xip.c . The following file operation implementations are provided:
-- aio_read/aio_write
-- readv/writev
-- sendfile
-
-The generic file operations do_sync_read/do_sync_write can be used to implement
-classic synchronous IO calls.
-
-Shortcomings
-------------
-This implementation is limited to storage devices that are cpu addressable at
-all times (no highmem or such). It works well on rom/ram, but enhancements are
-needed to make it work with flash in read+write mode.
-Putting the Linux kernel and/or its modules on a xip filesystem does not mean
-they are not copied.
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index ee4317fa..f9a5bf6 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = {
 	.direct_IO	= exofs_direct_IO,
 
 	/* With these NULL has special meaning or default is not exported */
-	.get_xip_mem	= NULL,
 	.migratepage	= NULL,
 	.launder_page	= NULL,
 	.is_partially_uptodate = NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 252481f..b156fe8 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -891,7 +891,6 @@ const struct address_space_operations ext2_aops = {
 
 const struct address_space_operations ext2_aops_xip = {
 	.bmap			= ext2_bmap,
-	.get_xip_mem		= ext2_get_xip_mem,
 	.direct_IO		= ext2_direct_IO,
 };
 
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index fa40091..ca745ff 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -22,27 +22,6 @@ static inline long __inode_direct_access(struct inode *inode, sector_t block,
 	return ops->direct_access(bdev, sector, kaddr, pfn, size);
 }
 
-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
-		   sector_t *result)
-{
-	struct buffer_head tmp;
-	int rc;
-
-	memset(&tmp, 0, sizeof(struct buffer_head));
-	tmp.b_size = 1 << inode->i_blkbits;
-	rc = ext2_get_block(inode, pgoff, &tmp, create);
-	*result = tmp.b_blocknr;
-
-	/* did we get a sparse block (hole in the file)? */
-	if (!tmp.b_blocknr && !rc) {
-		BUG_ON(create);
-		rc = -ENODATA;
-	}
-
-	return rc;
-}
-
 int
 ext2_clear_xip_target(struct inode *inode, sector_t block)
 {
@@ -69,19 +48,3 @@ void ext2_xip_verify_sb(struct super_block *sb)
 			     "not supported by bdev");
 	}
 }
-
-int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
-				void **kmem, unsigned long *pfn)
-{
-	long rc;
-	sector_t block;
-
-	/* first, retrieve the sector number */
-	rc = __ext2_get_block(mapping->host, pgoff, create, &block);
-	if (rc)
-		return rc;
-
-	/* retrieve address of the target data */
-	rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
-	return (rc < 0) ? rc : 0;
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 29be737..0fa8b7f 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -14,11 +14,8 @@ static inline int ext2_use_xip (struct super_block *sb)
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
-int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
-				void **, unsigned long *);
 #else
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #define ext2_clear_xip_target(inode, chain)	0
-#define ext2_get_xip_mem			NULL
 #endif
diff --git a/fs/open.c b/fs/open.c
index 4b3e1ed..4b16abe 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -665,11 +665,8 @@ int open_check_o_direct(struct file *f)
 {
 	/* NB: we're sure to have correct a_ops only after f_op->open */
 	if (f->f_flags & O_DIRECT) {
-		if (!f->f_mapping->a_ops ||
-		    ((!f->f_mapping->a_ops->direct_IO) &&
-		    (!f->f_mapping->a_ops->get_xip_mem))) {
+		if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
 			return -EINVAL;
-		}
 	}
 	return 0;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 02eeeb7..c7945fd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -372,8 +372,6 @@ struct address_space_operations {
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
-	int (*get_xip_mem)(struct address_space *, pgoff_t, int,
-						void **, unsigned long *);
 	/*
 	 * migrate the contents of a page to the specified target. If
 	 * migrate_mode is MIGRATE_ASYNC, it must not block.
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..1f1925f 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -28,6 +28,7 @@
 SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 {
 	struct fd f = fdget(fd);
+	struct inode *inode;
 	struct address_space *mapping;
 	struct backing_dev_info *bdi;
 	loff_t endbyte;			/* inclusive */
@@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 	if (!f.file)
 		return -EBADF;
 
-	if (S_ISFIFO(file_inode(f.file)->i_mode)) {
+	inode = file_inode(f.file);
+	if (S_ISFIFO(inode->i_mode)) {
 		ret = -ESPIPE;
 		goto out;
 	}
@@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 		goto out;
 	}
 
-	if (mapping->a_ops->get_xip_mem) {
+	if (IS_DAX(inode)) {
 		switch (advice) {
 		case POSIX_FADV_NORMAL:
 		case POSIX_FADV_RANDOM:
diff --git a/mm/madvise.c b/mm/madvise.c
index 539eeb9..b6a2f52 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	if (!file)
 		return -EBADF;
 
-	if (file->f_mapping->a_ops->get_xip_mem) {
+	if (IS_DAX(file_inode(file))) {
 		/* no bad return value, but ignore advice */
 		return 0;
 	}
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 11/22] Replace ext2_clear_xip_target with dax_clear_blocks
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

This is practically generic code; other filesystems will want to call
it from other places, but there's nothing ext2-specific about it.

Make it a little more generic by allowing it to take a count of the number
of bytes to zero rather than fixing it to a single page.  Thanks to Dave
Hansen for suggesting that I need to call cond_resched() if zeroing more
than one page.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/dax.c           | 34 ++++++++++++++++++++++++++++++++++
 fs/ext2/inode.c    |  8 +++++---
 fs/ext2/xip.c      | 23 -----------------------
 fs/ext2/xip.h      |  3 ---
 include/linux/fs.h |  6 ++++++
 5 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a6a1adc..75328bf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -22,8 +22,42 @@
 #include <linux/highmem.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/sched.h>
 #include <linux/uio.h>
 
+int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+	sector_t sector = block << (inode->i_blkbits - 9);
+	unsigned long pfn;
+
+	might_sleep();
+	do {
+		void *addr;
+		long count = ops->direct_access(bdev, sector, &addr, &pfn,
+									size);
+		if (count < 0)
+			return count;
+		while (count >= PAGE_SIZE) {
+			clear_page(addr);
+			addr += PAGE_SIZE;
+			size -= PAGE_SIZE;
+			count -= PAGE_SIZE;
+			sector += PAGE_SIZE / 512;
+			cond_resched();
+		}
+		if (count > 0) {
+			memset(addr, 0, count);
+			sector += count / 512;
+			size -= count;
+		}
+	} while (size);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_clear_blocks);
+
 static long dax_get_addr(struct inode *inode, struct buffer_head *bh,
 								void **addr)
 {
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index b156fe8..a9346a9 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode,
 
 	if (IS_DAX(inode)) {
 		/*
-		 * we need to clear the block
+		 * block must be initialised before we put it in the tree
+		 * so that it's not found by another thread before it's
+		 * initialised
 		 */
-		err = ext2_clear_xip_target (inode,
-			le32_to_cpu(chain[depth-1].key));
+		err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
+						count << inode->i_blkbits);
 		if (err) {
 			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index ca745ff..132d4da 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,29 +13,6 @@
 #include "ext2.h"
 #include "xip.h"
 
-static inline long __inode_direct_access(struct inode *inode, sector_t block,
-				void **kaddr, unsigned long *pfn, long size)
-{
-	struct block_device *bdev = inode->i_sb->s_bdev;
-	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	sector_t sector = block * (PAGE_SIZE / 512);
-	return ops->direct_access(bdev, sector, kaddr, pfn, size);
-}
-
-int
-ext2_clear_xip_target(struct inode *inode, sector_t block)
-{
-	void *kaddr;
-	unsigned long pfn;
-	long size;
-
-	size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
-	if (size < 0)
-		return size;
-	clear_page(kaddr);
-	return 0;
-}
-
 void ext2_xip_verify_sb(struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 0fa8b7f..e7b9f0a 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -7,8 +7,6 @@
 
 #ifdef CONFIG_EXT2_FS_XIP
 extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, sector_t);
-
 static inline int ext2_use_xip (struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -17,5 +15,4 @@ static inline int ext2_use_xip (struct super_block *sb)
 #else
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
-#define ext2_clear_xip_target(inode, chain)	0
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c7945fd..6f7f445 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2516,12 +2516,18 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
+int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
 #else
+static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
+{
+	return 0;
+}
+
 static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
 {
 	return 0;
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 11/22] Replace ext2_clear_xip_target with dax_clear_blocks
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

This is practically generic code; other filesystems will want to call
it from other places, but there's nothing ext2-specific about it.

Make it a little more generic by allowing it to take a count of the number
of bytes to zero rather than fixing it to a single page.  Thanks to Dave
Hansen for suggesting that I need to call cond_resched() if zeroing more
than one page.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/dax.c           | 34 ++++++++++++++++++++++++++++++++++
 fs/ext2/inode.c    |  8 +++++---
 fs/ext2/xip.c      | 23 -----------------------
 fs/ext2/xip.h      |  3 ---
 include/linux/fs.h |  6 ++++++
 5 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a6a1adc..75328bf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -22,8 +22,42 @@
 #include <linux/highmem.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/sched.h>
 #include <linux/uio.h>
 
+int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+	sector_t sector = block << (inode->i_blkbits - 9);
+	unsigned long pfn;
+
+	might_sleep();
+	do {
+		void *addr;
+		long count = ops->direct_access(bdev, sector, &addr, &pfn,
+									size);
+		if (count < 0)
+			return count;
+		while (count >= PAGE_SIZE) {
+			clear_page(addr);
+			addr += PAGE_SIZE;
+			size -= PAGE_SIZE;
+			count -= PAGE_SIZE;
+			sector += PAGE_SIZE / 512;
+			cond_resched();
+		}
+		if (count > 0) {
+			memset(addr, 0, count);
+			sector += count / 512;
+			size -= count;
+		}
+	} while (size);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_clear_blocks);
+
 static long dax_get_addr(struct inode *inode, struct buffer_head *bh,
 								void **addr)
 {
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index b156fe8..a9346a9 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode,
 
 	if (IS_DAX(inode)) {
 		/*
-		 * we need to clear the block
+		 * block must be initialised before we put it in the tree
+		 * so that it's not found by another thread before it's
+		 * initialised
 		 */
-		err = ext2_clear_xip_target (inode,
-			le32_to_cpu(chain[depth-1].key));
+		err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
+						count << inode->i_blkbits);
 		if (err) {
 			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index ca745ff..132d4da 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,29 +13,6 @@
 #include "ext2.h"
 #include "xip.h"
 
-static inline long __inode_direct_access(struct inode *inode, sector_t block,
-				void **kaddr, unsigned long *pfn, long size)
-{
-	struct block_device *bdev = inode->i_sb->s_bdev;
-	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	sector_t sector = block * (PAGE_SIZE / 512);
-	return ops->direct_access(bdev, sector, kaddr, pfn, size);
-}
-
-int
-ext2_clear_xip_target(struct inode *inode, sector_t block)
-{
-	void *kaddr;
-	unsigned long pfn;
-	long size;
-
-	size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
-	if (size < 0)
-		return size;
-	clear_page(kaddr);
-	return 0;
-}
-
 void ext2_xip_verify_sb(struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 0fa8b7f..e7b9f0a 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -7,8 +7,6 @@
 
 #ifdef CONFIG_EXT2_FS_XIP
 extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, sector_t);
-
 static inline int ext2_use_xip (struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -17,5 +15,4 @@ static inline int ext2_use_xip (struct super_block *sb)
 #else
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
-#define ext2_clear_xip_target(inode, chain)	0
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c7945fd..6f7f445 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2516,12 +2516,18 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
+int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
 #else
+static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
+{
+	return 0;
+}
+
 static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
 {
 	return 0;
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 12/22] ext2: Remove ext2_xip_verify_sb()
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed.  It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the
equivalent check and delete the definition.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/super.c | 33 ++++++++++++---------------------
 fs/ext2/xip.c   | 12 ------------
 fs/ext2/xip.h   |  2 --
 3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 20d6697..3a1db39 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 		((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
 		 MS_POSIXACL : 0);
 
-	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-				    EXT2_MOUNT_XIP if not */
-
 	if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
 	    (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
 	     EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-	if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
-		if (!silent)
+	if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+		if (blocksize != PAGE_SIZE) {
 			ext2_msg(sb, KERN_ERR,
-				"error: unsupported blocksize for xip");
-		goto failed_mount;
+					"error: unsupported blocksize for xip");
+			goto failed_mount;
+		}
+		if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			ext2_msg(sb, KERN_ERR,
+					"error: device does not support xip");
+			goto failed_mount;
+		}
 	}
 
 	/* If the blocksize doesn't match, re-read the thing.. */
@@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 {
 	struct ext2_sb_info * sbi = EXT2_SB(sb);
 	struct ext2_super_block * es;
-	unsigned long old_mount_opt = sbi->s_mount_opt;
 	struct ext2_mount_options old_opts;
 	unsigned long old_sb_flags;
 	int err;
@@ -1273,22 +1275,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
 		((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
-	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-				    EXT2_MOUNT_XIP if not */
-
-	if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) {
-		ext2_msg(sb, KERN_WARNING,
-			"warning: unsupported blocksize for xip");
-		err = -EINVAL;
-		goto restore_opts;
-	}
-
 	es = sbi->s_es;
-	if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) {
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
 		ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
 			 "xip flag with busy inodes while remounting");
-		sbi->s_mount_opt &= ~EXT2_MOUNT_XIP;
-		sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP;
+		sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
 	}
 	if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
 		spin_unlock(&sbi->s_lock);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 132d4da..66ca113 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,15 +13,3 @@
 #include "ext2.h"
 #include "xip.h"
 
-void ext2_xip_verify_sb(struct super_block *sb)
-{
-	struct ext2_sb_info *sbi = EXT2_SB(sb);
-
-	if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) &&
-	    !sb->s_bdev->bd_disk->fops->direct_access) {
-		sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
-		ext2_msg(sb, KERN_WARNING,
-			     "warning: ignoring xip option - "
-			     "not supported by bdev");
-	}
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index e7b9f0a..87eeb04 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -6,13 +6,11 @@
  */
 
 #ifdef CONFIG_EXT2_FS_XIP
-extern void ext2_xip_verify_sb (struct super_block *);
 static inline int ext2_use_xip (struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
 #else
-#define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #endif
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 12/22] ext2: Remove ext2_xip_verify_sb()
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed.  It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the
equivalent check and delete the definition.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/super.c | 33 ++++++++++++---------------------
 fs/ext2/xip.c   | 12 ------------
 fs/ext2/xip.h   |  2 --
 3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 20d6697..3a1db39 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 		((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
 		 MS_POSIXACL : 0);
 
-	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-				    EXT2_MOUNT_XIP if not */
-
 	if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
 	    (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
 	     EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-	if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
-		if (!silent)
+	if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+		if (blocksize != PAGE_SIZE) {
 			ext2_msg(sb, KERN_ERR,
-				"error: unsupported blocksize for xip");
-		goto failed_mount;
+					"error: unsupported blocksize for xip");
+			goto failed_mount;
+		}
+		if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			ext2_msg(sb, KERN_ERR,
+					"error: device does not support xip");
+			goto failed_mount;
+		}
 	}
 
 	/* If the blocksize doesn't match, re-read the thing.. */
@@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 {
 	struct ext2_sb_info * sbi = EXT2_SB(sb);
 	struct ext2_super_block * es;
-	unsigned long old_mount_opt = sbi->s_mount_opt;
 	struct ext2_mount_options old_opts;
 	unsigned long old_sb_flags;
 	int err;
@@ -1273,22 +1275,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
 		((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
-	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-				    EXT2_MOUNT_XIP if not */
-
-	if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) {
-		ext2_msg(sb, KERN_WARNING,
-			"warning: unsupported blocksize for xip");
-		err = -EINVAL;
-		goto restore_opts;
-	}
-
 	es = sbi->s_es;
-	if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) {
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
 		ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
 			 "xip flag with busy inodes while remounting");
-		sbi->s_mount_opt &= ~EXT2_MOUNT_XIP;
-		sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP;
+		sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
 	}
 	if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
 		spin_unlock(&sbi->s_lock);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 132d4da..66ca113 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,15 +13,3 @@
 #include "ext2.h"
 #include "xip.h"
 
-void ext2_xip_verify_sb(struct super_block *sb)
-{
-	struct ext2_sb_info *sbi = EXT2_SB(sb);
-
-	if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) &&
-	    !sb->s_bdev->bd_disk->fops->direct_access) {
-		sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
-		ext2_msg(sb, KERN_WARNING,
-			     "warning: ignoring xip option - "
-			     "not supported by bdev");
-	}
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index e7b9f0a..87eeb04 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -6,13 +6,11 @@
  */
 
 #ifdef CONFIG_EXT2_FS_XIP
-extern void ext2_xip_verify_sb (struct super_block *);
 static inline int ext2_use_xip (struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
 #else
-#define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #endif
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 13/22] ext2: Remove ext2_use_xip
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Replace ext2_use_xip() with test_opt(XIP) which expands to the same code

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/ext2.h  | 4 ++++
 fs/ext2/inode.c | 2 +-
 fs/ext2/namei.c | 4 ++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index d9a17d0..5ecf570 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,11 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID32		0x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
+#ifdef CONFIG_FS_XIP
 #define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
+#else
+#define EXT2_MOUNT_XIP			0
+#endif
 #define EXT2_MOUNT_USRQUOTA		0x020000  /* user quota */
 #define EXT2_MOUNT_GRPQUOTA		0x040000  /* group quota */
 #define EXT2_MOUNT_RESERVATION		0x080000  /* Preallocation */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index a9346a9..2e587e2 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1393,7 +1393,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (ext2_use_xip(inode->i_sb)) {
+		if (test_opt(inode->i_sb, XIP)) {
 			inode->i_mapping->a_ops = &ext2_aops_xip;
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index c268d0a..846c356 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (ext2_use_xip(inode->i_sb)) {
+	if (test_opt(inode->i_sb, XIP)) {
 		inode->i_mapping->a_ops = &ext2_aops_xip;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (ext2_use_xip(inode->i_sb)) {
+	if (test_opt(inode->i_sb, XIP)) {
 		inode->i_mapping->a_ops = &ext2_aops_xip;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 13/22] ext2: Remove ext2_use_xip
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Replace ext2_use_xip() with test_opt(XIP) which expands to the same code

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/ext2.h  | 4 ++++
 fs/ext2/inode.c | 2 +-
 fs/ext2/namei.c | 4 ++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index d9a17d0..5ecf570 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,11 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID32		0x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
+#ifdef CONFIG_FS_XIP
 #define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
+#else
+#define EXT2_MOUNT_XIP			0
+#endif
 #define EXT2_MOUNT_USRQUOTA		0x020000  /* user quota */
 #define EXT2_MOUNT_GRPQUOTA		0x040000  /* group quota */
 #define EXT2_MOUNT_RESERVATION		0x080000  /* Preallocation */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index a9346a9..2e587e2 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1393,7 +1393,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (ext2_use_xip(inode->i_sb)) {
+		if (test_opt(inode->i_sb, XIP)) {
 			inode->i_mapping->a_ops = &ext2_aops_xip;
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index c268d0a..846c356 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (ext2_use_xip(inode->i_sb)) {
+	if (test_opt(inode->i_sb, XIP)) {
 		inode->i_mapping->a_ops = &ext2_aops_xip;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (ext2_use_xip(inode->i_sb)) {
+	if (test_opt(inode->i_sb, XIP)) {
 		inode->i_mapping->a_ops = &ext2_aops_xip;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 14/22] ext2: Remove xip.c and xip.h
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

These files are now empty, so delete them

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/Makefile |  1 -
 fs/ext2/inode.c  |  1 -
 fs/ext2/namei.c  |  1 -
 fs/ext2/super.c  |  1 -
 fs/ext2/xip.c    | 15 ---------------
 fs/ext2/xip.h    | 16 ----------------
 6 files changed, 35 deletions(-)
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h

diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index f42af45..445b0e9 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
 ext2-$(CONFIG_EXT2_FS_XATTR)	 += xattr.o xattr_user.o xattr_trusted.o
 ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
 ext2-$(CONFIG_EXT2_FS_SECURITY)	 += xattr_security.o
-ext2-$(CONFIG_EXT2_FS_XIP)	 += xip.o
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 2e587e2..67124f0 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -34,7 +34,6 @@
 #include <linux/aio.h>
 #include "ext2.h"
 #include "acl.h"
-#include "xip.h"
 #include "xattr.h"
 
 static int __ext2_write_inode(struct inode *inode, int do_sync);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 846c356..7ca803f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
 {
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 3a1db39..752ccb4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static void ext2_sync_super(struct super_block *sb,
 			    struct ext2_super_block *es, int wait);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
deleted file mode 100644
index 66ca113..0000000
--- a/fs/ext2/xip.c
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- *  linux/fs/ext2/xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (cotte@de.ibm.com)
- */
-
-#include <linux/mm.h>
-#include <linux/fs.h>
-#include <linux/genhd.h>
-#include <linux/buffer_head.h>
-#include <linux/blkdev.h>
-#include "ext2.h"
-#include "xip.h"
-
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
deleted file mode 100644
index 87eeb04..0000000
--- a/fs/ext2/xip.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/*
- *  linux/fs/ext2/xip.h
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (cotte@de.ibm.com)
- */
-
-#ifdef CONFIG_EXT2_FS_XIP
-static inline int ext2_use_xip (struct super_block *sb)
-{
-	struct ext2_sb_info *sbi = EXT2_SB(sb);
-	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
-}
-#else
-#define ext2_use_xip(sb)			0
-#endif
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 14/22] ext2: Remove xip.c and xip.h
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

These files are now empty, so delete them

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/Makefile |  1 -
 fs/ext2/inode.c  |  1 -
 fs/ext2/namei.c  |  1 -
 fs/ext2/super.c  |  1 -
 fs/ext2/xip.c    | 15 ---------------
 fs/ext2/xip.h    | 16 ----------------
 6 files changed, 35 deletions(-)
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h

diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index f42af45..445b0e9 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
 ext2-$(CONFIG_EXT2_FS_XATTR)	 += xattr.o xattr_user.o xattr_trusted.o
 ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
 ext2-$(CONFIG_EXT2_FS_SECURITY)	 += xattr_security.o
-ext2-$(CONFIG_EXT2_FS_XIP)	 += xip.o
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 2e587e2..67124f0 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -34,7 +34,6 @@
 #include <linux/aio.h>
 #include "ext2.h"
 #include "acl.h"
-#include "xip.h"
 #include "xattr.h"
 
 static int __ext2_write_inode(struct inode *inode, int do_sync);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 846c356..7ca803f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
 {
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 3a1db39..752ccb4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static void ext2_sync_super(struct super_block *sb,
 			    struct ext2_super_block *es, int wait);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
deleted file mode 100644
index 66ca113..0000000
--- a/fs/ext2/xip.c
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- *  linux/fs/ext2/xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (cotte@de.ibm.com)
- */
-
-#include <linux/mm.h>
-#include <linux/fs.h>
-#include <linux/genhd.h>
-#include <linux/buffer_head.h>
-#include <linux/blkdev.h>
-#include "ext2.h"
-#include "xip.h"
-
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
deleted file mode 100644
index 87eeb04..0000000
--- a/fs/ext2/xip.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/*
- *  linux/fs/ext2/xip.h
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (cotte@de.ibm.com)
- */
-
-#ifdef CONFIG_EXT2_FS_XIP
-static inline int ext2_use_xip (struct super_block *sb)
-{
-	struct ext2_sb_info *sbi = EXT2_SB(sb);
-	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
-}
-#else
-#define ext2_use_xip(sb)			0
-#endif
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

The fewer Kconfig options we have the better.  Use the generic
CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/Kconfig         | 21 ++++++++++++++-------
 fs/Makefile        |  2 +-
 fs/ext2/Kconfig    | 11 -----------
 fs/ext2/ext2.h     |  2 +-
 fs/ext2/file.c     |  4 ++--
 fs/ext2/super.c    |  4 ++--
 include/linux/fs.h |  4 ++--
 7 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 7385e54..620ab73 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -13,13 +13,6 @@ if BLOCK
 source "fs/ext2/Kconfig"
 source "fs/ext3/Kconfig"
 source "fs/ext4/Kconfig"
-
-config FS_XIP
-# execute in place
-	bool
-	depends on EXT2_FS_XIP
-	default y
-
 source "fs/jbd/Kconfig"
 source "fs/jbd2/Kconfig"
 
@@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig"
 source "fs/btrfs/Kconfig"
 source "fs/nilfs2/Kconfig"
 
+config FS_DAX
+	bool "Direct Access support"
+	depends on MMU
+	help
+	  Direct Access (DAX) can be used on memory-backed block devices.
+	  If the block device supports DAX and the filesystem supports DAX,
+	  then you can avoid using the pagecache to buffer I/Os.  Turning
+	  on this option will compile in support for DAX; you will need to
+	  mount the filesystem using the -o xip option.
+
+	  If you do not have a block device that is capable of using this,
+	  or if unsure, say N.  Saying Y will increase the size of the kernel
+	  by about 2kB.
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index 2f194cd..b7e0a13 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_XIP)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 14a6780..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -42,14 +42,3 @@ config EXT2_FS_SECURITY
 
 	  If you are not using a security module that requires using
 	  extended attributes for file security labels, say N.
-
-config EXT2_FS_XIP
-	bool "Ext2 execute in place support"
-	depends on EXT2_FS && MMU
-	help
-	  Execute in place can be used on memory-backed block devices. If you
-	  enable this option, you can select to mount block devices which are
-	  capable of this feature without using the page cache.
-
-	  If you do not use a block device that is capable of using this,
-	  or if unsure, say N.
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 5ecf570..b30c3bd 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,7 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID32		0x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 #define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
 #else
 #define EXT2_MOUNT_XIP			0
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e3ce10d..ae7f000 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,7 +25,7 @@
 #include "xattr.h"
 #include "acl.h"
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	return dax_fault(vma, vmf, ext2_get_block);
@@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = {
 	.splice_write	= generic_file_splice_write,
 };
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 752ccb4..fdcacf7 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
 		seq_puts(seq, ",grpquota");
 #endif
 
-#if defined(CONFIG_EXT2_FS_XIP)
+#ifdef CONFIG_FS_DAX
 	if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
 		seq_puts(seq, ",xip");
 #endif
@@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb)
 			break;
 #endif
 		case Opt_xip:
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 			set_opt (sbi->s_mount_opt, XIP);
 #else
 			ext2_msg(sb, KERN_INFO, "xip option not supported");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6f7f445..19abdb1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1677,7 +1677,7 @@ struct super_operations {
 #define IS_IMA(inode)		((inode)->i_flags & S_IMA)
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
 #else
 #define IS_DAX(inode)		0
@@ -2515,7 +2515,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
 extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

The fewer Kconfig options we have the better.  Use the generic
CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/Kconfig         | 21 ++++++++++++++-------
 fs/Makefile        |  2 +-
 fs/ext2/Kconfig    | 11 -----------
 fs/ext2/ext2.h     |  2 +-
 fs/ext2/file.c     |  4 ++--
 fs/ext2/super.c    |  4 ++--
 include/linux/fs.h |  4 ++--
 7 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 7385e54..620ab73 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -13,13 +13,6 @@ if BLOCK
 source "fs/ext2/Kconfig"
 source "fs/ext3/Kconfig"
 source "fs/ext4/Kconfig"
-
-config FS_XIP
-# execute in place
-	bool
-	depends on EXT2_FS_XIP
-	default y
-
 source "fs/jbd/Kconfig"
 source "fs/jbd2/Kconfig"
 
@@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig"
 source "fs/btrfs/Kconfig"
 source "fs/nilfs2/Kconfig"
 
+config FS_DAX
+	bool "Direct Access support"
+	depends on MMU
+	help
+	  Direct Access (DAX) can be used on memory-backed block devices.
+	  If the block device supports DAX and the filesystem supports DAX,
+	  then you can avoid using the pagecache to buffer I/Os.  Turning
+	  on this option will compile in support for DAX; you will need to
+	  mount the filesystem using the -o xip option.
+
+	  If you do not have a block device that is capable of using this,
+	  or if unsure, say N.  Saying Y will increase the size of the kernel
+	  by about 2kB.
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index 2f194cd..b7e0a13 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_XIP)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 14a6780..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -42,14 +42,3 @@ config EXT2_FS_SECURITY
 
 	  If you are not using a security module that requires using
 	  extended attributes for file security labels, say N.
-
-config EXT2_FS_XIP
-	bool "Ext2 execute in place support"
-	depends on EXT2_FS && MMU
-	help
-	  Execute in place can be used on memory-backed block devices. If you
-	  enable this option, you can select to mount block devices which are
-	  capable of this feature without using the page cache.
-
-	  If you do not use a block device that is capable of using this,
-	  or if unsure, say N.
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 5ecf570..b30c3bd 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,7 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID32		0x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 #define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
 #else
 #define EXT2_MOUNT_XIP			0
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e3ce10d..ae7f000 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,7 +25,7 @@
 #include "xattr.h"
 #include "acl.h"
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	return dax_fault(vma, vmf, ext2_get_block);
@@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = {
 	.splice_write	= generic_file_splice_write,
 };
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 752ccb4..fdcacf7 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
 		seq_puts(seq, ",grpquota");
 #endif
 
-#if defined(CONFIG_EXT2_FS_XIP)
+#ifdef CONFIG_FS_DAX
 	if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
 		seq_puts(seq, ",xip");
 #endif
@@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb)
 			break;
 #endif
 		case Opt_xip:
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 			set_opt (sbi->s_mount_opt, XIP);
 #else
 			ext2_msg(sb, KERN_INFO, "xip option not supported");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6f7f445..19abdb1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1677,7 +1677,7 @@ struct super_operations {
 #define IS_IMA(inode)		((inode)->i_flags & S_IMA)
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
 #else
 #define IS_DAX(inode)		0
@@ -2515,7 +2515,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
 extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 16/22] ext2: Remove ext2_aops_xip
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

We shouldn't need a special address_space_operations any more

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/ext2.h  | 1 -
 fs/ext2/inode.c | 7 +------
 fs/ext2/namei.c | 4 ++--
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b30c3bd..b8b1c11 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
-extern const struct address_space_operations ext2_aops_xip;
 extern const struct address_space_operations ext2_nobh_aops;
 
 /* namei.c */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 67124f0..7ca76da 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -890,11 +890,6 @@ const struct address_space_operations ext2_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
-const struct address_space_operations ext2_aops_xip = {
-	.bmap			= ext2_bmap,
-	.direct_IO		= ext2_direct_IO,
-};
-
 const struct address_space_operations ext2_nobh_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
@@ -1393,7 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
 		if (test_opt(inode->i_sb, XIP)) {
-			inode->i_mapping->a_ops = &ext2_aops_xip;
+			inode->i_mapping->a_ops = &ext2_aops;
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 7ca803f..0db888c 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 
 	inode->i_op = &ext2_file_inode_operations;
 	if (test_opt(inode->i_sb, XIP)) {
-		inode->i_mapping->a_ops = &ext2_aops_xip;
+		inode->i_mapping->a_ops = &ext2_aops;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 
 	inode->i_op = &ext2_file_inode_operations;
 	if (test_opt(inode->i_sb, XIP)) {
-		inode->i_mapping->a_ops = &ext2_aops_xip;
+		inode->i_mapping->a_ops = &ext2_aops;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 16/22] ext2: Remove ext2_aops_xip
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

We shouldn't need a special address_space_operations any more

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/ext2.h  | 1 -
 fs/ext2/inode.c | 7 +------
 fs/ext2/namei.c | 4 ++--
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b30c3bd..b8b1c11 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
-extern const struct address_space_operations ext2_aops_xip;
 extern const struct address_space_operations ext2_nobh_aops;
 
 /* namei.c */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 67124f0..7ca76da 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -890,11 +890,6 @@ const struct address_space_operations ext2_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
-const struct address_space_operations ext2_aops_xip = {
-	.bmap			= ext2_bmap,
-	.direct_IO		= ext2_direct_IO,
-};
-
 const struct address_space_operations ext2_nobh_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
@@ -1393,7 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
 		if (test_opt(inode->i_sb, XIP)) {
-			inode->i_mapping->a_ops = &ext2_aops_xip;
+			inode->i_mapping->a_ops = &ext2_aops;
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 7ca803f..0db888c 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 
 	inode->i_op = &ext2_file_inode_operations;
 	if (test_opt(inode->i_sb, XIP)) {
-		inode->i_mapping->a_ops = &ext2_aops_xip;
+		inode->i_mapping->a_ops = &ext2_aops;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 
 	inode->i_op = &ext2_file_inode_operations;
 	if (test_opt(inode->i_sb, XIP)) {
-		inode->i_mapping->a_ops = &ext2_aops_xip;
+		inode->i_mapping->a_ops = &ext2_aops;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 17/22] Get rid of most mentions of XIP in ext2
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

The only remaining usage is userspace's 'xip' option.
---
 fs/ext2/ext2.h  |  6 +++---
 fs/ext2/file.c  |  2 +-
 fs/ext2/inode.c |  6 +++---
 fs/ext2/namei.c |  8 ++++----
 fs/ext2/super.c | 16 ++++++++--------
 5 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b8b1c11..0e1fe9d 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -381,9 +381,9 @@ struct ext2_inode {
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
 #ifdef CONFIG_FS_DAX
-#define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
+#define EXT2_MOUNT_DAX			0x010000  /* Direct Access */
 #else
-#define EXT2_MOUNT_XIP			0
+#define EXT2_MOUNT_DAX			0
 #endif
 #define EXT2_MOUNT_USRQUOTA		0x020000  /* user quota */
 #define EXT2_MOUNT_GRPQUOTA		0x040000  /* group quota */
@@ -789,7 +789,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
 		      int datasync);
 extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_xip_file_operations;
+extern const struct file_operations ext2_dax_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ae7f000..f9bcb9b 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = {
 };
 
 #ifdef CONFIG_FS_DAX
-const struct file_operations ext2_xip_file_operations = {
+const struct file_operations ext2_dax_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 7ca76da..3776063 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1285,7 +1285,7 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
-	if (test_opt(inode->i_sb, XIP))
+	if (test_opt(inode->i_sb, DAX))
 		inode->i_flags |= S_DAX;
 }
 
@@ -1387,9 +1387,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, XIP)) {
+		if (test_opt(inode->i_sb, DAX)) {
 			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_xip_file_operations;
+			inode->i_fop = &ext2_dax_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
 			inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0db888c..148f6e3 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, XIP)) {
+	if (test_opt(inode->i_sb, DAX)) {
 		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_xip_file_operations;
+		inode->i_fop = &ext2_dax_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
@@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, XIP)) {
+	if (test_opt(inode->i_sb, DAX)) {
 		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_xip_file_operations;
+		inode->i_fop = &ext2_dax_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index fdcacf7..8062373 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -288,7 +288,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
 #endif
 
 #ifdef CONFIG_FS_DAX
-	if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
+	if (sbi->s_mount_opt & EXT2_MOUNT_DAX)
 		seq_puts(seq, ",xip");
 #endif
 
@@ -393,7 +393,7 @@ enum {
 	Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic,
 	Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug,
 	Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr,
-	Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota,
+	Opt_acl, Opt_noacl, Opt_dax, Opt_ignore, Opt_err, Opt_quota,
 	Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation
 };
 
@@ -421,7 +421,7 @@ static const match_table_t tokens = {
 	{Opt_nouser_xattr, "nouser_xattr"},
 	{Opt_acl, "acl"},
 	{Opt_noacl, "noacl"},
-	{Opt_xip, "xip"},
+	{Opt_dax, "xip"},
 	{Opt_grpquota, "grpquota"},
 	{Opt_ignore, "noquota"},
 	{Opt_quota, "quota"},
@@ -548,9 +548,9 @@ static int parse_options(char *options, struct super_block *sb)
 				"(no)acl options not supported");
 			break;
 #endif
-		case Opt_xip:
+		case Opt_dax:
 #ifdef CONFIG_FS_DAX
-			set_opt (sbi->s_mount_opt, XIP);
+			set_opt (sbi->s_mount_opt, DAX);
 #else
 			ext2_msg(sb, KERN_INFO, "xip option not supported");
 #endif
@@ -896,7 +896,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-	if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+	if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
 		if (blocksize != PAGE_SIZE) {
 			ext2_msg(sb, KERN_ERR,
 					"error: unsupported blocksize for xip");
@@ -1275,10 +1275,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 		((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
 	es = sbi->s_es;
-	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) {
 		ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
 			 "xip flag with busy inodes while remounting");
-		sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
+		sbi->s_mount_opt ^= EXT2_MOUNT_DAX;
 	}
 	if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
 		spin_unlock(&sbi->s_lock);
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 17/22] Get rid of most mentions of XIP in ext2
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

The only remaining usage is userspace's 'xip' option.
---
 fs/ext2/ext2.h  |  6 +++---
 fs/ext2/file.c  |  2 +-
 fs/ext2/inode.c |  6 +++---
 fs/ext2/namei.c |  8 ++++----
 fs/ext2/super.c | 16 ++++++++--------
 5 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b8b1c11..0e1fe9d 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -381,9 +381,9 @@ struct ext2_inode {
 #define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
 #ifdef CONFIG_FS_DAX
-#define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
+#define EXT2_MOUNT_DAX			0x010000  /* Direct Access */
 #else
-#define EXT2_MOUNT_XIP			0
+#define EXT2_MOUNT_DAX			0
 #endif
 #define EXT2_MOUNT_USRQUOTA		0x020000  /* user quota */
 #define EXT2_MOUNT_GRPQUOTA		0x040000  /* group quota */
@@ -789,7 +789,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
 		      int datasync);
 extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_xip_file_operations;
+extern const struct file_operations ext2_dax_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ae7f000..f9bcb9b 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = {
 };
 
 #ifdef CONFIG_FS_DAX
-const struct file_operations ext2_xip_file_operations = {
+const struct file_operations ext2_dax_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 7ca76da..3776063 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1285,7 +1285,7 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
-	if (test_opt(inode->i_sb, XIP))
+	if (test_opt(inode->i_sb, DAX))
 		inode->i_flags |= S_DAX;
 }
 
@@ -1387,9 +1387,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, XIP)) {
+		if (test_opt(inode->i_sb, DAX)) {
 			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_xip_file_operations;
+			inode->i_fop = &ext2_dax_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
 			inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0db888c..148f6e3 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, XIP)) {
+	if (test_opt(inode->i_sb, DAX)) {
 		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_xip_file_operations;
+		inode->i_fop = &ext2_dax_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
@@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, XIP)) {
+	if (test_opt(inode->i_sb, DAX)) {
 		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_xip_file_operations;
+		inode->i_fop = &ext2_dax_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
 		inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index fdcacf7..8062373 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -288,7 +288,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
 #endif
 
 #ifdef CONFIG_FS_DAX
-	if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
+	if (sbi->s_mount_opt & EXT2_MOUNT_DAX)
 		seq_puts(seq, ",xip");
 #endif
 
@@ -393,7 +393,7 @@ enum {
 	Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic,
 	Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug,
 	Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr,
-	Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota,
+	Opt_acl, Opt_noacl, Opt_dax, Opt_ignore, Opt_err, Opt_quota,
 	Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation
 };
 
@@ -421,7 +421,7 @@ static const match_table_t tokens = {
 	{Opt_nouser_xattr, "nouser_xattr"},
 	{Opt_acl, "acl"},
 	{Opt_noacl, "noacl"},
-	{Opt_xip, "xip"},
+	{Opt_dax, "xip"},
 	{Opt_grpquota, "grpquota"},
 	{Opt_ignore, "noquota"},
 	{Opt_quota, "quota"},
@@ -548,9 +548,9 @@ static int parse_options(char *options, struct super_block *sb)
 				"(no)acl options not supported");
 			break;
 #endif
-		case Opt_xip:
+		case Opt_dax:
 #ifdef CONFIG_FS_DAX
-			set_opt (sbi->s_mount_opt, XIP);
+			set_opt (sbi->s_mount_opt, DAX);
 #else
 			ext2_msg(sb, KERN_INFO, "xip option not supported");
 #endif
@@ -896,7 +896,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-	if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+	if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
 		if (blocksize != PAGE_SIZE) {
 			ext2_msg(sb, KERN_ERR,
 					"error: unsupported blocksize for xip");
@@ -1275,10 +1275,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 		((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
 	es = sbi->s_es;
-	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) {
 		ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
 			 "xip flag with busy inodes while remounting");
-		sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
+		sbi->s_mount_opt ^= EXT2_MOUNT_DAX;
 	}
 	if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
 		spin_unlock(&sbi->s_lock);
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 18/22] xip: Add xip_zero_page_range
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox, Ross Zwisler

This new function allows us to support hole-punch for XIP files by zeroing
a partial page, as opposed to the xip_truncate_page() function which can
only truncate to the end of the page.  Reimplement xip_truncate_page() as
a macro that calls xip_zero_page_range().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
[ported to 3.13-rc2]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 Documentation/filesystems/dax.txt |  1 +
 fs/dax.c                          | 22 +++++++++++++++-------
 include/linux/fs.h                |  9 ++++++++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 06f84e5..e5706cc 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -55,6 +55,7 @@ Filesystem support consists of
   for fault and page_mkwrite (which should probably call dax_fault() and
   dax_mkwrite(), passing the appropriate get_block() callback)
 - calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- calling dax_zero_page_range() instead of zero_user() for DAX files
 - ensuring that there is sufficient locking between reads, writes,
   truncates and page faults
 
diff --git a/fs/dax.c b/fs/dax.c
index 75328bf..cdc8012 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -393,13 +393,16 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_mkwrite);
 
 /**
- * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * dax_zero_page_range - zero a range within a page of a DAX file
  * @inode: The file being truncated
  * @from: The file offset that is being truncated to
+ * @length: The number of bytes to zero
  * @get_block: The filesystem method used to translate file offsets to blocks
  *
- * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an DAX file to handle the partial page.
+ * This function can be called by a filesystem when it is zeroing part of a
+ * page in a DAX file.  This is intended for hole-punch operations.  If
+ * you are truncating a file, the helper function dax_truncate_page() may be
+ * more convenient.
  *
  * We work in terms of PAGE_CACHE_SIZE here for commonality with
  * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
@@ -407,12 +410,12 @@ EXPORT_SYMBOL_GPL(dax_mkwrite);
  * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
  * since the file might be mmaped.
  */
-int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
+							get_block_t get_block)
 {
 	struct buffer_head bh;
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned length = PAGE_CACHE_ALIGN(from) - from;
 	int err;
 
 	/* Block boundary? Nothing to do */
@@ -427,11 +430,16 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 	if (buffer_written(&bh)) {
 		void *addr;
 		err = dax_get_addr(inode, &bh, &addr);
-		if (err)
+		if (err < 0)
 			return err;
+		/*
+		 * ext4 sometimes asks to zero past the end of a block.  It
+		 * really just wants to zero to the end of the block.
+		 */
+		length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset);
 		memset(addr + offset, 0, length);
 	}
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(dax_truncate_page);
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 19abdb1..6a5091a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2517,6 +2517,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_DAX
 int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
@@ -2528,7 +2529,8 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
 	return 0;
 }
 
-static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
+static inline int dax_zero_page_range(struct inode *inode, loff_t from,
+						unsigned len, get_block_t gb)
 {
 	return 0;
 }
@@ -2541,6 +2543,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 }
 #endif
 
+/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */
+#define dax_truncate_page(inode, from, get_block)	\
+	dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block)
+
+
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
 			    loff_t file_offset);
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 18/22] xip: Add xip_zero_page_range
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox, Ross Zwisler

This new function allows us to support hole-punch for XIP files by zeroing
a partial page, as opposed to the xip_truncate_page() function which can
only truncate to the end of the page.  Reimplement xip_truncate_page() as
a macro that calls xip_zero_page_range().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
[ported to 3.13-rc2]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 Documentation/filesystems/dax.txt |  1 +
 fs/dax.c                          | 22 +++++++++++++++-------
 include/linux/fs.h                |  9 ++++++++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 06f84e5..e5706cc 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -55,6 +55,7 @@ Filesystem support consists of
   for fault and page_mkwrite (which should probably call dax_fault() and
   dax_mkwrite(), passing the appropriate get_block() callback)
 - calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- calling dax_zero_page_range() instead of zero_user() for DAX files
 - ensuring that there is sufficient locking between reads, writes,
   truncates and page faults
 
diff --git a/fs/dax.c b/fs/dax.c
index 75328bf..cdc8012 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -393,13 +393,16 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_mkwrite);
 
 /**
- * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * dax_zero_page_range - zero a range within a page of a DAX file
  * @inode: The file being truncated
  * @from: The file offset that is being truncated to
+ * @length: The number of bytes to zero
  * @get_block: The filesystem method used to translate file offsets to blocks
  *
- * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an DAX file to handle the partial page.
+ * This function can be called by a filesystem when it is zeroing part of a
+ * page in a DAX file.  This is intended for hole-punch operations.  If
+ * you are truncating a file, the helper function dax_truncate_page() may be
+ * more convenient.
  *
  * We work in terms of PAGE_CACHE_SIZE here for commonality with
  * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
@@ -407,12 +410,12 @@ EXPORT_SYMBOL_GPL(dax_mkwrite);
  * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
  * since the file might be mmaped.
  */
-int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
+							get_block_t get_block)
 {
 	struct buffer_head bh;
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned length = PAGE_CACHE_ALIGN(from) - from;
 	int err;
 
 	/* Block boundary? Nothing to do */
@@ -427,11 +430,16 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 	if (buffer_written(&bh)) {
 		void *addr;
 		err = dax_get_addr(inode, &bh, &addr);
-		if (err)
+		if (err < 0)
 			return err;
+		/*
+		 * ext4 sometimes asks to zero past the end of a block.  It
+		 * really just wants to zero to the end of the block.
+		 */
+		length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset);
 		memset(addr + offset, 0, length);
 	}
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(dax_truncate_page);
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 19abdb1..6a5091a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2517,6 +2517,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_DAX
 int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
@@ -2528,7 +2529,8 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
 	return 0;
 }
 
-static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
+static inline int dax_zero_page_range(struct inode *inode, loff_t from,
+						unsigned len, get_block_t gb)
 {
 	return 0;
 }
@@ -2541,6 +2543,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 }
 #endif
 
+/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */
+#define dax_truncate_page(inode, from, get_block)	\
+	dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block)
+
+
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
 			    loff_t file_offset);
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 19/22] ext4: Make ext4_block_zero_page_range static
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

It's only called within inode.c, so make it static, remove its prototype
from ext4.h and move it above all of its callers so it doesn't need a
prototype within inode.c.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext4/ext4.h  |  2 --
 fs/ext4/inode.c | 42 +++++++++++++++++++++---------------------
 2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d3a534f..e025c29 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2133,8 +2133,6 @@ extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
 extern int ext4_block_truncate_page(handle_t *handle,
 		struct address_space *mapping, loff_t from);
-extern int ext4_block_zero_page_range(handle_t *handle,
-		struct address_space *mapping, loff_t from, loff_t length);
 extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 			     loff_t lstart, loff_t lend);
 extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6e39895..ce7341c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3312,33 +3312,13 @@ void ext4_set_aops(struct inode *inode)
 }
 
 /*
- * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
- * up to the end of the block which corresponds to `from'.
- * This required during truncate. We need to physically zero the tail end
- * of that block so it doesn't yield old data if the file is later grown.
- */
-int ext4_block_truncate_page(handle_t *handle,
-		struct address_space *mapping, loff_t from)
-{
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned length;
-	unsigned blocksize;
-	struct inode *inode = mapping->host;
-
-	blocksize = inode->i_sb->s_blocksize;
-	length = blocksize - (offset & (blocksize - 1));
-
-	return ext4_block_zero_page_range(handle, mapping, from, length);
-}
-
-/*
  * ext4_block_zero_page_range() zeros out a mapping of length 'length'
  * starting from file offset 'from'.  The range to be zero'd must
  * be contained with in one block.  If the specified range exceeds
  * the end of the block it will be shortened to end of the block
  * that cooresponds to 'from'
  */
-int ext4_block_zero_page_range(handle_t *handle,
+static int ext4_block_zero_page_range(handle_t *handle,
 		struct address_space *mapping, loff_t from, loff_t length)
 {
 	ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3428,6 +3408,26 @@ unlock:
 	return err;
 }
 
+/*
+ * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
+ * up to the end of the block which corresponds to `from'.
+ * This required during truncate. We need to physically zero the tail end
+ * of that block so it doesn't yield old data if the file is later grown.
+ */
+int ext4_block_truncate_page(handle_t *handle,
+		struct address_space *mapping, loff_t from)
+{
+	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned length;
+	unsigned blocksize;
+	struct inode *inode = mapping->host;
+
+	blocksize = inode->i_sb->s_blocksize;
+	length = blocksize - (offset & (blocksize - 1));
+
+	return ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
 int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 			     loff_t lstart, loff_t length)
 {
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 19/22] ext4: Make ext4_block_zero_page_range static
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

It's only called within inode.c, so make it static, remove its prototype
from ext4.h and move it above all of its callers so it doesn't need a
prototype within inode.c.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext4/ext4.h  |  2 --
 fs/ext4/inode.c | 42 +++++++++++++++++++++---------------------
 2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d3a534f..e025c29 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2133,8 +2133,6 @@ extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
 extern int ext4_block_truncate_page(handle_t *handle,
 		struct address_space *mapping, loff_t from);
-extern int ext4_block_zero_page_range(handle_t *handle,
-		struct address_space *mapping, loff_t from, loff_t length);
 extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 			     loff_t lstart, loff_t lend);
 extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6e39895..ce7341c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3312,33 +3312,13 @@ void ext4_set_aops(struct inode *inode)
 }
 
 /*
- * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
- * up to the end of the block which corresponds to `from'.
- * This required during truncate. We need to physically zero the tail end
- * of that block so it doesn't yield old data if the file is later grown.
- */
-int ext4_block_truncate_page(handle_t *handle,
-		struct address_space *mapping, loff_t from)
-{
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned length;
-	unsigned blocksize;
-	struct inode *inode = mapping->host;
-
-	blocksize = inode->i_sb->s_blocksize;
-	length = blocksize - (offset & (blocksize - 1));
-
-	return ext4_block_zero_page_range(handle, mapping, from, length);
-}
-
-/*
  * ext4_block_zero_page_range() zeros out a mapping of length 'length'
  * starting from file offset 'from'.  The range to be zero'd must
  * be contained with in one block.  If the specified range exceeds
  * the end of the block it will be shortened to end of the block
  * that cooresponds to 'from'
  */
-int ext4_block_zero_page_range(handle_t *handle,
+static int ext4_block_zero_page_range(handle_t *handle,
 		struct address_space *mapping, loff_t from, loff_t length)
 {
 	ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3428,6 +3408,26 @@ unlock:
 	return err;
 }
 
+/*
+ * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
+ * up to the end of the block which corresponds to `from'.
+ * This required during truncate. We need to physically zero the tail end
+ * of that block so it doesn't yield old data if the file is later grown.
+ */
+int ext4_block_truncate_page(handle_t *handle,
+		struct address_space *mapping, loff_t from)
+{
+	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned length;
+	unsigned blocksize;
+	struct inode *inode = mapping->host;
+
+	blocksize = inode->i_sb->s_blocksize;
+	length = blocksize - (offset & (blocksize - 1));
+
+	return ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
 int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 			     loff_t lstart, loff_t length)
 {
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 20/22] ext4: Add DAX functionality
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Ross Zwisler, Matthew Wilcox

From: Ross Zwisler <ross.zwisler@linux.intel.com>

This is a port of the DAX functionality found in the current version of
ext2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
[heavily tweaked]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/dax.txt  |  1 +
 Documentation/filesystems/ext4.txt |  2 ++
 fs/ext4/ext4.h                     |  6 +++++
 fs/ext4/file.c                     | 53 +++++++++++++++++++++++++++++++++-----
 fs/ext4/indirect.c                 | 19 +++++++++-----
 fs/ext4/inode.c                    | 52 ++++++++++++++++++++++++-------------
 fs/ext4/namei.c                    | 10 +++++--
 fs/ext4/super.c                    | 39 +++++++++++++++++++++++++++-
 8 files changed, 149 insertions(+), 33 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index e5706cc..619dab5 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -66,6 +66,7 @@ or a write()) work correctly.
 
 These filesystems may be used for inspiration:
 - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
 
 
 Shortcomings
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..9c511c4 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@ max_dir_size_kb=n	This limits the size of directories so that any
 i_version		Enable 64-bit inode version support. This option is
 			off by default.
 
+dax			Use direct access if possible
+
 Data Mode
 =========
 There are 3 different data modes:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e025c29..00e9b79 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -966,6 +966,11 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_ERRORS_MASK		0x00070
 #define EXT4_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
 #define EXT4_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
+#ifdef CONFIG_FS_DAX
+#define EXT4_MOUNT_DAX			0x00200	/* Execute in place */
+#else
+#define EXT4_MOUNT_DAX			0
+#endif
 #define EXT4_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
 #define EXT4_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
 #define EXT4_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
@@ -2581,6 +2586,7 @@ extern const struct file_operations ext4_dir_operations;
 /* file.c */
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_dax_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 extern void ext4_unwritten_wait(struct inode *inode);
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a50739..42a8ccd 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -190,7 +190,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 	}
 
-	if (unlikely(iocb->ki_filp->f_flags & O_DIRECT))
+	if (io_is_direct(iocb->ki_filp))
 		ret = ext4_file_dio_write(iocb, iov, nr_segs, pos);
 	else
 		ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
@@ -198,6 +198,27 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 	return ret;
 }
 
+#ifdef CONFIG_FS_DAX
+static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_fault(vma, vmf, ext4_get_block);
+					/* Is this the right get_block? */
+}
+
+static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_mkwrite(vma, vmf, ext4_get_block);
+}
+
+static const struct vm_operations_struct ext4_dax_vm_ops = {
+	.fault		= ext4_dax_fault,
+	.page_mkwrite	= ext4_dax_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+#else
+#define ext4_dax_vm_ops	ext4_file_vm_ops
+#endif
+
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite   = ext4_page_mkwrite,
@@ -206,12 +227,13 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
 
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	struct address_space *mapping = file->f_mapping;
-
-	if (!mapping->a_ops->readpage)
-		return -ENOEXEC;
 	file_accessed(file);
-	vma->vm_ops = &ext4_file_vm_ops;
+	if (IS_DAX(file_inode(file))) {
+		vma->vm_ops = &ext4_dax_vm_ops;
+		vma->vm_flags |= VM_MIXEDMAP;
+	} else {
+		vma->vm_ops = &ext4_file_vm_ops;
+	}
 	return 0;
 }
 
@@ -609,6 +631,25 @@ const struct file_operations ext4_file_operations = {
 	.fallocate	= ext4_fallocate,
 };
 
+#ifdef CONFIG_FS_DAX
+const struct file_operations ext4_dax_file_operations = {
+	.llseek		= ext4_llseek,
+	.read		= do_sync_read,
+	.write		= do_sync_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= ext4_file_write,
+	.unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ext4_compat_ioctl,
+#endif
+	.mmap		= ext4_file_mmap,
+	.open		= ext4_file_open,
+	.release	= ext4_release_file,
+	.fsync		= ext4_sync_file,
+	.fallocate	= ext4_fallocate,
+};
+#endif
+
 const struct inode_operations ext4_file_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 594009f..dbdacef 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -686,15 +686,22 @@ retry:
 			inode_dio_done(inode);
 			goto locked;
 		}
-		ret = __blockdev_direct_IO(rw, iocb, inode,
-				 inode->i_sb->s_bdev, iov,
-				 offset, nr_segs,
-				 ext4_get_block, NULL, NULL, 0);
+		if (IS_DAX(inode))
+			ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+					ext4_get_block, NULL, 0);
+		else
+			ret = __blockdev_direct_IO(rw, iocb, inode,
+					inode->i_sb->s_bdev, iov, offset,
+					nr_segs, ext4_get_block, NULL, NULL, 0);
 		inode_dio_done(inode);
 	} else {
 locked:
-		ret = blockdev_direct_IO(rw, iocb, inode, iov,
-				 offset, nr_segs, ext4_get_block);
+		if (IS_DAX(inode))
+			ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+					ext4_get_block, NULL, 0);
+		else
+			ret = blockdev_direct_IO(rw, iocb, inode, iov,
+					offset, nr_segs, ext4_get_block);
 
 		if (unlikely((rw & WRITE) && ret < 0)) {
 			loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce7341c..9462730 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3140,13 +3140,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 		get_block_func = ext4_get_block_write;
 		dio_flags = DIO_LOCKING;
 	}
-	ret = __blockdev_direct_IO(rw, iocb, inode,
-				   inode->i_sb->s_bdev, iov,
-				   offset, nr_segs,
-				   get_block_func,
-				   ext4_end_io_dio,
-				   NULL,
-				   dio_flags);
+	if (IS_DAX(inode))
+		ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+				get_block_func, ext4_end_io_dio, dio_flags);
+	else
+		ret = __blockdev_direct_IO(rw, iocb, inode,
+					   inode->i_sb->s_bdev, iov, offset,
+					   nr_segs, get_block_func,
+					   ext4_end_io_dio, NULL, dio_flags);
 
 	/*
 	 * Put our reference to io_end. This can free the io_end structure e.g.
@@ -3311,14 +3312,7 @@ void ext4_set_aops(struct inode *inode)
 		inode->i_mapping->a_ops = &ext4_aops;
 }
 
-/*
- * ext4_block_zero_page_range() zeros out a mapping of length 'length'
- * starting from file offset 'from'.  The range to be zero'd must
- * be contained with in one block.  If the specified range exceeds
- * the end of the block it will be shortened to end of the block
- * that cooresponds to 'from'
- */
-static int ext4_block_zero_page_range(handle_t *handle,
+static int __ext4_block_zero_page_range(handle_t *handle,
 		struct address_space *mapping, loff_t from, loff_t length)
 {
 	ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3409,6 +3403,22 @@ unlock:
 }
 
 /*
+ * ext4_block_zero_page_range() zeros out a mapping of length 'length'
+ * starting from file offset 'from'.  The range to be zero'd must
+ * be contained with in one block.  If the specified range exceeds
+ * the end of the block it will be shortened to end of the block
+ * that cooresponds to 'from'
+ */
+static int ext4_block_zero_page_range(handle_t *handle,
+		struct address_space *mapping, loff_t from, loff_t length)
+{
+	struct inode *inode = mapping->host;
+	if (IS_DAX(inode))
+		return dax_zero_page_range(inode, from, length, ext4_get_block);
+	return __ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
+/*
  * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
  * up to the end of the block which corresponds to `from'.
  * This required during truncate. We need to physically zero the tail end
@@ -3922,7 +3932,8 @@ void ext4_set_inode_flags(struct inode *inode)
 {
 	unsigned int flags = EXT4_I(inode)->i_flags;
 
-	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+	inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+				S_DIRSYNC | S_DAX);
 	if (flags & EXT4_SYNC_FL)
 		inode->i_flags |= S_SYNC;
 	if (flags & EXT4_APPEND_FL)
@@ -3933,6 +3944,8 @@ void ext4_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT4_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	if (test_opt(inode->i_sb, DAX))
+		inode->i_flags |= S_DAX;
 }
 
 /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
@@ -4184,7 +4197,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
@@ -4640,7 +4656,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		 * Truncate pagecache after we've waited for commit
 		 * in data=journal mode to make pages freeable.
 		 */
-			truncate_pagecache(inode, inode->i_size);
+		truncate_pagecache(inode, inode->i_size);
 	}
 	/*
 	 * We want to call ext4_truncate() even if attr->ia_size ==
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index d050e04..acb9cca 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2249,7 +2249,10 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
 		if (!err && IS_DIRSYNC(dir))
@@ -2313,7 +2316,10 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		d_tmpfile(dentry, inode);
 		err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 710fed2..c0b7f4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1156,7 +1156,7 @@ enum {
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
 	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
-	Opt_usrquota, Opt_grpquota, Opt_i_version,
+	Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_dax,
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
 	Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
@@ -1218,6 +1218,7 @@ static const match_table_t tokens = {
 	{Opt_barrier, "barrier"},
 	{Opt_nobarrier, "nobarrier"},
 	{Opt_i_version, "i_version"},
+	{Opt_dax, "dax"},
 	{Opt_stripe, "stripe=%u"},
 	{Opt_delalloc, "delalloc"},
 	{Opt_nodelalloc, "nodelalloc"},
@@ -1400,6 +1401,7 @@ static const struct mount_opts {
 	{Opt_min_batch_time, 0, MOPT_GTE0},
 	{Opt_inode_readahead_blks, 0, MOPT_GTE0},
 	{Opt_init_itable, 0, MOPT_GTE0},
+	{Opt_dax, EXT4_MOUNT_DAX, MOPT_SET},
 	{Opt_stripe, 0, MOPT_GTE0},
 	{Opt_resuid, 0, MOPT_GTE0},
 	{Opt_resgid, 0, MOPT_GTE0},
@@ -1638,6 +1640,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		}
 		sbi->s_jquota_fmt = m->mount_opt;
 #endif
+#ifndef CONFIG_FS_DAX
+	} else if (token == Opt_dax) {
+		ext4_msg(sb, KERN_INFO, "dax option not supported");
+		return -1;
+#endif
 	} else {
 		if (!args->from)
 			arg = 1;
@@ -3560,6 +3567,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 				 "both data=journal and dioread_nolock");
 			goto failed_mount;
 		}
+		if (test_opt(sb, DAX)) {
+			ext4_msg(sb, KERN_ERR, "can't mount with "
+				 "both data=journal and dax");
+			goto failed_mount;
+		}
 		if (test_opt(sb, DELALLOC))
 			clear_opt(sb, DELALLOC);
 	}
@@ -3613,6 +3625,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto failed_mount;
 	}
 
+	if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
+		if (blocksize != PAGE_SIZE) {
+			ext4_msg(sb, KERN_ERR,
+					"error: unsupported blocksize for dax");
+			goto failed_mount;
+		}
+		if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			ext4_msg(sb, KERN_ERR,
+					"error: device does not support dax");
+			goto failed_mount;
+		}
+	}
+
 	if (sb->s_blocksize != blocksize) {
 		/* Validate the filesystem blocksize */
 		if (!sb_set_blocksize(sb, blocksize)) {
@@ -4813,6 +4838,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 			err = -EINVAL;
 			goto restore_opts;
 		}
+		if (test_opt(sb, DAX)) {
+			ext4_msg(sb, KERN_ERR, "can't mount with "
+				 "both data=journal and dax");
+			err = -EINVAL;
+			goto restore_opts;
+		}
+	}
+
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) {
+		ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+			"dax flag with busy inodes while remounting");
+		sbi->s_mount_opt ^= EXT4_MOUNT_DAX;
 	}
 
 	if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 20/22] ext4: Add DAX functionality
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Ross Zwisler, Matthew Wilcox

From: Ross Zwisler <ross.zwisler@linux.intel.com>

This is a port of the DAX functionality found in the current version of
ext2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
[heavily tweaked]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/dax.txt  |  1 +
 Documentation/filesystems/ext4.txt |  2 ++
 fs/ext4/ext4.h                     |  6 +++++
 fs/ext4/file.c                     | 53 +++++++++++++++++++++++++++++++++-----
 fs/ext4/indirect.c                 | 19 +++++++++-----
 fs/ext4/inode.c                    | 52 ++++++++++++++++++++++++-------------
 fs/ext4/namei.c                    | 10 +++++--
 fs/ext4/super.c                    | 39 +++++++++++++++++++++++++++-
 8 files changed, 149 insertions(+), 33 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index e5706cc..619dab5 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -66,6 +66,7 @@ or a write()) work correctly.
 
 These filesystems may be used for inspiration:
 - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
 
 
 Shortcomings
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..9c511c4 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@ max_dir_size_kb=n	This limits the size of directories so that any
 i_version		Enable 64-bit inode version support. This option is
 			off by default.
 
+dax			Use direct access if possible
+
 Data Mode
 =========
 There are 3 different data modes:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e025c29..00e9b79 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -966,6 +966,11 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_ERRORS_MASK		0x00070
 #define EXT4_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
 #define EXT4_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
+#ifdef CONFIG_FS_DAX
+#define EXT4_MOUNT_DAX			0x00200	/* Execute in place */
+#else
+#define EXT4_MOUNT_DAX			0
+#endif
 #define EXT4_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
 #define EXT4_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
 #define EXT4_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
@@ -2581,6 +2586,7 @@ extern const struct file_operations ext4_dir_operations;
 /* file.c */
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_dax_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 extern void ext4_unwritten_wait(struct inode *inode);
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a50739..42a8ccd 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -190,7 +190,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 	}
 
-	if (unlikely(iocb->ki_filp->f_flags & O_DIRECT))
+	if (io_is_direct(iocb->ki_filp))
 		ret = ext4_file_dio_write(iocb, iov, nr_segs, pos);
 	else
 		ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
@@ -198,6 +198,27 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 	return ret;
 }
 
+#ifdef CONFIG_FS_DAX
+static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_fault(vma, vmf, ext4_get_block);
+					/* Is this the right get_block? */
+}
+
+static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_mkwrite(vma, vmf, ext4_get_block);
+}
+
+static const struct vm_operations_struct ext4_dax_vm_ops = {
+	.fault		= ext4_dax_fault,
+	.page_mkwrite	= ext4_dax_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+#else
+#define ext4_dax_vm_ops	ext4_file_vm_ops
+#endif
+
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite   = ext4_page_mkwrite,
@@ -206,12 +227,13 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
 
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	struct address_space *mapping = file->f_mapping;
-
-	if (!mapping->a_ops->readpage)
-		return -ENOEXEC;
 	file_accessed(file);
-	vma->vm_ops = &ext4_file_vm_ops;
+	if (IS_DAX(file_inode(file))) {
+		vma->vm_ops = &ext4_dax_vm_ops;
+		vma->vm_flags |= VM_MIXEDMAP;
+	} else {
+		vma->vm_ops = &ext4_file_vm_ops;
+	}
 	return 0;
 }
 
@@ -609,6 +631,25 @@ const struct file_operations ext4_file_operations = {
 	.fallocate	= ext4_fallocate,
 };
 
+#ifdef CONFIG_FS_DAX
+const struct file_operations ext4_dax_file_operations = {
+	.llseek		= ext4_llseek,
+	.read		= do_sync_read,
+	.write		= do_sync_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= ext4_file_write,
+	.unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ext4_compat_ioctl,
+#endif
+	.mmap		= ext4_file_mmap,
+	.open		= ext4_file_open,
+	.release	= ext4_release_file,
+	.fsync		= ext4_sync_file,
+	.fallocate	= ext4_fallocate,
+};
+#endif
+
 const struct inode_operations ext4_file_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 594009f..dbdacef 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -686,15 +686,22 @@ retry:
 			inode_dio_done(inode);
 			goto locked;
 		}
-		ret = __blockdev_direct_IO(rw, iocb, inode,
-				 inode->i_sb->s_bdev, iov,
-				 offset, nr_segs,
-				 ext4_get_block, NULL, NULL, 0);
+		if (IS_DAX(inode))
+			ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+					ext4_get_block, NULL, 0);
+		else
+			ret = __blockdev_direct_IO(rw, iocb, inode,
+					inode->i_sb->s_bdev, iov, offset,
+					nr_segs, ext4_get_block, NULL, NULL, 0);
 		inode_dio_done(inode);
 	} else {
 locked:
-		ret = blockdev_direct_IO(rw, iocb, inode, iov,
-				 offset, nr_segs, ext4_get_block);
+		if (IS_DAX(inode))
+			ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+					ext4_get_block, NULL, 0);
+		else
+			ret = blockdev_direct_IO(rw, iocb, inode, iov,
+					offset, nr_segs, ext4_get_block);
 
 		if (unlikely((rw & WRITE) && ret < 0)) {
 			loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce7341c..9462730 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3140,13 +3140,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 		get_block_func = ext4_get_block_write;
 		dio_flags = DIO_LOCKING;
 	}
-	ret = __blockdev_direct_IO(rw, iocb, inode,
-				   inode->i_sb->s_bdev, iov,
-				   offset, nr_segs,
-				   get_block_func,
-				   ext4_end_io_dio,
-				   NULL,
-				   dio_flags);
+	if (IS_DAX(inode))
+		ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+				get_block_func, ext4_end_io_dio, dio_flags);
+	else
+		ret = __blockdev_direct_IO(rw, iocb, inode,
+					   inode->i_sb->s_bdev, iov, offset,
+					   nr_segs, get_block_func,
+					   ext4_end_io_dio, NULL, dio_flags);
 
 	/*
 	 * Put our reference to io_end. This can free the io_end structure e.g.
@@ -3311,14 +3312,7 @@ void ext4_set_aops(struct inode *inode)
 		inode->i_mapping->a_ops = &ext4_aops;
 }
 
-/*
- * ext4_block_zero_page_range() zeros out a mapping of length 'length'
- * starting from file offset 'from'.  The range to be zero'd must
- * be contained with in one block.  If the specified range exceeds
- * the end of the block it will be shortened to end of the block
- * that cooresponds to 'from'
- */
-static int ext4_block_zero_page_range(handle_t *handle,
+static int __ext4_block_zero_page_range(handle_t *handle,
 		struct address_space *mapping, loff_t from, loff_t length)
 {
 	ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3409,6 +3403,22 @@ unlock:
 }
 
 /*
+ * ext4_block_zero_page_range() zeros out a mapping of length 'length'
+ * starting from file offset 'from'.  The range to be zero'd must
+ * be contained with in one block.  If the specified range exceeds
+ * the end of the block it will be shortened to end of the block
+ * that cooresponds to 'from'
+ */
+static int ext4_block_zero_page_range(handle_t *handle,
+		struct address_space *mapping, loff_t from, loff_t length)
+{
+	struct inode *inode = mapping->host;
+	if (IS_DAX(inode))
+		return dax_zero_page_range(inode, from, length, ext4_get_block);
+	return __ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
+/*
  * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
  * up to the end of the block which corresponds to `from'.
  * This required during truncate. We need to physically zero the tail end
@@ -3922,7 +3932,8 @@ void ext4_set_inode_flags(struct inode *inode)
 {
 	unsigned int flags = EXT4_I(inode)->i_flags;
 
-	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+	inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+				S_DIRSYNC | S_DAX);
 	if (flags & EXT4_SYNC_FL)
 		inode->i_flags |= S_SYNC;
 	if (flags & EXT4_APPEND_FL)
@@ -3933,6 +3944,8 @@ void ext4_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT4_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	if (test_opt(inode->i_sb, DAX))
+		inode->i_flags |= S_DAX;
 }
 
 /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
@@ -4184,7 +4197,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
@@ -4640,7 +4656,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		 * Truncate pagecache after we've waited for commit
 		 * in data=journal mode to make pages freeable.
 		 */
-			truncate_pagecache(inode, inode->i_size);
+		truncate_pagecache(inode, inode->i_size);
 	}
 	/*
 	 * We want to call ext4_truncate() even if attr->ia_size ==
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index d050e04..acb9cca 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2249,7 +2249,10 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
 		if (!err && IS_DIRSYNC(dir))
@@ -2313,7 +2316,10 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, DAX))
+			inode->i_fop = &ext4_dax_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		d_tmpfile(dentry, inode);
 		err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 710fed2..c0b7f4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1156,7 +1156,7 @@ enum {
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
 	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
-	Opt_usrquota, Opt_grpquota, Opt_i_version,
+	Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_dax,
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
 	Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
@@ -1218,6 +1218,7 @@ static const match_table_t tokens = {
 	{Opt_barrier, "barrier"},
 	{Opt_nobarrier, "nobarrier"},
 	{Opt_i_version, "i_version"},
+	{Opt_dax, "dax"},
 	{Opt_stripe, "stripe=%u"},
 	{Opt_delalloc, "delalloc"},
 	{Opt_nodelalloc, "nodelalloc"},
@@ -1400,6 +1401,7 @@ static const struct mount_opts {
 	{Opt_min_batch_time, 0, MOPT_GTE0},
 	{Opt_inode_readahead_blks, 0, MOPT_GTE0},
 	{Opt_init_itable, 0, MOPT_GTE0},
+	{Opt_dax, EXT4_MOUNT_DAX, MOPT_SET},
 	{Opt_stripe, 0, MOPT_GTE0},
 	{Opt_resuid, 0, MOPT_GTE0},
 	{Opt_resgid, 0, MOPT_GTE0},
@@ -1638,6 +1640,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		}
 		sbi->s_jquota_fmt = m->mount_opt;
 #endif
+#ifndef CONFIG_FS_DAX
+	} else if (token == Opt_dax) {
+		ext4_msg(sb, KERN_INFO, "dax option not supported");
+		return -1;
+#endif
 	} else {
 		if (!args->from)
 			arg = 1;
@@ -3560,6 +3567,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 				 "both data=journal and dioread_nolock");
 			goto failed_mount;
 		}
+		if (test_opt(sb, DAX)) {
+			ext4_msg(sb, KERN_ERR, "can't mount with "
+				 "both data=journal and dax");
+			goto failed_mount;
+		}
 		if (test_opt(sb, DELALLOC))
 			clear_opt(sb, DELALLOC);
 	}
@@ -3613,6 +3625,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto failed_mount;
 	}
 
+	if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
+		if (blocksize != PAGE_SIZE) {
+			ext4_msg(sb, KERN_ERR,
+					"error: unsupported blocksize for dax");
+			goto failed_mount;
+		}
+		if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			ext4_msg(sb, KERN_ERR,
+					"error: device does not support dax");
+			goto failed_mount;
+		}
+	}
+
 	if (sb->s_blocksize != blocksize) {
 		/* Validate the filesystem blocksize */
 		if (!sb_set_blocksize(sb, blocksize)) {
@@ -4813,6 +4838,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 			err = -EINVAL;
 			goto restore_opts;
 		}
+		if (test_opt(sb, DAX)) {
+			ext4_msg(sb, KERN_ERR, "can't mount with "
+				 "both data=journal and dax");
+			err = -EINVAL;
+			goto restore_opts;
+		}
+	}
+
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) {
+		ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+			"dax flag with busy inodes while remounting");
+		sbi->s_mount_opt ^= EXT4_MOUNT_DAX;
 	}
 
 	if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 21/22] ext4: Fix typos
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Comment fix only

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext4/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9462730..14a9744 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3691,7 +3691,7 @@ void ext4_truncate(struct inode *inode)
 
 	/*
 	 * There is a possibility that we're either freeing the inode
-	 * or it completely new indode. In those cases we might not
+	 * or it's a completely new inode. In those cases we might not
 	 * have i_mutex locked because it's not necessary.
 	 */
 	if (!(inode->i_state & (I_NEW|I_FREEING)))
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 21/22] ext4: Fix typos
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

Comment fix only

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext4/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9462730..14a9744 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3691,7 +3691,7 @@ void ext4_truncate(struct inode *inode)
 
 	/*
 	 * There is a possibility that we're either freeing the inode
-	 * or it completely new indode. In those cases we might not
+	 * or it's a completely new inode. In those cases we might not
 	 * have i_mutex locked because it's not necessary.
 	 */
 	if (!(inode->i_state & (I_NEW|I_FREEING)))
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 22/22] dax: Add reporting of major faults
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-25 14:18   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

If we have to call get_block with the create argument set to 1, then
the filesystem almost certainly had to zero the block. which is an I/O,
which should be reported as a major fault.

Note that major faults on DAX files happen for different reasons than
major faults on non-DAX files.  DAX files behave as if everything except
file holes is already cached.  That's all the more reason to report
major faults when we do have to do I/O; it may be a valuable resource
for sysadmins trying to diagnose performance problems.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/dax.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index cdc8012..79a67c5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -20,10 +20,12 @@
 #include <linux/fs.h>
 #include <linux/genhd.h>
 #include <linux/highmem.h>
+#include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
+#include <linux/vmstat.h>
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -286,6 +288,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	pgoff_t size;
 	unsigned long pfn;
 	int error;
+	int major = 0;
 
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
@@ -301,6 +304,9 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (!buffer_written(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			error = get_block(inode, block, &bh, 1);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			major = VM_FAULT_MAJOR;
 			if (error || bh.b_size < PAGE_SIZE)
 				return VM_FAULT_SIGBUS;
 		} else {
@@ -332,7 +338,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	/* -EBUSY is fine, somebody else faulted on the same PTE */
 	if (error != -EBUSY)
 		BUG_ON(error);
-	return VM_FAULT_NOPAGE;
+	return VM_FAULT_NOPAGE | major;
 }
 
 /**
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 22/22] dax: Add reporting of major faults
@ 2014-02-25 14:18   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-25 14:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel, willy; +Cc: Matthew Wilcox

If we have to call get_block with the create argument set to 1, then
the filesystem almost certainly had to zero the block. which is an I/O,
which should be reported as a major fault.

Note that major faults on DAX files happen for different reasons than
major faults on non-DAX files.  DAX files behave as if everything except
file holes is already cached.  That's all the more reason to report
major faults when we do have to do I/O; it may be a valuable resource
for sysadmins trying to diagnose performance problems.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/dax.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index cdc8012..79a67c5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -20,10 +20,12 @@
 #include <linux/fs.h>
 #include <linux/genhd.h>
 #include <linux/highmem.h>
+#include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
+#include <linux/vmstat.h>
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -286,6 +288,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	pgoff_t size;
 	unsigned long pfn;
 	int error;
+	int major = 0;
 
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
@@ -301,6 +304,9 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (!buffer_written(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			error = get_block(inode, block, &bh, 1);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			major = VM_FAULT_MAJOR;
 			if (error || bh.b_size < PAGE_SIZE)
 				return VM_FAULT_SIGBUS;
 		} else {
@@ -332,7 +338,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	/* -EBUSY is fine, somebody else faulted on the same PTE */
 	if (error != -EBUSY)
 		BUG_ON(error);
-	return VM_FAULT_NOPAGE;
+	return VM_FAULT_NOPAGE | major;
 }
 
 /**
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 23/22] Bugfixes
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-26 15:07   ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-26 15:07 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm, linux-fsdevel

On Tue, Feb 25, 2014 at 09:18:16AM -0500, Matthew Wilcox wrote:
> Seven xfstests still fail reliably with DAX that pass reliably without
> DAX: ext4/301 generic/{075,091,112,127,223,263}

With the patches below, we're down to just two additional failures,
ext4/301 and generic/223.  I'll fold these patches into the right place
for a v7 of the patchset, but I thought it unsporting to send out a new
version of such a large patchset the day after.

I've now had a review with Kirill of the page fault path.  We identified
some ... infelicities in the mm code, but nothing that's worse than the
current XIP code.

commit 714bad38915139f381e28681ab46e2e2f7202556
Author: Matthew Wilcox <willy@linux.intel.com>
Date:   Wed Feb 26 08:00:46 2014 -0500

    Only call get_block when necessary
    
    In the dax_io function, we would call get_block when we got to the end
    of the current block returned by dax_get_addr().  When using a driver
    like PRD, that's fine, but using BRD means that we stop on every page
    boundary.  The problem is that we lose information from the first call
    when doing this.  For example, if a write crosses a page boundary, the
    first time around the loop the filesystem allocates two pages, BH_New is
    set and we zero the start of the first page.  The second time around the
    loop, the filesystem just returns the existing block with BH_New clear,
    so we don't zero the tail of the second page.
    
    This patch adds tracking for how far through the buffer_head we've got,
    and will only call get_block() again once we've got to the end of the
    previous buffer.
    
    Fixes generic/263.  Now generic/{075,091,112,127} also pass; 075 was
    looking like the same problem from Ross's investigation.

commit 1f1fd14eb17dc19ecb757f896dd7573af79b5699
Author: Matthew Wilcox <willy@linux.intel.com>
Date:   Tue Feb 25 14:41:30 2014 -0500

    Clear new or unwritten blocks in page fault handler
    
    Test generic/263 mmaps the end of the file, writes to it, then checks
    the bytes after EOF are zero.  They were not being zeroed before, so we
    must do it.

commit 4d21ffcf353b8c83a599fe09ae8657ba05da1c76
Author: Matthew Wilcox <matthew.r.wilcox@intel.com>
Date:   Tue Feb 25 12:06:42 2014 -0500

    Initialise cow_page in do_page_mkwrite()
    
    We will end up checking cow_page when turning a hole into a written page,
    so it needs to be zero.

 fs/dax.c    | 47 +++++++++++++++++++++++++++++++++++++----------
 mm/memory.c |  1 +
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6308197..2640db6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -100,6 +100,19 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
+/*
+ * When ext4 encounters a hole, it likes to return without modifying the
+ * buffer_head which means that we can't trust b_size.  To cope with this,
+ * we set b_state to 0 before calling get_block and, if any bit is set, we
+ * know we can trust b_size.  Unfortunate, really, since ext4 does know
+ * precisely how long a hole is and would save us time calling get_block
+ * repeatedly.
+ */
+static bool buffer_size_valid(struct buffer_head *bh)
+{
+	return bh->b_state != 0;
+}
+
 static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
 			loff_t start, loff_t end, get_block_t get_block,
 			struct buffer_head *bh)
@@ -110,6 +123,7 @@ static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
 	unsigned copied = 0;
 	loff_t offset = start;
 	loff_t max = start;
+	loff_t bh_max = start;
 	void *addr;
 	bool hole = false;
 
@@ -119,15 +133,27 @@ static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
 	while (offset < end) {
 		void __user *buf = iov[seg].iov_base + copied;
 
-		if (max == offset) {
+		if (offset == max) {
 			sector_t block = offset >> inode->i_blkbits;
 			unsigned first = offset - (block << inode->i_blkbits);
 			long size;
-			memset(bh, 0, sizeof(*bh));
-			bh->b_size = ALIGN(end - offset, PAGE_SIZE);
-			retval = get_block(inode, block, bh, rw == WRITE);
-			if (retval)
-				break;
+
+			if (offset == bh_max) {
+				bh->b_size = PAGE_ALIGN(end - offset);
+				bh->b_state = 0;
+				retval = get_block(inode, block, bh,
+								rw == WRITE);
+				if (retval)
+					break;
+				if (!buffer_size_valid(bh))
+					bh->b_size = 1 << inode->i_blkbits;
+				bh_max = offset - first + bh->b_size;
+			} else {
+				unsigned done = bh->b_size - (bh_max -
+							(offset - first));
+				bh->b_blocknr += done >> inode->i_blkbits;
+				bh->b_size -= done;
+			}
 			if (rw == WRITE) {
 				if (!buffer_mapped(bh)) {
 					retval = -EIO;
@@ -140,10 +166,7 @@ static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
 
 			if (hole) {
 				addr = NULL;
-				if (buffer_uptodate(bh))
-					size = bh->b_size - first;
-				else
-					size = (1 << inode->i_blkbits) - first;
+				size = bh->b_size - first;
 			} else {
 				retval = dax_get_addr(inode, bh, &addr);
 				if (retval < 0)
@@ -209,6 +232,7 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 	ssize_t retval = -EINVAL;
 	loff_t end = offset;
 
+	memset(&bh, 0, sizeof(bh));
 	for (seg = 0; seg < nr_segs; seg++)
 		end += iov[seg].iov_len;
 
@@ -314,6 +338,9 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 	}
 
+	if (buffer_unwritten(&bh) || buffer_new(&bh))
+		dax_clear_blocks(inode, bh.b_blocknr, bh.b_size);
+
 	/* Recheck i_size under i_mmap_mutex */
 	mutex_lock(&mapping->i_mmap_mutex);
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
diff --git a/mm/memory.c b/mm/memory.c
index 4e1bdee..3c6b8b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2672,6 +2672,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	vmf.pgoff = page->index;
 	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 	vmf.page = page;
+	vmf.cow_page = NULL;
 
 	ret = vma->vm_ops->page_mkwrite(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v6 23/22] Bugfixes
@ 2014-02-26 15:07   ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-26 15:07 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm, linux-fsdevel

On Tue, Feb 25, 2014 at 09:18:16AM -0500, Matthew Wilcox wrote:
> Seven xfstests still fail reliably with DAX that pass reliably without
> DAX: ext4/301 generic/{075,091,112,127,223,263}

With the patches below, we're down to just two additional failures,
ext4/301 and generic/223.  I'll fold these patches into the right place
for a v7 of the patchset, but I thought it unsporting to send out a new
version of such a large patchset the day after.

I've now had a review with Kirill of the page fault path.  We identified
some ... infelicities in the mm code, but nothing that's worse than the
current XIP code.

commit 714bad38915139f381e28681ab46e2e2f7202556
Author: Matthew Wilcox <willy@linux.intel.com>
Date:   Wed Feb 26 08:00:46 2014 -0500

    Only call get_block when necessary
    
    In the dax_io function, we would call get_block when we got to the end
    of the current block returned by dax_get_addr().  When using a driver
    like PRD, that's fine, but using BRD means that we stop on every page
    boundary.  The problem is that we lose information from the first call
    when doing this.  For example, if a write crosses a page boundary, the
    first time around the loop the filesystem allocates two pages, BH_New is
    set and we zero the start of the first page.  The second time around the
    loop, the filesystem just returns the existing block with BH_New clear,
    so we don't zero the tail of the second page.
    
    This patch adds tracking for how far through the buffer_head we've got,
    and will only call get_block() again once we've got to the end of the
    previous buffer.
    
    Fixes generic/263.  Now generic/{075,091,112,127} also pass; 075 was
    looking like the same problem from Ross's investigation.

commit 1f1fd14eb17dc19ecb757f896dd7573af79b5699
Author: Matthew Wilcox <willy@linux.intel.com>
Date:   Tue Feb 25 14:41:30 2014 -0500

    Clear new or unwritten blocks in page fault handler
    
    Test generic/263 mmaps the end of the file, writes to it, then checks
    the bytes after EOF are zero.  They were not being zeroed before, so we
    must do it.

commit 4d21ffcf353b8c83a599fe09ae8657ba05da1c76
Author: Matthew Wilcox <matthew.r.wilcox@intel.com>
Date:   Tue Feb 25 12:06:42 2014 -0500

    Initialise cow_page in do_page_mkwrite()
    
    We will end up checking cow_page when turning a hole into a written page,
    so it needs to be zero.

 fs/dax.c    | 47 +++++++++++++++++++++++++++++++++++++----------
 mm/memory.c |  1 +
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6308197..2640db6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -100,6 +100,19 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
+/*
+ * When ext4 encounters a hole, it likes to return without modifying the
+ * buffer_head which means that we can't trust b_size.  To cope with this,
+ * we set b_state to 0 before calling get_block and, if any bit is set, we
+ * know we can trust b_size.  Unfortunate, really, since ext4 does know
+ * precisely how long a hole is and would save us time calling get_block
+ * repeatedly.
+ */
+static bool buffer_size_valid(struct buffer_head *bh)
+{
+	return bh->b_state != 0;
+}
+
 static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
 			loff_t start, loff_t end, get_block_t get_block,
 			struct buffer_head *bh)
@@ -110,6 +123,7 @@ static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
 	unsigned copied = 0;
 	loff_t offset = start;
 	loff_t max = start;
+	loff_t bh_max = start;
 	void *addr;
 	bool hole = false;
 
@@ -119,15 +133,27 @@ static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
 	while (offset < end) {
 		void __user *buf = iov[seg].iov_base + copied;
 
-		if (max == offset) {
+		if (offset == max) {
 			sector_t block = offset >> inode->i_blkbits;
 			unsigned first = offset - (block << inode->i_blkbits);
 			long size;
-			memset(bh, 0, sizeof(*bh));
-			bh->b_size = ALIGN(end - offset, PAGE_SIZE);
-			retval = get_block(inode, block, bh, rw == WRITE);
-			if (retval)
-				break;
+
+			if (offset == bh_max) {
+				bh->b_size = PAGE_ALIGN(end - offset);
+				bh->b_state = 0;
+				retval = get_block(inode, block, bh,
+								rw == WRITE);
+				if (retval)
+					break;
+				if (!buffer_size_valid(bh))
+					bh->b_size = 1 << inode->i_blkbits;
+				bh_max = offset - first + bh->b_size;
+			} else {
+				unsigned done = bh->b_size - (bh_max -
+							(offset - first));
+				bh->b_blocknr += done >> inode->i_blkbits;
+				bh->b_size -= done;
+			}
 			if (rw == WRITE) {
 				if (!buffer_mapped(bh)) {
 					retval = -EIO;
@@ -140,10 +166,7 @@ static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
 
 			if (hole) {
 				addr = NULL;
-				if (buffer_uptodate(bh))
-					size = bh->b_size - first;
-				else
-					size = (1 << inode->i_blkbits) - first;
+				size = bh->b_size - first;
 			} else {
 				retval = dax_get_addr(inode, bh, &addr);
 				if (retval < 0)
@@ -209,6 +232,7 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 	ssize_t retval = -EINVAL;
 	loff_t end = offset;
 
+	memset(&bh, 0, sizeof(bh));
 	for (seg = 0; seg < nr_segs; seg++)
 		end += iov[seg].iov_len;
 
@@ -314,6 +338,9 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 	}
 
+	if (buffer_unwritten(&bh) || buffer_new(&bh))
+		dax_clear_blocks(inode, bh.b_blocknr, bh.b_size);
+
 	/* Recheck i_size under i_mmap_mutex */
 	mutex_lock(&mapping->i_mmap_mutex);
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
diff --git a/mm/memory.c b/mm/memory.c
index 4e1bdee..3c6b8b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2672,6 +2672,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	vmf.pgoff = page->index;
 	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 	vmf.page = page;
+	vmf.cow_page = NULL;
 
 	ret = vma->vm_ops->page_mkwrite(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 00/22] Support ext4 on NV-DIMMs
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-02-27 14:01   ` Florian Weimer
  -1 siblings, 0 replies; 79+ messages in thread
From: Florian Weimer @ 2014-02-27 14:01 UTC (permalink / raw)
  To: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel, willy

On 02/25/2014 03:18 PM, Matthew Wilcox wrote:
> One of the primary uses for NV-DIMMs is to expose them as a block device
> and use a filesystem to store files on the NV-DIMM.  While that works,
> it currently wastes memory and CPU time buffering the files in the page
> cache.  We have support in ext2 for bypassing the page cache, but it
> has some races which are unfixable in the current design.  This series
> of patches rewrite the underlying support, and add support for direct
> access to ext4.

I'm wondering if there is a potential security issue lurking here.

Some distributions use udisks2 to grant permission to local console 
users to create new loop devices from files.  File systems on these 
block devices are then mounted.  This is a replacement for several file 
systems implemented in user space, and for the users, this is a good 
thing because the in-kernel implementations are generally of higher quality.

What happens if we have DAX support in the entire stack, and an 
enterprising user mounts a file system?  Will she be able to fuzz the 
file system or binfmt loaders concurrently, changing the bits while they 
are being read?

Currently, it appears that the loop device duplicates pages in the page 
cache, so this does not seem to be possible, but DAX support might 
change this.

-- 
Florian Weimer / Red Hat Product Security Team

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 00/22] Support ext4 on NV-DIMMs
@ 2014-02-27 14:01   ` Florian Weimer
  0 siblings, 0 replies; 79+ messages in thread
From: Florian Weimer @ 2014-02-27 14:01 UTC (permalink / raw)
  To: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel, willy

On 02/25/2014 03:18 PM, Matthew Wilcox wrote:
> One of the primary uses for NV-DIMMs is to expose them as a block device
> and use a filesystem to store files on the NV-DIMM.  While that works,
> it currently wastes memory and CPU time buffering the files in the page
> cache.  We have support in ext2 for bypassing the page cache, but it
> has some races which are unfixable in the current design.  This series
> of patches rewrite the underlying support, and add support for direct
> access to ext4.

I'm wondering if there is a potential security issue lurking here.

Some distributions use udisks2 to grant permission to local console 
users to create new loop devices from files.  File systems on these 
block devices are then mounted.  This is a replacement for several file 
systems implemented in user space, and for the users, this is a good 
thing because the in-kernel implementations are generally of higher quality.

What happens if we have DAX support in the entire stack, and an 
enterprising user mounts a file system?  Will she be able to fuzz the 
file system or binfmt loaders concurrently, changing the bits while they 
are being read?

Currently, it appears that the loop device duplicates pages in the page 
cache, so this does not seem to be possible, but DAX support might 
change this.

-- 
Florian Weimer / Red Hat Product Security Team

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 00/22] Support ext4 on NV-DIMMs
  2014-02-27 14:01   ` Florian Weimer
@ 2014-02-27 16:29     ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-27 16:29 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On Thu, Feb 27, 2014 at 03:01:03PM +0100, Florian Weimer wrote:
> On 02/25/2014 03:18 PM, Matthew Wilcox wrote:
> >One of the primary uses for NV-DIMMs is to expose them as a block device
> >and use a filesystem to store files on the NV-DIMM.  While that works,
> >it currently wastes memory and CPU time buffering the files in the page
> >cache.  We have support in ext2 for bypassing the page cache, but it
> >has some races which are unfixable in the current design.  This series
> >of patches rewrite the underlying support, and add support for direct
> >access to ext4.
> 
> I'm wondering if there is a potential security issue lurking here.
> 
> Some distributions use udisks2 to grant permission to local console
> users to create new loop devices from files.  File systems on these
> block devices are then mounted.  This is a replacement for several
> file systems implemented in user space, and for the users, this is a
> good thing because the in-kernel implementations are generally of
> higher quality.

Just to be sure I understand; the user owns the file (so can change any
bit in it at will), and the loop device is used to present that file
to the filesystem as a block device to be mounted?  Have we fuzz-tested
all the filesystems enough to be sure that's safe?  :-)

> What happens if we have DAX support in the entire stack, and an
> enterprising user mounts a file system?  Will she be able to fuzz
> the file system or binfmt loaders concurrently, changing the bits
> while they are being read?
> 
> Currently, it appears that the loop device duplicates pages in the
> page cache, so this does not seem to be possible, but DAX support
> might change this.

I haven't looked at adding DAX support to the loop device, although
that would make sense.  At the moment, neither ext2 nor ext4 (our only
DAX-supporting filesystems) use DAX for their metadata, only for user
data.  As far as fuzzing the binfmt loaders ... are these filesystems not
forced to be at least nosuid?  I might go so far as to make them noexec.

Thanks for thinking about this.  I didn't know allowing users to mount
files they owned was something distros actually did.  Have we considered
prohibiting the user from modifying the file while it's mounted, eg
forcing its permissions to 0 or pretending it's immutable?


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 00/22] Support ext4 on NV-DIMMs
@ 2014-02-27 16:29     ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-27 16:29 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On Thu, Feb 27, 2014 at 03:01:03PM +0100, Florian Weimer wrote:
> On 02/25/2014 03:18 PM, Matthew Wilcox wrote:
> >One of the primary uses for NV-DIMMs is to expose them as a block device
> >and use a filesystem to store files on the NV-DIMM.  While that works,
> >it currently wastes memory and CPU time buffering the files in the page
> >cache.  We have support in ext2 for bypassing the page cache, but it
> >has some races which are unfixable in the current design.  This series
> >of patches rewrite the underlying support, and add support for direct
> >access to ext4.
> 
> I'm wondering if there is a potential security issue lurking here.
> 
> Some distributions use udisks2 to grant permission to local console
> users to create new loop devices from files.  File systems on these
> block devices are then mounted.  This is a replacement for several
> file systems implemented in user space, and for the users, this is a
> good thing because the in-kernel implementations are generally of
> higher quality.

Just to be sure I understand; the user owns the file (so can change any
bit in it at will), and the loop device is used to present that file
to the filesystem as a block device to be mounted?  Have we fuzz-tested
all the filesystems enough to be sure that's safe?  :-)

> What happens if we have DAX support in the entire stack, and an
> enterprising user mounts a file system?  Will she be able to fuzz
> the file system or binfmt loaders concurrently, changing the bits
> while they are being read?
> 
> Currently, it appears that the loop device duplicates pages in the
> page cache, so this does not seem to be possible, but DAX support
> might change this.

I haven't looked at adding DAX support to the loop device, although
that would make sense.  At the moment, neither ext2 nor ext4 (our only
DAX-supporting filesystems) use DAX for their metadata, only for user
data.  As far as fuzzing the binfmt loaders ... are these filesystems not
forced to be at least nosuid?  I might go so far as to make them noexec.

Thanks for thinking about this.  I didn't know allowing users to mount
files they owned was something distros actually did.  Have we considered
prohibiting the user from modifying the file while it's mounted, eg
forcing its permissions to 0 or pretending it's immutable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 00/22] Support ext4 on NV-DIMMs
  2014-02-27 16:29     ` Matthew Wilcox
@ 2014-02-27 16:36       ` Florian Weimer
  -1 siblings, 0 replies; 79+ messages in thread
From: Florian Weimer @ 2014-02-27 16:36 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On 02/27/2014 05:29 PM, Matthew Wilcox wrote:

>> Some distributions use udisks2 to grant permission to local console
>> users to create new loop devices from files.  File systems on these
>> block devices are then mounted.  This is a replacement for several
>> file systems implemented in user space, and for the users, this is a
>> good thing because the in-kernel implementations are generally of
>> higher quality.
>
> Just to be sure I understand; the user owns the file (so can change any
> bit in it at will), and the loop device is used to present that file
> to the filesystem as a block device to be mounted?

Yes, that's a fair summary.

 > Have we fuzz-tested
> all the filesystems enough to be sure that's safe?  :-)

It raised some eyebrows.  But I've looked at some of the userspace 
alternatives, and I can see why we ended up with this.

>> What happens if we have DAX support in the entire stack, and an
>> enterprising user mounts a file system?  Will she be able to fuzz
>> the file system or binfmt loaders concurrently, changing the bits
>> while they are being read?
>>
>> Currently, it appears that the loop device duplicates pages in the
>> page cache, so this does not seem to be possible, but DAX support
>> might change this.
>
> I haven't looked at adding DAX support to the loop device, although
> that would make sense.  At the moment, neither ext2 nor ext4 (our only
> DAX-supporting filesystems) use DAX for their metadata, only for user
> data.  As far as fuzzing the binfmt loaders ... are these filesystems not
> forced to be at least nosuid?

The kernel binfmt parser runs as root even without a SUID bit. :)

 > I might go so far as to make them noexec.

Oh, that's an interesting idea.

> Thanks for thinking about this.  I didn't know allowing users to mount
> files they owned was something distros actually did.  Have we considered
> prohibiting the user from modifying the file while it's mounted, eg
> forcing its permissions to 0 or pretending it's immutable?

Perhaps like "Text file busy" for executables?  How reliable is that in 
practice?

Changing file permissions doesn't affected already open descriptors and 
might not always be possible (the file system might be mounted 
read-only, but still be modifiable beneath).

-- 
Florian Weimer / Red Hat Product Security Team

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 00/22] Support ext4 on NV-DIMMs
@ 2014-02-27 16:36       ` Florian Weimer
  0 siblings, 0 replies; 79+ messages in thread
From: Florian Weimer @ 2014-02-27 16:36 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On 02/27/2014 05:29 PM, Matthew Wilcox wrote:

>> Some distributions use udisks2 to grant permission to local console
>> users to create new loop devices from files.  File systems on these
>> block devices are then mounted.  This is a replacement for several
>> file systems implemented in user space, and for the users, this is a
>> good thing because the in-kernel implementations are generally of
>> higher quality.
>
> Just to be sure I understand; the user owns the file (so can change any
> bit in it at will), and the loop device is used to present that file
> to the filesystem as a block device to be mounted?

Yes, that's a fair summary.

 > Have we fuzz-tested
> all the filesystems enough to be sure that's safe?  :-)

It raised some eyebrows.  But I've looked at some of the userspace 
alternatives, and I can see why we ended up with this.

>> What happens if we have DAX support in the entire stack, and an
>> enterprising user mounts a file system?  Will she be able to fuzz
>> the file system or binfmt loaders concurrently, changing the bits
>> while they are being read?
>>
>> Currently, it appears that the loop device duplicates pages in the
>> page cache, so this does not seem to be possible, but DAX support
>> might change this.
>
> I haven't looked at adding DAX support to the loop device, although
> that would make sense.  At the moment, neither ext2 nor ext4 (our only
> DAX-supporting filesystems) use DAX for their metadata, only for user
> data.  As far as fuzzing the binfmt loaders ... are these filesystems not
> forced to be at least nosuid?

The kernel binfmt parser runs as root even without a SUID bit. :)

 > I might go so far as to make them noexec.

Oh, that's an interesting idea.

> Thanks for thinking about this.  I didn't know allowing users to mount
> files they owned was something distros actually did.  Have we considered
> prohibiting the user from modifying the file while it's mounted, eg
> forcing its permissions to 0 or pretending it's immutable?

Perhaps like "Text file busy" for executables?  How reliable is that in 
practice?

Changing file permissions doesn't affected already open descriptors and 
might not always be possible (the file system might be mounted 
read-only, but still be modifiable beneath).

-- 
Florian Weimer / Red Hat Product Security Team

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
  2014-02-25 14:18   ` Matthew Wilcox
@ 2014-02-28 17:49     ` Toshi Kani
  -1 siblings, 0 replies; 79+ messages in thread
From: Toshi Kani @ 2014-02-28 17:49 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm, linux-fsdevel, willy

On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> Instead of calling aops->get_xip_mem from the fault handler, the
> filesystem passes a get_block_t that is used to find the appropriate
> blocks.
 :
> +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> +			get_block_t get_block)
> +{
> +	struct file *file = vma->vm_file;
> +	struct inode *inode = file_inode(file);
> +	struct address_space *mapping = file->f_mapping;
> +	struct buffer_head bh;
> +	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> +	sector_t block;
> +	pgoff_t size;
> +	unsigned long pfn;
> +	int error;
> +
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (vmf->pgoff >= size)
> +		return VM_FAULT_SIGBUS;
> +
> +	memset(&bh, 0, sizeof(bh));
> +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
> +	bh.b_size = PAGE_SIZE;
> +	error = get_block(inode, block, &bh, 0);
> +	if (error || bh.b_size < PAGE_SIZE)
> +		return VM_FAULT_SIGBUS;

I am learning the code and have some questions.  The original code,
xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
function, it looks to me that get_block(,,,0) returns 0 for both cases
(success and -ENODATA previously), which are dealt in the same way.  Is
that right?  If so, is there any reason for the change?  Also, isn't it
possible to call get_block(,,,1) even if get_block(,,,0) found a block?

Thanks,
-Toshi

> +
> +	if (!buffer_written(&bh) && !vmf->cow_page) {
> +		if (vmf->flags & FAULT_FLAG_WRITE) {
> +			error = get_block(inode, block, &bh, 1);
> +			if (error || bh.b_size < PAGE_SIZE)
> +				return VM_FAULT_SIGBUS;
> +		} else {
> +			return dax_load_hole(mapping, vmf);
> +		}
> +	}
> +



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
@ 2014-02-28 17:49     ` Toshi Kani
  0 siblings, 0 replies; 79+ messages in thread
From: Toshi Kani @ 2014-02-28 17:49 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm, linux-fsdevel, willy

On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> Instead of calling aops->get_xip_mem from the fault handler, the
> filesystem passes a get_block_t that is used to find the appropriate
> blocks.
 :
> +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> +			get_block_t get_block)
> +{
> +	struct file *file = vma->vm_file;
> +	struct inode *inode = file_inode(file);
> +	struct address_space *mapping = file->f_mapping;
> +	struct buffer_head bh;
> +	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> +	sector_t block;
> +	pgoff_t size;
> +	unsigned long pfn;
> +	int error;
> +
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (vmf->pgoff >= size)
> +		return VM_FAULT_SIGBUS;
> +
> +	memset(&bh, 0, sizeof(bh));
> +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
> +	bh.b_size = PAGE_SIZE;
> +	error = get_block(inode, block, &bh, 0);
> +	if (error || bh.b_size < PAGE_SIZE)
> +		return VM_FAULT_SIGBUS;

I am learning the code and have some questions.  The original code,
xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
function, it looks to me that get_block(,,,0) returns 0 for both cases
(success and -ENODATA previously), which are dealt in the same way.  Is
that right?  If so, is there any reason for the change?  Also, isn't it
possible to call get_block(,,,1) even if get_block(,,,0) found a block?

Thanks,
-Toshi

> +
> +	if (!buffer_written(&bh) && !vmf->cow_page) {
> +		if (vmf->flags & FAULT_FLAG_WRITE) {
> +			error = get_block(inode, block, &bh, 1);
> +			if (error || bh.b_size < PAGE_SIZE)
> +				return VM_FAULT_SIGBUS;
> +		} else {
> +			return dax_load_hole(mapping, vmf);
> +		}
> +	}
> +


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
  2014-02-28 17:49     ` Toshi Kani
@ 2014-02-28 20:20       ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-28 20:20 UTC (permalink / raw)
  To: Toshi Kani; +Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> > Instead of calling aops->get_xip_mem from the fault handler, the
> > filesystem passes a get_block_t that is used to find the appropriate
> > blocks.
>  :
> > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> > +			get_block_t get_block)
> > +{
> > +	struct file *file = vma->vm_file;
> > +	struct inode *inode = file_inode(file);
> > +	struct address_space *mapping = file->f_mapping;
> > +	struct buffer_head bh;
> > +	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> > +	sector_t block;
> > +	pgoff_t size;
> > +	unsigned long pfn;
> > +	int error;
> > +
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	if (vmf->pgoff >= size)
> > +		return VM_FAULT_SIGBUS;
> > +
> > +	memset(&bh, 0, sizeof(bh));
> > +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
> > +	bh.b_size = PAGE_SIZE;
> > +	error = get_block(inode, block, &bh, 0);
> > +	if (error || bh.b_size < PAGE_SIZE)
> > +		return VM_FAULT_SIGBUS;
> 
> I am learning the code and have some questions.

Hi Toshi,

Glad to see you're looking at it.  Let me try to help ...

> The original code,
> xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
> calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
> function, it looks to me that get_block(,,,0) returns 0 for both cases
> (success and -ENODATA previously), which are dealt in the same way.  Is
> that right?  If so, is there any reason for the change?

Yes, get_xip_mem() returned -ENODATA for a hole.  That was a suboptimal
interface because filesystems are actually capable of returning more
information than that, eg how long the hole is (ext4 *doesn't*, but I
consider that to be a bug).

I don't get to decide what the get_block() interface looks like.  It's the
standard way that the VFS calls back into the filesystem and has been
around for probably close to twenty years at this point.  I'm still trying
to understand exactly what the contract is for get_blocks() ... I have
a document that I'm working on to try to explain it, but it's tough going!

> Also, isn't it
> possible to call get_block(,,,1) even if get_block(,,,0) found a block?

The code in question looks like this:

        error = get_block(inode, block, &bh, 0);
        if (error || bh.b_size < PAGE_SIZE)
                goto sigbus;

        if (!buffer_written(&bh) && !vmf->cow_page) {
                if (vmf->flags & FAULT_FLAG_WRITE) {
                        error = get_block(inode, block, &bh, 1);

where buffer_written is defined as:
        return buffer_mapped(bh) && !buffer_unwritten(bh);

Doing some boolean algebra, that's:

	if (!buffer_mapped || buffer_unwritten)

In either case, we want to tell the filesystem that we're writing to
this block.  At least, that's my current understanding of the get_block()
interface.  I'm open to correction here!

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
@ 2014-02-28 20:20       ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-02-28 20:20 UTC (permalink / raw)
  To: Toshi Kani; +Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> > Instead of calling aops->get_xip_mem from the fault handler, the
> > filesystem passes a get_block_t that is used to find the appropriate
> > blocks.
>  :
> > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> > +			get_block_t get_block)
> > +{
> > +	struct file *file = vma->vm_file;
> > +	struct inode *inode = file_inode(file);
> > +	struct address_space *mapping = file->f_mapping;
> > +	struct buffer_head bh;
> > +	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> > +	sector_t block;
> > +	pgoff_t size;
> > +	unsigned long pfn;
> > +	int error;
> > +
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	if (vmf->pgoff >= size)
> > +		return VM_FAULT_SIGBUS;
> > +
> > +	memset(&bh, 0, sizeof(bh));
> > +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
> > +	bh.b_size = PAGE_SIZE;
> > +	error = get_block(inode, block, &bh, 0);
> > +	if (error || bh.b_size < PAGE_SIZE)
> > +		return VM_FAULT_SIGBUS;
> 
> I am learning the code and have some questions.

Hi Toshi,

Glad to see you're looking at it.  Let me try to help ...

> The original code,
> xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
> calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
> function, it looks to me that get_block(,,,0) returns 0 for both cases
> (success and -ENODATA previously), which are dealt in the same way.  Is
> that right?  If so, is there any reason for the change?

Yes, get_xip_mem() returned -ENODATA for a hole.  That was a suboptimal
interface because filesystems are actually capable of returning more
information than that, eg how long the hole is (ext4 *doesn't*, but I
consider that to be a bug).

I don't get to decide what the get_block() interface looks like.  It's the
standard way that the VFS calls back into the filesystem and has been
around for probably close to twenty years at this point.  I'm still trying
to understand exactly what the contract is for get_blocks() ... I have
a document that I'm working on to try to explain it, but it's tough going!

> Also, isn't it
> possible to call get_block(,,,1) even if get_block(,,,0) found a block?

The code in question looks like this:

        error = get_block(inode, block, &bh, 0);
        if (error || bh.b_size < PAGE_SIZE)
                goto sigbus;

        if (!buffer_written(&bh) && !vmf->cow_page) {
                if (vmf->flags & FAULT_FLAG_WRITE) {
                        error = get_block(inode, block, &bh, 1);

where buffer_written is defined as:
        return buffer_mapped(bh) && !buffer_unwritten(bh);

Doing some boolean algebra, that's:

	if (!buffer_mapped || buffer_unwritten)

In either case, we want to tell the filesystem that we're writing to
this block.  At least, that's my current understanding of the get_block()
interface.  I'm open to correction here!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
  2014-02-28 20:20       ` Matthew Wilcox
  (?)
@ 2014-02-28 22:18         ` Toshi Kani
  -1 siblings, 0 replies; 79+ messages in thread
From: Toshi Kani @ 2014-02-28 22:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On Fri, 2014-02-28 at 15:20 -0500, Matthew Wilcox wrote:
> On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> > On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
 :
> Glad to see you're looking at it.  Let me try to help ...

Hi Matt,

Thanks for the help.  This is really a nice work, and I am hoping to
help it... (in some day! :-)

> > The original code,
> > xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> > get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
> > calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
> > function, it looks to me that get_block(,,,0) returns 0 for both cases
> > (success and -ENODATA previously), which are dealt in the same way.  Is
> > that right?  If so, is there any reason for the change?
> 
> Yes, get_xip_mem() returned -ENODATA for a hole.  That was a suboptimal
> interface because filesystems are actually capable of returning more
> information than that, eg how long the hole is (ext4 *doesn't*, but I
> consider that to be a bug).
> 
> I don't get to decide what the get_block() interface looks like.  It's the
> standard way that the VFS calls back into the filesystem and has been
> around for probably close to twenty years at this point.  I'm still trying
> to understand exactly what the contract is for get_blocks() ... I have
> a document that I'm working on to try to explain it, but it's tough going!

Got it.  Yes, get_block() is a beast for file system newbie like me.
Thanks for working on the document.

> > Also, isn't it
> > possible to call get_block(,,,1) even if get_block(,,,0) found a block?
> 
> The code in question looks like this:
> 
>         error = get_block(inode, block, &bh, 0);
>         if (error || bh.b_size < PAGE_SIZE)
>                 goto sigbus;
> 
>         if (!buffer_written(&bh) && !vmf->cow_page) {
>                 if (vmf->flags & FAULT_FLAG_WRITE) {
>                         error = get_block(inode, block, &bh, 1);
> 
> where buffer_written is defined as:
>         return buffer_mapped(bh) && !buffer_unwritten(bh);
> 
> Doing some boolean algebra, that's:
> 
> 	if (!buffer_mapped || buffer_unwritten)

Oh, I see!  When the first get_block(,,,0) succeeded, this buffer is
mapped.  So, it won't go into this path.

> In either case, we want to tell the filesystem that we're writing to
> this block.  At least, that's my current understanding of the get_block()
> interface.  I'm open to correction here!

Thanks again!
-Toshi


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
@ 2014-02-28 22:18         ` Toshi Kani
  0 siblings, 0 replies; 79+ messages in thread
From: Toshi Kani @ 2014-02-28 22:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On Fri, 2014-02-28 at 15:20 -0500, Matthew Wilcox wrote:
> On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> > On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
 :
> Glad to see you're looking at it.  Let me try to help ...

Hi Matt,

Thanks for the help.  This is really a nice work, and I am hoping to
help it... (in some day! :-)

> > The original code,
> > xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> > get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
> > calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
> > function, it looks to me that get_block(,,,0) returns 0 for both cases
> > (success and -ENODATA previously), which are dealt in the same way.  Is
> > that right?  If so, is there any reason for the change?
> 
> Yes, get_xip_mem() returned -ENODATA for a hole.  That was a suboptimal
> interface because filesystems are actually capable of returning more
> information than that, eg how long the hole is (ext4 *doesn't*, but I
> consider that to be a bug).
> 
> I don't get to decide what the get_block() interface looks like.  It's the
> standard way that the VFS calls back into the filesystem and has been
> around for probably close to twenty years at this point.  I'm still trying
> to understand exactly what the contract is for get_blocks() ... I have
> a document that I'm working on to try to explain it, but it's tough going!

Got it.  Yes, get_block() is a beast for file system newbie like me.
Thanks for working on the document.

> > Also, isn't it
> > possible to call get_block(,,,1) even if get_block(,,,0) found a block?
> 
> The code in question looks like this:
> 
>         error = get_block(inode, block, &bh, 0);
>         if (error || bh.b_size < PAGE_SIZE)
>                 goto sigbus;
> 
>         if (!buffer_written(&bh) && !vmf->cow_page) {
>                 if (vmf->flags & FAULT_FLAG_WRITE) {
>                         error = get_block(inode, block, &bh, 1);
> 
> where buffer_written is defined as:
>         return buffer_mapped(bh) && !buffer_unwritten(bh);
> 
> Doing some boolean algebra, that's:
> 
> 	if (!buffer_mapped || buffer_unwritten)

Oh, I see!  When the first get_block(,,,0) succeeded, this buffer is
mapped.  So, it won't go into this path.

> In either case, we want to tell the filesystem that we're writing to
> this block.  At least, that's my current understanding of the get_block()
> interface.  I'm open to correction here!

Thanks again!
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
@ 2014-02-28 22:18         ` Toshi Kani
  0 siblings, 0 replies; 79+ messages in thread
From: Toshi Kani @ 2014-02-28 22:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On Fri, 2014-02-28 at 15:20 -0500, Matthew Wilcox wrote:
> On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> > On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
 :
> Glad to see you're looking at it.  Let me try to help ...

Hi Matt,

Thanks for the help.  This is really a nice work, and I am hoping to
help it... (in some day! :-)

> > The original code,
> > xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> > get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
> > calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
> > function, it looks to me that get_block(,,,0) returns 0 for both cases
> > (success and -ENODATA previously), which are dealt in the same way.  Is
> > that right?  If so, is there any reason for the change?
> 
> Yes, get_xip_mem() returned -ENODATA for a hole.  That was a suboptimal
> interface because filesystems are actually capable of returning more
> information than that, eg how long the hole is (ext4 *doesn't*, but I
> consider that to be a bug).
> 
> I don't get to decide what the get_block() interface looks like.  It's the
> standard way that the VFS calls back into the filesystem and has been
> around for probably close to twenty years at this point.  I'm still trying
> to understand exactly what the contract is for get_blocks() ... I have
> a document that I'm working on to try to explain it, but it's tough going!

Got it.  Yes, get_block() is a beast for file system newbie like me.
Thanks for working on the document.

> > Also, isn't it
> > possible to call get_block(,,,1) even if get_block(,,,0) found a block?
> 
> The code in question looks like this:
> 
>         error = get_block(inode, block, &bh, 0);
>         if (error || bh.b_size < PAGE_SIZE)
>                 goto sigbus;
> 
>         if (!buffer_written(&bh) && !vmf->cow_page) {
>                 if (vmf->flags & FAULT_FLAG_WRITE) {
>                         error = get_block(inode, block, &bh, 1);
> 
> where buffer_written is defined as:
>         return buffer_mapped(bh) && !buffer_unwritten(bh);
> 
> Doing some boolean algebra, that's:
> 
> 	if (!buffer_mapped || buffer_unwritten)

Oh, I see!  When the first get_block(,,,0) succeeded, this buffer is
mapped.  So, it won't go into this path.

> In either case, we want to tell the filesystem that we're writing to
> this block.  At least, that's my current understanding of the get_block()
> interface.  I'm open to correction here!

Thanks again!
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 00/22] Support ext4 on NV-DIMMs
  2014-02-25 14:18 ` Matthew Wilcox
@ 2014-03-02  8:22   ` Pavel Machek
  -1 siblings, 0 replies; 79+ messages in thread
From: Pavel Machek @ 2014-03-02  8:22 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm, linux-fsdevel, willy

On Tue 2014-02-25 09:18:16, Matthew Wilcox wrote:
> One of the primary uses for NV-DIMMs is to expose them as a block device
> and use a filesystem to store files on the NV-DIMM.  While that works,
> it currently wastes memory and CPU time buffering the files in the page
> cache.  We have support in ext2 for bypassing the page cache, but it
> has some races which are unfixable in the current design.  This series
> of patches rewrite the underlying support, and add support for direct
> access to ext4.
> 
> This iteration of the patchset renames the "XIP" support to "DAX".
> This fixes the confusion between kernel XIP and filesystem XIP.  It's not
> really about executing in-place; it's about direct access to memory-like
> storage, bypassing the page cache.  DAX is TLA-compliant, retains the
> exciting X, is pronouncable ("Dacks") and is not used elsewhere in
> the kernel.  The only major use of DAX outside the kernel is the German
> stock exchange, and I think that's pretty unlikely to cause
> confusion.

It is TLA compliant, but not widely understood, and probably not
googleable. Could we perhaps use some longer name for a while?

>  create mode 100644 Documentation/filesystems/dax.txt

Its not like filename can't be longer than 3.3, you see?

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 00/22] Support ext4 on NV-DIMMs
@ 2014-03-02  8:22   ` Pavel Machek
  0 siblings, 0 replies; 79+ messages in thread
From: Pavel Machek @ 2014-03-02  8:22 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm, linux-fsdevel, willy

On Tue 2014-02-25 09:18:16, Matthew Wilcox wrote:
> One of the primary uses for NV-DIMMs is to expose them as a block device
> and use a filesystem to store files on the NV-DIMM.  While that works,
> it currently wastes memory and CPU time buffering the files in the page
> cache.  We have support in ext2 for bypassing the page cache, but it
> has some races which are unfixable in the current design.  This series
> of patches rewrite the underlying support, and add support for direct
> access to ext4.
> 
> This iteration of the patchset renames the "XIP" support to "DAX".
> This fixes the confusion between kernel XIP and filesystem XIP.  It's not
> really about executing in-place; it's about direct access to memory-like
> storage, bypassing the page cache.  DAX is TLA-compliant, retains the
> exciting X, is pronouncable ("Dacks") and is not used elsewhere in
> the kernel.  The only major use of DAX outside the kernel is the German
> stock exchange, and I think that's pretty unlikely to cause
> confusion.

It is TLA compliant, but not widely understood, and probably not
googleable. Could we perhaps use some longer name for a while?

>  create mode 100644 Documentation/filesystems/dax.txt

Its not like filename can't be longer than 3.3, you see?

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
  2014-02-28 20:20       ` Matthew Wilcox
@ 2014-03-02 23:30         ` Dave Chinner
  -1 siblings, 0 replies; 79+ messages in thread
From: Dave Chinner @ 2014-03-02 23:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Toshi Kani, Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On Fri, Feb 28, 2014 at 03:20:31PM -0500, Matthew Wilcox wrote:
> On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> > On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> > > Instead of calling aops->get_xip_mem from the fault handler, the
> > > filesystem passes a get_block_t that is used to find the appropriate
> > > blocks.
> >  :
> > > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> > > +			get_block_t get_block)
> > > +{
> > > +	struct file *file = vma->vm_file;
> > > +	struct inode *inode = file_inode(file);
> > > +	struct address_space *mapping = file->f_mapping;
> > > +	struct buffer_head bh;
> > > +	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> > > +	sector_t block;
> > > +	pgoff_t size;
> > > +	unsigned long pfn;
> > > +	int error;
> > > +
> > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > +	if (vmf->pgoff >= size)
> > > +		return VM_FAULT_SIGBUS;
> > > +
> > > +	memset(&bh, 0, sizeof(bh));
> > > +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
> > > +	bh.b_size = PAGE_SIZE;
> > > +	error = get_block(inode, block, &bh, 0);
> > > +	if (error || bh.b_size < PAGE_SIZE)
> > > +		return VM_FAULT_SIGBUS;
> > 
> > I am learning the code and have some questions.
> 
> Hi Toshi,
> 
> Glad to see you're looking at it.  Let me try to help ...
> 
> > The original code,
> > xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> > get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
> > calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
> > function, it looks to me that get_block(,,,0) returns 0 for both cases
> > (success and -ENODATA previously), which are dealt in the same way.  Is
> > that right?  If so, is there any reason for the change?
> 
> Yes, get_xip_mem() returned -ENODATA for a hole.  That was a suboptimal
> interface because filesystems are actually capable of returning more
> information than that, eg how long the hole is (ext4 *doesn't*, but I
> consider that to be a bug).
> 
> I don't get to decide what the get_block() interface looks like.  It's the
> standard way that the VFS calls back into the filesystem and has been
> around for probably close to twenty years at this point.  I'm still trying
> to understand exactly what the contract is for get_blocks() ... I have
> a document that I'm working on to try to explain it, but it's tough going!
> 
> > Also, isn't it
> > possible to call get_block(,,,1) even if get_block(,,,0) found a block?
> 
> The code in question looks like this:
> 
>         error = get_block(inode, block, &bh, 0);
>         if (error || bh.b_size < PAGE_SIZE)
>                 goto sigbus;
> 
>         if (!buffer_written(&bh) && !vmf->cow_page) {
>                 if (vmf->flags & FAULT_FLAG_WRITE) {
>                         error = get_block(inode, block, &bh, 1);
> 
> where buffer_written is defined as:
>         return buffer_mapped(bh) && !buffer_unwritten(bh);
> 
> Doing some boolean algebra, that's:
> 
> 	if (!buffer_mapped || buffer_unwritten)
> 
> In either case, we want to tell the filesystem that we're writing to
> this block.  At least, that's my current understanding of the get_block()
> interface.  I'm open to correction here!

I've got a rewritten version on this that doesn't require two calls
to get_block() that I wrote while prototyping the XFS code. It also
fixes all the misunderstandings about what get_block() actually does
and returns so it works correctly with XFS.

I need to port it forward to your new patch set (hopefully later
this week), so don't spend too much time trying to work out exactly
what this code needs to do...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
@ 2014-03-02 23:30         ` Dave Chinner
  0 siblings, 0 replies; 79+ messages in thread
From: Dave Chinner @ 2014-03-02 23:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Toshi Kani, Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On Fri, Feb 28, 2014 at 03:20:31PM -0500, Matthew Wilcox wrote:
> On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> > On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> > > Instead of calling aops->get_xip_mem from the fault handler, the
> > > filesystem passes a get_block_t that is used to find the appropriate
> > > blocks.
> >  :
> > > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> > > +			get_block_t get_block)
> > > +{
> > > +	struct file *file = vma->vm_file;
> > > +	struct inode *inode = file_inode(file);
> > > +	struct address_space *mapping = file->f_mapping;
> > > +	struct buffer_head bh;
> > > +	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> > > +	sector_t block;
> > > +	pgoff_t size;
> > > +	unsigned long pfn;
> > > +	int error;
> > > +
> > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > +	if (vmf->pgoff >= size)
> > > +		return VM_FAULT_SIGBUS;
> > > +
> > > +	memset(&bh, 0, sizeof(bh));
> > > +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
> > > +	bh.b_size = PAGE_SIZE;
> > > +	error = get_block(inode, block, &bh, 0);
> > > +	if (error || bh.b_size < PAGE_SIZE)
> > > +		return VM_FAULT_SIGBUS;
> > 
> > I am learning the code and have some questions.
> 
> Hi Toshi,
> 
> Glad to see you're looking at it.  Let me try to help ...
> 
> > The original code,
> > xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> > get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
> > calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
> > function, it looks to me that get_block(,,,0) returns 0 for both cases
> > (success and -ENODATA previously), which are dealt in the same way.  Is
> > that right?  If so, is there any reason for the change?
> 
> Yes, get_xip_mem() returned -ENODATA for a hole.  That was a suboptimal
> interface because filesystems are actually capable of returning more
> information than that, eg how long the hole is (ext4 *doesn't*, but I
> consider that to be a bug).
> 
> I don't get to decide what the get_block() interface looks like.  It's the
> standard way that the VFS calls back into the filesystem and has been
> around for probably close to twenty years at this point.  I'm still trying
> to understand exactly what the contract is for get_blocks() ... I have
> a document that I'm working on to try to explain it, but it's tough going!
> 
> > Also, isn't it
> > possible to call get_block(,,,1) even if get_block(,,,0) found a block?
> 
> The code in question looks like this:
> 
>         error = get_block(inode, block, &bh, 0);
>         if (error || bh.b_size < PAGE_SIZE)
>                 goto sigbus;
> 
>         if (!buffer_written(&bh) && !vmf->cow_page) {
>                 if (vmf->flags & FAULT_FLAG_WRITE) {
>                         error = get_block(inode, block, &bh, 1);
> 
> where buffer_written is defined as:
>         return buffer_mapped(bh) && !buffer_unwritten(bh);
> 
> Doing some boolean algebra, that's:
> 
> 	if (!buffer_mapped || buffer_unwritten)
> 
> In either case, we want to tell the filesystem that we're writing to
> this block.  At least, that's my current understanding of the get_block()
> interface.  I'm open to correction here!

I've got a rewritten version on this that doesn't require two calls
to get_block() that I wrote while prototyping the XFS code. It also
fixes all the misunderstandings about what get_block() actually does
and returns so it works correctly with XFS.

I need to port it forward to your new patch set (hopefully later
this week), so don't spend too much time trying to work out exactly
what this code needs to do...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
  2014-03-02 23:30         ` Dave Chinner
@ 2014-03-03 23:07           ` Ross Zwisler
  -1 siblings, 0 replies; 79+ messages in thread
From: Ross Zwisler @ 2014-03-03 23:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Toshi Kani, Matthew Wilcox, linux-kernel,
	linux-mm, linux-fsdevel

On Mon, 3 Mar 2014, Dave Chinner wrote:
> On Fri, Feb 28, 2014 at 03:20:31PM -0500, Matthew Wilcox wrote:
> > On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> > > The original code,
> > > xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> > > get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
> > > calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
> > > function, it looks to me that get_block(,,,0) returns 0 for both cases
> > > (success and -ENODATA previously), which are dealt in the same way.  Is
> > > that right?  If so, is there any reason for the change?
> > 
> > Yes, get_xip_mem() returned -ENODATA for a hole.  That was a suboptimal
> > interface because filesystems are actually capable of returning more
> > information than that, eg how long the hole is (ext4 *doesn't*, but I
> > consider that to be a bug).
> > 
> > I don't get to decide what the get_block() interface looks like.  It's the
> > standard way that the VFS calls back into the filesystem and has been
> > around for probably close to twenty years at this point.  I'm still trying
> > to understand exactly what the contract is for get_blocks() ... I have
> > a document that I'm working on to try to explain it, but it's tough going!
> > 
> > > Also, isn't it
> > > possible to call get_block(,,,1) even if get_block(,,,0) found a block?
> > 
> > The code in question looks like this:
> > 
> >         error = get_block(inode, block, &bh, 0);
> >         if (error || bh.b_size < PAGE_SIZE)
> >                 goto sigbus;
> > 
> >         if (!buffer_written(&bh) && !vmf->cow_page) {
> >                 if (vmf->flags & FAULT_FLAG_WRITE) {
> >                         error = get_block(inode, block, &bh, 1);
> > 
> > where buffer_written is defined as:
> >         return buffer_mapped(bh) && !buffer_unwritten(bh);
> > 
> > Doing some boolean algebra, that's:
> > 
> > 	if (!buffer_mapped || buffer_unwritten)
> > 
> > In either case, we want to tell the filesystem that we're writing to
> > this block.  At least, that's my current understanding of the get_block()
> > interface.  I'm open to correction here!
> 
> I've got a rewritten version on this that doesn't require two calls
> to get_block() that I wrote while prototyping the XFS code. It also
> fixes all the misunderstandings about what get_block() actually does
> and returns so it works correctly with XFS.
> 
> I need to port it forward to your new patch set (hopefully later
> this week), so don't spend too much time trying to work out exactly
> what this code needs to do...

Here is a writeup from Matthew Wilcox describing the get_block() interface.

He sent this to me before Dave sent out the latest mail in this thread. :)

Corrections and updates are very welcome.

- Ross

================

get_block_t is used by the VFS to ask filesystem to translate logical
blocks within a file to sectors on a block device.

typedef int (get_block_t)(struct inode *inode, sector_t iblock,
                        struct buffer_head *bh_result, int create);

get_block() must not be called simultaneously with the create flag set
for overlapping extents *** or is/was this a bug in ext2? ***

Despite the iblock argument having type sector_t, iblock is actually
in units of the file block size, not in units of 512-byte sectors.
iblock must not extend beyond i_size. *** um, looks like xfs permits
this ... ? ***

If there is no current mapping from the block to the media, one will be
created if 'create' is set to 1.  'create' should not be set to a value
other than '0' or '1'.

On entry, bh_result should have b_size set to the number of bytes that the
caller is interested in and b_state initialised to zero.  b_size should
be a multiple of the file block size.  On exit, bh_result describes a
physically contiguous extent starting at iblock.  b_size will not be
increased by get_block, but it may be decreased if the filesystem extent
is shorter than the extent requested.

If bh_result describes an extent that is allocated, then BH_Mapped will
be set, and b_bdev and b_blocknr will be set to indicate the physical
location on the media.  The filesystem may wish to use map_bh() in order
to set BH_Mapped and initialise b_bdev, b_blocknr and b_size.  If the
filesystem knows that the extent is read into memory (eg because it
decided to populate the page cache as part of its get_block operation),
it should set BH_Uptodate.

If bh_result describes a hole, the filesystem should clear BH_Mapped and
set BH_Uptodate.  It will not set b_bdev or b_blocknr, but it should set
b_size to indicate the length of the hole.  It may also opt to leave
bh_result untouched as described above.  If the block corresponds to
a hole, bh_result *may* be unmodified, but the VFS can optimise some
operations if the filesystem reports the length of the hole as described
below. *** or is this a bug in ext4? ***

If bh_result describes an extent which has data in the pagecache, but
that data has not yet had space allocated on the media (due to delayed
allocation), BH_Mapped, BH_Uptodate and BH_Delay will be set.  b_blocknr
is not set.

If bh_result describes an extent which is reserved for data which has not
yet been written, BH_Unwritten is set, b_bdev and b_blocknr are set.

Other flags that may be set include BH_New (to indicate that the buffer
was newly allocated and may need to be initialised), and BH_Boundary (to
indicate a physical discontinuity after this extent).

The filesystem may choose to set b_private.  Other fields in buffer_head
are legacy buffer-cache uses and will not be modified by get_block.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
@ 2014-03-03 23:07           ` Ross Zwisler
  0 siblings, 0 replies; 79+ messages in thread
From: Ross Zwisler @ 2014-03-03 23:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Toshi Kani, Matthew Wilcox, linux-kernel,
	linux-mm, linux-fsdevel

On Mon, 3 Mar 2014, Dave Chinner wrote:
> On Fri, Feb 28, 2014 at 03:20:31PM -0500, Matthew Wilcox wrote:
> > On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> > > The original code,
> > > xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> > > get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
> > > calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
> > > function, it looks to me that get_block(,,,0) returns 0 for both cases
> > > (success and -ENODATA previously), which are dealt in the same way.  Is
> > > that right?  If so, is there any reason for the change?
> > 
> > Yes, get_xip_mem() returned -ENODATA for a hole.  That was a suboptimal
> > interface because filesystems are actually capable of returning more
> > information than that, eg how long the hole is (ext4 *doesn't*, but I
> > consider that to be a bug).
> > 
> > I don't get to decide what the get_block() interface looks like.  It's the
> > standard way that the VFS calls back into the filesystem and has been
> > around for probably close to twenty years at this point.  I'm still trying
> > to understand exactly what the contract is for get_blocks() ... I have
> > a document that I'm working on to try to explain it, but it's tough going!
> > 
> > > Also, isn't it
> > > possible to call get_block(,,,1) even if get_block(,,,0) found a block?
> > 
> > The code in question looks like this:
> > 
> >         error = get_block(inode, block, &bh, 0);
> >         if (error || bh.b_size < PAGE_SIZE)
> >                 goto sigbus;
> > 
> >         if (!buffer_written(&bh) && !vmf->cow_page) {
> >                 if (vmf->flags & FAULT_FLAG_WRITE) {
> >                         error = get_block(inode, block, &bh, 1);
> > 
> > where buffer_written is defined as:
> >         return buffer_mapped(bh) && !buffer_unwritten(bh);
> > 
> > Doing some boolean algebra, that's:
> > 
> > 	if (!buffer_mapped || buffer_unwritten)
> > 
> > In either case, we want to tell the filesystem that we're writing to
> > this block.  At least, that's my current understanding of the get_block()
> > interface.  I'm open to correction here!
> 
> I've got a rewritten version on this that doesn't require two calls
> to get_block() that I wrote while prototyping the XFS code. It also
> fixes all the misunderstandings about what get_block() actually does
> and returns so it works correctly with XFS.
> 
> I need to port it forward to your new patch set (hopefully later
> this week), so don't spend too much time trying to work out exactly
> what this code needs to do...

Here is a writeup from Matthew Wilcox describing the get_block() interface.

He sent this to me before Dave sent out the latest mail in this thread. :)

Corrections and updates are very welcome.

- Ross

================

get_block_t is used by the VFS to ask filesystem to translate logical
blocks within a file to sectors on a block device.

typedef int (get_block_t)(struct inode *inode, sector_t iblock,
                        struct buffer_head *bh_result, int create);

get_block() must not be called simultaneously with the create flag set
for overlapping extents *** or is/was this a bug in ext2? ***

Despite the iblock argument having type sector_t, iblock is actually
in units of the file block size, not in units of 512-byte sectors.
iblock must not extend beyond i_size. *** um, looks like xfs permits
this ... ? ***

If there is no current mapping from the block to the media, one will be
created if 'create' is set to 1.  'create' should not be set to a value
other than '0' or '1'.

On entry, bh_result should have b_size set to the number of bytes that the
caller is interested in and b_state initialised to zero.  b_size should
be a multiple of the file block size.  On exit, bh_result describes a
physically contiguous extent starting at iblock.  b_size will not be
increased by get_block, but it may be decreased if the filesystem extent
is shorter than the extent requested.

If bh_result describes an extent that is allocated, then BH_Mapped will
be set, and b_bdev and b_blocknr will be set to indicate the physical
location on the media.  The filesystem may wish to use map_bh() in order
to set BH_Mapped and initialise b_bdev, b_blocknr and b_size.  If the
filesystem knows that the extent is read into memory (eg because it
decided to populate the page cache as part of its get_block operation),
it should set BH_Uptodate.

If bh_result describes a hole, the filesystem should clear BH_Mapped and
set BH_Uptodate.  It will not set b_bdev or b_blocknr, but it should set
b_size to indicate the length of the hole.  It may also opt to leave
bh_result untouched as described above.  If the block corresponds to
a hole, bh_result *may* be unmodified, but the VFS can optimise some
operations if the filesystem reports the length of the hole as described
below. *** or is this a bug in ext4? ***

If bh_result describes an extent which has data in the pagecache, but
that data has not yet had space allocated on the media (due to delayed
allocation), BH_Mapped, BH_Uptodate and BH_Delay will be set.  b_blocknr
is not set.

If bh_result describes an extent which is reserved for data which has not
yet been written, BH_Unwritten is set, b_bdev and b_blocknr are set.

Other flags that may be set include BH_New (to indicate that the buffer
was newly allocated and may need to be initialised), and BH_Boundary (to
indicate a physical discontinuity after this extent).

The filesystem may choose to set b_private.  Other fields in buffer_head
are legacy buffer-cache uses and will not be modified by get_block.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
  2014-03-03 23:07           ` Ross Zwisler
@ 2014-03-04  0:56             ` Dave Chinner
  -1 siblings, 0 replies; 79+ messages in thread
From: Dave Chinner @ 2014-03-04  0:56 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, Toshi Kani, Matthew Wilcox, linux-kernel,
	linux-mm, linux-fsdevel

On Mon, Mar 03, 2014 at 04:07:35PM -0700, Ross Zwisler wrote:
> On Mon, 3 Mar 2014, Dave Chinner wrote:
> > On Fri, Feb 28, 2014 at 03:20:31PM -0500, Matthew Wilcox wrote:
> > > On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> > > > The original code,
> > > > xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> > > > get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
> > > > calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
> > > > function, it looks to me that get_block(,,,0) returns 0 for both cases
> > > > (success and -ENODATA previously), which are dealt in the same way.  Is
> > > > that right?  If so, is there any reason for the change?
> > > 
> > > Yes, get_xip_mem() returned -ENODATA for a hole.  That was a suboptimal
> > > interface because filesystems are actually capable of returning more
> > > information than that, eg how long the hole is (ext4 *doesn't*, but I
> > > consider that to be a bug).
> > > 
> > > I don't get to decide what the get_block() interface looks like.  It's the
> > > standard way that the VFS calls back into the filesystem and has been
> > > around for probably close to twenty years at this point.  I'm still trying
> > > to understand exactly what the contract is for get_blocks() ... I have
> > > a document that I'm working on to try to explain it, but it's tough going!
> > > 
> > > > Also, isn't it
> > > > possible to call get_block(,,,1) even if get_block(,,,0) found a block?
> > > 
> > > The code in question looks like this:
> > > 
> > >         error = get_block(inode, block, &bh, 0);
> > >         if (error || bh.b_size < PAGE_SIZE)
> > >                 goto sigbus;
> > > 
> > >         if (!buffer_written(&bh) && !vmf->cow_page) {
> > >                 if (vmf->flags & FAULT_FLAG_WRITE) {
> > >                         error = get_block(inode, block, &bh, 1);
> > > 
> > > where buffer_written is defined as:
> > >         return buffer_mapped(bh) && !buffer_unwritten(bh);
> > > 
> > > Doing some boolean algebra, that's:
> > > 
> > > 	if (!buffer_mapped || buffer_unwritten)
> > > 
> > > In either case, we want to tell the filesystem that we're writing to
> > > this block.  At least, that's my current understanding of the get_block()
> > > interface.  I'm open to correction here!
> > 
> > I've got a rewritten version on this that doesn't require two calls
> > to get_block() that I wrote while prototyping the XFS code. It also
> > fixes all the misunderstandings about what get_block() actually does
> > and returns so it works correctly with XFS.
> > 
> > I need to port it forward to your new patch set (hopefully later
> > this week), so don't spend too much time trying to work out exactly
> > what this code needs to do...
> 
> Here is a writeup from Matthew Wilcox describing the get_block() interface.
> 
> He sent this to me before Dave sent out the latest mail in this thread. :)
> 
> Corrections and updates are very welcome.
> 
> - Ross
> 
> ================
> 
> get_block_t is used by the VFS to ask filesystem to translate logical
> blocks within a file to sectors on a block device.
> 
> typedef int (get_block_t)(struct inode *inode, sector_t iblock,
>                         struct buffer_head *bh_result, int create);
> 
> get_block() must not be called simultaneously with the create flag set
> for overlapping extents *** or is/was this a bug in ext2? ***

Why not? The filesystem is responsible for serialising such requests
internally, or if it can't handle them preventing them from
occurring. XFS has allowed this to occur with concurrent overlapping
direct IO writes for a long time. In the situations where we need
serialisation to avoid data corruption (e.g. overlapping sub-block
IO), we serialise at the IO syscall context (i.e at a much higher
layer).

> Despite the iblock argument having type sector_t, iblock is actually
> in units of the file block size, not in units of 512-byte sectors.
> iblock must not extend beyond i_size. *** um, looks like xfs permits
> this ... ? ***

Of course.  There is nothing that prevents filesystems from mapping
and allocating blocks beyond EOF if they can do so sanely, and
nothign that prevents getblocks from letting them do so. Callers
need to handle the situation appropriately, either by doing their
own EOF checks or by leaving it up to the filesystems to do the
right thing. Buffered IO does the former, direct IO does the latter.

The code in question in __xfs_get_blocks:

        if (!create && direct && offset >= i_size_read(inode))
	                return 0;

This is because XFS does direct IO differently to everyone else.  It
doesn't update the inode size until after IO completion (see
xfs_end_io_direct_write()), and hence it has to be able to allocate
and map blocks beyond EOF.

i.e. only direct IO reads are disallowed if the requested mapping is
beyond EOF. Buffered reads beyond EOF don't get here, but direct IO
does.

> If there is no current mapping from the block to the media, one will be
> created if 'create' is set to 1.  'create' should not be set to a value
> other than '0' or '1'.

*nod*

Though the patch set I referred to created a third value ;)

> On entry, bh_result should have b_size set to the number of bytes that the
> caller is interested in and b_state initialised to zero.  b_size should
> be a multiple of the file block size.  On exit, bh_result describes a
> physically contiguous extent starting at iblock.  b_size will not be
> increased by get_block, but it may be decreased if the filesystem extent
> is shorter than the extent requested.

*nod*

> If bh_result describes an extent that is allocated, then BH_Mapped will
> be set, and b_bdev and b_blocknr will be set to indicate the physical
> location on the media. 

Definition is wrong. If bh_result has a physical mapping that the
*filesystem understands* and can reuse later, then BH_Mapped will be
set.

> The filesystem may wish to use map_bh() in order
> to set BH_Mapped and initialise b_bdev, b_blocknr and b_size.

That's fine.

> If the
> filesystem knows that the extent is read into memory (eg because it
> decided to populate the page cache as part of its get_block operation),
> it should set BH_Uptodate.

I think your terminology is wrong their - it is not valid to
*populate* the page cache during a get_blocks call. page cache
population occurs before *calls* get_blocks to map a page to a
block.  On read (mpage_readpages) the page is attached the page to
the "map_bh". This allows getblocks to initialise the contents of
the buffer being mapped during the getblocks call, and tell the
callee that they need to map_buffer_to_page() to correctly update
the state of the buffers on the page to reflect the state
returned.

So, it's not *populating* the page cache at all - the page cache is
already populated - it's *initialising data* in the page.

Note that the behaviour on write can be different. For example,
__block_write_begin() uses the page uptodate and BH_Uptodate to
determine if it needs to read in data from disk (i.e. a RMW cycle).
Some filesystems will set BH_Uptodate despite not having
instantiated data into the blocks because the block is not
instantiated on disk (BH_delay) or contains zeros (BH_Unwritten) and
higher layers will zero the buffer data appropriately....

> If bh_result describes a hole, the filesystem should clear BH_Mapped and
> set BH_Uptodate.

This is describing create = 0 behaviour (create = 1 means
allocation is required, and so that's going to return something
different).

in the read case, BH_Uptodate should only be set if the filesystem
initialised the data to zero. Otherwise the data is not uptodate,
and the caller needs to zero the hole. Indeed, XFS never sets
BH_Uptodate on a hole mapping because it needs the caller to zero
the data....

> It will not set b_bdev or b_blocknr, but it should set
> b_size to indicate the length of the hole.  It may also opt to leave
> bh_result untouched as described above.  If the block corresponds to
> a hole, bh_result *may* be unmodified, but the VFS can optimise some
> operations if the filesystem reports the length of the hole as described
> below. *** or is this a bug in ext4? ***

mpage_readpages() remaps the hole on every block, so regardless of
the size of the hole getblocks returns it will do the right thing.

> If bh_result describes an extent which has data in the pagecache, but
> that data has not yet had space allocated on the media (due to delayed
> allocation), BH_Mapped, BH_Uptodate and BH_Delay will be set.  b_blocknr
> is not set.

b_blocknr is undefined. filesystems can set it to whatever they
want; they will interpret it correctly when they see BH_Delay on the
buffer.

Again, BH_Uptodate is used here by XFS to prevent
__block_write_begin() from issuing IO on delalloc buffers that have
no data in them yet. That's specific to the filesystem
implementation (as XFS uses __block_write_begin), so this is
actually a requirement of __block_write_begin() for being used with
delalloc buffers, not a requirement of the getblocks interface.

> If bh_result describes an extent which is reserved for data which has not
> yet been written, BH_Unwritten is set, b_bdev and b_blocknr are set.

And if the getblocks call zeroed the data, BH_Uptodate shoul dbe
set, too.

> Other flags that may be set include BH_New (to indicate that the buffer
> was newly allocated and may need to be initialised), and BH_Boundary (to
> indicate a physical discontinuity after this extent).

BH_New deserved more description than that. Yes, it is used to
indicate a newly allocated block, but you'll note that they require
special handling *everywhere*. The common processing is to clear
underlying block device aliases and invalidate metadata mappings,
but it is also used to zero the regions of new buffers that aren't
getting data written to and to inform error handling to zero regions
that are still marked new when an error is detected.

In fact, because of this data initialisation and handling of errors,
XFS also sets BH_new on unwritten extents and any buffer that maps
beyond EOF so that the callers initialise them to contain zeros
appropriately.

> The filesystem may choose to set b_private.  Other fields in buffer_head
> are legacy buffer-cache uses and will not be modified by get_block.

Precisely why we should be moving this interface to a structure like
this:

struct iomap {
	sector_t	blocknr;
	sector_t	length;
	void		*fsprivate;
	unsigned int	state;
};

That is used specifically for returning block maps and directing
allocation, rather than something that conflates mapping, allocation
and data initialisation all into the one interface...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
@ 2014-03-04  0:56             ` Dave Chinner
  0 siblings, 0 replies; 79+ messages in thread
From: Dave Chinner @ 2014-03-04  0:56 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, Toshi Kani, Matthew Wilcox, linux-kernel,
	linux-mm, linux-fsdevel

On Mon, Mar 03, 2014 at 04:07:35PM -0700, Ross Zwisler wrote:
> On Mon, 3 Mar 2014, Dave Chinner wrote:
> > On Fri, Feb 28, 2014 at 03:20:31PM -0500, Matthew Wilcox wrote:
> > > On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> > > > The original code,
> > > > xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> > > > get_xip_mem(,,0,,) succeeded.  If get_xip_mem() returns -ENODATA, it
> > > > calls either get_xip_mem(,,1,,) or xip_sparse_page().  In this new
> > > > function, it looks to me that get_block(,,,0) returns 0 for both cases
> > > > (success and -ENODATA previously), which are dealt in the same way.  Is
> > > > that right?  If so, is there any reason for the change?
> > > 
> > > Yes, get_xip_mem() returned -ENODATA for a hole.  That was a suboptimal
> > > interface because filesystems are actually capable of returning more
> > > information than that, eg how long the hole is (ext4 *doesn't*, but I
> > > consider that to be a bug).
> > > 
> > > I don't get to decide what the get_block() interface looks like.  It's the
> > > standard way that the VFS calls back into the filesystem and has been
> > > around for probably close to twenty years at this point.  I'm still trying
> > > to understand exactly what the contract is for get_blocks() ... I have
> > > a document that I'm working on to try to explain it, but it's tough going!
> > > 
> > > > Also, isn't it
> > > > possible to call get_block(,,,1) even if get_block(,,,0) found a block?
> > > 
> > > The code in question looks like this:
> > > 
> > >         error = get_block(inode, block, &bh, 0);
> > >         if (error || bh.b_size < PAGE_SIZE)
> > >                 goto sigbus;
> > > 
> > >         if (!buffer_written(&bh) && !vmf->cow_page) {
> > >                 if (vmf->flags & FAULT_FLAG_WRITE) {
> > >                         error = get_block(inode, block, &bh, 1);
> > > 
> > > where buffer_written is defined as:
> > >         return buffer_mapped(bh) && !buffer_unwritten(bh);
> > > 
> > > Doing some boolean algebra, that's:
> > > 
> > > 	if (!buffer_mapped || buffer_unwritten)
> > > 
> > > In either case, we want to tell the filesystem that we're writing to
> > > this block.  At least, that's my current understanding of the get_block()
> > > interface.  I'm open to correction here!
> > 
> > I've got a rewritten version on this that doesn't require two calls
> > to get_block() that I wrote while prototyping the XFS code. It also
> > fixes all the misunderstandings about what get_block() actually does
> > and returns so it works correctly with XFS.
> > 
> > I need to port it forward to your new patch set (hopefully later
> > this week), so don't spend too much time trying to work out exactly
> > what this code needs to do...
> 
> Here is a writeup from Matthew Wilcox describing the get_block() interface.
> 
> He sent this to me before Dave sent out the latest mail in this thread. :)
> 
> Corrections and updates are very welcome.
> 
> - Ross
> 
> ================
> 
> get_block_t is used by the VFS to ask filesystem to translate logical
> blocks within a file to sectors on a block device.
> 
> typedef int (get_block_t)(struct inode *inode, sector_t iblock,
>                         struct buffer_head *bh_result, int create);
> 
> get_block() must not be called simultaneously with the create flag set
> for overlapping extents *** or is/was this a bug in ext2? ***

Why not? The filesystem is responsible for serialising such requests
internally, or if it can't handle them preventing them from
occurring. XFS has allowed this to occur with concurrent overlapping
direct IO writes for a long time. In the situations where we need
serialisation to avoid data corruption (e.g. overlapping sub-block
IO), we serialise at the IO syscall context (i.e at a much higher
layer).

> Despite the iblock argument having type sector_t, iblock is actually
> in units of the file block size, not in units of 512-byte sectors.
> iblock must not extend beyond i_size. *** um, looks like xfs permits
> this ... ? ***

Of course.  There is nothing that prevents filesystems from mapping
and allocating blocks beyond EOF if they can do so sanely, and
nothign that prevents getblocks from letting them do so. Callers
need to handle the situation appropriately, either by doing their
own EOF checks or by leaving it up to the filesystems to do the
right thing. Buffered IO does the former, direct IO does the latter.

The code in question in __xfs_get_blocks:

        if (!create && direct && offset >= i_size_read(inode))
	                return 0;

This is because XFS does direct IO differently to everyone else.  It
doesn't update the inode size until after IO completion (see
xfs_end_io_direct_write()), and hence it has to be able to allocate
and map blocks beyond EOF.

i.e. only direct IO reads are disallowed if the requested mapping is
beyond EOF. Buffered reads beyond EOF don't get here, but direct IO
does.

> If there is no current mapping from the block to the media, one will be
> created if 'create' is set to 1.  'create' should not be set to a value
> other than '0' or '1'.

*nod*

Though the patch set I referred to created a third value ;)

> On entry, bh_result should have b_size set to the number of bytes that the
> caller is interested in and b_state initialised to zero.  b_size should
> be a multiple of the file block size.  On exit, bh_result describes a
> physically contiguous extent starting at iblock.  b_size will not be
> increased by get_block, but it may be decreased if the filesystem extent
> is shorter than the extent requested.

*nod*

> If bh_result describes an extent that is allocated, then BH_Mapped will
> be set, and b_bdev and b_blocknr will be set to indicate the physical
> location on the media. 

Definition is wrong. If bh_result has a physical mapping that the
*filesystem understands* and can reuse later, then BH_Mapped will be
set.

> The filesystem may wish to use map_bh() in order
> to set BH_Mapped and initialise b_bdev, b_blocknr and b_size.

That's fine.

> If the
> filesystem knows that the extent is read into memory (eg because it
> decided to populate the page cache as part of its get_block operation),
> it should set BH_Uptodate.

I think your terminology is wrong their - it is not valid to
*populate* the page cache during a get_blocks call. page cache
population occurs before *calls* get_blocks to map a page to a
block.  On read (mpage_readpages) the page is attached the page to
the "map_bh". This allows getblocks to initialise the contents of
the buffer being mapped during the getblocks call, and tell the
callee that they need to map_buffer_to_page() to correctly update
the state of the buffers on the page to reflect the state
returned.

So, it's not *populating* the page cache at all - the page cache is
already populated - it's *initialising data* in the page.

Note that the behaviour on write can be different. For example,
__block_write_begin() uses the page uptodate and BH_Uptodate to
determine if it needs to read in data from disk (i.e. a RMW cycle).
Some filesystems will set BH_Uptodate despite not having
instantiated data into the blocks because the block is not
instantiated on disk (BH_delay) or contains zeros (BH_Unwritten) and
higher layers will zero the buffer data appropriately....

> If bh_result describes a hole, the filesystem should clear BH_Mapped and
> set BH_Uptodate.

This is describing create = 0 behaviour (create = 1 means
allocation is required, and so that's going to return something
different).

in the read case, BH_Uptodate should only be set if the filesystem
initialised the data to zero. Otherwise the data is not uptodate,
and the caller needs to zero the hole. Indeed, XFS never sets
BH_Uptodate on a hole mapping because it needs the caller to zero
the data....

> It will not set b_bdev or b_blocknr, but it should set
> b_size to indicate the length of the hole.  It may also opt to leave
> bh_result untouched as described above.  If the block corresponds to
> a hole, bh_result *may* be unmodified, but the VFS can optimise some
> operations if the filesystem reports the length of the hole as described
> below. *** or is this a bug in ext4? ***

mpage_readpages() remaps the hole on every block, so regardless of
the size of the hole getblocks returns it will do the right thing.

> If bh_result describes an extent which has data in the pagecache, but
> that data has not yet had space allocated on the media (due to delayed
> allocation), BH_Mapped, BH_Uptodate and BH_Delay will be set.  b_blocknr
> is not set.

b_blocknr is undefined. filesystems can set it to whatever they
want; they will interpret it correctly when they see BH_Delay on the
buffer.

Again, BH_Uptodate is used here by XFS to prevent
__block_write_begin() from issuing IO on delalloc buffers that have
no data in them yet. That's specific to the filesystem
implementation (as XFS uses __block_write_begin), so this is
actually a requirement of __block_write_begin() for being used with
delalloc buffers, not a requirement of the getblocks interface.

> If bh_result describes an extent which is reserved for data which has not
> yet been written, BH_Unwritten is set, b_bdev and b_blocknr are set.

And if the getblocks call zeroed the data, BH_Uptodate shoul dbe
set, too.

> Other flags that may be set include BH_New (to indicate that the buffer
> was newly allocated and may need to be initialised), and BH_Boundary (to
> indicate a physical discontinuity after this extent).

BH_New deserved more description than that. Yes, it is used to
indicate a newly allocated block, but you'll note that they require
special handling *everywhere*. The common processing is to clear
underlying block device aliases and invalidate metadata mappings,
but it is also used to zero the regions of new buffers that aren't
getting data written to and to inform error handling to zero regions
that are still marked new when an error is detected.

In fact, because of this data initialisation and handling of errors,
XFS also sets BH_new on unwritten extents and any buffer that maps
beyond EOF so that the callers initialise them to contain zeros
appropriately.

> The filesystem may choose to set b_private.  Other fields in buffer_head
> are legacy buffer-cache uses and will not be modified by get_block.

Precisely why we should be moving this interface to a structure like
this:

struct iomap {
	sector_t	blocknr;
	sector_t	length;
	void		*fsprivate;
	unsigned int	state;
};

That is used specifically for returning block maps and directing
allocation, rather than something that conflates mapping, allocation
and data initialisation all into the one interface...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 06/22] Replace XIP read and write with DAX I/O
  2014-02-25 14:18   ` Matthew Wilcox
@ 2014-03-11  0:32     ` Toshi Kani
  -1 siblings, 0 replies; 79+ messages in thread
From: Toshi Kani @ 2014-03-11  0:32 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm, linux-fsdevel, willy

On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> Use the generic AIO infrastructure instead of custom read and write
> methods.  In addition to giving us support for AIO, this adds the missing
> locking between read() and truncate().
> 
 :
> +static void dax_new_buf(void *addr, unsigned size, unsigned first,
> +					loff_t offset, loff_t end, int rw)
> +{
> +	loff_t final = end - offset;	/* The final byte in this buffer */

I may be missing something, but shouldn't it take first into account?

	loff_t final = end - offset + first;

Thanks,
-Toshi


> +	if (rw != WRITE) {
> +		memset(addr, 0, size);
> +		return;
> +	}
> +
> +	if (first > 0)
> +		memset(addr, 0, first);
> +	if (final < size)
> +		memset(addr + final, 0, size - final);
> +}




^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 06/22] Replace XIP read and write with DAX I/O
@ 2014-03-11  0:32     ` Toshi Kani
  0 siblings, 0 replies; 79+ messages in thread
From: Toshi Kani @ 2014-03-11  0:32 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm, linux-fsdevel, willy

On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> Use the generic AIO infrastructure instead of custom read and write
> methods.  In addition to giving us support for AIO, this adds the missing
> locking between read() and truncate().
> 
 :
> +static void dax_new_buf(void *addr, unsigned size, unsigned first,
> +					loff_t offset, loff_t end, int rw)
> +{
> +	loff_t final = end - offset;	/* The final byte in this buffer */

I may be missing something, but shouldn't it take first into account?

	loff_t final = end - offset + first;

Thanks,
-Toshi


> +	if (rw != WRITE) {
> +		memset(addr, 0, size);
> +		return;
> +	}
> +
> +	if (first > 0)
> +		memset(addr, 0, first);
> +	if (final < size)
> +		memset(addr + final, 0, size - final);
> +}



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 06/22] Replace XIP read and write with DAX I/O
  2014-03-11  0:32     ` Toshi Kani
@ 2014-03-11 12:53       ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-03-11 12:53 UTC (permalink / raw)
  To: Toshi Kani; +Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On Mon, Mar 10, 2014 at 06:32:38PM -0600, Toshi Kani wrote:
> On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> > Use the generic AIO infrastructure instead of custom read and write
> > methods.  In addition to giving us support for AIO, this adds the missing
> > locking between read() and truncate().
> > 
>  :
> > +static void dax_new_buf(void *addr, unsigned size, unsigned first,
> > +					loff_t offset, loff_t end, int rw)
> > +{
> > +	loff_t final = end - offset;	/* The final byte in this buffer */
> 
> I may be missing something, but shouldn't it take first into account?
> 
> 	loff_t final = end - offset + first;

Yes it should.  Thanks!  (Fortunately, this is only a performance problem
as we'll end up zeroing more than we ought to, which is fine as it will
be overwritten by the copy_from_user later)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 06/22] Replace XIP read and write with DAX I/O
@ 2014-03-11 12:53       ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-03-11 12:53 UTC (permalink / raw)
  To: Toshi Kani; +Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel

On Mon, Mar 10, 2014 at 06:32:38PM -0600, Toshi Kani wrote:
> On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> > Use the generic AIO infrastructure instead of custom read and write
> > methods.  In addition to giving us support for AIO, this adds the missing
> > locking between read() and truncate().
> > 
>  :
> > +static void dax_new_buf(void *addr, unsigned size, unsigned first,
> > +					loff_t offset, loff_t end, int rw)
> > +{
> > +	loff_t final = end - offset;	/* The final byte in this buffer */
> 
> I may be missing something, but shouldn't it take first into account?
> 
> 	loff_t final = end - offset + first;

Yes it should.  Thanks!  (Fortunately, this is only a performance problem
as we'll end up zeroing more than we ought to, which is fine as it will
be overwritten by the copy_from_user later)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 20/22] ext4: Add DAX functionality
       [not found] ` <CF4DEE22.25C8F%matthew.r.wilcox@intel.com>
@ 2014-03-18 18:45     ` Ross Zwisler
  0 siblings, 0 replies; 79+ messages in thread
From: Ross Zwisler @ 2014-03-18 18:45 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm, linux-fsdevel, willy, Ross Zwisler

On Tue, 25 Feb 2014, Matthew Wilcox wrote:
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> This is a port of the DAX functionality found in the current version of
> ext2.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
> [heavily tweaked]
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>

...

> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 594009f..dbdacef 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -686,15 +686,22 @@ retry:
>  			inode_dio_done(inode);
>  			goto locked;
>  		}
> -		ret = __blockdev_direct_IO(rw, iocb, inode,
> -				 inode->i_sb->s_bdev, iov,
> -				 offset, nr_segs,
> -				 ext4_get_block, NULL, NULL, 0);
> +		if (IS_DAX(inode))
> +			ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
> +					ext4_get_block, NULL, 0);
> +		else
> +			ret = __blockdev_direct_IO(rw, iocb, inode,
> +					inode->i_sb->s_bdev, iov, offset,
> +					nr_segs, ext4_get_block, NULL, NULL, 0);
>  		inode_dio_done(inode);
>  	} else {
>  locked:
> -		ret = blockdev_direct_IO(rw, iocb, inode, iov,
> -				 offset, nr_segs, ext4_get_block);
> +		if (IS_DAX(inode))
> +			ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
> +					ext4_get_block, NULL, 0);

We need to pass in a DIO_LOCKING flag to this call to dax_do_io.  This flag is
provided correctly in ext2_direct_IO which is the only other place I found
where we have a call to dax_do_io as an alternative to blockdev_direct_IO.

The other calls to dax_do_io are alternatives to __blockdev_direct_IO, which
has an explicit flags parameter.  I believe all of these cases are being
handled correctly.

- Ross

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 20/22] ext4: Add DAX functionality
@ 2014-03-18 18:45     ` Ross Zwisler
  0 siblings, 0 replies; 79+ messages in thread
From: Ross Zwisler @ 2014-03-18 18:45 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm, linux-fsdevel, willy, Ross Zwisler

On Tue, 25 Feb 2014, Matthew Wilcox wrote:
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> This is a port of the DAX functionality found in the current version of
> ext2.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
> [heavily tweaked]
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>

...

> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 594009f..dbdacef 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -686,15 +686,22 @@ retry:
>  			inode_dio_done(inode);
>  			goto locked;
>  		}
> -		ret = __blockdev_direct_IO(rw, iocb, inode,
> -				 inode->i_sb->s_bdev, iov,
> -				 offset, nr_segs,
> -				 ext4_get_block, NULL, NULL, 0);
> +		if (IS_DAX(inode))
> +			ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
> +					ext4_get_block, NULL, 0);
> +		else
> +			ret = __blockdev_direct_IO(rw, iocb, inode,
> +					inode->i_sb->s_bdev, iov, offset,
> +					nr_segs, ext4_get_block, NULL, NULL, 0);
>  		inode_dio_done(inode);
>  	} else {
>  locked:
> -		ret = blockdev_direct_IO(rw, iocb, inode, iov,
> -				 offset, nr_segs, ext4_get_block);
> +		if (IS_DAX(inode))
> +			ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
> +					ext4_get_block, NULL, 0);

We need to pass in a DIO_LOCKING flag to this call to dax_do_io.  This flag is
provided correctly in ext2_direct_IO which is the only other place I found
where we have a call to dax_do_io as an alternative to blockdev_direct_IO.

The other calls to dax_do_io are alternatives to __blockdev_direct_IO, which
has an explicit flags parameter.  I believe all of these cases are being
handled correctly.

- Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
  2014-03-04  0:56             ` Dave Chinner
@ 2014-03-20 19:38               ` Matthew Wilcox
  -1 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-03-20 19:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, Toshi Kani, Matthew Wilcox, linux-kernel, linux-mm,
	linux-fsdevel

On Tue, Mar 04, 2014 at 11:56:25AM +1100, Dave Chinner wrote:
> > get_block_t is used by the VFS to ask filesystem to translate logical
> > blocks within a file to sectors on a block device.
> > 
> > typedef int (get_block_t)(struct inode *inode, sector_t iblock,
> >                         struct buffer_head *bh_result, int create);
> > 
> > get_block() must not be called simultaneously with the create flag set
> > for overlapping extents *** or is/was this a bug in ext2? ***
> 
> Why not? The filesystem is responsible for serialising such requests
> internally, or if it can't handle them preventing them from
> occurring. XFS has allowed this to occur with concurrent overlapping
> direct IO writes for a long time. In the situations where we need
> serialisation to avoid data corruption (e.g. overlapping sub-block
> IO), we serialise at the IO syscall context (i.e at a much higher
> layer).

I'm basing this on this commit:

commit 14bac5acfdb6a40be64acc042c6db73f1a68f6a4
Author: Nick Piggin <npiggin@suse.de>
Date:   Wed Aug 20 14:09:20 2008 -0700

    mm: xip/ext2 fix block allocation race
    
    XIP can call into get_xip_mem concurrently with the same file,offset with
    create=1.  This usually maps down to get_block, which expects the page
    lock to prevent such a situation.  This causes ext2 to explode for one
    reason or another.
    
    Serialise those calls for the moment.  For common usages today, I suspect
    get_xip_mem rarely is called to create new blocks.  In future as XIP
    technologies evolve we might need to look at which operations require
    scalability, and rework the locking to suit.

Now, specifically for DAX, reads and writes are serialised by DIO_LOCKING,
just like direct I/O.  *currently*, simultaneous pagefaults are not
serialised against each other at all (or reads/writes), which means we
can call get_block() with create=1 simultaneously for the same block.
For a variety of reasons (scalability, fix a subtle race), I'm going to
implement an equivalent to the page lock for pages of DAX memory, so in
the future the DAX code will also not call into get_block in parallel.

So what should the document say here?  It sounds like it's similar to
the iblock beyond i_size below; that the filesystem and the VFS should
cooperate to ensure that it only happens if the filesystem can make
it work.

> > Despite the iblock argument having type sector_t, iblock is actually
> > in units of the file block size, not in units of 512-byte sectors.
> > iblock must not extend beyond i_size. *** um, looks like xfs permits
> > this ... ? ***
> 
> Of course.  There is nothing that prevents filesystems from mapping
> and allocating blocks beyond EOF if they can do so sanely, and
> nothign that prevents getblocks from letting them do so. Callers
> need to handle the situation appropriately, either by doing their
> own EOF checks or by leaving it up to the filesystems to do the
> right thing. Buffered IO does the former, direct IO does the latter.
> 
> The code in question in __xfs_get_blocks:
> 
>         if (!create && direct && offset >= i_size_read(inode))
> 	                return 0;
> 
> This is because XFS does direct IO differently to everyone else.  It
> doesn't update the inode size until after IO completion (see
> xfs_end_io_direct_write()), and hence it has to be able to allocate
> and map blocks beyond EOF.
> 
> i.e. only direct IO reads are disallowed if the requested mapping is
> beyond EOF. Buffered reads beyond EOF don't get here, but direct IO
> does.

So how can we document this?  'iblock must not be beyond i_size unless
it's allowed to be'?  'iblock may only be beyond end of file for direct
I/O'?

> > If bh_result describes an extent that is allocated, then BH_Mapped will
> > be set, and b_bdev and b_blocknr will be set to indicate the physical
> > location on the media. 
> 
> Definition is wrong. If bh_result has a physical mapping that the
> *filesystem understands* and can reuse later, then BH_Mapped will be
> set.

Ah, yes, fair point.

> > If the
> > filesystem knows that the extent is read into memory (eg because it
> > decided to populate the page cache as part of its get_block operation),
> > it should set BH_Uptodate.
> 
> I think your terminology is wrong their - it is not valid to
> *populate* the page cache during a get_blocks call. page cache
> population occurs before *calls* get_blocks to map a page to a
> block.  On read (mpage_readpages) the page is attached the page to
> the "map_bh". This allows getblocks to initialise the contents of
> the buffer being mapped during the getblocks call, and tell the
> callee that they need to map_buffer_to_page() to correctly update
> the state of the buffers on the page to reflect the state
> returned.
> 
> So, it's not *populating* the page cache at all - the page cache is
> already populated - it's *initialising data* in the page.

Ah!  I misunderstood this comment:

                        /*
                         * get_block() might have updated the buffer
                         * synchronously
                         */
                        if (buffer_uptodate(bh))
                                continue;

I'll fix the text.

> Note that the behaviour on write can be different. For example,
> __block_write_begin() uses the page uptodate and BH_Uptodate to
> determine if it needs to read in data from disk (i.e. a RMW cycle).
> Some filesystems will set BH_Uptodate despite not having
> instantiated data into the blocks because the block is not
> instantiated on disk (BH_delay) or contains zeros (BH_Unwritten) and
> higher layers will zero the buffer data appropriately....
> 
> > If bh_result describes a hole, the filesystem should clear BH_Mapped and
> > set BH_Uptodate.
> 
> This is describing create = 0 behaviour (create = 1 means
> allocation is required, and so that's going to return something
> different).

Sure, but this whole section is describing the state of the returned
buffer.  I agree that you're not going to get a bh_result that describes
a hole when you pass in create=1.

> in the read case, BH_Uptodate should only be set if the filesystem
> initialised the data to zero. Otherwise the data is not uptodate,
> and the caller needs to zero the hole. Indeed, XFS never sets
> BH_Uptodate on a hole mapping because it needs the caller to zero
> the data....

Aha, good to know, thanks!

> > It will not set b_bdev or b_blocknr, but it should set
> > b_size to indicate the length of the hole.  It may also opt to leave
> > bh_result untouched as described above.  If the block corresponds to
> > a hole, bh_result *may* be unmodified, but the VFS can optimise some
> > operations if the filesystem reports the length of the hole as described
> > below. *** or is this a bug in ext4? ***
> 
> mpage_readpages() remaps the hole on every block, so regardless of
> the size of the hole getblocks returns it will do the right thing.

Yes, this is true, but it could be more efficient if it could rely on
get_block to return an accurate length of the hole.

> > If bh_result describes an extent which has data in the pagecache, but
> > that data has not yet had space allocated on the media (due to delayed
> > allocation), BH_Mapped, BH_Uptodate and BH_Delay will be set.  b_blocknr
> > is not set.
> 
> b_blocknr is undefined. filesystems can set it to whatever they
> want; they will interpret it correctly when they see BH_Delay on the
> buffer.

Thanks, I'll fix that.

> Again, BH_Uptodate is used here by XFS to prevent
> __block_write_begin() from issuing IO on delalloc buffers that have
> no data in them yet. That's specific to the filesystem
> implementation (as XFS uses __block_write_begin), so this is
> actually a requirement of __block_write_begin() for being used with
> delalloc buffers, not a requirement of the getblocks interface.

So ... "BH_Mapped and BH_Delay will be set.  BH_Uptodate may be set.
b_blocknr may be used by the filesystem for its own purpose."

> > If bh_result describes an extent which is reserved for data which has not
> > yet been written, BH_Unwritten is set, b_bdev and b_blocknr are set.
> 
> And if the getblocks call zeroed the data, BH_Uptodate shoul dbe
> set, too.
> 
> > Other flags that may be set include BH_New (to indicate that the buffer
> > was newly allocated and may need to be initialised), and BH_Boundary (to
> > indicate a physical discontinuity after this extent).
> 
> BH_New deserved more description than that. Yes, it is used to
> indicate a newly allocated block, but you'll note that they require
> special handling *everywhere*. The common processing is to clear
> underlying block device aliases and invalidate metadata mappings,
> but it is also used to zero the regions of new buffers that aren't
> getting data written to and to inform error handling to zero regions
> that are still marked new when an error is detected.
> 
> In fact, because of this data initialisation and handling of errors,
> XFS also sets BH_new on unwritten extents and any buffer that maps
> beyond EOF so that the callers initialise them to contain zeros
> appropriately.
> 
> > The filesystem may choose to set b_private.  Other fields in buffer_head
> > are legacy buffer-cache uses and will not be modified by get_block.
> 
> Precisely why we should be moving this interface to a structure like
> this:
> 
> struct iomap {
> 	sector_t	blocknr;
> 	sector_t	length;
> 	void		*fsprivate;
> 	unsigned int	state;
> };
> 
> That is used specifically for returning block maps and directing
> allocation, rather than something that conflates mapping, allocation
> and data initialisation all into the one interface...

You missed:
	struct block_device *bdev;

And I agree, but documentation of the current get_block interface is
not the place to launch into a polemic on how things ought to be.  It's
reasonable to point out infelicities ... which I do in a few places :-)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
@ 2014-03-20 19:38               ` Matthew Wilcox
  0 siblings, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-03-20 19:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, Toshi Kani, Matthew Wilcox, linux-kernel, linux-mm,
	linux-fsdevel

On Tue, Mar 04, 2014 at 11:56:25AM +1100, Dave Chinner wrote:
> > get_block_t is used by the VFS to ask filesystem to translate logical
> > blocks within a file to sectors on a block device.
> > 
> > typedef int (get_block_t)(struct inode *inode, sector_t iblock,
> >                         struct buffer_head *bh_result, int create);
> > 
> > get_block() must not be called simultaneously with the create flag set
> > for overlapping extents *** or is/was this a bug in ext2? ***
> 
> Why not? The filesystem is responsible for serialising such requests
> internally, or if it can't handle them preventing them from
> occurring. XFS has allowed this to occur with concurrent overlapping
> direct IO writes for a long time. In the situations where we need
> serialisation to avoid data corruption (e.g. overlapping sub-block
> IO), we serialise at the IO syscall context (i.e at a much higher
> layer).

I'm basing this on this commit:

commit 14bac5acfdb6a40be64acc042c6db73f1a68f6a4
Author: Nick Piggin <npiggin@suse.de>
Date:   Wed Aug 20 14:09:20 2008 -0700

    mm: xip/ext2 fix block allocation race
    
    XIP can call into get_xip_mem concurrently with the same file,offset with
    create=1.  This usually maps down to get_block, which expects the page
    lock to prevent such a situation.  This causes ext2 to explode for one
    reason or another.
    
    Serialise those calls for the moment.  For common usages today, I suspect
    get_xip_mem rarely is called to create new blocks.  In future as XIP
    technologies evolve we might need to look at which operations require
    scalability, and rework the locking to suit.

Now, specifically for DAX, reads and writes are serialised by DIO_LOCKING,
just like direct I/O.  *currently*, simultaneous pagefaults are not
serialised against each other at all (or reads/writes), which means we
can call get_block() with create=1 simultaneously for the same block.
For a variety of reasons (scalability, fix a subtle race), I'm going to
implement an equivalent to the page lock for pages of DAX memory, so in
the future the DAX code will also not call into get_block in parallel.

So what should the document say here?  It sounds like it's similar to
the iblock beyond i_size below; that the filesystem and the VFS should
cooperate to ensure that it only happens if the filesystem can make
it work.

> > Despite the iblock argument having type sector_t, iblock is actually
> > in units of the file block size, not in units of 512-byte sectors.
> > iblock must not extend beyond i_size. *** um, looks like xfs permits
> > this ... ? ***
> 
> Of course.  There is nothing that prevents filesystems from mapping
> and allocating blocks beyond EOF if they can do so sanely, and
> nothign that prevents getblocks from letting them do so. Callers
> need to handle the situation appropriately, either by doing their
> own EOF checks or by leaving it up to the filesystems to do the
> right thing. Buffered IO does the former, direct IO does the latter.
> 
> The code in question in __xfs_get_blocks:
> 
>         if (!create && direct && offset >= i_size_read(inode))
> 	                return 0;
> 
> This is because XFS does direct IO differently to everyone else.  It
> doesn't update the inode size until after IO completion (see
> xfs_end_io_direct_write()), and hence it has to be able to allocate
> and map blocks beyond EOF.
> 
> i.e. only direct IO reads are disallowed if the requested mapping is
> beyond EOF. Buffered reads beyond EOF don't get here, but direct IO
> does.

So how can we document this?  'iblock must not be beyond i_size unless
it's allowed to be'?  'iblock may only be beyond end of file for direct
I/O'?

> > If bh_result describes an extent that is allocated, then BH_Mapped will
> > be set, and b_bdev and b_blocknr will be set to indicate the physical
> > location on the media. 
> 
> Definition is wrong. If bh_result has a physical mapping that the
> *filesystem understands* and can reuse later, then BH_Mapped will be
> set.

Ah, yes, fair point.

> > If the
> > filesystem knows that the extent is read into memory (eg because it
> > decided to populate the page cache as part of its get_block operation),
> > it should set BH_Uptodate.
> 
> I think your terminology is wrong their - it is not valid to
> *populate* the page cache during a get_blocks call. page cache
> population occurs before *calls* get_blocks to map a page to a
> block.  On read (mpage_readpages) the page is attached the page to
> the "map_bh". This allows getblocks to initialise the contents of
> the buffer being mapped during the getblocks call, and tell the
> callee that they need to map_buffer_to_page() to correctly update
> the state of the buffers on the page to reflect the state
> returned.
> 
> So, it's not *populating* the page cache at all - the page cache is
> already populated - it's *initialising data* in the page.

Ah!  I misunderstood this comment:

                        /*
                         * get_block() might have updated the buffer
                         * synchronously
                         */
                        if (buffer_uptodate(bh))
                                continue;

I'll fix the text.

> Note that the behaviour on write can be different. For example,
> __block_write_begin() uses the page uptodate and BH_Uptodate to
> determine if it needs to read in data from disk (i.e. a RMW cycle).
> Some filesystems will set BH_Uptodate despite not having
> instantiated data into the blocks because the block is not
> instantiated on disk (BH_delay) or contains zeros (BH_Unwritten) and
> higher layers will zero the buffer data appropriately....
> 
> > If bh_result describes a hole, the filesystem should clear BH_Mapped and
> > set BH_Uptodate.
> 
> This is describing create = 0 behaviour (create = 1 means
> allocation is required, and so that's going to return something
> different).

Sure, but this whole section is describing the state of the returned
buffer.  I agree that you're not going to get a bh_result that describes
a hole when you pass in create=1.

> in the read case, BH_Uptodate should only be set if the filesystem
> initialised the data to zero. Otherwise the data is not uptodate,
> and the caller needs to zero the hole. Indeed, XFS never sets
> BH_Uptodate on a hole mapping because it needs the caller to zero
> the data....

Aha, good to know, thanks!

> > It will not set b_bdev or b_blocknr, but it should set
> > b_size to indicate the length of the hole.  It may also opt to leave
> > bh_result untouched as described above.  If the block corresponds to
> > a hole, bh_result *may* be unmodified, but the VFS can optimise some
> > operations if the filesystem reports the length of the hole as described
> > below. *** or is this a bug in ext4? ***
> 
> mpage_readpages() remaps the hole on every block, so regardless of
> the size of the hole getblocks returns it will do the right thing.

Yes, this is true, but it could be more efficient if it could rely on
get_block to return an accurate length of the hole.

> > If bh_result describes an extent which has data in the pagecache, but
> > that data has not yet had space allocated on the media (due to delayed
> > allocation), BH_Mapped, BH_Uptodate and BH_Delay will be set.  b_blocknr
> > is not set.
> 
> b_blocknr is undefined. filesystems can set it to whatever they
> want; they will interpret it correctly when they see BH_Delay on the
> buffer.

Thanks, I'll fix that.

> Again, BH_Uptodate is used here by XFS to prevent
> __block_write_begin() from issuing IO on delalloc buffers that have
> no data in them yet. That's specific to the filesystem
> implementation (as XFS uses __block_write_begin), so this is
> actually a requirement of __block_write_begin() for being used with
> delalloc buffers, not a requirement of the getblocks interface.

So ... "BH_Mapped and BH_Delay will be set.  BH_Uptodate may be set.
b_blocknr may be used by the filesystem for its own purpose."

> > If bh_result describes an extent which is reserved for data which has not
> > yet been written, BH_Unwritten is set, b_bdev and b_blocknr are set.
> 
> And if the getblocks call zeroed the data, BH_Uptodate shoul dbe
> set, too.
> 
> > Other flags that may be set include BH_New (to indicate that the buffer
> > was newly allocated and may need to be initialised), and BH_Boundary (to
> > indicate a physical discontinuity after this extent).
> 
> BH_New deserved more description than that. Yes, it is used to
> indicate a newly allocated block, but you'll note that they require
> special handling *everywhere*. The common processing is to clear
> underlying block device aliases and invalidate metadata mappings,
> but it is also used to zero the regions of new buffers that aren't
> getting data written to and to inform error handling to zero regions
> that are still marked new when an error is detected.
> 
> In fact, because of this data initialisation and handling of errors,
> XFS also sets BH_new on unwritten extents and any buffer that maps
> beyond EOF so that the callers initialise them to contain zeros
> appropriately.
> 
> > The filesystem may choose to set b_private.  Other fields in buffer_head
> > are legacy buffer-cache uses and will not be modified by get_block.
> 
> Precisely why we should be moving this interface to a structure like
> this:
> 
> struct iomap {
> 	sector_t	blocknr;
> 	sector_t	length;
> 	void		*fsprivate;
> 	unsigned int	state;
> };
> 
> That is used specifically for returning block maps and directing
> allocation, rather than something that conflates mapping, allocation
> and data initialisation all into the one interface...

You missed:
	struct block_device *bdev;

And I agree, but documentation of the current get_block interface is
not the place to launch into a polemic on how things ought to be.  It's
reasonable to point out infelicities ... which I do in a few places :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
  2014-03-20 19:38               ` Matthew Wilcox
@ 2014-03-20 23:55                 ` Dave Chinner
  -1 siblings, 0 replies; 79+ messages in thread
From: Dave Chinner @ 2014-03-20 23:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ross Zwisler, Toshi Kani, Matthew Wilcox, linux-kernel, linux-mm,
	linux-fsdevel

On Thu, Mar 20, 2014 at 03:38:44PM -0400, Matthew Wilcox wrote:
> On Tue, Mar 04, 2014 at 11:56:25AM +1100, Dave Chinner wrote:
> > > get_block_t is used by the VFS to ask filesystem to translate logical
> > > blocks within a file to sectors on a block device.
> > > 
> > > typedef int (get_block_t)(struct inode *inode, sector_t iblock,
> > >                         struct buffer_head *bh_result, int create);
> > > 
> > > get_block() must not be called simultaneously with the create flag set
> > > for overlapping extents *** or is/was this a bug in ext2? ***
> > 
> > Why not? The filesystem is responsible for serialising such requests
> > internally, or if it can't handle them preventing them from
> > occurring. XFS has allowed this to occur with concurrent overlapping
> > direct IO writes for a long time. In the situations where we need
> > serialisation to avoid data corruption (e.g. overlapping sub-block
> > IO), we serialise at the IO syscall context (i.e at a much higher
> > layer).
> 
> I'm basing this on this commit:
> 
> commit 14bac5acfdb6a40be64acc042c6db73f1a68f6a4
> Author: Nick Piggin <npiggin@suse.de>
> Date:   Wed Aug 20 14:09:20 2008 -0700
> 
>     mm: xip/ext2 fix block allocation race
>     
>     XIP can call into get_xip_mem concurrently with the same file,offset with
>     create=1.  This usually maps down to get_block, which expects the page
>     lock to prevent such a situation.  This causes ext2 to explode for one
>     reason or another.
>     
>     Serialise those calls for the moment.  For common usages today, I suspect
>     get_xip_mem rarely is called to create new blocks.  In future as XIP
>     technologies evolve we might need to look at which operations require
>     scalability, and rework the locking to suit.
> 
> Now, specifically for DAX, reads and writes are serialised by DIO_LOCKING,
> just like direct I/O.

No, filesystems don't necessarily use DIO_LOCKING for direct IO.
That's the whole point of having the flag - so filesystems can use
their own, more efficient serialisation for getblocks calls.  XFS
does it via the internal XFS inode i_ilock, others do it via the
i_alloc_sem, and so on.

Indeed, XFS has extremely good DIO scalability because it uses
shared locks where ever possible inside getblocks - it only takes an
exclusive lock if allocation is actually required - and hence
concurrent lookups don't serialise unless a modification is
required...

> *currently*, simultaneous pagefaults are not
> serialised against each other at all (or reads/writes), which means we
> can call get_block() with create=1 simultaneously for the same block.
> For a variety of reasons (scalability, fix a subtle race), I'm going to
> implement an equivalent to the page lock for pages of DAX memory, so in
> the future the DAX code will also not call into get_block in parallel.

Which is going to be a major scalability issue. You need to allow
concurrent calls into getblock...

> So what should the document say here?  It sounds like it's similar to
> the iblock beyond i_size below; that the filesystem and the VFS should
> cooperate to ensure that it only happens if the filesystem can make
> it work.

"Filesystems are responsible for ensuring coherency w.r.t.
concurrent access to their block mapping routines"

> > > Despite the iblock argument having type sector_t, iblock is actually
> > > in units of the file block size, not in units of 512-byte sectors.
> > > iblock must not extend beyond i_size. *** um, looks like xfs permits
> > > this ... ? ***
> > 
> > Of course.  There is nothing that prevents filesystems from mapping
> > and allocating blocks beyond EOF if they can do so sanely, and
> > nothign that prevents getblocks from letting them do so. Callers
> > need to handle the situation appropriately, either by doing their
> > own EOF checks or by leaving it up to the filesystems to do the
> > right thing. Buffered IO does the former, direct IO does the latter.
> > 
> > The code in question in __xfs_get_blocks:
> > 
> >         if (!create && direct && offset >= i_size_read(inode))
> > 	                return 0;
> > 
> > This is because XFS does direct IO differently to everyone else.  It
> > doesn't update the inode size until after IO completion (see
> > xfs_end_io_direct_write()), and hence it has to be able to allocate
> > and map blocks beyond EOF.
> > 
> > i.e. only direct IO reads are disallowed if the requested mapping is
> > beyond EOF. Buffered reads beyond EOF don't get here, but direct IO
> > does.
> 
> So how can we document this?  'iblock must not be beyond i_size unless
> it's allowed to be'?  'iblock may only be beyond end of file for direct
> I/O'?

"filesystems must reject attempts to map a range beyond EOF if
create is not set".

> > > b_size to indicate the length of the hole.  It may also opt to leave
> > > bh_result untouched as described above.  If the block corresponds to
> > > a hole, bh_result *may* be unmodified, but the VFS can optimise some
> > > operations if the filesystem reports the length of the hole as described
> > > below. *** or is this a bug in ext4? ***
> > 
> > mpage_readpages() remaps the hole on every block, so regardless of
> > the size of the hole getblocks returns it will do the right thing.
> 
> Yes, this is true, but it could be more efficient if it could rely on
> get_block to return an accurate length of the hole.

Sure. But do_mpage_readpage() needs an aenema. It's a complex mess
that relies on block_read_full_page() and bufferhead based IO to
sort out the crap it gets confused about.

As it is, I suspect that this is all going to be academic. I'm in
the process of ripping bufferhead support out of XFS and adding a
new iomapping interface so that we don't have to carry bufferheads
around on pages anymore. As such, I'll have to re-implement whatever
you do with DAX to support bufferhead enabled filesystems....

> > > If bh_result describes an extent which has data in the pagecache, but
> > > that data has not yet had space allocated on the media (due to delayed
> > > allocation), BH_Mapped, BH_Uptodate and BH_Delay will be set.  b_blocknr
> > > is not set.
> > 
> > b_blocknr is undefined. filesystems can set it to whatever they
> > want; they will interpret it correctly when they see BH_Delay on the
> > buffer.
> 
> Thanks, I'll fix that.
> 
> > Again, BH_Uptodate is used here by XFS to prevent
> > __block_write_begin() from issuing IO on delalloc buffers that have
> > no data in them yet. That's specific to the filesystem
> > implementation (as XFS uses __block_write_begin), so this is
> > actually a requirement of __block_write_begin() for being used with
> > delalloc buffers, not a requirement of the getblocks interface.
> 
> So ... "BH_Mapped and BH_Delay will be set.  BH_Uptodate may be set.
> b_blocknr may be used by the filesystem for its own purpose."

The only indication that a buffer contains a delayed allocation map
is BH_Delay. What the filesystem sets in other flags is determined
by the filesystem implementation, not the getblock API. e.g. it
looks like ext4 always sets buffer_new() on a delalloc block, but
XFs only ever sets it on the getblocks call that allocates the
delalloc block. i.e. the use of many of these flags is filesystem
implementation specific, so you can't actually say that the getblock
API has fixed definitions for these combinations of flags.

> > > The filesystem may choose to set b_private.  Other fields in buffer_head
> > > are legacy buffer-cache uses and will not be modified by get_block.
> > 
> > Precisely why we should be moving this interface to a structure like
> > this:
> > 
> > struct iomap {
> > 	sector_t	blocknr;
> > 	sector_t	length;
> > 	void		*fsprivate;
> > 	unsigned int	state;
> > };
> > 
> > That is used specifically for returning block maps and directing
> > allocation, rather than something that conflates mapping, allocation
> > and data initialisation all into the one interface...
> 
> You missed:
> 	struct block_device *bdev;

Sure - it's in the structures I'm using. ;)

> And I agree, but documentation of the current get_block interface is
> not the place to launch into a polemic on how things ought to be.  It's
> reasonable to point out infelicities ... which I do in a few places :-)

Saying how things are broken and ought to be is the first step
towards fixing a mess. XFS is definitely going to be moving away
from this current mess so we can support block size > page size  and
mulit-page writes without major changes needed to either the
filesystem or the page cache.  Moving away from bufferheads and the
getblock interface is a necessary part of that change...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler
@ 2014-03-20 23:55                 ` Dave Chinner
  0 siblings, 0 replies; 79+ messages in thread
From: Dave Chinner @ 2014-03-20 23:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ross Zwisler, Toshi Kani, Matthew Wilcox, linux-kernel, linux-mm,
	linux-fsdevel

On Thu, Mar 20, 2014 at 03:38:44PM -0400, Matthew Wilcox wrote:
> On Tue, Mar 04, 2014 at 11:56:25AM +1100, Dave Chinner wrote:
> > > get_block_t is used by the VFS to ask filesystem to translate logical
> > > blocks within a file to sectors on a block device.
> > > 
> > > typedef int (get_block_t)(struct inode *inode, sector_t iblock,
> > >                         struct buffer_head *bh_result, int create);
> > > 
> > > get_block() must not be called simultaneously with the create flag set
> > > for overlapping extents *** or is/was this a bug in ext2? ***
> > 
> > Why not? The filesystem is responsible for serialising such requests
> > internally, or if it can't handle them preventing them from
> > occurring. XFS has allowed this to occur with concurrent overlapping
> > direct IO writes for a long time. In the situations where we need
> > serialisation to avoid data corruption (e.g. overlapping sub-block
> > IO), we serialise at the IO syscall context (i.e at a much higher
> > layer).
> 
> I'm basing this on this commit:
> 
> commit 14bac5acfdb6a40be64acc042c6db73f1a68f6a4
> Author: Nick Piggin <npiggin@suse.de>
> Date:   Wed Aug 20 14:09:20 2008 -0700
> 
>     mm: xip/ext2 fix block allocation race
>     
>     XIP can call into get_xip_mem concurrently with the same file,offset with
>     create=1.  This usually maps down to get_block, which expects the page
>     lock to prevent such a situation.  This causes ext2 to explode for one
>     reason or another.
>     
>     Serialise those calls for the moment.  For common usages today, I suspect
>     get_xip_mem rarely is called to create new blocks.  In future as XIP
>     technologies evolve we might need to look at which operations require
>     scalability, and rework the locking to suit.
> 
> Now, specifically for DAX, reads and writes are serialised by DIO_LOCKING,
> just like direct I/O.

No, filesystems don't necessarily use DIO_LOCKING for direct IO.
That's the whole point of having the flag - so filesystems can use
their own, more efficient serialisation for getblocks calls.  XFS
does it via the internal XFS inode i_ilock, others do it via the
i_alloc_sem, and so on.

Indeed, XFS has extremely good DIO scalability because it uses
shared locks where ever possible inside getblocks - it only takes an
exclusive lock if allocation is actually required - and hence
concurrent lookups don't serialise unless a modification is
required...

> *currently*, simultaneous pagefaults are not
> serialised against each other at all (or reads/writes), which means we
> can call get_block() with create=1 simultaneously for the same block.
> For a variety of reasons (scalability, fix a subtle race), I'm going to
> implement an equivalent to the page lock for pages of DAX memory, so in
> the future the DAX code will also not call into get_block in parallel.

Which is going to be a major scalability issue. You need to allow
concurrent calls into getblock...

> So what should the document say here?  It sounds like it's similar to
> the iblock beyond i_size below; that the filesystem and the VFS should
> cooperate to ensure that it only happens if the filesystem can make
> it work.

"Filesystems are responsible for ensuring coherency w.r.t.
concurrent access to their block mapping routines"

> > > Despite the iblock argument having type sector_t, iblock is actually
> > > in units of the file block size, not in units of 512-byte sectors.
> > > iblock must not extend beyond i_size. *** um, looks like xfs permits
> > > this ... ? ***
> > 
> > Of course.  There is nothing that prevents filesystems from mapping
> > and allocating blocks beyond EOF if they can do so sanely, and
> > nothign that prevents getblocks from letting them do so. Callers
> > need to handle the situation appropriately, either by doing their
> > own EOF checks or by leaving it up to the filesystems to do the
> > right thing. Buffered IO does the former, direct IO does the latter.
> > 
> > The code in question in __xfs_get_blocks:
> > 
> >         if (!create && direct && offset >= i_size_read(inode))
> > 	                return 0;
> > 
> > This is because XFS does direct IO differently to everyone else.  It
> > doesn't update the inode size until after IO completion (see
> > xfs_end_io_direct_write()), and hence it has to be able to allocate
> > and map blocks beyond EOF.
> > 
> > i.e. only direct IO reads are disallowed if the requested mapping is
> > beyond EOF. Buffered reads beyond EOF don't get here, but direct IO
> > does.
> 
> So how can we document this?  'iblock must not be beyond i_size unless
> it's allowed to be'?  'iblock may only be beyond end of file for direct
> I/O'?

"filesystems must reject attempts to map a range beyond EOF if
create is not set".

> > > b_size to indicate the length of the hole.  It may also opt to leave
> > > bh_result untouched as described above.  If the block corresponds to
> > > a hole, bh_result *may* be unmodified, but the VFS can optimise some
> > > operations if the filesystem reports the length of the hole as described
> > > below. *** or is this a bug in ext4? ***
> > 
> > mpage_readpages() remaps the hole on every block, so regardless of
> > the size of the hole getblocks returns it will do the right thing.
> 
> Yes, this is true, but it could be more efficient if it could rely on
> get_block to return an accurate length of the hole.

Sure. But do_mpage_readpage() needs an aenema. It's a complex mess
that relies on block_read_full_page() and bufferhead based IO to
sort out the crap it gets confused about.

As it is, I suspect that this is all going to be academic. I'm in
the process of ripping bufferhead support out of XFS and adding a
new iomapping interface so that we don't have to carry bufferheads
around on pages anymore. As such, I'll have to re-implement whatever
you do with DAX to support bufferhead enabled filesystems....

> > > If bh_result describes an extent which has data in the pagecache, but
> > > that data has not yet had space allocated on the media (due to delayed
> > > allocation), BH_Mapped, BH_Uptodate and BH_Delay will be set.  b_blocknr
> > > is not set.
> > 
> > b_blocknr is undefined. filesystems can set it to whatever they
> > want; they will interpret it correctly when they see BH_Delay on the
> > buffer.
> 
> Thanks, I'll fix that.
> 
> > Again, BH_Uptodate is used here by XFS to prevent
> > __block_write_begin() from issuing IO on delalloc buffers that have
> > no data in them yet. That's specific to the filesystem
> > implementation (as XFS uses __block_write_begin), so this is
> > actually a requirement of __block_write_begin() for being used with
> > delalloc buffers, not a requirement of the getblocks interface.
> 
> So ... "BH_Mapped and BH_Delay will be set.  BH_Uptodate may be set.
> b_blocknr may be used by the filesystem for its own purpose."

The only indication that a buffer contains a delayed allocation map
is BH_Delay. What the filesystem sets in other flags is determined
by the filesystem implementation, not the getblock API. e.g. it
looks like ext4 always sets buffer_new() on a delalloc block, but
XFs only ever sets it on the getblocks call that allocates the
delalloc block. i.e. the use of many of these flags is filesystem
implementation specific, so you can't actually say that the getblock
API has fixed definitions for these combinations of flags.

> > > The filesystem may choose to set b_private.  Other fields in buffer_head
> > > are legacy buffer-cache uses and will not be modified by get_block.
> > 
> > Precisely why we should be moving this interface to a structure like
> > this:
> > 
> > struct iomap {
> > 	sector_t	blocknr;
> > 	sector_t	length;
> > 	void		*fsprivate;
> > 	unsigned int	state;
> > };
> > 
> > That is used specifically for returning block maps and directing
> > allocation, rather than something that conflates mapping, allocation
> > and data initialisation all into the one interface...
> 
> You missed:
> 	struct block_device *bdev;

Sure - it's in the structures I'm using. ;)

> And I agree, but documentation of the current get_block interface is
> not the place to launch into a polemic on how things ought to be.  It's
> reasonable to point out infelicities ... which I do in a few places :-)

Saying how things are broken and ought to be is the first step
towards fixing a mess. XFS is definitely going to be moving away
from this current mess so we can support block size > page size  and
mulit-page writes without major changes needed to either the
filesystem or the page cache.  Moving away from bufferheads and the
getblock interface is a necessary part of that change...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2014-03-20 23:55 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-25 14:18 [PATCH v6 00/22] Support ext4 on NV-DIMMs Matthew Wilcox
2014-02-25 14:18 ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 01/22] Fix XIP fault vs truncate race Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 02/22] Allow page fault handlers to perform the COW Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 03/22] axonram: Fix bug in direct_access Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 04/22] Change direct_access calling convention Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 05/22] Introduce IS_DAX(inode) Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 06/22] Replace XIP read and write with DAX I/O Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-03-11  0:32   ` Toshi Kani
2014-03-11  0:32     ` Toshi Kani
2014-03-11 12:53     ` Matthew Wilcox
2014-03-11 12:53       ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-28 17:49   ` Toshi Kani
2014-02-28 17:49     ` Toshi Kani
2014-02-28 20:20     ` Matthew Wilcox
2014-02-28 20:20       ` Matthew Wilcox
2014-02-28 22:18       ` Toshi Kani
2014-02-28 22:18         ` Toshi Kani
2014-02-28 22:18         ` Toshi Kani
2014-03-02 23:30       ` Dave Chinner
2014-03-02 23:30         ` Dave Chinner
2014-03-03 23:07         ` Ross Zwisler
2014-03-03 23:07           ` Ross Zwisler
2014-03-04  0:56           ` Dave Chinner
2014-03-04  0:56             ` Dave Chinner
2014-03-20 19:38             ` Matthew Wilcox
2014-03-20 19:38               ` Matthew Wilcox
2014-03-20 23:55               ` Dave Chinner
2014-03-20 23:55                 ` Dave Chinner
2014-02-25 14:18 ` [PATCH v6 08/22] Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 09/22] Remove mm/filemap_xip.c Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 10/22] Remove get_xip_mem Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 11/22] Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 12/22] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 13/22] ext2: Remove ext2_use_xip Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 14/22] ext2: Remove xip.c and xip.h Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 16/22] ext2: Remove ext2_aops_xip Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 17/22] Get rid of most mentions of XIP in ext2 Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 18/22] xip: Add xip_zero_page_range Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 19/22] ext4: Make ext4_block_zero_page_range static Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 20/22] ext4: Add DAX functionality Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 21/22] ext4: Fix typos Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-25 14:18 ` [PATCH v6 22/22] dax: Add reporting of major faults Matthew Wilcox
2014-02-25 14:18   ` Matthew Wilcox
2014-02-26 15:07 ` [PATCH v6 23/22] Bugfixes Matthew Wilcox
2014-02-26 15:07   ` Matthew Wilcox
2014-02-27 14:01 ` [PATCH v6 00/22] Support ext4 on NV-DIMMs Florian Weimer
2014-02-27 14:01   ` Florian Weimer
2014-02-27 16:29   ` Matthew Wilcox
2014-02-27 16:29     ` Matthew Wilcox
2014-02-27 16:36     ` Florian Weimer
2014-02-27 16:36       ` Florian Weimer
2014-03-02  8:22 ` Pavel Machek
2014-03-02  8:22   ` Pavel Machek
     [not found] ` <CF4DEE22.25C8F%matthew.r.wilcox@intel.com>
2014-03-18 18:45   ` [PATCH v6 20/22] ext4: Add DAX functionality Ross Zwisler
2014-03-18 18:45     ` Ross Zwisler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.